Node.js, Tor, Puppeteer and Cheerio: anonymous web scraping

Web scraping is a method of collecting data from websites. This term is commonly used in relation to automated data collection. Today weโ€™ll talk about how to collect data from sites anonymously. The reason why someone may want anonymity in web scraping is because many web servers apply certain rules to connections from IP addresses, from which a certain number of requests are made over a certain period of time. Here we will use the following tools:

  • Puppeteer - for accessing web pages.
  • Cheerio - for parsing HTML code.
  • Tor - to execute each request from a different IP address.

It should be noted that the legal aspects of web scraping are a complex and often unclear issue. Therefore, respect the " Terms of Use " of the pages whose data you collect. Here is some good material on this topic.



Tor installation


Let's start from the beginning - therefore, first of all, install the Tor client using the following command:

sudo apt-get install tor

Tor setup


Now configure the Tor client. The default configuration of Tor uses the SOCKS port , which gives us one path to a single output node (that is, one IP address). For everyday use of Tor, like just browsing the web, this is a good fit. But we need some IP addresses. This will allow you to switch between them during the web scraping process.

In order to configure Tor as we need, we simply open additional ports for listening to SOCKS connections. This is done by adding several entries SocksPortto the main configuration file of the program, which can be found at /etc/tor.

Open the file /etc/tor/torrcusing some text editor and add the following entries to the end of the file:

#  4 SOCKS-,       Tor-.

SocksPort 9050
SocksPort 9052
SocksPort 9053
SocksPort 9054

Here it is worth paying attention to the following:

  • SocksPort . โ€” , , SOCKS, .
  • SocksPort , , , - .
  • 9050 โ€” , Tor- .
  • 9051. Tor , , , Tor.
  • 9051, , 1.

In order to apply the changes made to the configuration file, restart the Tor client:

sudo /etc/init.d/tor restart

Creating a new Node.js project


Create a new directory for the project. Let's call her superWebScraping:

mkdir superWebScraping

Let's go to this directory and initialize an empty Node project:

cd superWebScraping && npm init -y

Set the necessary dependencies:

npm i --save puppeteer cheerio

Work with websites using Puppeteer


Puppeteer is a browser without a user interface that uses the DevTools protocol to interact with Chrome or Chromium . The reason we donโ€™t use the library here to work with requests, like tor-request , is because such a library will not be able to process sites created as single-page web applications whose contents are loaded dynamically.

Create a file index.jsand put the following code into it. The main features of this code are described in the comments.

/**
 *   puppeteer.
 */
const puppeteer = require('puppeteer');

/**
 *   main  , 
 *     -.
 * ,      ,
 *   ,     
 *  puppeteer.
 */
async function main() {
  /**
   *  Chromium.   `headless`   false,
   *     .
   */
  const browser = await puppeteer.launch({
    headless: false
  });

  /**
   *   .
   */
  const page = await browser.newPage();

  /**
   *   ,   https://api.ipify.org.
   */
  await page.goto('https://api.ipify.org');

  /**
   *  3     .
   */
  setTimeout(() => {
    browser.close();
  }, 3000);
}

/**
 *  ,  main().
 */
main();

Run the script with the following command:

node index.js

After that, the Chromium browser window should appear on the screen, in which the address is open https://api.ipify.org.


Browser window, Tor connection is not used.

I opened it in the browser window preciselyhttps://api.ipify.orgbecause this page can show the public IP address from which it is accessed. This is the address that is visible to the sites I visit if I visit them without using Tor.

Change the above code by adding the following key to the object with the parameters that is passedpuppeteer.launch:

  /**
   *  Chromium.   `headless`  false,
   *     .
   */
  const browser = await puppeteer.launch({
  headless: false,
  
  //   .
  args: ['--proxy-server=socks5://127.0.0.1:9050']
});

We passed an argument to the browser --proxy-server. The value of this argument tells the browser that it should use a socks5 proxy server running on our computer and accessible on the port 9050. The port number is one of those numbers that we previously entered into the file torrc.

Run the script again:

node index.js

This time, on the open page, you can see a different IP address. This is the address used to view the site through the Tor network.


Browser window, Tor connection is used.

In my case, an address appeared in this window144.217.7.33. You may have some other address. Please note that if you run the script again and use the same port number (9050), you will receive the same IP address that you received before.


Re-launched browser window, Tor connection is used.

That is why in the Tor settings we opened several ports. Try connecting your browser to a different port. This will change the IP address.

Data collection with Cheerio


Now that we have a convenient mechanism for loading pages, it's time to do web scraping. For this, we are going to use the cheerio library . This is an HTML parser whose API is built in the same way as the jQuery API . Our goal is to get 5 of the latest posts from the Hacker News page.

Go to the Hacker News website .


Hacker News website

We want to take 5 fresh headlines from the open page (now it is HAKMEM (1972), Larry Roberts has died and others). Examining the title of the article using the browser developer tools, I noticed that each title is placed in an HTML element<a>with a classstorylink.


Examining the structure of the document

In order to extract from the HTML code of the page what we need, we need to perform the following sequence of actions:

  • Starting a new browser instance without a user interface connected to the Tor proxy.
  • Create a new page.
  • Go to the address https://news.ycombinator.com/.
  • Retrieving the HTML content of the page.
  • Loading the HTML content of the page into cheerio.
  • Creating an array to save article titles.
  • Accessing elements with a class storylink.
  • Getting the first 5 elements using the cheerio slice () method .
  • Traversing the resulting elements using the cheerio each () method .
  • Write each header found to an array.

Here is the code that implements these actions:

const puppeteer = require('puppeteer');

/**
 *   cheerio.
 */
const cheerio = require('cheerio');

async function main() {
  const browser = await puppeteer.launch({
    /**
     *       (   ).
     */
    headless: true,
    args: ['--proxy-server=socks5://127.0.0.1:9050']
  });

  const page = await browser.newPage();

  await page.goto('https://news.ycombinator.com/');

  /**
   *      HTML-.
   */
  const content = await page.content();

  /**
   *    cheerio.
   */
  const $ = cheerio.load(content);


  /**
   *      .
   */
  const titles = [];

  /**
   *   ,   `storylink`.
   *  slice()      5   .
   *      each().
   */
  $('.storylink').slice(0, 5).each((idx, elem) => {
    /**
     *   HTML-,   .
     */
    const title = $(elem).text();
  
    /**
     *    .
     */
    titles.push(title);
  })

  browser.close();
  
  /**
   *     .
   */
  console.log(titles);
}

main();

Here's what happens after running this script.


First 5 Hacker News Headlines Successfully Extracted from Page Code

Continuous scraping using different IP addresses


Now let's talk about how to use the various SOCKS ports that we specified in the file torrc. It is pretty simple. We will declare an array, each of which will contain a port number. Then rename the function main()into a function scrape()and declare a new function main()that will call the function scrape(), passing it a new port number with each call.

Here is the finished code:

const puppeteer = require('puppeteer');
const cheerio = require('cheerio');

async function scrape(port) {
  const browser = await puppeteer.launch({
    args: ['--proxy-server=socks5://127.0.0.1:' + port]
  });

  const page = await browser.newPage();
  await page.goto('https://news.ycombinator.com/');
  const content = await page.content();

  const $ = cheerio.load(content);

  const titles = [];

  $('.storylink').slice(0, 5).each((idx, elem) => {
    const title = $(elem).text();
    titles.push(title);
  });

  browser.close();
  return titles;
}

async function main() {
  /**
   *  SOCKS- Tor,    torrc. 
   */
  const ports = [
    '9050',
    '9052',
    '9053',
    '9054'
  ];
  
  /**
   *  -...
   */
  while (true) {
    for (const port of ports) {
      /**
       * ...  -    .
       */
      console.log(await scrape(port));
    }
  }
}

main();

Summary


Now you have tools at your disposal that allow you to do anonymous web scraping.

Dear readers! Have you ever done web scraping? If so, please tell us which tools you use for this.

Source: https://habr.com/ru/post/undefined/


All Articles