admin管理员组

文章数量:1355620

I have a list of urls that need to be scraped from a website that uses React, for this reason I am using Puppeteer. I do not want to be blocked by anti-bot servers, for this reason I have added puppeteer-extra-plugin-stealth I want to prevent ads from loading on the pages, so I am blocking ads by using puppeteer-extra-plugin-adblocker I also want to prevent my IP address from being blacklisted, so I have used TOR nodes to have different IP addresses. Below is a simplified version of my code and the setup works (TOR_port and webUrl are assigned dynamically though but for simplifying my question I have assigned it as a variable) . There is a problem though:

const puppeteer = require('puppeteer-extra');
const _StealthPlugin = require('puppeteer-extra-plugin-stealth');
const _AdblockerPlugin = require('puppeteer-extra-plugin-adblocker');

puppeteer.use(_StealthPlugin());
puppeteer.use(_AdblockerPlugin());

var TOR_port = 13931;
var webUrl ='/';


const browser = await puppeteer.launch({
    dumpio: false,
    headless: false,
    args: [
        `--proxy-server=socks5://127.0.0.1:${TOR_port}`,
        `--no-sandbox`,
    ],
    ignoreHTTPSErrors: true,
});

try {
    const page = await browser.newPage();
    await page.setViewport({ width: 1280, height: 720 });
    await page.goto(webUrl, {
        waitUntil: 'load',
        timeout: 30000,
    });

    page
    .waitForSelector('.price')
    .then(() => {
        console.log('The price is available');
        await browser.close();
    })
    .catch(() => {
        // close this since it is clearly not a zillow website
        throw new Error('This is not the zillow website');
    });
} catch (e) {
    await browser.close();
}

The above setup works but is very unreliable and I recently learnt about Puppeteer-Cluster. I need it to help me manage crawling multiple pages, to track my scraping tasks.

So, my question is how do I implement Puppeteer-Cluster with the above set-up. I am aware of an example(.js) offered by the library to show how you can implement plugins, but is so bare that I didn't quite understand it.

How do I implement Puppeteer-Cluster with the above TOR, AdBlocker, and Stealth configurations?

I have a list of urls that need to be scraped from a website that uses React, for this reason I am using Puppeteer. I do not want to be blocked by anti-bot servers, for this reason I have added puppeteer-extra-plugin-stealth I want to prevent ads from loading on the pages, so I am blocking ads by using puppeteer-extra-plugin-adblocker I also want to prevent my IP address from being blacklisted, so I have used TOR nodes to have different IP addresses. Below is a simplified version of my code and the setup works (TOR_port and webUrl are assigned dynamically though but for simplifying my question I have assigned it as a variable) . There is a problem though:

const puppeteer = require('puppeteer-extra');
const _StealthPlugin = require('puppeteer-extra-plugin-stealth');
const _AdblockerPlugin = require('puppeteer-extra-plugin-adblocker');

puppeteer.use(_StealthPlugin());
puppeteer.use(_AdblockerPlugin());

var TOR_port = 13931;
var webUrl ='https://www.zillow./homedetails/2861-Bass-Haven-Ln-Saint-Augustine-FL-32092/47739703_zpid/';


const browser = await puppeteer.launch({
    dumpio: false,
    headless: false,
    args: [
        `--proxy-server=socks5://127.0.0.1:${TOR_port}`,
        `--no-sandbox`,
    ],
    ignoreHTTPSErrors: true,
});

try {
    const page = await browser.newPage();
    await page.setViewport({ width: 1280, height: 720 });
    await page.goto(webUrl, {
        waitUntil: 'load',
        timeout: 30000,
    });

    page
    .waitForSelector('.price')
    .then(() => {
        console.log('The price is available');
        await browser.close();
    })
    .catch(() => {
        // close this since it is clearly not a zillow website
        throw new Error('This is not the zillow website');
    });
} catch (e) {
    await browser.close();
}

The above setup works but is very unreliable and I recently learnt about Puppeteer-Cluster. I need it to help me manage crawling multiple pages, to track my scraping tasks.

So, my question is how do I implement Puppeteer-Cluster with the above set-up. I am aware of an example(https://github./thomasdondorf/puppeteer-cluster/blob/master/examples/different-puppeteer-library.js) offered by the library to show how you can implement plugins, but is so bare that I didn't quite understand it.

How do I implement Puppeteer-Cluster with the above TOR, AdBlocker, and Stealth configurations?

Share Improve this question edited Feb 20, 2021 at 9:25 DisappointedByUnaccountableMod 6,8464 gold badges20 silver badges23 bronze badges asked Dec 24, 2020 at 18:45 Daggie Blanqx - Douglas MwangiDaggie Blanqx - Douglas Mwangi 2,57924 silver badges29 bronze badges 1
  • 1 maybe you can read this as reference github./thomasdondorf/puppeteer-cluster/issues/… – Edi Imanto Commented Jan 1, 2021 at 22:04
Add a ment  | 

2 Answers 2

Reset to default 7

You can just hand over your puppeteer Instance like following:

const puppeteer = require('puppeteer-extra');
const _StealthPlugin = require('puppeteer-extra-plugin-stealth');
const _AdblockerPlugin = require('puppeteer-extra-plugin-adblocker');

puppeteer.use(_StealthPlugin());
puppeteer.use(_AdblockerPlugin());

const browser = await puppeteer.launch({
    puppeteer,
});

Src: https://github./thomasdondorf/puppeteer-cluster#clusterlaunchoptions

You can just add the plugins with puppeteer.use()

You have to use puppeteer-extra.

const { addExtra } = require("puppeteer-extra");
const vanillaPuppeteer = require("puppeteer");
const StealthPlugin = require("puppeteer-extra-plugin-stealth");
const RecaptchaPlugin = require("puppeteer-extra-plugin-recaptcha");
const { Cluster } = require("puppeteer-cluster");

(async () => {
  const puppeteer = addExtra(vanillaPuppeteer);
  puppeteer.use(StealthPlugin());
  puppeteer.use(RecaptchaPlugin());

  // Do stuff
})();

本文标签: javascriptHow do I combine puppeteer plugins with puppeteer clustersStack Overflow