admin管理员组文章数量:1221466
I'm creating a web api that scrapes a given url and sends that back. I am using Puppeteer to do this. I asked this question: Puppeteer not behaving like in Developer Console
and recieved an answer that suggested it would only work if headless was set to be false. I don't want to be constantly opening up a browser UI i don't need (I just the need the data!) so I'm looking for why headless has to be false and can I get a fix that lets headless = true.
Here's my code:
express()
.get("/*", (req, res) => {
global.notBaseURL = req.params[0];
(async () => {
const browser = await puppet.launch({ headless: false }); // Line of Interest
const page = await browser.newPage();
console.log(req.params[0]);
await page.goto(req.params[0], { waitUntil: "networkidle2" }); //this is the url
title = await page.$eval("title", (el) => el.innerText);
browser.close();
res.send({
title: title,
});
})();
})
.listen(PORT, () => console.log(`Listening on ${PORT}`));
This is the page I'm trying to scrape: ;recs_placement=FTR&recs_strategy=recently_viewed_snowplow_mvp&recs_source=recbot&recs_page_type=category&recs_seed=0&color=BLACK
I'm creating a web api that scrapes a given url and sends that back. I am using Puppeteer to do this. I asked this question: Puppeteer not behaving like in Developer Console
and recieved an answer that suggested it would only work if headless was set to be false. I don't want to be constantly opening up a browser UI i don't need (I just the need the data!) so I'm looking for why headless has to be false and can I get a fix that lets headless = true.
Here's my code:
express()
.get("/*", (req, res) => {
global.notBaseURL = req.params[0];
(async () => {
const browser = await puppet.launch({ headless: false }); // Line of Interest
const page = await browser.newPage();
console.log(req.params[0]);
await page.goto(req.params[0], { waitUntil: "networkidle2" }); //this is the url
title = await page.$eval("title", (el) => el.innerText);
browser.close();
res.send({
title: title,
});
})();
})
.listen(PORT, () => console.log(`Listening on ${PORT}`));
This is the page I'm trying to scrape: https://www.nordstrom.com/s/zella-high-waist-studio-pocket-7-8-leggings/5460106?origin=coordinating-5460106-0-1-FTR-recbot-recently_viewed_snowplow_mvp&recs_placement=FTR&recs_strategy=recently_viewed_snowplow_mvp&recs_source=recbot&recs_page_type=category&recs_seed=0&color=BLACK
Share Improve this question edited Sep 9, 2020 at 21:25 Qrow Saki asked Sep 9, 2020 at 20:08 Qrow SakiQrow Saki 1,0521 gold badge10 silver badges26 bronze badges3 Answers
Reset to default 11The reason it might work in UI mode but not headless is that sites who aggressively fight scraping will detect that you are running in a headless browser.
Some possible workarounds:
Use puppeteer-extra
Found here: https://github.com/berstend/puppeteer-extra Check out their docs for how to use it. It has a couple plugins that might help in getting past headless-mode detection:
puppeteer-extra-plugin-anonymize-ua
-- anonymizes your User Agent. Note that this might help with getting past headless mode detection, but as you'll see if you visit https://amiunique.org/ it is unlikely to be enough to keep you from being identified as a repeat visitor.puppeteer-extra-plugin-stealth
-- this might help win the cat-and-mouse game of not being detected as headless. There are many tricks that are employed to detect headless mode, and as many tricks to evade them.
Run a "real" Chromium instance/UI
It's possible to run a single browser UI in a manner that let's you attach puppeteer to that running instance. Here's an article that explains it: https://medium.com/@jaredpotter1/connecting-puppeteer-to-existing-chrome-window-8a10828149e0
Essentially you're starting Chrome or Chromium (or Edge?) from the command line with --remote-debugging-port=9222
(or any old port?) plus other command line switches depending on what environment you're running it in. Then you use puppeteer to connect to that running instance instead of having it do the default behavior of launching a headless Chromium instance: const browser = await puppeteer.connect({ browserURL: ENDPOINT_URL });
. Read the puppeteer docs here for more info: https://pptr.dev/#?product=Puppeteer&version=v5.2.1&show=api-puppeteerlaunchoptions
The ENDPOINT_URL
is displayed in the terminal when you launch the browser from the command line with the --remote-debugging-port=9222
option.
This option is going to require some server/ops mojo, so be prepared to do a lot more Stack Overflow searches. :-)
There are other strategies I'm sure but those are the two I'm most familiar with. Good luck!
Todd's answer is thorough, but worth trying before resorting to some of the recommendations there is to try manually setting a human user agent, based on the Puppeteer GitHub issue Different behavior between { headless: false } and { headless: true }:
const ua =
"Mozilla/5.0 (Linux; Android 10; K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Mobile Safari/537.3";
await page.setUserAgent(ua);
await page.goto(yourURL);
Now, the Nordstorm site provided by OP seems to be able to detect robots even with headless: false
, at least at the present moment. But other sites are less strict and I've found the above line to be useful on some of them as shown in Puppeteer can't find elements when Headless TRUE and Puppeteer , bringing back blank array, among many other cases.
Visit the GH issue thread above for other ideas and see useragents.me and the user-agents npm package for a rotating list of current user agents. The one provided here may not work.
https://bot.sannysoft.com/, https://bot-detector.rebrowser.net/ and https://www.browserscan.net/ are useful tools for checking to what extent your script may be seen as a bot.
on top of todd's answer I would recommend to use args: ['--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36']
while launching puppeteer so set userAgent at browser level, so that even if you open new tabs, user agent does not change.
本文标签: javascriptWhy does headless need to be false for Puppeteer to workStack Overflow
版权声明:本文标题:javascript - Why does headless need to be false for Puppeteer to work? - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1739365850a2160033.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论