admin管理员组文章数量:1387357
I tried to scrape an Amazon page to get the price of a product, but the scraping result gives me different amounts of money than shown in the actual browser. I checked many times but couldn't get the right result. It gives me $89.99 dollars while on the actual site the product costs $58.95. Does Amazon confuse web scrapers and crawlers intentionally or is it my fault? I used Puppeteer and JSDom in NodeJS.
NodeJS code:
const puppeteer = require('puppeteer');
const jsdom = require('jsdom');
const { JSDOM } = jsdom;
const url = ';keywords=Deathadder%2BChroma&qid=1625425444&sr=8-2&th=1';
async function configureBrowser() {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url);
return page;
}
async function pageContent() {
let page = await configureBrowser();
// await page.reload();
let html = await page.evaluate(() => document.body.innerHTML);
await page.close();
console.log(new JSDOM(html).window.document.querySelector('#priceblock_ourprice').textContent);
// return new JSDOM(html).window.document.querySelector('#priceblock_ourprice').textContent;
}
module.exports = pageContent;
I tried to scrape an Amazon page to get the price of a product, but the scraping result gives me different amounts of money than shown in the actual browser. I checked many times but couldn't get the right result. It gives me $89.99 dollars while on the actual site the product costs $58.95. Does Amazon confuse web scrapers and crawlers intentionally or is it my fault? I used Puppeteer and JSDom in NodeJS.
NodeJS code:
const puppeteer = require('puppeteer');
const jsdom = require('jsdom');
const { JSDOM } = jsdom;
const url = 'https://www.amazon./Razer-DeathAdder-Chroma-Multi-Color-Comfortable/dp/B00MYTSDU4/ref=sr_1_2?dchild=1&keywords=Deathadder%2BChroma&qid=1625425444&sr=8-2&th=1';
async function configureBrowser() {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url);
return page;
}
async function pageContent() {
let page = await configureBrowser();
// await page.reload();
let html = await page.evaluate(() => document.body.innerHTML);
await page.close();
console.log(new JSDOM(html).window.document.querySelector('#priceblock_ourprice').textContent);
// return new JSDOM(html).window.document.querySelector('#priceblock_ourprice').textContent;
}
module.exports = pageContent;
Share
Improve this question
edited Jun 9, 2022 at 16:12
ggorlen
58k8 gold badges114 silver badges157 bronze badges
asked Jul 4, 2021 at 20:48
KoboldMinesKoboldMines
4902 gold badges8 silver badges22 bronze badges
2 Answers
Reset to default 5It's odd to bine JSDom with Puppeteer. Puppeteer already has a full suite of selectors and works on the actual, realtime DOM inside the webpage, so to dump and re-parse the entire HTML using a simulated DOM like JSDom is an unnecessary layer of indirection that can lead to confusion.
When the page is injecting the content dynamically, just use Puppeteer alone:
const puppeteer = require("puppeteer"); // ^23.0.1
const url = "<your URL>";
let browser;
(async () => {
browser = await puppeteer.launch({headless: "new"});
const [page] = await browser.pages();
await page.setUserAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36");
await page.setJavaScriptEnabled(false);
await page.setRequestInterception(true);
page.on("request", req => req.url() === url ? req.continue() : req.abort());
await page.goto(url, {waitUntil: "domcontentloaded"});
const el = await page.waitForSelector(".a-price .a-offscreen");
const price = await el.evaluate(el => el.innerText);
console.log(price);
})()
.catch(err => console.error(err))
.finally(() => browser.close());
Since the price you want appears to be baked into the static HTML in this case, I've disabled JS and resource requests. But you can potentially go a step further and skip Puppeteer and use JSDom along with a basic HTTP request to get the data:
<span class="a-price a-text-price a-size-medium apexPriceToPay" data-a-size="b" data-a-color="price">
<span class="a-offscreen">$53.00</span>
<span aria-hidden="true">$53.00</span>
</span>
const axios = require("axios"); // ^1.6.8
const {JSDOM} = require("jsdom"); // ^24.0.0
const url = "<your URL>";
(async () => {
const {data: html} = await axios.get(url, {
headers: { // https://www.zenrows./blog/stealth-web-scraping-in-python-avoid-blocking-like-a-ninja#full-set-of-headers
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.9",
"Sec-Ch-Ua": "\"Chromium\";v=\"92\", \" Not A;Brand\";v=\"99\", \"Google Chrome\";v=\"92\"",
"Sec-Ch-Ua-Mobile": "?0",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-60ff12bb-55defac340ac48081d670f9d"
}
});
const price = new JSDOM(html)
.window
.document
.querySelector(".a-price .a-offscreen")
.textContent;
console.log(price);
})()
.catch(err => console.error(err));
Does Amazon confuse web scrapers and crawlers intentionally?
It's possible that you're offered a different price based on location or other factors, such as running the script multiple times, but some of these changes occur even when visiting the page as a normal user.
Amazon changes selectors often and may take more intensive measures to block scraping, so some of the code here will require tweaks and updates to work in the future.
If ggorlen answer didn't help you could give this way a try only using puppeteer.
const puppeteer = require("puppeteer");
const scrape = async (url) => {
let browser, page;
try {
console.log('opening browser');
browser = await puppeteer.launch();
page = await browser.newPage();
await page.goto(url, { waitUntil: 'domcontentloaded', timeout: 60000 });
await page.waitForSelector('#priceblock_ourprice', { visible: true });
const data = await page.evaluate(() => {
return [
JSON.stringify(document.getElementById('priceblock_ourprice').innerText)
];
});
const [price] = [ JSON.parse(data[0]) ];
console.log({ price });
return { price };
} catch (error) {
console.log('scrape error', error.message);
} finally {
if (browser) {
await browser.close();
console.log('closing browser');
}
}
}
scrape('https://www.amazon./Razer-DeathAdder-Chroma-Multi-Color-Comfortable/dp/B00MYTSDU4/ref=sr_1_2?dchild=1&keywords=Deathadder%2BChroma&qid=1625425444&sr=8-2&th=1');
本文标签: javascriptScraping Amazon prices with PuppeteerStack Overflow
版权声明:本文标题:javascript - Scraping Amazon prices with Puppeteer - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1744545475a2611870.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论