admin管理员组

文章数量:1314480

I'm trying to scrape an address from whitepages, but my scraper keeps throwing this error every time I run it.

(node:11389) UnhandledPromiseRejectionWarning: TypeError: Cannot read property 'getProperty' of undefined

here's my code:

const puppeteer = require('puppeteer')

async function scrapeAddress(url){
    const browser = await puppeteer.launch();

    const page = await browser.newPage();
    await page.goto(url,{timeout: 0, waitUntil: 'networkidle0'});

    const [el]= await page.$x('//*[@id="left"]/div/div[4]/div[3]/div[2]/a/h3/span[1]');
    // console.log(el)
    const txt = await el.getProperty('textContent');
    const rawTxt = await txt.jsonValue(); 

    console.log({rawTxt}); 

    browser.close();

}

scrapeAddress('')

After investigating a bit, I realized that the el variable is getting returned as undefined and I'm not sure why. I've tried this same code to get elements from other sites but only for this site am I getting this error.

I tried both the full and short XPath as well as other surrounding elements and everything on this site throws this error.

Why would this be happening and is there any way I can fix it?

I'm trying to scrape an address from whitepages., but my scraper keeps throwing this error every time I run it.

(node:11389) UnhandledPromiseRejectionWarning: TypeError: Cannot read property 'getProperty' of undefined

here's my code:

const puppeteer = require('puppeteer')

async function scrapeAddress(url){
    const browser = await puppeteer.launch();

    const page = await browser.newPage();
    await page.goto(url,{timeout: 0, waitUntil: 'networkidle0'});

    const [el]= await page.$x('//*[@id="left"]/div/div[4]/div[3]/div[2]/a/h3/span[1]');
    // console.log(el)
    const txt = await el.getProperty('textContent');
    const rawTxt = await txt.jsonValue(); 

    console.log({rawTxt}); 

    browser.close();

}

scrapeAddress('https://www.whitepages./business/CA/San-Diego/Cvs-Health/b-1ahg5bs')

After investigating a bit, I realized that the el variable is getting returned as undefined and I'm not sure why. I've tried this same code to get elements from other sites but only for this site am I getting this error.

I tried both the full and short XPath as well as other surrounding elements and everything on this site throws this error.

Why would this be happening and is there any way I can fix it?

Share Improve this question asked Jan 18, 2020 at 8:17 Zafar SaifiZafar Saifi 211 silver badge2 bronze badges
Add a ment  | 

6 Answers 6

Reset to default 3

You can try wrapping everything in a try catch block, otherwise try unwrapping the promise with then().

(async() => {
  const browser = await puppeteer.launch();
  try {
    const page = await browser.newPage();
    await page.goto(url,{timeout: 0, waitUntil: 'networkidle0'});

    const [el]= await page.$x('//*[@id="left"]/div/div[4]/div[3]/div[2]/a/h3/span[1]');
    // console.log(el)
    const txt = await el.getProperty('textContent');
    const rawTxt = await txt.jsonValue(); 

    console.log({rawTxt}); 

  } catch (err) {
    console.error(err.message);
  } finally {
    await browser.close();
  }
})();

The reason is the website detects puppeteer as an automated bot. Set the headless to false and you can see it never navigates to the website.

I'd suggest using puppeteer-extra-plugin-stealth. Also always make sure to wait for the element to appear in the page.

const puppeteer = require('puppeteer-extra');
const pluginStealth = require('puppeteer-extra-plugin-stealth');
puppeteer.use(pluginStealth());

async function scrapeAddress(url){
    const browser = await puppeteer.launch();

    const page = await browser.newPage();
    await page.goto(url,{waitUntil: 'networkidle0'});

    //wait for xpath
    await page.waitForXPath('//*[@id="left"]/div/div[4]/div[3]/div[2]/a/h3/span[1]');
    const [el]= await page.$x('//*[@id="left"]/div/div[4]/div[3]/div[2]/a/h3/span[1]');
    // console.log(el)
    const txt = await el.getProperty('textContent');
    const rawTxt = await txt.jsonValue(); 

    console.log({rawTxt}); 

    browser.close();

}

scrapeAddress('https://www.whitepages./business/CA/San-Diego/Cvs-Health/b-1ahg5bs')

I recently ran into this error and changing my xpath worked for me. I had one grabbing the Full xpath and it was causing some issues

Most probably because the website is responsive, therefore when the scraper runs, it shows different XPATH.

I would suggest you to debug by using a headless browser:

const browser = await puppeteer.launch({headless: false});

I took the code that @mbit provided and modified it to my needs and also used a headless browser. I was unable to do it using a headless browser. If anyone was able to figure out how to do that please explain. Here is my solution:

first you must install a couple things in console bash so run the following two mands:

npm install puppeteer-extra
npm install puppeteer-extra-plugin-stealth

Installing these will allow you to run the first few lines in @mbit 's code. Then in this line of code:

 const browser = await puppeteer.launch();

as a parameter to puppeteer.launch(); pass in the following:

{headless: false}

which should in turn look like this:

const browser = await puppeteer.launch({headless: false});

I also believe that the Path that @mbit was using may not exist anymore so provide one of your own as well as a site. You can do this using the following 3 lines of code, just replace {XPath} with your own XPath and {address} with your own web address. NOTE: be mindful of your usage of quotes '' or "" as the XPath address may have the same ones that you are used to using which will mess up your path.

await page.waitForXPath({XPath});
const [el]= await page.$x({XPath});

scrapeAddress({address})

After you do this you should be able to run your code and retrieve values Heres what my code looked like in the end, feel free to copy paste into your own file to confirm that it works on your end at all!

let puppeteer = require('puppeteer-extra');
let pluginStealth = require('puppeteer-extra-plugin-stealth');
puppeteer.use(pluginStealth());

puppeteer = require('puppeteer')

async function scrapeAddress(url){
    const browser = await puppeteer.launch({headless: false});

    const page = await browser.newPage();
    await page.goto(url,{waitUntil: 'networkidle0'});

    //wait for xpath
    await page.waitForXPath('//*[@id="root"]/div[1]/div[2]/div[2]/div[9]/div/div/div/div[3]/div[2]/div[3]/div[3]');
    const [el]= await page.$x('//*[@id="root"]/div[1]/div[2]/div[2]/div[9]/div/div/div/div[3]/div[2]/div[3]/div[3]');
    
    const txt = await el.getProperty('textContent');
    const rawTxt = await txt.jsonValue(); 

    console.log({rawTxt}); 

    browser.close();
}

scrapeAddress("https://stockx./air-jordan-1-retro-high-unc-leather")

I was able to fix it by adding {waitUntil: 'networkidle0'} to the page.goto mand:

await page.goto(url, {waitUntil: 'networkidle0'});

Was running into the same issue so I tried @mbit's solution and it worked. After some tests I realized didn't actually needed puppeteer-extra-plugin-stealth running. Implementing the await page.goto mand worked just fine!

本文标签: