admin管理员组

文章数量:1415467

I'm new to pupeteer and don't know it's full potential. I have the following code that return results from scrape. But the format is one long tab delimited string. I'm trying to get a proper json.

(async () => {
const browser = await puppeteer.launch( {headless: true} );
    const page = await browser.newPage();
    await page.goto(url, {waitUntil: 'networkidle0'});

    let data = await page.evaluate(() => {
        const table = Array.from(document.querySelectorAll('table[id="gvM"] > tbody > tr ')); 
        return table.map(td => td.innerText);
    })

    console.log(data);
})();

Here is the html table:

<table cellspacing="0" cellpadding="4" rules="all" border="1" id="gvM" >
        <tr >
            <th scope="col">#</th><th scope="col">Resource</th><th scope="col">EM #</th><th scope="col">CVO</th><th scope="col">Start</th><th scope="col">End</th><th scope="col">Status</th><th scope="col">Assignment</th><th scope="col">&nbsp;</th>
        </tr>
        <tr >
            <td>31</td><td>Smith</td><td>618</td><td align="center"><span class="aspNetDisabled"><input id="gvM_ctl00_0" type="checkbox" name="gvM$ctl02$ctl00" disabled="disabled" /></span></td><td>&nbsp;</td><td>&nbsp;</td><td>AVAILABLE EXEC</td><td style="width:800px;">6F</td><td align="center"></td>
        </tr>
        <tr style="background-color:LightGreen;">
            <td>1</td><td>John</td><td>604</td><td align="center"><span class="aspNetDisabled"></span></td><td>1400</td><td>2200</td><td>AVAILABLE</td><td style="width:800px;">&nbsp;</td><td align="center"></td>
        </tr>
</table>

This is what I get:

[ '#\tResource\tEM #\tCVO\tStart\tEnd\tStatus\tAssignment\t ', '31\tSmith\t618\t\t \t \tAVAILABLE EXEC\t6F\t', '1\tJohn\t604\t\t1400\t2200\tAVAILABLE\t \t']

and this is what I want to get:

[{'#','Resource','EM', '#','CVO','Start','tEnd','Status', 'Assignment'}, {'31','Smith', '618',' ',' ',' ',' ','AVAILABLE EXEC','6F'}, {'1','John', '604',' ',' ','1400 ','2200','AVAILABLE', ' '}]

I applied the answer below, but I wasn't able to reproduce the results. Perhaps I'm doing something wrong. Could you explain how e I'm messing up?

const context = document.querySelectorAll('table[id="gvM"] > tbody > tr ');

const query = (selector, context) => Array.from(context.querySelectorAll(selector));
console.log( 
    query('tr', context).map(row => 
        query('td, th', row).map(cell => 
        cell.textContent))  
);

What does this error mean? (node:6204) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with. .catch(). (rejection id: 1) (node:6204) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

I'm new to pupeteer and don't know it's full potential. I have the following code that return results from scrape. But the format is one long tab delimited string. I'm trying to get a proper json.

(async () => {
const browser = await puppeteer.launch( {headless: true} );
    const page = await browser.newPage();
    await page.goto(url, {waitUntil: 'networkidle0'});

    let data = await page.evaluate(() => {
        const table = Array.from(document.querySelectorAll('table[id="gvM"] > tbody > tr ')); 
        return table.map(td => td.innerText);
    })

    console.log(data);
})();

Here is the html table:

<table cellspacing="0" cellpadding="4" rules="all" border="1" id="gvM" >
        <tr >
            <th scope="col">#</th><th scope="col">Resource</th><th scope="col">EM #</th><th scope="col">CVO</th><th scope="col">Start</th><th scope="col">End</th><th scope="col">Status</th><th scope="col">Assignment</th><th scope="col">&nbsp;</th>
        </tr>
        <tr >
            <td>31</td><td>Smith</td><td>618</td><td align="center"><span class="aspNetDisabled"><input id="gvM_ctl00_0" type="checkbox" name="gvM$ctl02$ctl00" disabled="disabled" /></span></td><td>&nbsp;</td><td>&nbsp;</td><td>AVAILABLE EXEC</td><td style="width:800px;">6F</td><td align="center"></td>
        </tr>
        <tr style="background-color:LightGreen;">
            <td>1</td><td>John</td><td>604</td><td align="center"><span class="aspNetDisabled"></span></td><td>1400</td><td>2200</td><td>AVAILABLE</td><td style="width:800px;">&nbsp;</td><td align="center"></td>
        </tr>
</table>

This is what I get:

[ '#\tResource\tEM #\tCVO\tStart\tEnd\tStatus\tAssignment\t ', '31\tSmith\t618\t\t \t \tAVAILABLE EXEC\t6F\t', '1\tJohn\t604\t\t1400\t2200\tAVAILABLE\t \t']

and this is what I want to get:

[{'#','Resource','EM', '#','CVO','Start','tEnd','Status', 'Assignment'}, {'31','Smith', '618',' ',' ',' ',' ','AVAILABLE EXEC','6F'}, {'1','John', '604',' ',' ','1400 ','2200','AVAILABLE', ' '}]

I applied the answer below, but I wasn't able to reproduce the results. Perhaps I'm doing something wrong. Could you explain how e I'm messing up?

const context = document.querySelectorAll('table[id="gvM"] > tbody > tr ');

const query = (selector, context) => Array.from(context.querySelectorAll(selector));
console.log( 
    query('tr', context).map(row => 
        query('td, th', row).map(cell => 
        cell.textContent))  
);

What does this error mean? (node:6204) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with. .catch(). (rejection id: 1) (node:6204) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

Share Improve this question edited Mar 27, 2019 at 17:15 vt2424253 asked Mar 27, 2019 at 0:38 vt2424253vt2424253 1,4274 gold badges27 silver badges42 bronze badges 1
  • 2 The wanted json in your question is invalid. – Niloct Commented Mar 27, 2019 at 0:45
Add a ment  | 

2 Answers 2

Reset to default 4

If you need an array of arrays from the table, you can try this approach, with mapping all rows to an array of rows and all cells to an array of cells inside a row element (this variant uses Array.from() with mapping function as a second argument):

const data = await page.evaluate(
  () => Array.from(
    document.querySelectorAll('table[id="gvM"] > tbody > tr'),
    row => Array.from(row.querySelectorAll('th, td'), cell => cell.innerText)
  )
);

I don't think this is related to Puppeteer but to the way you "iterate" over your <table>:

In your attempt, you're simply dumping the textual content of an entire row which produces the result that you're observing. Actually for each <tr> you need to get all its <td> (or <th>) elements:

const query = (selector, context) =>
  Array.from(context.querySelectorAll(selector));
  
console.log(

  query('tr', document).map(row =>
    query('td, th', row).map(cell =>
      cell.textContent))

)
<table>
  <tr>
    <th>col 1</th>
    <th>col 2</th>
    <th>col 3</th>
  </tr>
  <tr>
    <td>a</td>
    <td>b</td>
    <td>c</td>
  </tr>
  <tr>
    <td>x</td>
    <td>y</td>
    <td>z</td>
  </tr>
</table>

本文标签: javascriptHow to output proper json from a pupeteer scraped tableStack Overflow