admin管理员组文章数量:1415467
I'm new to pupeteer and don't know it's full potential. I have the following code that return results from scrape. But the format is one long tab delimited string. I'm trying to get a proper json.
(async () => {
const browser = await puppeteer.launch( {headless: true} );
const page = await browser.newPage();
await page.goto(url, {waitUntil: 'networkidle0'});
let data = await page.evaluate(() => {
const table = Array.from(document.querySelectorAll('table[id="gvM"] > tbody > tr '));
return table.map(td => td.innerText);
})
console.log(data);
})();
Here is the html table:
<table cellspacing="0" cellpadding="4" rules="all" border="1" id="gvM" >
<tr >
<th scope="col">#</th><th scope="col">Resource</th><th scope="col">EM #</th><th scope="col">CVO</th><th scope="col">Start</th><th scope="col">End</th><th scope="col">Status</th><th scope="col">Assignment</th><th scope="col"> </th>
</tr>
<tr >
<td>31</td><td>Smith</td><td>618</td><td align="center"><span class="aspNetDisabled"><input id="gvM_ctl00_0" type="checkbox" name="gvM$ctl02$ctl00" disabled="disabled" /></span></td><td> </td><td> </td><td>AVAILABLE EXEC</td><td style="width:800px;">6F</td><td align="center"></td>
</tr>
<tr style="background-color:LightGreen;">
<td>1</td><td>John</td><td>604</td><td align="center"><span class="aspNetDisabled"></span></td><td>1400</td><td>2200</td><td>AVAILABLE</td><td style="width:800px;"> </td><td align="center"></td>
</tr>
</table>
This is what I get:
[ '#\tResource\tEM #\tCVO\tStart\tEnd\tStatus\tAssignment\t ',
'31\tSmith\t618\t\t \t \tAVAILABLE EXEC\t6F\t',
'1\tJohn\t604\t\t1400\t2200\tAVAILABLE\t \t']
and this is what I want to get:
[{'#','Resource','EM', '#','CVO','Start','tEnd','Status', 'Assignment'},
{'31','Smith', '618',' ',' ',' ',' ','AVAILABLE EXEC','6F'},
{'1','John', '604',' ',' ','1400 ','2200','AVAILABLE', ' '}]
I applied the answer below, but I wasn't able to reproduce the results. Perhaps I'm doing something wrong. Could you explain how e I'm messing up?
const context = document.querySelectorAll('table[id="gvM"] > tbody > tr ');
const query = (selector, context) => Array.from(context.querySelectorAll(selector));
console.log(
query('tr', context).map(row =>
query('td, th', row).map(cell =>
cell.textContent))
);
What does this error mean?
(node:6204) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with. .catch(). (rejection id: 1)
(node:6204) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.
I'm new to pupeteer and don't know it's full potential. I have the following code that return results from scrape. But the format is one long tab delimited string. I'm trying to get a proper json.
(async () => {
const browser = await puppeteer.launch( {headless: true} );
const page = await browser.newPage();
await page.goto(url, {waitUntil: 'networkidle0'});
let data = await page.evaluate(() => {
const table = Array.from(document.querySelectorAll('table[id="gvM"] > tbody > tr '));
return table.map(td => td.innerText);
})
console.log(data);
})();
Here is the html table:
<table cellspacing="0" cellpadding="4" rules="all" border="1" id="gvM" >
<tr >
<th scope="col">#</th><th scope="col">Resource</th><th scope="col">EM #</th><th scope="col">CVO</th><th scope="col">Start</th><th scope="col">End</th><th scope="col">Status</th><th scope="col">Assignment</th><th scope="col"> </th>
</tr>
<tr >
<td>31</td><td>Smith</td><td>618</td><td align="center"><span class="aspNetDisabled"><input id="gvM_ctl00_0" type="checkbox" name="gvM$ctl02$ctl00" disabled="disabled" /></span></td><td> </td><td> </td><td>AVAILABLE EXEC</td><td style="width:800px;">6F</td><td align="center"></td>
</tr>
<tr style="background-color:LightGreen;">
<td>1</td><td>John</td><td>604</td><td align="center"><span class="aspNetDisabled"></span></td><td>1400</td><td>2200</td><td>AVAILABLE</td><td style="width:800px;"> </td><td align="center"></td>
</tr>
</table>
This is what I get:
[ '#\tResource\tEM #\tCVO\tStart\tEnd\tStatus\tAssignment\t ',
'31\tSmith\t618\t\t \t \tAVAILABLE EXEC\t6F\t',
'1\tJohn\t604\t\t1400\t2200\tAVAILABLE\t \t']
and this is what I want to get:
[{'#','Resource','EM', '#','CVO','Start','tEnd','Status', 'Assignment'},
{'31','Smith', '618',' ',' ',' ',' ','AVAILABLE EXEC','6F'},
{'1','John', '604',' ',' ','1400 ','2200','AVAILABLE', ' '}]
I applied the answer below, but I wasn't able to reproduce the results. Perhaps I'm doing something wrong. Could you explain how e I'm messing up?
const context = document.querySelectorAll('table[id="gvM"] > tbody > tr ');
const query = (selector, context) => Array.from(context.querySelectorAll(selector));
console.log(
query('tr', context).map(row =>
query('td, th', row).map(cell =>
cell.textContent))
);
What does this error mean?
(node:6204) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with. .catch(). (rejection id: 1)
(node:6204) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.
- 2 The wanted json in your question is invalid. – Niloct Commented Mar 27, 2019 at 0:45
2 Answers
Reset to default 4If you need an array of arrays from the table, you can try this approach, with mapping all rows to an array of rows and all cells to an array of cells inside a row element (this variant uses Array.from()
with mapping function as a second argument):
const data = await page.evaluate(
() => Array.from(
document.querySelectorAll('table[id="gvM"] > tbody > tr'),
row => Array.from(row.querySelectorAll('th, td'), cell => cell.innerText)
)
);
I don't think this is related to Puppeteer but to the way you "iterate" over your <table>
:
In your attempt, you're simply dumping the textual content of an entire row which produces the result that you're observing. Actually for each <tr>
you need to get all its <td>
(or <th>
) elements:
const query = (selector, context) =>
Array.from(context.querySelectorAll(selector));
console.log(
query('tr', document).map(row =>
query('td, th', row).map(cell =>
cell.textContent))
)
<table>
<tr>
<th>col 1</th>
<th>col 2</th>
<th>col 3</th>
</tr>
<tr>
<td>a</td>
<td>b</td>
<td>c</td>
</tr>
<tr>
<td>x</td>
<td>y</td>
<td>z</td>
</tr>
</table>
本文标签: javascriptHow to output proper json from a pupeteer scraped tableStack Overflow
版权声明:本文标题:javascript - How to output proper json from a pupeteer scraped table? - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1745181367a2646480.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论