javascript - Parse HTML table without IDs or CSS selectors in Node.js - Stack Overflow

IT技术

更新时间：2025-03-060

admin管理员组
文章数量:1277875

This data is from an old system and the output is as is. We cannot add CSS selectors or IDs. Most of the examples online for node.js parsing involves parsing tables, rows, data with some ID or CSS classes but so far I haven't run into anything that can help parse the page below. This includes examples for JSDOM (AFAIK).

What I would like is to extract each of the rows into [fileName, link, size, dateTime] tuples on which I can then run some queries like what was the latest timestamp in the group, etc and then extract the filename and link - was thinking of using YQL. The alternating table row attributes is also making it a bit challenging. New to node.js so some of the terminology might be wrong. Any help will be appreciated.

Thanks.

<html>
<body>
    <table width="100%" cellspacing="0" cellpadding="5" align="center">
        <tr> 
        <td align="left"><font size="+1"><strong>Filename</strong></font></td>
        <td align="center"><font size="+1"><strong>Size</strong></font></td>
        <td align="right"><font size="+1"><strong>Last Modified</strong></font></td>
        </tr>
        <tr>
        <td align="left">&nbsp;&nbsp;
        <a href="/path_to_file.csv"><tt>file1.csv</tt></a></td>
        <td align="right"><tt>86.6 kb</tt></td>
        <td align="right"><tt>Fri, 21 Mar 2014 21:00:19 GMT</tt></td>
        </tr>
        <tr bgcolor="#eeeeee">
        <td align="left">&nbsp;&nbsp;
        <a href="/path_to_file.csv"><tt>file2.csv</tt></a></td>
        <td align="right"><tt>20.7 kb</tt></td>
        <td align="right"><tt>Fri, 21 Mar 2014 21:00:19 GMT</tt></td>
        </tr>
        <tr>
        <td align="left">&nbsp;&nbsp;
        <a href="/path_to_file.xml"><tt>file1.xml</tt></a></td>
        <td align="right"><tt>266.5 kb</tt></td>
        <td align="right"><tt>Fri, 21 Mar 2014 21:00:19 GMT</tt></td>
        </tr>
        <tr bgcolor="#eeeeee">
        <td align="left">&nbsp;&nbsp;
        <a href="/path_to_file.xml"><tt>file2.xml</tt></a></td>
        <td align="right"><tt>27.2 kb</tt></td>
        <td align="right"><tt>Fri, 21 Mar 2014 21:00:19 GMT</tt></td>
        </tr>
    </table>
</body>
</html>

Answer (thanks @Enragedmrt):

    res.on('data', function(data) {

        $ = cheerio.load(data.toString());
        var data = [];
        $('tr').each(function(i, tr){

            var children = $(this).children();
            var fileItem = children.eq(0);
            var linkItem = children.eq(0).children().eq(0);
            var lastModifiedItem = children.eq(2);

            var row = {
                "Filename": fileItem.text().trim(),
                "Link": linkItem.attr("href"),
                "LastModified": lastModifiedItem.text().trim()
            };
            data.push(row);
            console.log(row);
        });
    });

This data is from an old system and the output is as is. We cannot add CSS selectors or IDs. Most of the examples online for node.js parsing involves parsing tables, rows, data with some ID or CSS classes but so far I haven't run into anything that can help parse the page below. This includes examples for JSDOM (AFAIK).

What I would like is to extract each of the rows into [fileName, link, size, dateTime] tuples on which I can then run some queries like what was the latest timestamp in the group, etc and then extract the filename and link - was thinking of using YQL. The alternating table row attributes is also making it a bit challenging. New to node.js so some of the terminology might be wrong. Any help will be appreciated.

Thanks.

<html>
<body>
    <table width="100%" cellspacing="0" cellpadding="5" align="center">
        <tr> 
        <td align="left"><font size="+1"><strong>Filename</strong></font></td>
        <td align="center"><font size="+1"><strong>Size</strong></font></td>
        <td align="right"><font size="+1"><strong>Last Modified</strong></font></td>
        </tr>
        <tr>
        <td align="left">&nbsp;&nbsp;
        <a href="/path_to_file.csv"><tt>file1.csv</tt></a></td>
        <td align="right"><tt>86.6 kb</tt></td>
        <td align="right"><tt>Fri, 21 Mar 2014 21:00:19 GMT</tt></td>
        </tr>
        <tr bgcolor="#eeeeee">
        <td align="left">&nbsp;&nbsp;
        <a href="/path_to_file.csv"><tt>file2.csv</tt></a></td>
        <td align="right"><tt>20.7 kb</tt></td>
        <td align="right"><tt>Fri, 21 Mar 2014 21:00:19 GMT</tt></td>
        </tr>
        <tr>
        <td align="left">&nbsp;&nbsp;
        <a href="/path_to_file.xml"><tt>file1.xml</tt></a></td>
        <td align="right"><tt>266.5 kb</tt></td>
        <td align="right"><tt>Fri, 21 Mar 2014 21:00:19 GMT</tt></td>
        </tr>
        <tr bgcolor="#eeeeee">
        <td align="left">&nbsp;&nbsp;
        <a href="/path_to_file.xml"><tt>file2.xml</tt></a></td>
        <td align="right"><tt>27.2 kb</tt></td>
        <td align="right"><tt>Fri, 21 Mar 2014 21:00:19 GMT</tt></td>
        </tr>
    </table>
</body>
</html>

Answer (thanks @Enragedmrt):

    res.on('data', function(data) {

        $ = cheerio.load(data.toString());
        var data = [];
        $('tr').each(function(i, tr){

            var children = $(this).children();
            var fileItem = children.eq(0);
            var linkItem = children.eq(0).children().eq(0);
            var lastModifiedItem = children.eq(2);

            var row = {
                "Filename": fileItem.text().trim(),
                "Link": linkItem.attr("href"),
                "LastModified": lastModifiedItem.text().trim()
            };
            data.push(row);
            console.log(row);
        });
    });

Share Improve this question edited Mar 21, 2014 at 22:55 asked Mar 21, 2014 at 21:15 Shawn 3531 gold badge5 silver badges17 bronze badges

Add a ment |

4 Answers 4

Sorted by: Reset to default 8

I would suggest using Cheerio over JSDOM as it's significantly faster and more lightweight. That said, you'll need to do a for each loop grabbing up the 'tr' elements and subsequently their 'td' elements. Here's a rough example (My Node.js/Cheerio is rusty, but if you dig around in JQuery you can find some decent examples):

var data = [];
$('tr').each(function(i, tr){
    var children = $(this).children();
    var row = {
        "Filename": children[0].text(),
        "Size": children[1].text(),
        "Last Modified": children[2].text()
    };
    data.push(row);
});

I don't know JSDom, but it sounds like it can parse a HTML document into a DOM (Document Object Model). From there it should be very possible to loop through the nodes and recognise them by tag name, attributes or position in the document, even if they don't have ids.

Googling for 5 seconds, please hold on...

JSDom's documentation on GitHub seems to confirm this. It shows jQuery-like selectors, like window.$("a.the-link").text(). So instead of adding a class, you can select for selectors like td, th, or probably even td[align="left"]. Using selectors like that, and convenient methods like .first and .each, to traverse over multiple results (like every row) you should be able to parse the document just fine, although it will of course be a bit more cumbersome than having convenient classnames for every different kind of cell.

I still don't think I'm a JSDom expert, but reading their project's main page for a couple of minutes already shows all the answers to your questions, and much more.

JSFiddle

var rawData = new Array();
var rows = document.getElementsByTagName('tr');
for(var cnt = 1; cnt < rows.length; cnt++) {
    var cells = rows[cnt].getElementsByTagName('tt');
    var row = [];
    for (var count = 0; count < cells.length; count++) {
        row.push(cells[count].innerText.trim());
    }    
    rawData.push(row);
}

console.log(rawData);

Additional way

var cheerio = require('cheerio'),
    cheerioTableparser = require('cheerio-tableparser');

res.on('data', function(data) {

    $ = cheerio.load(data.toString());
    cheerioTableparser($);
    var data = [];
    var array = $("table").parsetable(false, false, false)
    array[0].forEach(function(d, i) {

        var firstColumnHTMLCell = $("<div>" + array[0][i] + "</div>");
        var fileItem = firstColumnHTMLCell.text().trim();
        var linkItem = firstColumnHTMLCell.find("a").attr("href");
        var lastModifiedItem = $("<div>" + array[2][i] + "</div>").text();

        var row = {
            "Filename": fileItem,
            "Link": linkItem,
            "LastModified": lastModifiedItem
        };

        data.push(row);
        console.log(row);
    })
});

本文标签： javascriptParse HTML table without IDs or CSS selectors in NodejsStack Overflow

版权声明：本文标题：javascript - Parse HTML table without IDs or CSS selectors in Node.js - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1741269418a2368994.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

javascript - Parse HTML table without IDs or CSS selectors in Node.js - Stack Overflow

4 Answers 4

更多相关文章