javascript - How to convert HTML page to plain text in node.js? - Stack Overflow

IT技术

更新时间：2025-01-302

admin管理员组
文章数量:1180536

I know this has been asked before but I can't find a good answer for node.js

I need server-side to extract the plain text (no tags, script, etc.) from an HTML page that is fetched.

I know how to do it client-side with jQuery (get the .text() contents of the body tag), but do not know how to do this on the server side.

I've tried but this doesn't handle scripts.

  var { convert } = require('html-to-text');
    var request = require('request');
    request.get(url, function (error, result) {
        var text = convert(result.body, {
            wordwrap: 130
        });
    });

I've tried phantom.js but can't find a way to just get plain text.

I know this has been asked before but I can't find a good answer for node.js

I need server-side to extract the plain text (no tags, script, etc.) from an HTML page that is fetched.

I know how to do it client-side with jQuery (get the .text() contents of the body tag), but do not know how to do this on the server side.

I've tried https://npmjs.org/package/html-to-text but this doesn't handle scripts.

  var { convert } = require('html-to-text');
    var request = require('request');
    request.get(url, function (error, result) {
        var text = convert(result.body, {
            wordwrap: 130
        });
    });

I've tried phantom.js but can't find a way to just get plain text.

Share Improve this question edited Jul 31, 2023 at 15:27 technophyle 9,1187 gold badges33 silver badges52 bronze badges asked Nov 14, 2013 at 18:39 metalaureate 7,73211 gold badges59 silver badges96 bronze badges

Add a comment |

5 Answers 5

Sorted by: Reset to default 8

Use jsdom and jQuery (server-side).

With jQuery you can delete all scripts, styles, templates and the like and then you can extract the text.

Example

(This is not tested with jsdom and node, only in Chrome)

jQuery('script').remove()
jQuery('noscript').remove()
jQuery('body').text().replace(/\s{2,9999}/g, ' ')

As another answer suggested, use JSDOM, but you don't need jQuery. Try this:

JSDOM.fragment(sourceHtml).textContent

For those searching for a regex solution, here is my one

const HTMLPartToTextPart = (HTMLPart) => (
  HTMLPart
    .replace(/\n/ig, '')
    .replace(/<style[^>]*>[\s\S]*?<\/style[^>]*>/ig, '')
    .replace(/<head[^>]*>[\s\S]*?<\/head[^>]*>/ig, '')
    .replace(/<script[^>]*>[\s\S]*?<\/script[^>]*>/ig, '')
    .replace(/<\/\s*(?:p|div)>/ig, '\n')
    .replace(/<br[^>]*\/?>/ig, '\n')
    .replace(/<[^>]*>/ig, '')
    .replace('&nbsp;', ' ')
    .replace(/[^\S\r\n][^\S\r\n]+/ig, ' ')
);

Note: If you need to handle edge cases (see comments bellow) you will prefer html parsing solutions instead of regex one.

You can use TextVersionJS (http://textversionjs.com) to generate the plain text version of an HTML string. It's pure javascript (with tons of RegExps) so you can use it in the browser and in node.js as well.

This library may work for your needs, but it's NOT the same as getting the text of an element in the browser. Its purpose is to create a text version of an HTML email. This means that things like images are included. For example, given the following HTML and code snippet:

var textVersion = require("textversionjs");
var htmlText = "<html>" +
                    "<body>" +
                        "Lorem ipsum <a href=\"http://foo.foo\">dolor</a> sic <strong>amet</strong><br />" +
                        "Lorem ipsum <img src=\"http://foo.jpg\" alt=\"foo\" /> sic <pre>amet</pre>" +
                        "<p>Lorem ipsum dolor <br /> sic amet</p>" +
                        "<script>" +
                            "alert(\"nothing\");" +
                        "</script>" +
                    "</body>" +
                "</html>";
var plainText = textVersion.htmlToPlainText(htmlText);

The variable plainText will contain this string:

Lorem ipsum [dolor] (http://foo.foo) sic amet
Lorem ipsum ![foo] (http://foo.jpg) sic amet
Lorem ipsum dolor
sic amet

Note that it does properly ignore script tags. You'll find the latest version of the source code on GitHub.

Why not just get textContent of the body tag?

var body = document.getElementsByTagName('body')[0];
var bodyText = body.textContent;

本文标签： javascriptHow to convert HTML page to plain text in nodejsStack Overflow

版权声明：本文标题：javascript - How to convert HTML page to plain text in node.js? - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1738194413a2068098.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

javascript - How to convert HTML page to plain text in node.js? - Stack Overflow

5 Answers 5

更多相关文章