javascript - Getting all the text content from a HTML string in NodeJS - Stack Overflow

IT技术

更新时间：2025-02-122

admin管理员组
文章数量:1222503

I need to get only the text content from a HTML String with a space or a line break separating the text content of different elements.

For example, the HTML String might be:

<ul>
  <li>First</li>
  <li>Second</li>
</ul>

What I want:

First Second

or

First
Second

I've tried to get the text content by first wrapping the entire string inside a div and then getting the textContent using third party libraries. But, there is no spacing or line breaks between text content of different elements which I specifically require (i.e. I get FirstSecond which is not what I want).

The only solution I am thinking of right now is to make a DOM Tree and then apply recursion to get the nodes that contain text, and then append the text of that element to a string with spaces. Are there any cleaner, neater, and simpler solution than this?

I need to get only the text content from a HTML String with a space or a line break separating the text content of different elements.

For example, the HTML String might be:

<ul>
  <li>First</li>
  <li>Second</li>
</ul>

What I want:

First Second

or

First
Second

I've tried to get the text content by first wrapping the entire string inside a div and then getting the textContent using third party libraries. But, there is no spacing or line breaks between text content of different elements which I specifically require (i.e. I get FirstSecond which is not what I want).

The only solution I am thinking of right now is to make a DOM Tree and then apply recursion to get the nodes that contain text, and then append the text of that element to a string with spaces. Are there any cleaner, neater, and simpler solution than this?

Share Improve this question edited Apr 7, 2020 at 12:05 Heretic Monkey 12.1k7 gold badges61 silver badges130 bronze badges asked Mar 3, 2020 at 16:39 thesamiroli 4723 gold badges8 silver badges22 bronze badges

3 You can use the package cheerio to do these sorts of things, it is built for scraping/navigating/selecting HTML content. – Jon Church Commented Mar 3, 2020 at 20:12

Add a comment |

5 Answers 5

Sorted by: Reset to default 7

Convert HTML to Plain Text:

In your terminal, install the html-to-text npm package:

npm install html-to-text

Then in JavaScript::

const { convert } = require('html-to-text'); // Import the library

var htmlString = `
<ul>
  <li>First</li>
  <li>Second</li>
</ul>
`;

var text = convert(htmlString, { wordwrap: 130 })
// Out:
// First
// Second

Hope this helps!

You can try get rid of html tags using regex, for the yours example try the following:

let str = `<ul>
<li>First</li>
<li>Second</li>
</ul>`

console.log(str)

let regex = '<\/?!?(li|ul)[^>]*>'

var re = new RegExp(regex, 'g');

str = str.replace(re, '');
console.log(str)

Okay you can try this example, This may help you

I used JSDom module

https://www.npmjs.com/package/jsdom

const jsdom = require("jsdom");
const { JSDOM } = jsdom;
const dom = new JSDOM(`<!DOCTYPE html><p>Hello world</p>`);
console.log(dom.window.document.querySelector("p").textContent);

BTW Helped me

This code can help I think :)

Using the DOM, you could use document.Node.textContent. However, NodeJs doesn't have textContent (since it doesn't have native access to the DOM), therefore you should use external packages. You could install request and cheerio, using npm. cheerio, suggested by Jon Church, is maybe the easiest web scraping tool to use (there are also complexer ones like jsdom) With power of cheerio and request in your hands, you could write

const request = require("request");
const cheerio = require("cheerio");
const fs = require("fs");

//taken from https://stackoverflow.com/a/19709846/10713877
function is_absolute(url)
{
    var r = new RegExp('^(?:[a-z]+:)?//', 'i');
    return r.test(url);
}

function is_local(url)
{
    var r = new RegExp('^(?:file:)?//', 'i');
    return (r.test(url) || !is_absolute(url));
}

function send_request(URL)
    {
        if(is_local(URL))
        {
            if(URL.slice(0,7)==="file://")
                url_tmp = URL.slice(7,URL.length);
            else
                url_tmp = URL;

           //taken from https://stackoverflow.com/a/20665078/10713877
           const $ = cheerio.load(fs.readFileSync(url_tmp));
           //Do something
           console.log($.text())
        }
        else
        {
            var options = {
                url: URL,
                headers: {
                  'User-Agent': 'Your-User-Agent'
                }
              };

            request(options, function(error, response, html) {
                //no error
                if(!error && response.statusCode == 200)
                {
                    console.log("Success");

                    const $ = cheerio.load(html);


                    return Promise.resolve().then(()=> {
                        //Do something
                        console.log($.text())
                    });
                }
                else
                {
                    console.log(`Failure: ${error}`);
                }
            });
        }
    }

Let me explain the code. You pass a URL to send_request function. It checks whether the URL string is a path to your local file, (a relative path, or a path starting with file://). If it is a local file, it proceeds to use cheerio module, otherwise, it has to send a request, to the website, using the request module, then use cheerio module. Regular Expressions are used in is_absolute and is_local. You get the text using text() method provided by cheerio. Under the comments //Do something, you could do whatever you want with the text. There are websites that let you know 'Your-User-Agent', copy-paste your user agent to that field.

Below lines will work

//your local file
send_request("/absolute/path/to/your/local/index.html"); 
send_request("/relative/path/to/your/local/index.html"); 
send_request("file:///absolute/path/to/your/local/index.html"); 
//website
send_request("https://stackoverflow.com/");

EDIT: I am on a linux system.

You can try using npm library htmlparser2. Its will be very simple using this

const htmlparser2 = require('htmlparser2');

const htmlString = ''; //your html string goes here
const parser = new htmlparser2.Parser({
    ontext(text) {
      if (text && text.trim().length > 0) {
        //do as you need, you can concatenate or collect as string array
      }
    }
  });

parser.write(htmlString);
parser.end();

本文标签： javascriptGetting all the text content from a HTML string in NodeJSStack Overflow

版权声明：本文标题：javascript - Getting all the text content from a HTML string in NodeJS - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1739313912a2157710.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

javascript - Getting all the text content from a HTML string in NodeJS - Stack Overflow

5 Answers 5

Convert HTML to Plain Text:

更多相关文章