admin管理员组文章数量:1222503
I need to get only the text content from a HTML String with a space or a line break separating the text content of different elements.
For example, the HTML String might be:
<ul>
<li>First</li>
<li>Second</li>
</ul>
What I want:
First Second
or
First
Second
I've tried to get the text content by first wrapping the entire string inside a div
and then getting the textContent
using third party libraries. But, there is no spacing or line breaks between text content of different elements which I specifically require (i.e. I get FirstSecond
which is not what I want).
The only solution I am thinking of right now is to make a DOM Tree and then apply recursion to get the nodes that contain text, and then append the text of that element to a string with spaces. Are there any cleaner, neater, and simpler solution than this?
I need to get only the text content from a HTML String with a space or a line break separating the text content of different elements.
For example, the HTML String might be:
<ul>
<li>First</li>
<li>Second</li>
</ul>
What I want:
First Second
or
First
Second
I've tried to get the text content by first wrapping the entire string inside a div
and then getting the textContent
using third party libraries. But, there is no spacing or line breaks between text content of different elements which I specifically require (i.e. I get FirstSecond
which is not what I want).
The only solution I am thinking of right now is to make a DOM Tree and then apply recursion to get the nodes that contain text, and then append the text of that element to a string with spaces. Are there any cleaner, neater, and simpler solution than this?
Share Improve this question edited Apr 7, 2020 at 12:05 Heretic Monkey 12.1k7 gold badges61 silver badges130 bronze badges asked Mar 3, 2020 at 16:39 thesamirolithesamiroli 4723 gold badges8 silver badges22 bronze badges 1- 3 You can use the package cheerio to do these sorts of things, it is built for scraping/navigating/selecting HTML content. – Jon Church Commented Mar 3, 2020 at 20:12
5 Answers
Reset to default 7Convert HTML to Plain Text:
In your terminal, install the html-to-text
npm package:
npm install html-to-text
Then in JavaScript::
const { convert } = require('html-to-text'); // Import the library
var htmlString = `
<ul>
<li>First</li>
<li>Second</li>
</ul>
`;
var text = convert(htmlString, { wordwrap: 130 })
// Out:
// First
// Second
- Hope this helps!
You can try get rid of html tags using regex, for the yours example try the following:
let str = `<ul>
<li>First</li>
<li>Second</li>
</ul>`
console.log(str)
let regex = '<\/?!?(li|ul)[^>]*>'
var re = new RegExp(regex, 'g');
str = str.replace(re, '');
console.log(str)
Okay you can try this example, This may help you
I used JSDom
module
https://www.npmjs.com/package/jsdom
const jsdom = require("jsdom");
const { JSDOM } = jsdom;
const dom = new JSDOM(`<!DOCTYPE html><p>Hello world</p>`);
console.log(dom.window.document.querySelector("p").textContent);
BTW Helped me
This code can help I think :)
Using the DOM, you could use document.Node.textContent
. However, NodeJs doesn't have textContent (since it doesn't have native access to the DOM), therefore you should use external packages. You could install request
and cheerio
, using npm. cheerio
, suggested by Jon Church, is maybe the easiest web scraping tool to use (there are also complexer ones like jsdom
)
With power of cheerio
and request
in your hands, you could write
const request = require("request");
const cheerio = require("cheerio");
const fs = require("fs");
//taken from https://stackoverflow.com/a/19709846/10713877
function is_absolute(url)
{
var r = new RegExp('^(?:[a-z]+:)?//', 'i');
return r.test(url);
}
function is_local(url)
{
var r = new RegExp('^(?:file:)?//', 'i');
return (r.test(url) || !is_absolute(url));
}
function send_request(URL)
{
if(is_local(URL))
{
if(URL.slice(0,7)==="file://")
url_tmp = URL.slice(7,URL.length);
else
url_tmp = URL;
//taken from https://stackoverflow.com/a/20665078/10713877
const $ = cheerio.load(fs.readFileSync(url_tmp));
//Do something
console.log($.text())
}
else
{
var options = {
url: URL,
headers: {
'User-Agent': 'Your-User-Agent'
}
};
request(options, function(error, response, html) {
//no error
if(!error && response.statusCode == 200)
{
console.log("Success");
const $ = cheerio.load(html);
return Promise.resolve().then(()=> {
//Do something
console.log($.text())
});
}
else
{
console.log(`Failure: ${error}`);
}
});
}
}
Let me explain the code. You pass a URL to send_request
function. It checks whether the URL string is a path to your local file, (a relative path, or a path starting with file://
). If it is a local file, it proceeds to use cheerio
module, otherwise, it has to send a request, to the website, using the request
module, then use cheerio
module. Regular Expressions are used in is_absolute
and is_local
. You get the text using text()
method provided by cheerio
. Under the comments //Do something
, you could do whatever you want with the text.
There are websites that let you know 'Your-User-Agent'
, copy-paste your user agent to that field.
Below lines will work
//your local file
send_request("/absolute/path/to/your/local/index.html");
send_request("/relative/path/to/your/local/index.html");
send_request("file:///absolute/path/to/your/local/index.html");
//website
send_request("https://stackoverflow.com/");
EDIT: I am on a linux system.
You can try using npm library htmlparser2
. Its will be very simple using this
const htmlparser2 = require('htmlparser2');
const htmlString = ''; //your html string goes here
const parser = new htmlparser2.Parser({
ontext(text) {
if (text && text.trim().length > 0) {
//do as you need, you can concatenate or collect as string array
}
}
});
parser.write(htmlString);
parser.end();
本文标签: javascriptGetting all the text content from a HTML string in NodeJSStack Overflow
版权声明:本文标题:javascript - Getting all the text content from a HTML string in NodeJS - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1739313912a2157710.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论