admin管理员组

文章数量:1334193

I am trying to build a string of the contents of a webpage, without HTML syntax (probably replace it with a space, so words are not all conjoined) or punctuation.

so say you have the code:

    <body>
    <h1>Content:</h1>
    <p>paragraph 1</p>
    <p>paragraph 2</p>

    <script> alert("blah blah blah"); </script>

    This is some text<br />
    ....and some more
    </body>

I want to return the string:

    var content = "Content paragraph 1 paragraph 2 this is some text and this is some more";

any idea how to do this? Thanks.

I am trying to build a string of the contents of a webpage, without HTML syntax (probably replace it with a space, so words are not all conjoined) or punctuation.

so say you have the code:

    <body>
    <h1>Content:</h1>
    <p>paragraph 1</p>
    <p>paragraph 2</p>

    <script> alert("blah blah blah"); </script>

    This is some text<br />
    ....and some more
    </body>

I want to return the string:

    var content = "Content paragraph 1 paragraph 2 this is some text and this is some more";

any idea how to do this? Thanks.

Share Improve this question asked Jul 14, 2011 at 0:16 Test TesterTest Tester 1,4795 gold badges14 silver badges14 bronze badges
Add a ment  | 

4 Answers 4

Reset to default 3

You can use the innerText property (instead of innerHTML, which returns the HTML tags as well):

var content = document.getElementsByTagName("body")[0].innerText;

However, note that this will also include new lines, so if you are after exactly what you specified in your question, you would need to remove them.

There is the W3C DOM 3 Core textContent property supported by some browsers, or the MS/HTML5 innerText property supported by other browsers (some support both). Likely the content of the script element is unwanted, so a recursive traverse of the related part of the DOM tree seems best:

// Get the text within an element
// Doesn't do any normalising, returns a string
// of text as found.
function getTextRecursive(element) {
  var text = [];
  var self = arguments.callee;
  var el, els = element.childNodes;

  for (var i=0, iLen=els.length; i<iLen; i++) {
    el = els[i];

    // May need to add other node types here
    // Exclude script element content
    if (el.nodeType == 1 && el.tagName && el.tagName.toLowerCase() != 'script') {
      text.push(self(el));

    // If working with XML, add nodeType 4 to get text from CDATA nodes
    } else if (el.nodeType == 3) {

      // Deal with extra whitespace and returns in text here.
      text.push(el.data);
    }
  }
  return text.join('');
}

You'll need a striptags function in javascript for that and a regex to replace consecutive newlines with a single space.

You can try using the replace statement below

var str = "..your HTML..";
var content = str.replace(/</?[a-zA-Z0-9]+>|<[a-zA-Z0-9]+\s*/>|\r?\n/g," ");

For the HTML that you have provided above, this will give you the following string in content

   Content:   paragraph 1   paragraph 2    alert("blah blah blah");   This is some text  ....and some more  

本文标签: regexjavascript HTML from documentbodyinnerHTMLStack Overflow