regex - javascript HTML from document.body.innerHTML - Stack Overflow

IT技术

更新时间：2025-03-181

admin管理员组
文章数量:1334193

I am trying to build a string of the contents of a webpage, without HTML syntax (probably replace it with a space, so words are not all conjoined) or punctuation.

so say you have the code:

    <body>
    <h1>Content:</h1>
    <p>paragraph 1</p>
    <p>paragraph 2</p>

    <script> alert("blah blah blah"); </script>

    This is some text<br />
    ....and some more
    </body>

I want to return the string:

    var content = "Content paragraph 1 paragraph 2 this is some text and this is some more";

any idea how to do this? Thanks.

I am trying to build a string of the contents of a webpage, without HTML syntax (probably replace it with a space, so words are not all conjoined) or punctuation.

so say you have the code:

    <body>
    <h1>Content:</h1>
    <p>paragraph 1</p>
    <p>paragraph 2</p>

    <script> alert("blah blah blah"); </script>

    This is some text<br />
    ....and some more
    </body>

I want to return the string:

    var content = "Content paragraph 1 paragraph 2 this is some text and this is some more";

any idea how to do this? Thanks.

Share Improve this question asked Jul 14, 2011 at 0:16 Test Tester 1,4795 gold badges14 silver badges14 bronze badges

Add a ment |

4 Answers 4

Sorted by: Reset to default 3

You can use the innerText property (instead of innerHTML, which returns the HTML tags as well):

var content = document.getElementsByTagName("body")[0].innerText;

However, note that this will also include new lines, so if you are after exactly what you specified in your question, you would need to remove them.

There is the W3C DOM 3 Core textContent property supported by some browsers, or the MS/HTML5 innerText property supported by other browsers (some support both). Likely the content of the script element is unwanted, so a recursive traverse of the related part of the DOM tree seems best:

// Get the text within an element
// Doesn't do any normalising, returns a string
// of text as found.
function getTextRecursive(element) {
  var text = [];
  var self = arguments.callee;
  var el, els = element.childNodes;

  for (var i=0, iLen=els.length; i<iLen; i++) {
    el = els[i];

    // May need to add other node types here
    // Exclude script element content
    if (el.nodeType == 1 && el.tagName && el.tagName.toLowerCase() != 'script') {
      text.push(self(el));

    // If working with XML, add nodeType 4 to get text from CDATA nodes
    } else if (el.nodeType == 3) {

      // Deal with extra whitespace and returns in text here.
      text.push(el.data);
    }
  }
  return text.join('');
}

You'll need a striptags function in javascript for that and a regex to replace consecutive newlines with a single space.

You can try using the replace statement below

var str = "..your HTML..";
var content = str.replace(/</?[a-zA-Z0-9]+>|<[a-zA-Z0-9]+\s*/>|\r?\n/g," ");

For the HTML that you have provided above, this will give you the following string in content

   Content:   paragraph 1   paragraph 2    alert("blah blah blah");   This is some text  ....and some more

本文标签： regexjavascript HTML from documentbodyinnerHTMLStack Overflow

版权声明：本文标题：regex - javascript HTML from document.body.innerHTML - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1742298770a2449257.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

regex - javascript HTML from document.body.innerHTML - Stack Overflow

4 Answers 4

更多相关文章