javascript - Regexp to searchreplace only text, not in HTML attribute - Stack Overflow

IT技术

更新时间：2025-03-090

admin管理员组
文章数量:1289876

I'm using JavaScript to do some regular expression. Considering I'm working with well-formed source, and I want to remove any space before[,.] and keep only one space after [,.], except that [,.] is part of a number. Thus I use:

text = text.replace(/ *(,|\.) *([^ 0-9])/g, '$1 $2');

The problem is that this replaces also text in the html tag attributes. For example my text is (always wrapped with a tag):

<p>Test,and test . Again <img src="xyz.jpg"> ...</p>

Now it adds a space like this src="xyz. jpg" that is not expected. How can I rewrite my regular expression? What I want is

<p>Test, and test. Again <img src="xyz.jpg"> ...</p>

Thanks!

I'm using JavaScript to do some regular expression. Considering I'm working with well-formed source, and I want to remove any space before[,.] and keep only one space after [,.], except that [,.] is part of a number. Thus I use:

text = text.replace(/ *(,|\.) *([^ 0-9])/g, '$1 $2');

The problem is that this replaces also text in the html tag attributes. For example my text is (always wrapped with a tag):

<p>Test,and test . Again <img src="xyz.jpg"> ...</p>

Now it adds a space like this src="xyz. jpg" that is not expected. How can I rewrite my regular expression? What I want is

<p>Test, and test. Again <img src="xyz.jpg"> ...</p>

Thanks!

Share Improve this question edited Aug 11, 2010 at 16:34 bakkal 55.4k12 gold badges136 silver badges113 bronze badges asked Aug 11, 2010 at 15:24 jcisio 5071 gold badge7 silver badges17 bronze badges

9 This isn't something Regex's are good at as HTML isn't a regular language. There is too much scope/nesting/context. – CaffGeek Commented Aug 11, 2010 at 15:26
1 Is that text accessibly through the DOM? – Gumbo Commented Aug 11, 2010 at 15:39
Yes, I think, even I haven't tried. I wanted to write it as a CKEditor plugin, that's why I said "well-formed" (well, I meant XHTML anyway). I have the source code, but I think I can get is as DOM elements. – jcisio Commented Aug 13, 2010 at 8:10

Add a ment |

6 Answers 6

Sorted by: Reset to default 4

You can use a lookahead to make sure the match isn't occurring inside a tag:

text = text.replace(/(?![^<>]*>) *([.,]) *([^ \d])/g, '$1 $2');

The usual warnings apply regarding CDATA sections, SGML ments, SCRIPT elements, and angle brackets in attribute values. But I suspect your real problems will arise from the vagaries of "plain" text; HTML's not even in the same league. :D

Do not try to rewrite your expression to do this. You won’t succeed and will almost certainly forget about some corner cases. In the best case, this will lead to nasty bugs and in the worst case you will introduce security problems.

Instead, when you’re already using JavaScript and have well-formed code, use a genuine XML parser to loop over the text nodes and only apply your regex to them.

If you can access that text through the DOM, you can do this:

function fixPunctuation(elem) {
    // check if parameter is a an ELEMENT_NODE
    if (!(elem instanceof Node) || elem.nodeType !== Node.ELEMENT_NODE) return;
    var children = elem.childNodes, node;
    // iterate the child nodes of the element node
    for (var i=0; children[i]; ++i) {
        node = children[i];
        // check the child’s node type
        switch (node.nodeType) {
        case Node.ELEMENT_NODE:
            // call fixPunctuation if it’s also an ELEMENT_NODE
            fixPunctuation(node);
            break;
        case Node.TEXT_NODE:
            // fix punctuation if it’s a TEXT_NODE
            node.nodeValue = node.nodeValue.replace(/ *(,|\.) *([^ 0-9])/g, '$1 $2');
            break;
        }
    }
}

Now just pass the DOM node to that function like this:

fixPunctuation(document.body);
fixPunctuation(document.getElementById("foobar"));

Html is not a "regular language", therefore regex is not the optimal tool for parsing it. You might be better suited to use a html parser like this one to get at the attribute and then apply regex to do something with the value.

Enjoy!

As stated above and many times before, HTML is not a regular language and thus cannot be parsed with regular expressions.

You will have to do this recursively; I'd suggest crawling the DOM object.

Try something like this...

function regexReplaceInnerText(curr_element) {
    if (curr_element.childNodes.length <= 0) { // termination case:
                                               // no children; this is a "leaf node"
        if (curr_element.nodeName == "#text" || curr_element.nodeType == 3) { // node is text; not an empty tag like <br />
            if (curr_element.data.replace(/^\s*|\s*$/g, '') != "") { // node isn't just white space
                                                                     // (you can skip this check if you want)
                var text = curr_element.data;
                text = text.replace(/ *(,|\.) *([^ 0-9])/g, '$1 $2');
                curr_element.data = text;
            }
        }
    } else {
        // recursive case:
        // this isn't a leaf node, so we iterate over all children and recurse
        for (var i = 0; curr_element.childNodes[i]; i++) {
            regexReplaceInnerText(curr_element.childNodes[i]);
        }
    }
}
// then get the element whose children's text nodes you want to be regex'd
regexReplaceInnerText(document.getElementsByTagName("body")[0]);
// or if you don't want to do the whole document...
regexReplaceInnerText(document.getElementById("ElementToRegEx"));

Don't parse ~~regex~~HTML with ~~HTML~~regex. If you know your HTML is well-formed, use an HTML/XML parser. Otherwise, run it through Tidy first and then use an XML parser.

本文标签： javascriptRegexp to searchreplace only text not in HTML attributeStack Overflow

版权声明：本文标题：javascript - Regexp to searchreplace only text, not in HTML attribute - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1741466281a2380331.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

javascript - Regexp to searchreplace only text, not in HTML attribute - Stack Overflow

6 Answers 6

更多相关文章