javascript - Regex strip all html tags excluding <br> & <a class='user'>&-软件玩家

admin管理员组
文章数量:1406969

I am relatively new to regex but my aim is to strip all html tags from a string excluding <br>s and <a> tags with the class='user'. I want to user this regex to clean unwanted html garbage from contentedittable fields.

Hope one of you regex masters can help...

Here is an example to test on: /?2tpai

I think I'm close, but the closing tag of the a class='user' is currently still being selected as garbage when it is required.

Hope one of you regex masters can help...

Here is an example to test on: http://gskinner./RegExr/?2tpai

I think I'm close, but the closing tag of the a class='user' is currently still being selected as garbage when it is required.

Share Improve this question asked May 17, 2011 at 7:33 wilsonpage 17.6k23 gold badges105 silver badges150 bronze badges

2 Don't roll your own. This is important and nontrivial stuff. It has been done a zillion times. Re-use it – sehe Commented May 17, 2011 at 7:34
"I want to user this regex to clean unwanted html garbage from contentedittable fields." If you're dealing with a contentEditable field, why not just walk the DOM tree? HTML is very difficult to parse with regular expressions. (In fact, I think it's technically impossible, but you can get a close approximation if you try really hard.) – T.J. Crowder Commented May 17, 2011 at 7:40
Presumably, it's JUST the tags you want to strip, not their elements' content? – Bobby Jack Commented May 17, 2011 at 7:42
What about if the class='user' isn't the first attribute of the <a> tag? What if the anchor tag has other tags inside it (such as an <img>)? What if there are ments anywhere after your start tags, how will you know to skip over matches that appear inside them? I don't think you're close at all - I don't mean to be critical, merely to emphasize the point in my answer that this is technically impossible, and even getting "good enough" for casual use is very, very hard. There's undoubtedly even more issues that neither of us have considered. – Andrzej Doyle Commented May 17, 2011 at 7:49
@TJCrowder Ok if regex is the wrong choice could you give an example of how to clean the html in my example using javascript/jquery via DOM manipulation? – wilsonpage Commented May 17, 2011 at 8:01

| Show 2 more ments

3 Answers 3

Sorted by: Reset to default 5

Formally speaking you can't parse HTML with regex, because HTML is not a regular language. See also Can you provide some examples of why it is hard to parse XML and HTML with a regex? for some nightmare material.

Undoubtedly you can e up with some regexes that work in "most" situations, or "sensible" situations, but if you're sanitising user input via the regex this won't be sufficient. It's not a million miles away from trying to substitute SQL parameters via elaborate textual replacement and escaping; no matter how clever you are, the correct answer is to use a different approach as some weakness will still exist.

In this case, that approach would be using an HTML parsing library to read the text, and then remove the <br> and <a> tags from the parsed DOM. (This is much more robust then trying to remove a sequence of characters from the raw text, and probably creates more understandable source code too!). In fact, since you're probably talking about JS in the client, you already have the DOM available, pre-parsed by the browser, so this would be a simple operation.

If you're unfamiliar with JavaScript's DOM manipulation methods, I consider the quirksmode intro to be approachable and informative.

I would suggest this to you:

<(?!a class='user'|br|/a)[^>]+>

i.e., you keep in your html all the </a> tags, which should not bother much.

This is pretty hacky, but the regex engine will immediately skip a chunk of text starting with <a class='user' and start looking for the next <...

In general, in my experience with transforming html through regexes, I found out that the only way to go safely is splitting the process in several intermediate steps, like first dealing with the <a class='user'..../a>s, then with the rest, but I can't see an easy way to do that in your case without transforming the <a class='user'..../a> into something different as an intermediate step.

How about:

<?php
$new_content = strip_tags($content, '<a><br>');

This would allow all br and all a-elements. Unforutantely in this function you don't have the possibility to allow/disallow element-properties like your class="user". This functions allways allows/disallows the specified elements with all properties.

本文标签： javascriptRegex strip all html tags excluding ltbrgt amp lta class39user39gtltagtStack Overflow

版权声明：本文标题：javascript - Regex strip all html tags excluding <br> & <a class='user'>& 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1744964453a2634857.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

javascript - Regex strip all html tags excluding <br> & <a class='user'>&

3 Answers 3

更多相关文章