admin管理员组

文章数量:1415111

I'm attempting to split a string by spaces into an array of words. If the string contains HTML tags, I would like the full tag (including content) to be treated as a single word.

For example,

I like to eat <a href="/">tasty delicious waffles</a> for breakfast

should split into

I
like
to
eat
<a href="/">tasty delicious waffles</a>
for
breakfast

I've seen a couple related threads on Stack Overflow but I'm having trouble adapting anything to Javascript because they were written for languages that I'm not quite familiar with. Is there a regex expression that could easily do this or will the solution require multiple regex splits and iteration?

Thanks.

I'm attempting to split a string by spaces into an array of words. If the string contains HTML tags, I would like the full tag (including content) to be treated as a single word.

For example,

I like to eat <a href="http://www.waffles./">tasty delicious waffles</a> for breakfast

should split into

I
like
to
eat
<a href="http://www.waffles./">tasty delicious waffles</a>
for
breakfast

I've seen a couple related threads on Stack Overflow but I'm having trouble adapting anything to Javascript because they were written for languages that I'm not quite familiar with. Is there a regex expression that could easily do this or will the solution require multiple regex splits and iteration?

Thanks.

Share Improve this question asked Sep 26, 2011 at 7:44 mcarpentermcarpenter 471 silver badge6 bronze badges 1
  • Can there be nested tags like <div> foo <div> bar </div> baz </div>? – Tim Pietzcker Commented Sep 26, 2011 at 8:15
Add a ment  | 

2 Answers 2

Reset to default 7
result = subject.match(/<\s*(\w+\b)(?:(?!<\s*\/\s*\1\b)[\s\S])*<\s*\/\s*\1\s*>|\S+/g);

will work if your tags can't be nested, if all tags are properly closed, and if current tag names don't occur in ments, strings etc.

Explanation:

<\s*            # Either match a < (+ optional whitespace)
(\w+\b)         # tag name
(?:             # Then match...
 (?!            # (as long as it's impossible to match...
  <\s*\/\s*\1\b # the closing tag here
 )              # End of negative lookahead)
 [\s\S]         # ...any character
)*              # zero or more times.
<\s*\/\s*\1\s*> # Then match the closing tag.
|               # OR:
\S+             # Match a run of non-whitespace characters.

This is hard or impossible to do with regexp alone (depending on the plexity of HTML that you want/need to allow).

Instead, iterate over the children of the parent node and split them if they are text nodes or print them unmodified if they are non-text nodes.

本文标签: