admin管理员组文章数量:1290978
I'm trying to write a regex that will find a string of HTML tags inside a code editor (Khan Live Editor) and give the following error:
"You can't put <h1.. 2.. 3..> inside <p> elements."
This is the string I'm trying to match:
<p> ... <h1>
This the string I don't want to match:
<p> ... </p><h1>
Instead the expected behavior is that another error message appears in this situation.
So in English I want a string that;
- starts with <p>
and
- ends with <h1>
but
- does not contain </p>
.
It's easy enough to make this work if I don't care about the existence of a </p>
. My expression looks like this, /<p>.*<h[1-6]>/
and it works fine. But I need to make sure that </p>
does not e between the <p>
and <h1>
tags (or any <h#>
tag, hence the <h[1-6]>
).
I've tried a lot of different expressions from some other posts on here:
Regular expression to match a line that doesn't contain a word?
From which I tried: <p>^((?!<\/p>).)*$</h1>
regex string does not contain substring
From which I tried: /^<p>(?!<\/p>)<h1>$/
Regular expression that doesn't contain certain string
This link suggested: aa([^a] | a[^a])aa
Which doesn't work in my case because I need the specific string "</p>
" not just the characters of it since there might be other tags between <p> ... <h1>
.
I'm really stumped here. The regex I've tried seems like it should work... Any idea how I would make this work? Maybe I'm implementing the suggestions from other posts wrong?
Thanks in advance for any help.
Edit:
To answer why I need this done:
The problem is that <p><h1></h1></p>
is a syntax error since h1
closes the first <p>
and there is an unmatched </p>
. The original syntax error is not informative, but in most cases it is correct; my example being the exception. I'm trying to pass the syntax parser a new message to override the original message if the regex finds this exception.
I'm trying to write a regex that will find a string of HTML tags inside a code editor (Khan Live Editor) and give the following error:
"You can't put <h1.. 2.. 3..> inside <p> elements."
This is the string I'm trying to match:
<p> ... <h1>
This the string I don't want to match:
<p> ... </p><h1>
Instead the expected behavior is that another error message appears in this situation.
So in English I want a string that;
- starts with <p>
and
- ends with <h1>
but
- does not contain </p>
.
It's easy enough to make this work if I don't care about the existence of a </p>
. My expression looks like this, /<p>.*<h[1-6]>/
and it works fine. But I need to make sure that </p>
does not e between the <p>
and <h1>
tags (or any <h#>
tag, hence the <h[1-6]>
).
I've tried a lot of different expressions from some other posts on here:
Regular expression to match a line that doesn't contain a word?
From which I tried: <p>^((?!<\/p>).)*$</h1>
regex string does not contain substring
From which I tried: /^<p>(?!<\/p>)<h1>$/
Regular expression that doesn't contain certain string
This link suggested: aa([^a] | a[^a])aa
Which doesn't work in my case because I need the specific string "</p>
" not just the characters of it since there might be other tags between <p> ... <h1>
.
I'm really stumped here. The regex I've tried seems like it should work... Any idea how I would make this work? Maybe I'm implementing the suggestions from other posts wrong?
Thanks in advance for any help.
Edit:
To answer why I need this done:
The problem is that <p><h1></h1></p>
is a syntax error since h1
closes the first <p>
and there is an unmatched </p>
. The original syntax error is not informative, but in most cases it is correct; my example being the exception. I'm trying to pass the syntax parser a new message to override the original message if the regex finds this exception.
-
Exactly. So the problem is that
<p><h1></h1></p>
is a syntax error sinceh1
closes the first<p>
and there is an unmatched</p>
. The original syntax error is not informative, but in most cases it is correct; my example being the exception. I'm trying to pass the syntax parser a new message to override the original message if the regex finds this exception. – Dan Fletcher Commented Nov 24, 2015 at 18:53 -
This has nothing to do with your regex question, but it is actually correct and fine to have html content that contains an <h1>, <p>, etc before an explicit </p> as, in HTML5 (which has this flow-content rule) the </p> is pletely optional. For instance:
<p>Paragraph 1.<p>Paragraph 2.<h2>Heading</h2><p>Paragraph 3.
Is pletely valid HTML5 and can be authored as such intentionally. – rgthree Commented Nov 24, 2015 at 18:57 - Should we assume you don't ever have attributes or whitespace in the tags? – Alan McBee Commented Nov 24, 2015 at 19:03
- @AlanMcBee Yes that's true. – Dan Fletcher Commented Nov 24, 2015 at 19:05
- 1 @DanFletcher You said that RegEx is your only option. However, you can cheat your validator and pass a RegEx from an IIFE in argument list, and utilize Niet the Dark Absol's code. Please check a fiddle. – Teemu Commented Nov 24, 2015 at 19:45
5 Answers
Reset to default 6Sometimes it's better to break a problem down.
var str = "YOUR INPUT HERE";
str = str.substr(str.indexOf("<p>"));
str = str.substr(0,str.lastIndexOf("<h1>"));
if( str.indexOf("</p>") > -1) {
// there is a <p>...</p>...<h1>
}
else {
// there isn't
}
This code doesn't handle the case of "what if there is no <p>
to begin with" very well, but it does give a basic idea of how to break a problem down into simpler parts, without using regex.
Search for <p>
followed by any number of characters ([^]
means any character that is not nothing, this allows us to also capture newlines) that are not followed by </p>
which is eventually followed by <h[1-6]>
.
/<p>(?:[^](?!<\/p>))*<h[1-6]>/gi
RegEx101 Test Case
const strings = [ '<p> ... <h1>', '<p> ... </p><h1>', '<P> Hello <h1>', '<p></p><h1>',
'<p><h1>' ];
const regex = /<p>(?:(?!<\/p>)[^])*<h[1-6]>/gi;
const test = input => ({ input, test: regex.test(input), matches: input.match(regex) });
for(let input of strings) console.log(JSON.stringify(test(input)));
// { "input": "<p> ... <h1>", "test": true, "matches": ["<p> ... <h1>"] }
// { "input": "<p> ... </p><h1>", "test": false, "matches": null }
// { "input": "<P> Hello <h1>", "test": true, "matches": ["<P> Hello <h1>"] }
// { "input": "<p></p><h1>", "test": false, "matches": null }
// { "input": "<p><h1>", "test": true, "matches": ["<p><h1>"] }
.as-console-wrapper { max-height: 100% !important; min-height: 100% !important; }
Your first regular expression was close, but needed to remove the ^
and $
characters. If you need to match across newlines, you should use [/s/S]
instead of .
.
Here's the final regex: <p>(?:(?!<\/p>)[\s\S])*<h[1-6]>
However, having a header tag (<h1>
- <h6>
) is perfectly legal inside a paragraph element. They're just considered sibling elements, with the paragraph element ending where the header element begins.
A p element’s end tag may be omitted if the p element is immediately followed by an address, article, aside, blockquote, dir, div, dl, fieldset, footer, form, h1, h2, h3, h4, h5, h6, header, hr, menu, nav, ol, p, pre, section, table, or ul element, or if there is no more content in the parent element and the parent element is not an a element.
http://www.w3/TR/html-markup/p.html
I'm reaching the conclusion that using a regular expression to find the error is going to turn your one problem into two problems.
Consequently, I think a better approach is to do a very simplistic form of tree parsing. A "poor-man's HTML parser", if you will.
Use a simple regular expression to simply find all tags in the HTML, and put them into a list in the same order in which they were found. Ignore the text nodes between the tags.
Then, walk through the list in order, keeping a running tally on the tags. Increment the P counter when you get a <p>
tag, and decrement it when you get a </p>
tag. Increment the H counter and the H counter when you get to a <h1>
(etc.) tag, decrement on the closing tag.
If the H counter is > 0 while the P counter is > 0, that's your error.
I know im not formatting it correctly but I think the logic will work,
(just replace the AND and NOT with the correct symbols):
/(<p>.*<h[1-6]>)AND !(<p>.*</p><h[1-6]>)/
Let me know how it goes :)
本文标签: htmlJavaScript Regex Finding a String that does not contain ltpgtStack Overflow
版权声明:本文标题:html - JavaScript Regex: Finding a String that does not contain <p> - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1741523527a2383329.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论