admin管理员组文章数量:1399217
Using javascript, I need to parse the Content-Type text/html
portion of an email message and extract just the HTML part. Here's an example of the part of the mail source in question:
------=_Part_1504541_510475628.1327512846983
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: 7bit
<html ... a bunch of html ...
/html>
I want to extract everything between (and including) the <html>
tags after text/html
. How do I do this?
NOTE: I'm OK with a hacky regex. I don't expect this to be bulletproof.
Using javascript, I need to parse the Content-Type text/html
portion of an email message and extract just the HTML part. Here's an example of the part of the mail source in question:
------=_Part_1504541_510475628.1327512846983
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: 7bit
<html ... a bunch of html ...
/html>
I want to extract everything between (and including) the <html>
tags after text/html
. How do I do this?
NOTE: I'm OK with a hacky regex. I don't expect this to be bulletproof.
Share Improve this question asked Jul 3, 2012 at 20:25 Ben McCormackBen McCormack 33.2k49 gold badges154 silver badges225 bronze badges3 Answers
Reset to default 5Based on RFC/MIME documentation, the encapsulation boundary is defined as a line consisting entirely of two hyphen characters ("-", decimal code 45) followed by the boundary parameter value from the Content-Type header field.
Note: In JavaScript there is indeed no /s
modifier to make the dot .
match all characters, including line breaks. To match absolutely any character, you can use character class that contains a shorthand class and its negated version, such as [\s\S]
.
Regex:
\n--[^\n\r]*\r?\nContent-Type: text\/html[\s\S]*?\r?\n\r?\n([\s\S]*?)\n\r?\n--
JavaScript:
matches = /\n--[^\n\r]*\r?\nContent-Type: text\/html[\s\S]*?\r?\n\r?\n([\s\S]*?)\n\r?\n--/gim.exec(mail);
The answer by Ωmega is close but you can't be sure that the boundary contains the -
character.
You first need to look within the headers. The headers and body of the actual email content will be separated by \r\n\r\n
. You should see a header something like
Content-Type: multipart/alternative;
boundary="------=_Part_1504541_510475628.1327512846983"
This boundary is what you can then use to find the actual divider. You can then construct a regexp just like Ωmega's but substitute in this divider.
The only thing to be aware of is that the last boundary will have --
at the end in addition to the normal boundary content.
var html = source.toString().substr(source.toString().indexOf("\n\n")).trim();
本文标签: regexParse texthtml part of email source using JavascriptStack Overflow
版权声明:本文标题:regex - Parse texthtml part of email source using Javascript - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1744120595a2591721.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论