admin管理员组文章数量:1289582
What is the best way to remove HTML tags from a string where that string can contain greater than and less than signs?
Example String
<h1>Hello</h1> two is < three but three is > one"
Expected Answer
"Hello two is < three but three is > one"
I've tried this but it removes " three but three is " which can't happen
Regex.Replace(b, "<.*?>", String.Empty);
HTML encode/decode is an option but last resort. My current answer is to create a concrete list of HTML tags and do a string split or string replace type of thing. What is the best most performant way to handle this situation?
Solution:
Html.Scrub(html);
public static class Html
{
public static string Scrub(string s)
{
HtmlDocument d = new HtmlDocument();
d.LoadHtml(s);
return d.DocumentNode.InnerText;
}
public static string Scrub(string s, HtmlDocument d)
{
d.LoadHtml(s);
return d.DocumentNode.InnerText;
}
}
This way allowed me to walk through all the string props of an entity:
HtmlDocument d = new HtmlDocument();
foreach (var eItem in eItems)
{
eItem.string1 = Html.Scrub(eItem.string1, d);
eItem.string2 = Html.Scrub(eItem.string2, d);
eItem.string3 = Html.Scrub(eItem.string3, d);
What is the best way to remove HTML tags from a string where that string can contain greater than and less than signs?
Example String
<h1>Hello</h1> two is < three but three is > one"
Expected Answer
"Hello two is < three but three is > one"
I've tried this but it removes " three but three is " which can't happen
Regex.Replace(b, "<.*?>", String.Empty);
HTML encode/decode is an option but last resort. My current answer is to create a concrete list of HTML tags and do a string split or string replace type of thing. What is the best most performant way to handle this situation?
Solution:
Html.Scrub(html);
public static class Html
{
public static string Scrub(string s)
{
HtmlDocument d = new HtmlDocument();
d.LoadHtml(s);
return d.DocumentNode.InnerText;
}
public static string Scrub(string s, HtmlDocument d)
{
d.LoadHtml(s);
return d.DocumentNode.InnerText;
}
}
This way allowed me to walk through all the string props of an entity:
HtmlDocument d = new HtmlDocument();
foreach (var eItem in eItems)
{
eItem.string1 = Html.Scrub(eItem.string1, d);
eItem.string2 = Html.Scrub(eItem.string2, d);
eItem.string3 = Html.Scrub(eItem.string3, d);
Share
Improve this question
edited Feb 21 at 13:30
Ben
asked Feb 20 at 15:45
BenBen
2,0402 gold badges16 silver badges25 bronze badges
4
|
2 Answers
Reset to default 2Use a library (as per my comments above).
Example below using HTMLAgilityPack, (https://www.nuget./packages/htmlagilitypack/):
using System;
using HtmlAgilityPack;
public class Program
{
public static void Main()
{
String html_str = @"<h1>Hello</h1> two is < three but three is > one";
String text;
HtmlDocument html_doc = new HtmlDocument();
html_doc.LoadHtml( html_str );
text = html_doc.DocumentNode.InnerText;
Console.WriteLine( text );
}
}
Output:
Hello two is < three but three is > one
Using Regular Expressions
using System;
using System.Text.RegularExpressions;
using System.Net;
class Program
{
static void Main()
{
string htmlString = "<h1>Hello</h1> two is < three but three is > one";
// Remove HTML tags without affecting < and > in normal text
string withoutTags = Regex.Replace(htmlString, "</?\\w+.*?>", string.Empty);
// Decode any HTML entities
string decodedString = WebUtility.HtmlDecode(withoutTags);
Console.WriteLine(decodedString);
}
}
Output
Hello two is < three but three is > one
本文标签: cRemove Html Tags From StringStack Overflow
版权声明:本文标题:c# - Remove Html Tags From String - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1741424147a2377978.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
<
and>
characters (instead of<
and>
) will make parsing much more difficult. – David Commented Feb 20 at 15:49