admin管理员组

文章数量:1289582

What is the best way to remove HTML tags from a string where that string can contain greater than and less than signs?

Example String

<h1>Hello</h1> two is < three but three is > one"

Expected Answer

"Hello two is < three but three is > one"

I've tried this but it removes " three but three is " which can't happen

Regex.Replace(b, "<.*?>", String.Empty);

HTML encode/decode is an option but last resort. My current answer is to create a concrete list of HTML tags and do a string split or string replace type of thing. What is the best most performant way to handle this situation?

Solution:

Html.Scrub(html);

public static class Html
{
    public static string Scrub(string s)
    {
        HtmlDocument d = new HtmlDocument();
        d.LoadHtml(s);
        return d.DocumentNode.InnerText;
    }
    public static string Scrub(string s, HtmlDocument d)
    {
        d.LoadHtml(s);
        return d.DocumentNode.InnerText;
    }
}

This way allowed me to walk through all the string props of an entity:

HtmlDocument d = new HtmlDocument();
foreach (var eItem in eItems)
{
    eItem.string1 = Html.Scrub(eItem.string1, d);
    eItem.string2 = Html.Scrub(eItem.string2, d);
    eItem.string3 = Html.Scrub(eItem.string3, d);

What is the best way to remove HTML tags from a string where that string can contain greater than and less than signs?

Example String

<h1>Hello</h1> two is < three but three is > one"

Expected Answer

"Hello two is < three but three is > one"

I've tried this but it removes " three but three is " which can't happen

Regex.Replace(b, "<.*?>", String.Empty);

HTML encode/decode is an option but last resort. My current answer is to create a concrete list of HTML tags and do a string split or string replace type of thing. What is the best most performant way to handle this situation?

Solution:

Html.Scrub(html);

public static class Html
{
    public static string Scrub(string s)
    {
        HtmlDocument d = new HtmlDocument();
        d.LoadHtml(s);
        return d.DocumentNode.InnerText;
    }
    public static string Scrub(string s, HtmlDocument d)
    {
        d.LoadHtml(s);
        return d.DocumentNode.InnerText;
    }
}

This way allowed me to walk through all the string props of an entity:

HtmlDocument d = new HtmlDocument();
foreach (var eItem in eItems)
{
    eItem.string1 = Html.Scrub(eItem.string1, d);
    eItem.string2 = Html.Scrub(eItem.string2, d);
    eItem.string3 = Html.Scrub(eItem.string3, d);
Share Improve this question edited Feb 21 at 13:30 Ben asked Feb 20 at 15:45 BenBen 2,0402 gold badges16 silver badges25 bronze badges 4
  • Is it possible to change the source data to be valid HTML? Those errant < and > characters (instead of &lt; and &gt;) will make parsing much more difficult. – David Commented Feb 20 at 15:49
  • @David It is a requirement that end users are allowed to type "<" and ">" into their strings. Whether the front end encodes it or not doesn't matter because I still have to check for html tags in the strings in case the request was hijacked and changed. -> Per the audit team. – Ben Commented Feb 20 at 15:58
  • See these stackoverflow posts: - stackoverflow/questions/26991134 - stackoverflow/questions/5002111 – bdcoder Commented Feb 20 at 16:02
  • @Ben Post your solution as answer and add some useful context. For example, which HTML parser/renderer is that? nuget link ....etc. – dr.null Commented Feb 21 at 14:11
Add a comment  | 

2 Answers 2

Reset to default 2

Use a library (as per my comments above).

Example below using HTMLAgilityPack, (https://www.nuget./packages/htmlagilitypack/):

using System;
using HtmlAgilityPack;  

public class Program
{
    public static void Main()
    {
        
        String html_str = @"<h1>Hello</h1> two is < three but three is > one";
        
        String text;

        HtmlDocument html_doc = new HtmlDocument();
        
        html_doc.LoadHtml( html_str );
        
        text = html_doc.DocumentNode.InnerText;
            
        Console.WriteLine( text );

    }
    
}

Output:

Hello two is < three but three is > one

Using Regular Expressions

using System;
using System.Text.RegularExpressions;
using System.Net;

class Program
{
    static void Main()
    {
        string htmlString = "<h1>Hello</h1> two is < three but three is > one";

        // Remove HTML tags without affecting < and > in normal text
        string withoutTags = Regex.Replace(htmlString, "</?\\w+.*?>", string.Empty);

        // Decode any HTML entities
        string decodedString = WebUtility.HtmlDecode(withoutTags);

        Console.WriteLine(decodedString);
    }
}

Output

Hello two is < three but three is > one

本文标签: cRemove Html Tags From StringStack Overflow