c# - Remove Html Tags From String - Stack Overflow

IT技术

更新时间：2025-03-082

admin管理员组
文章数量:1289582

What is the best way to remove HTML tags from a string where that string can contain greater than and less than signs?

Example String

<h1>Hello</h1> two is < three but three is > one"

Expected Answer

"Hello two is < three but three is > one"

I've tried this but it removes " three but three is " which can't happen

Regex.Replace(b, "<.*?>", String.Empty);

HTML encode/decode is an option but last resort. My current answer is to create a concrete list of HTML tags and do a string split or string replace type of thing. What is the best most performant way to handle this situation?

Solution:

Html.Scrub(html);

public static class Html
{
    public static string Scrub(string s)
    {
        HtmlDocument d = new HtmlDocument();
        d.LoadHtml(s);
        return d.DocumentNode.InnerText;
    }
    public static string Scrub(string s, HtmlDocument d)
    {
        d.LoadHtml(s);
        return d.DocumentNode.InnerText;
    }
}

This way allowed me to walk through all the string props of an entity:

HtmlDocument d = new HtmlDocument();
foreach (var eItem in eItems)
{
    eItem.string1 = Html.Scrub(eItem.string1, d);
    eItem.string2 = Html.Scrub(eItem.string2, d);
    eItem.string3 = Html.Scrub(eItem.string3, d);

What is the best way to remove HTML tags from a string where that string can contain greater than and less than signs?

Example String

<h1>Hello</h1> two is < three but three is > one"

Expected Answer

"Hello two is < three but three is > one"

I've tried this but it removes " three but three is " which can't happen

Regex.Replace(b, "<.*?>", String.Empty);

HTML encode/decode is an option but last resort. My current answer is to create a concrete list of HTML tags and do a string split or string replace type of thing. What is the best most performant way to handle this situation?

Solution:

Html.Scrub(html);

public static class Html
{
    public static string Scrub(string s)
    {
        HtmlDocument d = new HtmlDocument();
        d.LoadHtml(s);
        return d.DocumentNode.InnerText;
    }
    public static string Scrub(string s, HtmlDocument d)
    {
        d.LoadHtml(s);
        return d.DocumentNode.InnerText;
    }
}

This way allowed me to walk through all the string props of an entity:

HtmlDocument d = new HtmlDocument();
foreach (var eItem in eItems)
{
    eItem.string1 = Html.Scrub(eItem.string1, d);
    eItem.string2 = Html.Scrub(eItem.string2, d);
    eItem.string3 = Html.Scrub(eItem.string3, d);

Share Improve this question edited Feb 21 at 13:30 asked Feb 20 at 15:45 Ben 2,0402 gold badges16 silver badges25 bronze badges

Is it possible to change the source data to be valid HTML? Those errant < and > characters (instead of < and >) will make parsing much more difficult. – David Commented Feb 20 at 15:49
@David It is a requirement that end users are allowed to type "<" and ">" into their strings. Whether the front end encodes it or not doesn't matter because I still have to check for html tags in the strings in case the request was hijacked and changed. -> Per the audit team. – Ben Commented Feb 20 at 15:58
See these stackoverflow posts: - stackoverflow/questions/26991134 - stackoverflow/questions/5002111 – bdcoder Commented Feb 20 at 16:02
@Ben Post your solution as answer and add some useful context. For example, which HTML parser/renderer is that? nuget link ....etc. – dr.null Commented Feb 21 at 14:11

Add a comment |

2 Answers 2

Sorted by: Reset to default 2

Use a library (as per my comments above).

Example below using HTMLAgilityPack, (https://www.nuget./packages/htmlagilitypack/):

using System;
using HtmlAgilityPack;  

public class Program
{
    public static void Main()
    {
        
        String html_str = @"<h1>Hello</h1> two is < three but three is > one";
        
        String text;

        HtmlDocument html_doc = new HtmlDocument();
        
        html_doc.LoadHtml( html_str );
        
        text = html_doc.DocumentNode.InnerText;
            
        Console.WriteLine( text );

    }
    
}

Output:

Hello two is < three but three is > one

Using Regular Expressions

using System;
using System.Text.RegularExpressions;
using System.Net;

class Program
{
    static void Main()
    {
        string htmlString = "<h1>Hello</h1> two is < three but three is > one";

        // Remove HTML tags without affecting < and > in normal text
        string withoutTags = Regex.Replace(htmlString, "</?\\w+.*?>", string.Empty);

        // Decode any HTML entities
        string decodedString = WebUtility.HtmlDecode(withoutTags);

        Console.WriteLine(decodedString);
    }
}

Output

Hello two is < three but three is > one

本文标签： cRemove Html Tags From StringStack Overflow

版权声明：本文标题：c# - Remove Html Tags From String - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1741424147a2377978.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

c# - Remove Html Tags From String - Stack Overflow

2 Answers 2

更多相关文章