从字符串中删除HTML标签，包括c#

How can I remove all the HTML tags including &nbsp using regex in C#. My string looks like

如何删除c#中的所有HTML标签，包括使用regex。我的字符串的样子

  "<div>hello</div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;</div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div>"

8 个解决方案

#1

176

If you can't use an HTML parser oriented solution to filter out the tags, here's a simple regex for it.

如果您不能使用面向HTML解析器的解决方案来过滤标记，这里有一个简单的regex。

string noHTML = Regex.Replace(inputHTML, @"<[^>]+>|&nbsp;", "").Trim();

You should ideally make another pass through a regex filter that takes care of multiple spaces as

理想情况下，您应该通过regex过滤器进行另一次传递，该过滤器负责多个空格

string noHTMLNormalised = Regex.Replace(noHTML, @"\s{2,}", " ");

#2

I took @Ravi Thapliyal's code and made a method: It is simple and might not clean everything, but so far it is doing what I need it to do.

我使用了@Ravi Thapliyal的代码，并创建了一个方法:它很简单，可能不能清除所有东西，但到目前为止，它正在做我需要它做的事情。

public static string ScrubHtml(string value) {
    var step1 = Regex.Replace(value, @"<[^>]+>|&nbsp;", "").Trim();
    var step2 = Regex.Replace(step1, @"\s{2,}", " ");
    return step2;
}

#3

I've been using this function for a while. Removes pretty much any messy html you can throw at it and leaves the text intact.

我用这个函数已经有一段时间了。删除几乎任何凌乱的html，你可以扔掉它，并留下文本完整。

        private static readonly Regex _tags_ = new Regex(@"<[^>]+?>", RegexOptions.Multiline | RegexOptions.Compiled);

        //add characters that are should not be removed to this regex
        private static readonly Regex _notOkCharacter_ = new Regex(@"[^\w;&#@.:/\\?=|%!() -]", RegexOptions.Compiled);

        public static String UnHtml(String html)
        {
            html = HttpUtility.UrlDecode(html);
            html = HttpUtility.HtmlDecode(html);

            html = RemoveTag(html, "<!--", "-->");
            html = RemoveTag(html, "<script", "</script>");
            html = RemoveTag(html, "<style", "</style>");

            //replace matches of these regexes with space
            html = _tags_.Replace(html, " ");
            html = _notOkCharacter_.Replace(html, " ");
            html = SingleSpacedTrim(html);

            return html;
        }

        private static String RemoveTag(String html, String startTag, String endTag)
        {
            Boolean bAgain;
            do
            {
                bAgain = false;
                Int32 startTagPos = html.IndexOf(startTag, 0, StringComparison.CurrentCultureIgnoreCase);
                if (startTagPos < 0)
                    continue;
                Int32 endTagPos = html.IndexOf(endTag, startTagPos + 1, StringComparison.CurrentCultureIgnoreCase);
                if (endTagPos <= startTagPos)
                    continue;
                html = html.Remove(startTagPos, endTagPos - startTagPos + endTag.Length);
                bAgain = true;
            } while (bAgain);
            return html;
        }

        private static String SingleSpacedTrim(String inString)
        {
            StringBuilder sb = new StringBuilder();
            Boolean inBlanks = false;
            foreach (Char c in inString)
            {
                switch (c)
                {
                    case '\r':
                    case '\n':
                    case '\t':
                    case ' ':
                        if (!inBlanks)
                        {
                            inBlanks = true;
                            sb.Append(' ');
                        }   
                        continue;
                    default:
                        inBlanks = false;
                        sb.Append(c);
                        break;
                }
            }
            return sb.ToString().Trim();
        }

#4

var noHtml = Regex.Replace(inputHTML, @"<[^>]*(>|$)|&nbsp;|&zwnj;|&raquo;|&laquo;", string.Empty).Trim();

#5

this:

这样的:

(<.+?> | &nbsp;)

will match any tag or  

将匹配任何标记或

string regex = @"(<.+?>|&nbsp;)";
var x = Regex.Replace(originalString, regex, "").Trim();

then x = hello

那么x =你好

#6

HTML is in its basic form just XML. You could Parse your text in an XmlDocument object, and on the root element call InnerText to extract the text. This will strip all HTML tages in any form and also deal with special characters like <   all in one go.

HTML的基本形式就是XML。您可以在XmlDocument对象中解析文本，并在根元素上调用InnerText来提取文本。这将剥离所有HTML tages在任何形式，并处理特殊字符，如<,都是一气呵成。

#7

-1

Sanitizing an Html document involves a lot of tricky things. This package maybe of help: https://github.com/mganss/HtmlSanitizer

对Html文档进行消毒涉及到很多棘手的事情。这个包可能有帮助:https://github.com/mganss/HtmlSanitizer

#8

-1

(<([^>]+)>|&nbsp;)

You can test it here: https://regex101.com/r/kB0rQ4/1

您可以在这里测试:https://regex101.com/r/kB0rQ4/1。

#1

176