正则表达式删除HTML标签

时间:2022-08-27 17:15:51

I am using the following Regular Expresion to remove html tags from a string. It works except I leave the closing tag. If I attempt to remove: <a href="blah">blah</a> it leaves the <a/>.

我正在使用以下常规示例从字符串中删除html标记。它是有效的,除非我离开结束标签。如果我试图删除:blah它会留下

I do not know Regular Expression syntax at all and fumbled through this. Can someone with RegEx knowledge please provide me with a pattern that will work.

我一点也不知道正则表达式语法,所以在这一过程中我摸索着。有RegEx知识的人能给我提供一个可以工作的模式吗?

Here is my code:

这是我的代码:

  string sPattern = @"<\/?!?(img|a)[^>]*>";
  Regex rgx = new Regex(sPattern);
  Match m = rgx.Match(sSummary);
  string sResult = "";
  if (m.Success)
   sResult = rgx.Replace(sSummary, "", 1);

I am looking to remove the first occurence of the <a> and <img> tags.

我希望删除正则表达式删除HTML标签标签的第一次出现。

10 个解决方案

#1


18  

Using a regular expression to parse HTML is fraught with pitfalls. HTML is not a regular language and hence can't be 100% correctly parsed with a regex. This is just one of many problems you will run into. The best approach is to use an HTML / XML parser to do this for you.

使用正则表达式来解析HTML充满了陷阱。HTML不是一种常规语言,因此不能100%正确地使用regex解析。这只是您将遇到的众多问题之一。最好的方法是使用HTML / XML解析器为您实现这一点。

Here is a link to a blog post I wrote awhile back which goes into more details about this problem.

这是我之前写的一篇博客文章的链接,里面有关于这个问题的更多细节。

That being said, here's a solution that should fix this particular problem. It in no way is a perfect solution though.

也就是说,这是一个解决这个问题的方法。但这绝不是一个完美的解决方案。

var pattern = @"<(img|a)[^>]*>(?<content>[^<]*)<";
var regex = new Regex(pattern);
var m = regex.Match(sSummary);
if ( m.Success ) { 
  sResult = m.Groups["content"].Value;

#2


14  

To turn this:

把这个:

'<td>mamma</td><td><strong>papa</strong></td>'

into this:

到这个:

'mamma papa'

You need to replace the tags with spaces:

你需要用空格替换标签:

.replace(/<[^>]*>/g, ' ')

and reduce any duplicate spaces into single spaces:

并将任何重复的空间减少为单个空间:

.replace(/\s{2,}/g, ' ')

then trim away leading and trailing spaces with:

然后用以下方法修剪掉前导和后置空间:

.trim();

Meaning that your remove tag function look like this:

这意味着移除标签函数是这样的:

function removeTags(string){
  return string.replace(/<[^>]*>/g, ' ')
               .replace(/\s{2,}/g, ' ')
               .trim();
}

#3


2  

So the HTML parser everyone's talking about is Html Agility Pack.

每个人都在谈论的HTML解析器是HTML敏捷包。

If it is clean XHTML, you can also use System.Xml.Linq.XDocument or System.Xml.XmlDocument.

如果是干净的XHTML,还可以使用System.Xml.Linq。XDocument或System.Xml.XmlDocument。

#4


2  

In order to remove also spaces between tags, you can use the following method a combination between regex and a trim for spaces at start and end of the input html:

为了删除标签之间的空格,您可以使用以下方法,regex和在输入html的开始和结束处对空格进行组合:

    public static string StripHtml(string inputHTML)
    {
        const string HTML_MARKUP_REGEX_PATTERN = @"<[^>]+>\s+(?=<)|<[^>]+>";
        inputHTML = WebUtility.HtmlDecode(inputHTML).Trim();

        string noHTML = Regex.Replace(inputHTML, HTML_MARKUP_REGEX_PATTERN, string.Empty);

        return noHTML;
    }

So for the following input:

因此对于以下输入:

      <p>     <strong>  <em><span style="text-decoration:underline;background-color:#cc6600;"></span><span style="text-decoration:underline;background-color:#cc6600;color:#663333;"><del>   test text  </del></span></em></strong></p><p><strong><span style="background-color:#999900;"> test 1 </span></strong></p><p><strong><em><span style="background-color:#333366;"> test 2 </span></em></strong></p><p><strong><em><span style="text-decoration:underline;background-color:#006600;"> test 3 </span></em></strong></p>      

The output will be only the text without spaces between html tags or space before or after html: "   test text   test 1  test 2  test 3 ".

输出将仅为html标记之间没有空格的文本或html前后的空格:“test text test 1 test 2 test 3”。

Please notice that the spaces before test text are from the <del> test text </del> html and the space after test 3 is from the <em><span style="text-decoration:underline;background-color:#006600;"> test 3 </span></em></strong></p> html.

请注意,测试前的空格来自测试文本 html,测试3后的空格来自

#5


1  

You can use already existing libraries to strip off the html tags. One good one being Chilkat C# Library.

您可以使用现有的库去掉html标记。一个很好的例子就是奇尔卡特图书馆。

#6


1  

can use:

可以使用:

Regex.Replace(source, "<[^>]*>", string.Empty);

#7


0  

Here is the extension method I've been using for quite some time.

这是我使用了很长一段时间的扩展方法。

public static class StringExtensions
{
     public static string StripHTML(this string htmlString, string htmlPlaceHolder) {
         const string pattern = @"<(.|\n)*?>";
         string sOut = Regex.Replace(htmlString, pattern, htmlPlaceHolder);
         sOut = sOut.Replace("&nbsp;", String.Empty);
         sOut = sOut.Replace("&amp;", "&");
         sOut = sOut.Replace("&gt;", ">");
         sOut = sOut.Replace("&lt;", "<");
         return sOut;
     }
}

#8


0  

Remove image from the string, using a regular expression in c# (image search performed by image id)

使用c#中的正则表达式从字符串中删除图像(由图像id执行的图像搜索)

string PRQ=<td valign=\"top\" style=\"width: 400px;\" align=\"left\"><img id=\"llgo\" src=\"http://test.Logo.png\" alt=\"logo\"></td>

var regex = new Regex("(<img(.+?)id=\"llgo\"(.+?))src=\"([^\"]+)\"");

PRQ = regex.Replace(PRQ, match => match.Groups[1].Value + "");

#9


0  

Why not trying reluctant quantifier? htmlString.replaceAll("<\\S*?>", "")

为什么不试试不情愿的量词呢?htmlString.replaceAll(“< \ \ S * ?> "," ")

(It's Java but the main thing is to show the idea)

(这是Java,但最主要的是展示这个想法)

#10


-1  

Here's an extension method I created using a simple regular expression to remove HTML tags from a string:

下面是我创建的一个扩展方法,它使用一个简单的正则表达式从字符串中删除HTML标记:

/// <summary>
/// Converts an Html string to plain text, and replaces all br tags with line breaks.
/// </summary>
/// <returns></returns>
/// <remarks></remarks>
[Extension()]
public string ToPlainText(string s)
{

    s = s.Replace("<br>", Constants.vbCrLf);
    s = s.Replace("<br />", Constants.vbCrLf);
    s = s.Replace("<br/>", Constants.vbCrLf);


    s = Regex.Replace(s, "<[^>]*>", string.Empty);


    return s;
}

Hope that helps.

希望有帮助。

#1


18  

Using a regular expression to parse HTML is fraught with pitfalls. HTML is not a regular language and hence can't be 100% correctly parsed with a regex. This is just one of many problems you will run into. The best approach is to use an HTML / XML parser to do this for you.

使用正则表达式来解析HTML充满了陷阱。HTML不是一种常规语言,因此不能100%正确地使用regex解析。这只是您将遇到的众多问题之一。最好的方法是使用HTML / XML解析器为您实现这一点。

Here is a link to a blog post I wrote awhile back which goes into more details about this problem.

这是我之前写的一篇博客文章的链接,里面有关于这个问题的更多细节。

That being said, here's a solution that should fix this particular problem. It in no way is a perfect solution though.

也就是说,这是一个解决这个问题的方法。但这绝不是一个完美的解决方案。

var pattern = @"<(img|a)[^>]*>(?<content>[^<]*)<";
var regex = new Regex(pattern);
var m = regex.Match(sSummary);
if ( m.Success ) { 
  sResult = m.Groups["content"].Value;

#2


14  

To turn this:

把这个:

'<td>mamma</td><td><strong>papa</strong></td>'

into this:

到这个:

'mamma papa'

You need to replace the tags with spaces:

你需要用空格替换标签:

.replace(/<[^>]*>/g, ' ')

and reduce any duplicate spaces into single spaces:

并将任何重复的空间减少为单个空间:

.replace(/\s{2,}/g, ' ')

then trim away leading and trailing spaces with:

然后用以下方法修剪掉前导和后置空间:

.trim();

Meaning that your remove tag function look like this:

这意味着移除标签函数是这样的:

function removeTags(string){
  return string.replace(/<[^>]*>/g, ' ')
               .replace(/\s{2,}/g, ' ')
               .trim();
}

#3


2  

So the HTML parser everyone's talking about is Html Agility Pack.

每个人都在谈论的HTML解析器是HTML敏捷包。

If it is clean XHTML, you can also use System.Xml.Linq.XDocument or System.Xml.XmlDocument.

如果是干净的XHTML,还可以使用System.Xml.Linq。XDocument或System.Xml.XmlDocument。

#4


2  

In order to remove also spaces between tags, you can use the following method a combination between regex and a trim for spaces at start and end of the input html:

为了删除标签之间的空格,您可以使用以下方法,regex和在输入html的开始和结束处对空格进行组合:

    public static string StripHtml(string inputHTML)
    {
        const string HTML_MARKUP_REGEX_PATTERN = @"<[^>]+>\s+(?=<)|<[^>]+>";
        inputHTML = WebUtility.HtmlDecode(inputHTML).Trim();

        string noHTML = Regex.Replace(inputHTML, HTML_MARKUP_REGEX_PATTERN, string.Empty);

        return noHTML;
    }

So for the following input:

因此对于以下输入:

      <p>     <strong>  <em><span style="text-decoration:underline;background-color:#cc6600;"></span><span style="text-decoration:underline;background-color:#cc6600;color:#663333;"><del>   test text  </del></span></em></strong></p><p><strong><span style="background-color:#999900;"> test 1 </span></strong></p><p><strong><em><span style="background-color:#333366;"> test 2 </span></em></strong></p><p><strong><em><span style="text-decoration:underline;background-color:#006600;"> test 3 </span></em></strong></p>      

The output will be only the text without spaces between html tags or space before or after html: "   test text   test 1  test 2  test 3 ".

输出将仅为html标记之间没有空格的文本或html前后的空格:“test text test 1 test 2 test 3”。

Please notice that the spaces before test text are from the <del> test text </del> html and the space after test 3 is from the <em><span style="text-decoration:underline;background-color:#006600;"> test 3 </span></em></strong></p> html.

请注意,测试前的空格来自测试文本 html,测试3后的空格来自

#5


1  

You can use already existing libraries to strip off the html tags. One good one being Chilkat C# Library.

您可以使用现有的库去掉html标记。一个很好的例子就是奇尔卡特图书馆。

#6


1  

can use:

可以使用:

Regex.Replace(source, "<[^>]*>", string.Empty);

#7


0  

Here is the extension method I've been using for quite some time.

这是我使用了很长一段时间的扩展方法。

public static class StringExtensions
{
     public static string StripHTML(this string htmlString, string htmlPlaceHolder) {
         const string pattern = @"<(.|\n)*?>";
         string sOut = Regex.Replace(htmlString, pattern, htmlPlaceHolder);
         sOut = sOut.Replace("&nbsp;", String.Empty);
         sOut = sOut.Replace("&amp;", "&");
         sOut = sOut.Replace("&gt;", ">");
         sOut = sOut.Replace("&lt;", "<");
         return sOut;
     }
}

#8


0  

Remove image from the string, using a regular expression in c# (image search performed by image id)

使用c#中的正则表达式从字符串中删除图像(由图像id执行的图像搜索)

string PRQ=<td valign=\"top\" style=\"width: 400px;\" align=\"left\"><img id=\"llgo\" src=\"http://test.Logo.png\" alt=\"logo\"></td>

var regex = new Regex("(<img(.+?)id=\"llgo\"(.+?))src=\"([^\"]+)\"");

PRQ = regex.Replace(PRQ, match => match.Groups[1].Value + "");

#9


0  

Why not trying reluctant quantifier? htmlString.replaceAll("<\\S*?>", "")

为什么不试试不情愿的量词呢?htmlString.replaceAll(“< \ \ S * ?> "," ")

(It's Java but the main thing is to show the idea)

(这是Java,但最主要的是展示这个想法)

#10


-1  

Here's an extension method I created using a simple regular expression to remove HTML tags from a string:

下面是我创建的一个扩展方法,它使用一个简单的正则表达式从字符串中删除HTML标记:

/// <summary>
/// Converts an Html string to plain text, and replaces all br tags with line breaks.
/// </summary>
/// <returns></returns>
/// <remarks></remarks>
[Extension()]
public string ToPlainText(string s)
{

    s = s.Replace("<br>", Constants.vbCrLf);
    s = s.Replace("<br />", Constants.vbCrLf);
    s = s.Replace("<br/>", Constants.vbCrLf);


    s = Regex.Replace(s, "<[^>]*>", string.Empty);


    return s;
}

Hope that helps.

希望有帮助。