I am using the following Regular Expresion to remove html tags from a string. It works except I leave the closing tag. If I attempt to remove: <a href="blah">blah</a>
it leaves the <a/>
.
我正在使用以下常规示例从字符串中删除html标记。它是有效的,除非我离开结束标签。如果我试图删除:blah它会留下。
I do not know Regular Expression syntax at all and fumbled through this. Can someone with RegEx knowledge please provide me with a pattern that will work.
我一点也不知道正则表达式语法,所以在这一过程中我摸索着。有RegEx知识的人能给我提供一个可以工作的模式吗?
Here is my code:
这是我的代码:
string sPattern = @"<\/?!?(img|a)[^>]*>";
Regex rgx = new Regex(sPattern);
Match m = rgx.Match(sSummary);
string sResult = "";
if (m.Success)
sResult = rgx.Replace(sSummary, "", 1);
I am looking to remove the first occurence of the <a>
and <img>
tags.
我希望删除和标签的第一次出现。
10 个解决方案
#1
18
Using a regular expression to parse HTML is fraught with pitfalls. HTML is not a regular language and hence can't be 100% correctly parsed with a regex. This is just one of many problems you will run into. The best approach is to use an HTML / XML parser to do this for you.
使用正则表达式来解析HTML充满了陷阱。HTML不是一种常规语言,因此不能100%正确地使用regex解析。这只是您将遇到的众多问题之一。最好的方法是使用HTML / XML解析器为您实现这一点。
Here is a link to a blog post I wrote awhile back which goes into more details about this problem.
这是我之前写的一篇博客文章的链接,里面有关于这个问题的更多细节。
- http://blogs.msdn.com/b/jaredpar/archive/2008/10/15/regular-expression-limitations.aspx
- http://blogs.msdn.com/b/jaredpar/archive/2008/10/15/regular-expression-limitations.aspx
That being said, here's a solution that should fix this particular problem. It in no way is a perfect solution though.
也就是说,这是一个解决这个问题的方法。但这绝不是一个完美的解决方案。
var pattern = @"<(img|a)[^>]*>(?<content>[^<]*)<";
var regex = new Regex(pattern);
var m = regex.Match(sSummary);
if ( m.Success ) {
sResult = m.Groups["content"].Value;
#2
14
To turn this:
把这个:
'<td>mamma</td><td><strong>papa</strong></td>'
into this:
到这个:
'mamma papa'
You need to replace the tags with spaces:
你需要用空格替换标签:
.replace(/<[^>]*>/g, ' ')
and reduce any duplicate spaces into single spaces:
并将任何重复的空间减少为单个空间:
.replace(/\s{2,}/g, ' ')
then trim away leading and trailing spaces with:
然后用以下方法修剪掉前导和后置空间:
.trim();
Meaning that your remove tag function look like this:
这意味着移除标签函数是这样的:
function removeTags(string){
return string.replace(/<[^>]*>/g, ' ')
.replace(/\s{2,}/g, ' ')
.trim();
}
#3
2
So the HTML parser everyone's talking about is Html Agility Pack.
每个人都在谈论的HTML解析器是HTML敏捷包。
If it is clean XHTML, you can also use System.Xml.Linq.XDocument
or System.Xml.XmlDocument
.
如果是干净的XHTML,还可以使用System.Xml.Linq。XDocument或System.Xml.XmlDocument。
#4
2
In order to remove also spaces between tags, you can use the following method a combination between regex and a trim for spaces at start and end of the input html:
为了删除标签之间的空格,您可以使用以下方法,regex和在输入html的开始和结束处对空格进行组合:
public static string StripHtml(string inputHTML)
{
const string HTML_MARKUP_REGEX_PATTERN = @"<[^>]+>\s+(?=<)|<[^>]+>";
inputHTML = WebUtility.HtmlDecode(inputHTML).Trim();
string noHTML = Regex.Replace(inputHTML, HTML_MARKUP_REGEX_PATTERN, string.Empty);
return noHTML;
}
So for the following input:
因此对于以下输入:
<p> <strong> <em><span style="text-decoration:underline;background-color:#cc6600;"></span><span style="text-decoration:underline;background-color:#cc6600;color:#663333;"><del> test text </del></span></em></strong></p><p><strong><span style="background-color:#999900;"> test 1 </span></strong></p><p><strong><em><span style="background-color:#333366;"> test 2 </span></em></strong></p><p><strong><em><span style="text-decoration:underline;background-color:#006600;"> test 3 </span></em></strong></p>
The output will be only the text without spaces between html tags or space before or after html: " test text test 1 test 2 test 3 ".
输出将仅为html标记之间没有空格的文本或html前后的空格:“test text test 1 test 2 test 3”。
Please notice that the spaces before test text
are from the <del> test text </del>
html and the space after test 3
is from the <em><span style="text-decoration:underline;background-color:#006600;"> test 3 </span></em></strong></p>
html.
请注意,测试前的空格来自测试文本 html,测试3后的空格来自
#5
1
You can use already existing libraries to strip off the html tags. One good one being Chilkat C# Library.
您可以使用现有的库去掉html标记。一个很好的例子就是奇尔卡特图书馆。
#6
1
can use:
可以使用:
Regex.Replace(source, "<[^>]*>", string.Empty);
#7
0
Here is the extension method I've been using for quite some time.
这是我使用了很长一段时间的扩展方法。
public static class StringExtensions
{
public static string StripHTML(this string htmlString, string htmlPlaceHolder) {
const string pattern = @"<(.|\n)*?>";
string sOut = Regex.Replace(htmlString, pattern, htmlPlaceHolder);
sOut = sOut.Replace(" ", String.Empty);
sOut = sOut.Replace("&", "&");
sOut = sOut.Replace(">", ">");
sOut = sOut.Replace("<", "<");
return sOut;
}
}
#8
0
Remove image from the string, using a regular expression in c# (image search performed by image id)
使用c#中的正则表达式从字符串中删除图像(由图像id执行的图像搜索)
string PRQ=<td valign=\"top\" style=\"width: 400px;\" align=\"left\"><img id=\"llgo\" src=\"http://test.Logo.png\" alt=\"logo\"></td>
var regex = new Regex("(<img(.+?)id=\"llgo\"(.+?))src=\"([^\"]+)\"");
PRQ = regex.Replace(PRQ, match => match.Groups[1].Value + "");
#9
0
Why not trying reluctant quantifier? htmlString.replaceAll("<\\S*?>", "")
为什么不试试不情愿的量词呢?htmlString.replaceAll(“< \ \ S * ?> "," ")
(It's Java but the main thing is to show the idea)
(这是Java,但最主要的是展示这个想法)
#10
-1
Here's an extension method I created using a simple regular expression to remove HTML tags from a string:
下面是我创建的一个扩展方法,它使用一个简单的正则表达式从字符串中删除HTML标记:
/// <summary>
/// Converts an Html string to plain text, and replaces all br tags with line breaks.
/// </summary>
/// <returns></returns>
/// <remarks></remarks>
[Extension()]
public string ToPlainText(string s)
{
s = s.Replace("<br>", Constants.vbCrLf);
s = s.Replace("<br />", Constants.vbCrLf);
s = s.Replace("<br/>", Constants.vbCrLf);
s = Regex.Replace(s, "<[^>]*>", string.Empty);
return s;
}
Hope that helps.
希望有帮助。
#1
18
Using a regular expression to parse HTML is fraught with pitfalls. HTML is not a regular language and hence can't be 100% correctly parsed with a regex. This is just one of many problems you will run into. The best approach is to use an HTML / XML parser to do this for you.
使用正则表达式来解析HTML充满了陷阱。HTML不是一种常规语言,因此不能100%正确地使用regex解析。这只是您将遇到的众多问题之一。最好的方法是使用HTML / XML解析器为您实现这一点。
Here is a link to a blog post I wrote awhile back which goes into more details about this problem.
这是我之前写的一篇博客文章的链接,里面有关于这个问题的更多细节。
- http://blogs.msdn.com/b/jaredpar/archive/2008/10/15/regular-expression-limitations.aspx
- http://blogs.msdn.com/b/jaredpar/archive/2008/10/15/regular-expression-limitations.aspx
That being said, here's a solution that should fix this particular problem. It in no way is a perfect solution though.
也就是说,这是一个解决这个问题的方法。但这绝不是一个完美的解决方案。
var pattern = @"<(img|a)[^>]*>(?<content>[^<]*)<";
var regex = new Regex(pattern);
var m = regex.Match(sSummary);
if ( m.Success ) {
sResult = m.Groups["content"].Value;
#2
14
To turn this:
把这个:
'<td>mamma</td><td><strong>papa</strong></td>'
into this:
到这个:
'mamma papa'
You need to replace the tags with spaces:
你需要用空格替换标签:
.replace(/<[^>]*>/g, ' ')
and reduce any duplicate spaces into single spaces:
并将任何重复的空间减少为单个空间:
.replace(/\s{2,}/g, ' ')
then trim away leading and trailing spaces with:
然后用以下方法修剪掉前导和后置空间:
.trim();
Meaning that your remove tag function look like this:
这意味着移除标签函数是这样的:
function removeTags(string){
return string.replace(/<[^>]*>/g, ' ')
.replace(/\s{2,}/g, ' ')
.trim();
}
#3
2
So the HTML parser everyone's talking about is Html Agility Pack.
每个人都在谈论的HTML解析器是HTML敏捷包。
If it is clean XHTML, you can also use System.Xml.Linq.XDocument
or System.Xml.XmlDocument
.
如果是干净的XHTML,还可以使用System.Xml.Linq。XDocument或System.Xml.XmlDocument。
#4
2
In order to remove also spaces between tags, you can use the following method a combination between regex and a trim for spaces at start and end of the input html:
为了删除标签之间的空格,您可以使用以下方法,regex和在输入html的开始和结束处对空格进行组合:
public static string StripHtml(string inputHTML)
{
const string HTML_MARKUP_REGEX_PATTERN = @"<[^>]+>\s+(?=<)|<[^>]+>";
inputHTML = WebUtility.HtmlDecode(inputHTML).Trim();
string noHTML = Regex.Replace(inputHTML, HTML_MARKUP_REGEX_PATTERN, string.Empty);
return noHTML;
}
So for the following input:
因此对于以下输入:
<p> <strong> <em><span style="text-decoration:underline;background-color:#cc6600;"></span><span style="text-decoration:underline;background-color:#cc6600;color:#663333;"><del> test text </del></span></em></strong></p><p><strong><span style="background-color:#999900;"> test 1 </span></strong></p><p><strong><em><span style="background-color:#333366;"> test 2 </span></em></strong></p><p><strong><em><span style="text-decoration:underline;background-color:#006600;"> test 3 </span></em></strong></p>
The output will be only the text without spaces between html tags or space before or after html: " test text test 1 test 2 test 3 ".
输出将仅为html标记之间没有空格的文本或html前后的空格:“test text test 1 test 2 test 3”。
Please notice that the spaces before test text
are from the <del> test text </del>
html and the space after test 3
is from the <em><span style="text-decoration:underline;background-color:#006600;"> test 3 </span></em></strong></p>
html.
请注意,测试前的空格来自测试文本 html,测试3后的空格来自
#5
1
You can use already existing libraries to strip off the html tags. One good one being Chilkat C# Library.
您可以使用现有的库去掉html标记。一个很好的例子就是奇尔卡特图书馆。
#6
1
can use:
可以使用:
Regex.Replace(source, "<[^>]*>", string.Empty);
#7
0
Here is the extension method I've been using for quite some time.
这是我使用了很长一段时间的扩展方法。
public static class StringExtensions
{
public static string StripHTML(this string htmlString, string htmlPlaceHolder) {
const string pattern = @"<(.|\n)*?>";
string sOut = Regex.Replace(htmlString, pattern, htmlPlaceHolder);
sOut = sOut.Replace(" ", String.Empty);
sOut = sOut.Replace("&", "&");
sOut = sOut.Replace(">", ">");
sOut = sOut.Replace("<", "<");
return sOut;
}
}
#8
0
Remove image from the string, using a regular expression in c# (image search performed by image id)
使用c#中的正则表达式从字符串中删除图像(由图像id执行的图像搜索)
string PRQ=<td valign=\"top\" style=\"width: 400px;\" align=\"left\"><img id=\"llgo\" src=\"http://test.Logo.png\" alt=\"logo\"></td>
var regex = new Regex("(<img(.+?)id=\"llgo\"(.+?))src=\"([^\"]+)\"");
PRQ = regex.Replace(PRQ, match => match.Groups[1].Value + "");
#9
0
Why not trying reluctant quantifier? htmlString.replaceAll("<\\S*?>", "")
为什么不试试不情愿的量词呢?htmlString.replaceAll(“< \ \ S * ?> "," ")
(It's Java but the main thing is to show the idea)
(这是Java,但最主要的是展示这个想法)
#10
-1
Here's an extension method I created using a simple regular expression to remove HTML tags from a string:
下面是我创建的一个扩展方法,它使用一个简单的正则表达式从字符串中删除HTML标记:
/// <summary>
/// Converts an Html string to plain text, and replaces all br tags with line breaks.
/// </summary>
/// <returns></returns>
/// <remarks></remarks>
[Extension()]
public string ToPlainText(string s)
{
s = s.Replace("<br>", Constants.vbCrLf);
s = s.Replace("<br />", Constants.vbCrLf);
s = s.Replace("<br/>", Constants.vbCrLf);
s = Regex.Replace(s, "<[^>]*>", string.Empty);
return s;
}
Hope that helps.
希望有帮助。