I got some html text, which contains all kinds of html tags, such as <table>, <a>, <img>
, and so on.
我得到了一些html文本,其中包含各种html标签,例如
,
Now I want to use a regular expression to remove all the html tags, except <img ...>
and </img>
(and upper case <IMG></IMG>
).
现在我想使用正则表达式删除所有html标签,除了和
(以及大写
)。
How to do this?
这个怎么做?
UPDATE:
My task is very simple, it just print the text content(including images) of a html as a summary in the front page, so I think regular expression is good and simple enough.
我的任务很简单,它只是在首页打印html的文本内容(包括图像)作为摘要,所以我认为正则表达式很好而且很简单。
UPDATE AGAIN
Maybe a sample will make my question better to understand :)
也许一个样本会让我的问题更好理解:)
There are some html text:
有一些HTML文字:
<html>
<head></head>
<body>
Hello, everyone. Here is my photo: <img src="xxx.jpg" />.
And, <a href="xxx">know more</a> about me!
</body>
</html>
I want to keep , and remove other tags. Following is what I want:
我想保留,并删除其他标签。以下是我想要的:
Hello, everyone. Here is my photo: <img src="xxx.jpg" />. And, know more about me!
Now I code like this:
现在我的代码如下:
html.replaceAll("<.*?>", "")
But it will remove all the content between <
and >
, but I want to keep <img xxx>
and </img>
, and remove the other content between < and >
但它会删除 <和> 之间的所有内容,但我想保留 和
,并删除 <和> 之间的其他内容
Thank for everyone!
谢谢大家!
4 个解决方案
#1
9
I tried a lot, this regular expression seems work for me:
我尝试了很多,这个正则表达式似乎对我有用:
(?i)<(?!img|/img).*?>
My code is:
我的代码是:
html.replaceAll('(?i)<(?!img|/img).*?>', '');
#2
4
Do not use a RegEx to parse HTML. See here for a compelling demonstration of why.
不要使用RegEx来解析HTML。请参阅此处,了解原因。
Use an HTML parser for your language/platform.
为您的语言/平台使用HTML解析器。
- Here is a java one (HTML parser)
- For .NET, the HTML Agility Pack is recommended
- For ruby, there is nokogiry, though I am not a ruby dev, so don't know how good it is
这是一个java(HTML解析器)
对于.NET,建议使用HTML Agility Pack
对于红宝石来说,虽然我不是红宝石开发者,但仍有nokogiry,所以不知道它有多好
#3
1
A simple answer to why Do not use a RegEx is:
为什么不使用RegEx的简单答案是:
Regexp can't parse recursive grammar such as:
Regexp无法解析递归语法,例如:
S -> (S)
S -> Empty
Because this kind of grammar has infinite state.
因为这种语法具有无限状态。
Since HTML has a recursive grammar you can simply use regexp.
由于HTML具有递归语法,因此您只需使用regexp即可。
SPAN -> <span>SPAN</span>
SPAN -> text
But in your case you can express a regular expression that is not recursive.
但在您的情况下,您可以表达一个非递归的正则表达式。
#4
0
<(img|IMG)*>*</(img|IMG)>
#1
9
I tried a lot, this regular expression seems work for me:
我尝试了很多,这个正则表达式似乎对我有用:
(?i)<(?!img|/img).*?>
My code is:
我的代码是:
html.replaceAll('(?i)<(?!img|/img).*?>', '');
#2
4
Do not use a RegEx to parse HTML. See here for a compelling demonstration of why.
不要使用RegEx来解析HTML。请参阅此处,了解原因。
Use an HTML parser for your language/platform.
为您的语言/平台使用HTML解析器。
- Here is a java one (HTML parser)
- For .NET, the HTML Agility Pack is recommended
- For ruby, there is nokogiry, though I am not a ruby dev, so don't know how good it is
这是一个java(HTML解析器)
对于.NET,建议使用HTML Agility Pack
对于红宝石来说,虽然我不是红宝石开发者,但仍有nokogiry,所以不知道它有多好
#3
1
A simple answer to why Do not use a RegEx is:
为什么不使用RegEx的简单答案是:
Regexp can't parse recursive grammar such as:
Regexp无法解析递归语法,例如:
S -> (S)
S -> Empty
Because this kind of grammar has infinite state.
因为这种语法具有无限状态。
Since HTML has a recursive grammar you can simply use regexp.
由于HTML具有递归语法,因此您只需使用regexp即可。
SPAN -> <span>SPAN</span>
SPAN -> text
But in your case you can express a regular expression that is not recursive.
但在您的情况下,您可以表达一个非递归的正则表达式。
#4
0
<(img|IMG)*>*</(img|IMG)>