如何删除除img之外的所有html标签？

I got some html text, which contains all kinds of html tags, such as <table>, <a>, <img>, and so on.

我得到了一些html文本,其中包含各种html标签,例如

Now I want to use a regular expression to remove all the html tags, except <img ...> and </img>(and upper case <IMG></IMG>).

现在我想使用正则表达式删除所有html标签,除了如何删除除img之外的所有html标签？和 (以及大写)。

How to do this?

这个怎么做?

UPDATE:

My task is very simple, it just print the text content(including images) of a html as a summary in the front page, so I think regular expression is good and simple enough.

我的任务很简单,它只是在首页打印html的文本内容(包括图像)作为摘要,所以我认为正则表达式很好而且很简单。

UPDATE AGAIN

Maybe a sample will make my question better to understand :)

也许一个样本会让我的问题更好理解:)

There are some html text:

有一些HTML文字:

<html>
  <head></head>
  <body>
     Hello, everyone. Here is my photo: <img src="xxx.jpg" />. 
     And, <a href="xxx">know more</a> about me!
  </body>
</html>

I want to keep , and remove other tags. Following is what I want:

我想保留,并删除其他标签。以下是我想要的:

Hello, everyone. Here is my photo: <img src="xxx.jpg" />. And, know more about me!

Now I code like this:

现在我的代码如下:

html.replaceAll("<.*?>", "")

But it will remove all the content between < and >, but I want to keep <img xxx> and </img>, and remove the other content between < and >

但它会删除 <和> 之间的所有内容,但我想保留如何删除除img之外的所有html标签？和 ,并删除 <和> 之间的其他内容

Thank for everyone!

谢谢大家!

4 个解决方案

#1

I tried a lot, this regular expression seems work for me:

我尝试了很多,这个正则表达式似乎对我有用:

(?i)<(?!img|/img).*?>

My code is:

我的代码是:

html.replaceAll('(?i)<(?!img|/img).*?>', '');

#2

Do not use a RegEx to parse HTML. See here for a compelling demonstration of why.

不要使用RegEx来解析HTML。请参阅此处,了解原因。

Use an HTML parser for your language/platform.

为您的语言/平台使用HTML解析器。

Here is a java one (HTML parser)

这是一个java(HTML解析器)

For .NET, the HTML Agility Pack is recommended

对于.NET,建议使用HTML Agility Pack

For ruby, there is nokogiry, though I am not a ruby dev, so don't know how good it is

对于红宝石来说,虽然我不是红宝石开发者,但仍有nokogiry,所以不知道它有多好

#3

A simple answer to why Do not use a RegEx is:

为什么不使用RegEx的简单答案是:

Regexp can't parse recursive grammar such as:

Regexp无法解析递归语法,例如:

S -> (S)
S -> Empty

Because this kind of grammar has infinite state.

因为这种语法具有无限状态。

Since HTML has a recursive grammar you can simply use regexp.

由于HTML具有递归语法,因此您只需使用regexp即可。

SPAN -> <span>SPAN</span>
SPAN -> text

But in your case you can express a regular expression that is not recursive.

但在您的情况下,您可以表达一个非递归的正则表达式。

#4

<(img|IMG)*>*</(img|IMG)>

#1