从字符串[duplicate]中删除HTML标记的正则表达式

时间:2022-08-27 17:16:03

Possible Duplicate:
Regular expression to remove HTML tags

可能的重复:正则表达式以删除HTML标记

Is there an expression which will get the value between two HTML tags?

是否有一个表达式可以在两个HTML标记之间获得值?

Given this:

鉴于这种:

<td class="played">0</td>

I am looking for an expression which will return 0, stripping the <td> tags.

我正在寻找一个将返回0的表达式,去掉标记。

3 个解决方案

#1


100  

You should not attempt to parse HTML with regex. HTML is not a regular language, so any regex you come up with will likely fail on some esoteric edge case. Please refer to the seminal answer to this question for specifics. While mostly formatted as a joke, it makes a very good point.

您不应该尝试使用regex解析HTML。HTML不是一种常规的语言,所以您提出的任何regex都可能在某些深奥的边缘情况下失败。请参考这个问题的基本答案。虽然大多数格式都是一个玩笑,但这是一个很好的观点。


The following examples are Java, but the regex will be similar -- if not identical -- for other languages.

下面的示例是Java,但是对于其他语言,regex将是类似的(如果不是相同的话)。


String target = someString.replaceAll("<[^>]*>", "");

Assuming your non-html does not contain any < or > and that your input string is correctly structured.

假设您的非html不包含任何 <或> ,并且您的输入字符串结构正确。

If you know they're a specific tag -- for example you know the text contains only <td> tags, you could do something like this:

如果您知道它们是一个特定的标记——例如,您知道文本只包含标记,您可以这样做:

String target = someString.replaceAll("(?i)<td[^>]*>", "");

Edit: Ωmega brought up a good point in a comment on another post that this would result in multiple results all being squished together if there were multiple tags.

编辑:Ωmega长大的一个很好的观点在另一篇文章的评论,这将导致多个结果都被挤压在一起如果有多个标签。

For example, if the input string were <td>Something</td><td>Another Thing</td>, then the above would result in SomethingAnother Thing.

例如,如果输入字符串某事另一件事,那么上面的结果将是另一件事。

In a situation where multiple tags are expected, we could do something like:

在需要多个标签的情况下,我们可以这样做:

String target = someString.replaceAll("(?i)<td[^>]*>", " ").replaceAll("\\s+", " ").trim();

This replaces the HTML with a single space, then collapses whitespace, and then trims any on the ends.

这将用一个空格替换HTML,然后折叠空格,然后对末尾的任何空格进行修剪。

#2


34  

A trivial approach would be to replace

一个微不足道的方法是替换

<[^>]*>

with nothing. But depending on how ill-structured your input is that may well fail.

一无所有。但这取决于你的输入结构有多糟糕,很可能会失败。

#3


3  

You could do it with jsoup http://jsoup.org/

您可以使用jsoup http://jsoup.org/进行此操作

Whitelist whitelist = Whitelist.none();
String cleanStr = Jsoup.clean(yourText, whitelist);

#1


100  

You should not attempt to parse HTML with regex. HTML is not a regular language, so any regex you come up with will likely fail on some esoteric edge case. Please refer to the seminal answer to this question for specifics. While mostly formatted as a joke, it makes a very good point.

您不应该尝试使用regex解析HTML。HTML不是一种常规的语言,所以您提出的任何regex都可能在某些深奥的边缘情况下失败。请参考这个问题的基本答案。虽然大多数格式都是一个玩笑,但这是一个很好的观点。


The following examples are Java, but the regex will be similar -- if not identical -- for other languages.

下面的示例是Java,但是对于其他语言,regex将是类似的(如果不是相同的话)。


String target = someString.replaceAll("<[^>]*>", "");

Assuming your non-html does not contain any < or > and that your input string is correctly structured.

假设您的非html不包含任何 <或> ,并且您的输入字符串结构正确。

If you know they're a specific tag -- for example you know the text contains only <td> tags, you could do something like this:

如果您知道它们是一个特定的标记——例如,您知道文本只包含标记,您可以这样做:

String target = someString.replaceAll("(?i)<td[^>]*>", "");

Edit: Ωmega brought up a good point in a comment on another post that this would result in multiple results all being squished together if there were multiple tags.

编辑:Ωmega长大的一个很好的观点在另一篇文章的评论,这将导致多个结果都被挤压在一起如果有多个标签。

For example, if the input string were <td>Something</td><td>Another Thing</td>, then the above would result in SomethingAnother Thing.

例如,如果输入字符串某事另一件事,那么上面的结果将是另一件事。

In a situation where multiple tags are expected, we could do something like:

在需要多个标签的情况下,我们可以这样做:

String target = someString.replaceAll("(?i)<td[^>]*>", " ").replaceAll("\\s+", " ").trim();

This replaces the HTML with a single space, then collapses whitespace, and then trims any on the ends.

这将用一个空格替换HTML,然后折叠空格,然后对末尾的任何空格进行修剪。

#2


34  

A trivial approach would be to replace

一个微不足道的方法是替换

<[^>]*>

with nothing. But depending on how ill-structured your input is that may well fail.

一无所有。但这取决于你的输入结构有多糟糕,很可能会失败。

#3


3  

You could do it with jsoup http://jsoup.org/

您可以使用jsoup http://jsoup.org/进行此操作

Whitelist whitelist = Whitelist.none();
String cleanStr = Jsoup.clean(yourText, whitelist);