如何使用Perl在字符串中去除HTML ?

时间:2022-10-30 09:37:00

Is there anyway easier than this to strip HTML from a string using Perl?

使用Perl将HTML从字符串中剥离出来是否比这更容易?

$Error_Msg =~ s|<b>||ig;
$Error_Msg =~ s|</b>||ig;
$Error_Msg =~ s|<h1>||ig;
$Error_Msg =~ s|</h1>||ig;
$Error_Msg =~ s|<br>||ig;

I would appreicate both a slimmed down regular expression, e.g. something like this:

我很欣赏这两种有规律的表达方式,例如:

$Error_Msg =~ s|</?[b|h1|br]>||ig;

Is there an existing Perl function that strips any/all HTML from a string, even though I only need bolds, h1 headers and br stripped?

是否存在一个现有的Perl函数,可以从字符串中删除任何/所有HTML,即使我只需要去掉bolds、h1标头和br ?

3 个解决方案

#1


18  

Assuming the code is valid HTML (no stray < or > operators)

假设代码是有效的HTML(没有杂散 <或> 操作符)

$htmlCode =~ s|<.+?>||g;

If you need to remove only bolds, h1's and br's

如果您只需要删除螺栓,h1和br的

$htmlCode =~ s#</?(?:b|h1|br)\b.*?>##g

And you might want to consider the HTML::Strip module

您可能需要考虑HTML::Strip模块

#2


14  

You should definitely have a look at the HTML::Restrict which allows you to strip away or restrict the HTML tags allowed. A minimal example that strips away all HTML tags:

您一定要看看HTML:: limit允许您删除或限制允许的HTML标记。去掉所有HTML标记的最小示例:

use HTML::Restrict;

my $hr = HTML::Restrict->new();
my $processed = $hr->process('<b>i am bold</b>'); # returns 'i am bold'

I would recommend to stay away from HTML::Strip because it breaks utf8 encoding.

我建议不要使用HTML::Strip,因为它破坏了utf8编码。

#3


14  

From perlfaq9: How do I remove HTML from a string?

如何从字符串中删除HTML ?


The most correct way (albeit not the fastest) is to use HTML::Parser from CPAN. Another mostly correct way is to use HTML::FormatText which not only removes HTML but also attempts to do a little simple formatting of the resulting plain text.

最正确的方法(尽管不是最快的方法)是使用来自CPAN的HTML:::Parser。另一种最正确的方法是使用HTML:: remove FormatText,它不仅可以删除HTML,还可以尝试对生成的纯文本进行一些简单的格式化。

Many folks attempt a simple-minded regular expression approach, like s/<.*?>//g, but that fails in many cases because the tags may continue over line breaks, they may contain quoted angle-brackets, or HTML comment may be present. Plus, folks forget to convert entities--like < for example.

许多人尝试使用简单的正则表达式方法,比如s/<.*?>//g,但在许多情况下这是失败的,因为标记可能在换行符上继续,它们可能包含引用的角括号,或者存在HTML注释。另外,人们忘记转换实体——例如<。

Here's one "simple-minded" approach, that works for most files:

这里有一个“头脑简单”的方法,适用于大多数文件:

#!/usr/bin/perl -p0777
s/<(?:[^>'"]*|(['"]).*?\1)*>//gs

If you want a more complete solution, see the 3-stage striphtml program in http://www.cpan.org/authors/id/T/TO/TOMC/scripts/striphtml.gz .

如果您想要更完整的解决方案,请参阅http://www.cpan.org/authors/id/T/TO/TOMC/scripts/striphtml.gz中的三阶段striphtml程序。

Here are some tricky cases that you should think about when picking a solution:

在选择解决方案时,你应该考虑以下一些棘手的情况:

<IMG SRC = "foo.gif" ALT = "A > B">

<IMG SRC = "foo.gif"
 ALT = "A > B">

<!-- <A comment> -->

<script>if (a<b && a>c)</script>

<# Just data #>

<![INCLUDE CDATA [ >>>>>>>>>>>> ]]>

If HTML comments include other tags, those solutions would also break on text like this:

如果HTML注释包含其他标记,那么这些解决方案也会破坏如下文本:

<!-- This section commented out.
    <B>You can't see me!</B>
-->

#1


18  

Assuming the code is valid HTML (no stray < or > operators)

假设代码是有效的HTML(没有杂散 <或> 操作符)

$htmlCode =~ s|<.+?>||g;

If you need to remove only bolds, h1's and br's

如果您只需要删除螺栓,h1和br的

$htmlCode =~ s#</?(?:b|h1|br)\b.*?>##g

And you might want to consider the HTML::Strip module

您可能需要考虑HTML::Strip模块

#2


14  

You should definitely have a look at the HTML::Restrict which allows you to strip away or restrict the HTML tags allowed. A minimal example that strips away all HTML tags:

您一定要看看HTML:: limit允许您删除或限制允许的HTML标记。去掉所有HTML标记的最小示例:

use HTML::Restrict;

my $hr = HTML::Restrict->new();
my $processed = $hr->process('<b>i am bold</b>'); # returns 'i am bold'

I would recommend to stay away from HTML::Strip because it breaks utf8 encoding.

我建议不要使用HTML::Strip,因为它破坏了utf8编码。

#3


14  

From perlfaq9: How do I remove HTML from a string?

如何从字符串中删除HTML ?


The most correct way (albeit not the fastest) is to use HTML::Parser from CPAN. Another mostly correct way is to use HTML::FormatText which not only removes HTML but also attempts to do a little simple formatting of the resulting plain text.

最正确的方法(尽管不是最快的方法)是使用来自CPAN的HTML:::Parser。另一种最正确的方法是使用HTML:: remove FormatText,它不仅可以删除HTML,还可以尝试对生成的纯文本进行一些简单的格式化。

Many folks attempt a simple-minded regular expression approach, like s/<.*?>//g, but that fails in many cases because the tags may continue over line breaks, they may contain quoted angle-brackets, or HTML comment may be present. Plus, folks forget to convert entities--like < for example.

许多人尝试使用简单的正则表达式方法,比如s/<.*?>//g,但在许多情况下这是失败的,因为标记可能在换行符上继续,它们可能包含引用的角括号,或者存在HTML注释。另外,人们忘记转换实体——例如<。

Here's one "simple-minded" approach, that works for most files:

这里有一个“头脑简单”的方法,适用于大多数文件:

#!/usr/bin/perl -p0777
s/<(?:[^>'"]*|(['"]).*?\1)*>//gs

If you want a more complete solution, see the 3-stage striphtml program in http://www.cpan.org/authors/id/T/TO/TOMC/scripts/striphtml.gz .

如果您想要更完整的解决方案,请参阅http://www.cpan.org/authors/id/T/TO/TOMC/scripts/striphtml.gz中的三阶段striphtml程序。

Here are some tricky cases that you should think about when picking a solution:

在选择解决方案时,你应该考虑以下一些棘手的情况:

<IMG SRC = "foo.gif" ALT = "A > B">

<IMG SRC = "foo.gif"
 ALT = "A > B">

<!-- <A comment> -->

<script>if (a<b && a>c)</script>

<# Just data #>

<![INCLUDE CDATA [ >>>>>>>>>>>> ]]>

If HTML comments include other tags, those solutions would also break on text like this:

如果HTML注释包含其他标记,那么这些解决方案也会破坏如下文本:

<!-- This section commented out.
    <B>You can't see me!</B>
-->