使用perl剥离除html标记之外的所有内容

时间:2021-06-18 12:15:31

I have been searching for a way to strip everything out of an html document leaving ONLY the html tags. Is anyone aware of a method for this? I have experience with many perl modules and have searched this site thoroughly.

我一直在寻找一种方法来从html文档中删除所有内容,只留下html标签。有人知道这种方法吗?我有很多perl模块的经验,并且已经彻底搜索过这个网站。

I want to pass html as a string to my perl script and remove everything except the tags. Here is an example:

我想将html作为字符串传递给我的perl脚本并删除除标签之外的所有内容。这是一个例子:

Incoming:

传入:

<!doctype html>
<html>
<head>
<title>Example Domain</title>

<meta charset="utf-8" />
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<style type="text/css">
body {
    background-color: #f0f0f2;
    margin: 0;
    padding: 0;
    font-family: "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;

}
div {
    width: 600px;
    margin: 5em auto;
    padding: 50px;
    background-color: #fff;
    border-radius: 1em;
}
a:link, a:visited {
    color: #38488f;
    text-decoration: none;
}
@media (max-width: 700px) {
    body {
        background-color: #fff;
    }
    div {
        width: auto;
        margin: 0 auto;
        border-radius: 0;
        padding: 1em;
    }
}
</style>    
</head>

<body>
<div>
    website content ....
</div>
</body>
</html>

Becomes:

变为:

<html><head><title></title><meta><meta><meta><style></style></head><body><div><h1></h1>       <p></p><p><a></a></p></div></body></html>

3 个解决方案

#1


2  

#!/usr/bin/perl --
use strict;
use warnings;
use XML::Twig;

Main( @ARGV );
exit( 0 );

sub Main {
    if( @_ ){
        nothing_but_tags("$_") for @_;
    } else {
        nothing_but_tags(q{<NoTe
KunG="FoO"
ChOp="SuEy"> 
NoteKungFo0Ch0pSuEy
<To KunG="FoO">ToKungFo0 
<Person KunG="FoO">Satan</Person>
</To>
<Beef KunG="FoO"> BeefKunGFoO <SaUsAGe KunG="FoO">is Tasty
</SaUsAGe>
</Beef>
</NoTe>},
        );
    }
}

sub nothing_but_tags
{
    my( $input, %opt ) = @_;

    $opt{pretty_print}  ||= 'indented' ;

    my $t = XML::Twig->new(
        %opt,
        force_end_tag_handlers_usage => 1,
        start_tag_handlers => {
            _all_ =>  sub {
                if( $_->has_atts ){
                    $_->set_atts ({});
                }
                return;
            },
        },
        end_tag_handlers => { _all_ =>  sub { $_->flush; return }, },
        char_handler => sub { '' },
    );
    $t->xparse( $_[0] );
    $t->flush();
    ();
}
__END__
<NoTe>
  <To>
    <Person></Person>
  </To>
  <Beef>
    <SaUsAGe></SaUsAGe>
  </Beef>
</NoTe>

#2


0  

Such a transform is very simple with XSLT, so here's an example using XML::LibXSLT.

使用XSLT,这样的转换非常简单,所以这是使用XML :: LibXSLT的示例。

#!/usr/bin/perl
use strict;

use XML::LibXML;
use XML::LibXSLT;

my $filename = $ARGV[0] or die("Usage: $0 filename\n");
my $doc      = XML::LibXML->load_html(location => $filename);

my $stylesheet_doc = XML::LibXML->load_xml(string => <<'EOF');
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:template match="*">
        <xsl:copy>
            <xsl:apply-templates select="*"/>
        </xsl:copy>
    </xsl:template>
</xsl:stylesheet>
EOF

my $xslt       = XML::LibXSLT->new;
my $stylesheet = $xslt->parse_stylesheet($stylesheet_doc);
my $result     = $stylesheet->transform($doc);

print $result->serialize_html;

#3


0  

i don't know if i understood well your question but to leave JUST THE TAGS you could take the output from strip tags (strip only tags) and then replace this output with null in the original text. In theory the first function will give you the exact text that is outside the tags and the next step will replace this text with null.

我不知道我是否理解你的问题,但是留下JUST THE TAGS你可以从条带标签中获取输出(仅剥离标签),然后在原始文本中将此输出替换为null。理论上,第一个函数将为您提供标记之外的确切文本,下一步将用null替换此文本。

#1


2  

#!/usr/bin/perl --
use strict;
use warnings;
use XML::Twig;

Main( @ARGV );
exit( 0 );

sub Main {
    if( @_ ){
        nothing_but_tags("$_") for @_;
    } else {
        nothing_but_tags(q{<NoTe
KunG="FoO"
ChOp="SuEy"> 
NoteKungFo0Ch0pSuEy
<To KunG="FoO">ToKungFo0 
<Person KunG="FoO">Satan</Person>
</To>
<Beef KunG="FoO"> BeefKunGFoO <SaUsAGe KunG="FoO">is Tasty
</SaUsAGe>
</Beef>
</NoTe>},
        );
    }
}

sub nothing_but_tags
{
    my( $input, %opt ) = @_;

    $opt{pretty_print}  ||= 'indented' ;

    my $t = XML::Twig->new(
        %opt,
        force_end_tag_handlers_usage => 1,
        start_tag_handlers => {
            _all_ =>  sub {
                if( $_->has_atts ){
                    $_->set_atts ({});
                }
                return;
            },
        },
        end_tag_handlers => { _all_ =>  sub { $_->flush; return }, },
        char_handler => sub { '' },
    );
    $t->xparse( $_[0] );
    $t->flush();
    ();
}
__END__
<NoTe>
  <To>
    <Person></Person>
  </To>
  <Beef>
    <SaUsAGe></SaUsAGe>
  </Beef>
</NoTe>

#2


0  

Such a transform is very simple with XSLT, so here's an example using XML::LibXSLT.

使用XSLT,这样的转换非常简单,所以这是使用XML :: LibXSLT的示例。

#!/usr/bin/perl
use strict;

use XML::LibXML;
use XML::LibXSLT;

my $filename = $ARGV[0] or die("Usage: $0 filename\n");
my $doc      = XML::LibXML->load_html(location => $filename);

my $stylesheet_doc = XML::LibXML->load_xml(string => <<'EOF');
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:template match="*">
        <xsl:copy>
            <xsl:apply-templates select="*"/>
        </xsl:copy>
    </xsl:template>
</xsl:stylesheet>
EOF

my $xslt       = XML::LibXSLT->new;
my $stylesheet = $xslt->parse_stylesheet($stylesheet_doc);
my $result     = $stylesheet->transform($doc);

print $result->serialize_html;

#3


0  

i don't know if i understood well your question but to leave JUST THE TAGS you could take the output from strip tags (strip only tags) and then replace this output with null in the original text. In theory the first function will give you the exact text that is outside the tags and the next step will replace this text with null.

我不知道我是否理解你的问题,但是留下JUST THE TAGS你可以从条带标签中获取输出(仅剥离标签),然后在原始文本中将此输出替换为null。理论上,第一个函数将为您提供标记之外的确切文本,下一步将用null替换此文本。