如何使用Perl在HTML中的第2和第3段之间插入内容?

时间:2022-09-01 15:21:33

I'm trying to match the point between 2nd and 3rd paragraphs to insert some content. Paragraphs are delimited either by <p> or 2 newlines, mixed. Here's an example:

我试图匹配第2和第3段之间的点来插入一些内容。段落由

或2个换行符分隔,混合。这是一个例子:

text text text text
text text text text

文本文本文本文本文本文本文本文本

<p>
text text text text
text text text text
</p>
<--------------------------- want to insert text here
<p>
text text text text
text text text text
</p>

文本文本文本文本文本文本文本文本 <---------------------------想在此处插入文本

文本文本文本文本文本文本文本

3 个解决方案

#1


3  

Assuming there are no nested paragraphs...

假设没有嵌套段落......

my $to_insert = get_thing_to_insert();
$text =~ s/((?:<p>.*?</p>|\n\n){2})/$1$to_insert/s;

should just about do it.

应该做的就是这样。

With extended formatting:

扩展格式:

$text =~ s{
    (             # a group
        (?:       # containing ...
            <p>   # the start of a paragraph
            .*?   # to...
            </p>  # its closing tag
        |         # OR...
           \n\n   # two newlines alone. 
        ){2}      # twice
    )             # and take all of that...
}
{$1$to_insert}xms # and append $val to it

Note, I used \n\n as the delimiter; if you're using a windows style text file, this needs to be \r\n\r\n, or if it might be mixed, something like \r?\n\r?\n to make the \r optional.

注意,我使用\ n \ n作为分隔符;如果您使用的是Windows风格的文本文件,则需要为\ r \ n \ r \ n,或者如果它可能是混合的,请使用\ r?\ n \ r?\ n来使\ r \ n可选。

Also note that because the '\n\n' is after the |, the <p> blocks can have double newlines in them - <p> to </p> takes priority. If you want newlines inside the <p>'s to take priority, swap those around.

另请注意,因为'\ n \ n'在|之后,

块中可以有双重换行符 -

到 优先。如果你希望

中的换行优先,那就换掉它们。

#2


0  

Instead of using a regular expression, use an HTML tree walker to find the second paragraph and add whatever you like. I talked about this sort of thing in my Process HTML with a Perl module article for InformIT.

不使用正则表达式,而是使用HTML树walker查找第二段并添加您喜欢的任何内容。我在我的Process HTML中用InformIT的Perl模块文章谈到了这种事情。

The advantage of something like HTML::TreeBuilder is that you deal with the logical structure of the HTML rather than contending with the position and order of random characters in a regular expression. If the structure stays the same, a tree walker should keep working. If you change almost anything, the regex is probably going to break.

像HTML :: TreeBuilder这样的优点是你可以处理HTML的逻辑结构,而不是与正则表达式中随机字符的位置和顺序竞争。如果结构保持不变,树木行走者应该继续工作。如果你改变几乎任何东西,正则表达式可能会破坏。

An HTML::TreeBuilder example looks something like this:

HTML :: TreeBuilder示例如下所示:

#!perl
use strict;
use warnings;

use HTML::TreeBuilder;
use HTML::Element;

my $html  = HTML::TreeBuilder->new;
my $root  = $html->parse_file( *DATA );

my $second = ( $root->find_by_tag_name('p') )[1];

my $new_para = HTML::Element->new( 'p' );
$new_para->push_content( 'Add this line' );

$second->postinsert( $new_para );

print $root->as_HTML( undef, "\t", {} );

__END__
<p>
This is the first paragraph
</p>

<p>
This is the second paragraph
</p>

<p>
This is the last paragraph
</p>

If you need to clean up your data first, you can throw in some steps to use HTML::Tidy with the enclose_text option.

如果您需要首先清理数据,可以使用一些步骤来使用带有enclose_text选项的HTML :: Tidy。

#3


-1  

Text:

my $text = '
text text text text
text text text text

<p>
text text text text
text text text text
</p>
<p>
text text text text
text text text text
</p>
';

This should work with:

这应该适用于:

our $cnt = 0;
our $where = 2;

my $new_stuff='<- want to insert text here';
$text =~ s/
           (
            (?:\n|<\/p>)\n
           )
           (?{ ++$cnt })
           (??{ $cnt==$where?'':'!$' })
          /$1$new_stuff\n/xs;

Result:

text text text text
text text text text

<p>
text text text text
text text text text
</p>
<- want to insert text here
<p>
text text text text
text text text text
</p>

Regards

rbo

#1


3  

Assuming there are no nested paragraphs...

假设没有嵌套段落......

my $to_insert = get_thing_to_insert();
$text =~ s/((?:<p>.*?</p>|\n\n){2})/$1$to_insert/s;

should just about do it.

应该做的就是这样。

With extended formatting:

扩展格式:

$text =~ s{
    (             # a group
        (?:       # containing ...
            <p>   # the start of a paragraph
            .*?   # to...
            </p>  # its closing tag
        |         # OR...
           \n\n   # two newlines alone. 
        ){2}      # twice
    )             # and take all of that...
}
{$1$to_insert}xms # and append $val to it

Note, I used \n\n as the delimiter; if you're using a windows style text file, this needs to be \r\n\r\n, or if it might be mixed, something like \r?\n\r?\n to make the \r optional.

注意,我使用\ n \ n作为分隔符;如果您使用的是Windows风格的文本文件,则需要为\ r \ n \ r \ n,或者如果它可能是混合的,请使用\ r?\ n \ r?\ n来使\ r \ n可选。

Also note that because the '\n\n' is after the |, the <p> blocks can have double newlines in them - <p> to </p> takes priority. If you want newlines inside the <p>'s to take priority, swap those around.

另请注意,因为'\ n \ n'在|之后,

块中可以有双重换行符 -

到 优先。如果你希望

中的换行优先,那就换掉它们。

#2


0  

Instead of using a regular expression, use an HTML tree walker to find the second paragraph and add whatever you like. I talked about this sort of thing in my Process HTML with a Perl module article for InformIT.

不使用正则表达式,而是使用HTML树walker查找第二段并添加您喜欢的任何内容。我在我的Process HTML中用InformIT的Perl模块文章谈到了这种事情。

The advantage of something like HTML::TreeBuilder is that you deal with the logical structure of the HTML rather than contending with the position and order of random characters in a regular expression. If the structure stays the same, a tree walker should keep working. If you change almost anything, the regex is probably going to break.

像HTML :: TreeBuilder这样的优点是你可以处理HTML的逻辑结构,而不是与正则表达式中随机字符的位置和顺序竞争。如果结构保持不变,树木行走者应该继续工作。如果你改变几乎任何东西,正则表达式可能会破坏。

An HTML::TreeBuilder example looks something like this:

HTML :: TreeBuilder示例如下所示:

#!perl
use strict;
use warnings;

use HTML::TreeBuilder;
use HTML::Element;

my $html  = HTML::TreeBuilder->new;
my $root  = $html->parse_file( *DATA );

my $second = ( $root->find_by_tag_name('p') )[1];

my $new_para = HTML::Element->new( 'p' );
$new_para->push_content( 'Add this line' );

$second->postinsert( $new_para );

print $root->as_HTML( undef, "\t", {} );

__END__
<p>
This is the first paragraph
</p>

<p>
This is the second paragraph
</p>

<p>
This is the last paragraph
</p>

If you need to clean up your data first, you can throw in some steps to use HTML::Tidy with the enclose_text option.

如果您需要首先清理数据,可以使用一些步骤来使用带有enclose_text选项的HTML :: Tidy。

#3


-1  

Text:

my $text = '
text text text text
text text text text

<p>
text text text text
text text text text
</p>
<p>
text text text text
text text text text
</p>
';

This should work with:

这应该适用于:

our $cnt = 0;
our $where = 2;

my $new_stuff='<- want to insert text here';
$text =~ s/
           (
            (?:\n|<\/p>)\n
           )
           (?{ ++$cnt })
           (??{ $cnt==$where?'':'!$' })
          /$1$new_stuff\n/xs;

Result:

text text text text
text text text text

<p>
text text text text
text text text text
</p>
<- want to insert text here
<p>
text text text text
text text text text
</p>

Regards

rbo