I'm trying to match the point between 2nd and 3rd paragraphs to insert some content. Paragraphs are delimited either by <p>
or 2 newlines, mixed. Here's an example:
我试图匹配第2和第3段之间的点来插入一些内容。段落由
或2个换行符分隔,混合。这是一个例子:
text text text text
text text text text文本文本文本文本文本文本文本文本
<p>
text text text text
text text text text</p>
<--------------------------- want to insert text here<p>
text text text text
text text text text</p>
文本文本文本文本文本文本文本文本 <---------------------------想在此处插入文本
文本文本文本文本文本文本文本
3 个解决方案
#1
3
Assuming there are no nested paragraphs...
假设没有嵌套段落......
my $to_insert = get_thing_to_insert();
$text =~ s/((?:<p>.*?</p>|\n\n){2})/$1$to_insert/s;
should just about do it.
应该做的就是这样。
With extended formatting:
扩展格式:
$text =~ s{
( # a group
(?: # containing ...
<p> # the start of a paragraph
.*? # to...
</p> # its closing tag
| # OR...
\n\n # two newlines alone.
){2} # twice
) # and take all of that...
}
{$1$to_insert}xms # and append $val to it
Note, I used \n\n as the delimiter; if you're using a windows style text file, this needs to be \r\n\r\n
, or if it might be mixed, something like \r?\n\r?\n
to make the \r
optional.
注意,我使用\ n \ n作为分隔符;如果您使用的是Windows风格的文本文件,则需要为\ r \ n \ r \ n,或者如果它可能是混合的,请使用\ r?\ n \ r?\ n来使\ r \ n可选。
Also note that because the '\n\n' is after the |
, the <p>
blocks can have double newlines in them - <p>
to </p>
takes priority. If you want newlines inside the <p>
's to take priority, swap those around.
另请注意,因为'\ n \ n'在|之后,
块中可以有双重换行符 -
到 优先。如果你希望
中的换行优先,那就换掉它们。
#2
0
Instead of using a regular expression, use an HTML tree walker to find the second paragraph and add whatever you like. I talked about this sort of thing in my Process HTML with a Perl module article for InformIT.
不使用正则表达式,而是使用HTML树walker查找第二段并添加您喜欢的任何内容。我在我的Process HTML中用InformIT的Perl模块文章谈到了这种事情。
The advantage of something like HTML::TreeBuilder is that you deal with the logical structure of the HTML rather than contending with the position and order of random characters in a regular expression. If the structure stays the same, a tree walker should keep working. If you change almost anything, the regex is probably going to break.
像HTML :: TreeBuilder这样的优点是你可以处理HTML的逻辑结构,而不是与正则表达式中随机字符的位置和顺序竞争。如果结构保持不变,树木行走者应该继续工作。如果你改变几乎任何东西,正则表达式可能会破坏。
An HTML::TreeBuilder example looks something like this:
HTML :: TreeBuilder示例如下所示:
#!perl
use strict;
use warnings;
use HTML::TreeBuilder;
use HTML::Element;
my $html = HTML::TreeBuilder->new;
my $root = $html->parse_file( *DATA );
my $second = ( $root->find_by_tag_name('p') )[1];
my $new_para = HTML::Element->new( 'p' );
$new_para->push_content( 'Add this line' );
$second->postinsert( $new_para );
print $root->as_HTML( undef, "\t", {} );
__END__
<p>
This is the first paragraph
</p>
<p>
This is the second paragraph
</p>
<p>
This is the last paragraph
</p>
If you need to clean up your data first, you can throw in some steps to use HTML::Tidy with the enclose_text
option.
如果您需要首先清理数据,可以使用一些步骤来使用带有enclose_text选项的HTML :: Tidy。
#3
-1
Text:
my $text = '
text text text text
text text text text
<p>
text text text text
text text text text
</p>
<p>
text text text text
text text text text
</p>
';
This should work with:
这应该适用于:
our $cnt = 0;
our $where = 2;
my $new_stuff='<- want to insert text here';
$text =~ s/
(
(?:\n|<\/p>)\n
)
(?{ ++$cnt })
(??{ $cnt==$where?'':'!$' })
/$1$new_stuff\n/xs;
Result:
text text text text
text text text text
<p>
text text text text
text text text text
</p>
<- want to insert text here
<p>
text text text text
text text text text
</p>
Regards
rbo
#1
3
Assuming there are no nested paragraphs...
假设没有嵌套段落......
my $to_insert = get_thing_to_insert();
$text =~ s/((?:<p>.*?</p>|\n\n){2})/$1$to_insert/s;
should just about do it.
应该做的就是这样。
With extended formatting:
扩展格式:
$text =~ s{
( # a group
(?: # containing ...
<p> # the start of a paragraph
.*? # to...
</p> # its closing tag
| # OR...
\n\n # two newlines alone.
){2} # twice
) # and take all of that...
}
{$1$to_insert}xms # and append $val to it
Note, I used \n\n as the delimiter; if you're using a windows style text file, this needs to be \r\n\r\n
, or if it might be mixed, something like \r?\n\r?\n
to make the \r
optional.
注意,我使用\ n \ n作为分隔符;如果您使用的是Windows风格的文本文件,则需要为\ r \ n \ r \ n,或者如果它可能是混合的,请使用\ r?\ n \ r?\ n来使\ r \ n可选。
Also note that because the '\n\n' is after the |
, the <p>
blocks can have double newlines in them - <p>
to </p>
takes priority. If you want newlines inside the <p>
's to take priority, swap those around.
另请注意,因为'\ n \ n'在|之后,
块中可以有双重换行符 -
到 优先。如果你希望
中的换行优先,那就换掉它们。
#2
0
Instead of using a regular expression, use an HTML tree walker to find the second paragraph and add whatever you like. I talked about this sort of thing in my Process HTML with a Perl module article for InformIT.
不使用正则表达式,而是使用HTML树walker查找第二段并添加您喜欢的任何内容。我在我的Process HTML中用InformIT的Perl模块文章谈到了这种事情。
The advantage of something like HTML::TreeBuilder is that you deal with the logical structure of the HTML rather than contending with the position and order of random characters in a regular expression. If the structure stays the same, a tree walker should keep working. If you change almost anything, the regex is probably going to break.
像HTML :: TreeBuilder这样的优点是你可以处理HTML的逻辑结构,而不是与正则表达式中随机字符的位置和顺序竞争。如果结构保持不变,树木行走者应该继续工作。如果你改变几乎任何东西,正则表达式可能会破坏。
An HTML::TreeBuilder example looks something like this:
HTML :: TreeBuilder示例如下所示:
#!perl
use strict;
use warnings;
use HTML::TreeBuilder;
use HTML::Element;
my $html = HTML::TreeBuilder->new;
my $root = $html->parse_file( *DATA );
my $second = ( $root->find_by_tag_name('p') )[1];
my $new_para = HTML::Element->new( 'p' );
$new_para->push_content( 'Add this line' );
$second->postinsert( $new_para );
print $root->as_HTML( undef, "\t", {} );
__END__
<p>
This is the first paragraph
</p>
<p>
This is the second paragraph
</p>
<p>
This is the last paragraph
</p>
If you need to clean up your data first, you can throw in some steps to use HTML::Tidy with the enclose_text
option.
如果您需要首先清理数据,可以使用一些步骤来使用带有enclose_text选项的HTML :: Tidy。
#3
-1
Text:
my $text = '
text text text text
text text text text
<p>
text text text text
text text text text
</p>
<p>
text text text text
text text text text
</p>
';
This should work with:
这应该适用于:
our $cnt = 0;
our $where = 2;
my $new_stuff='<- want to insert text here';
$text =~ s/
(
(?:\n|<\/p>)\n
)
(?{ ++$cnt })
(??{ $cnt==$where?'':'!$' })
/$1$new_stuff\n/xs;
Result:
text text text text
text text text text
<p>
text text text text
text text text text
</p>
<- want to insert text here
<p>
text text text text
text text text text
</p>
Regards
rbo