如何计算Perl中重叠的子串？

i need to implement a program to count the occurrence of a substring in a string in perl. i have implemented it as follows

我需要实现一个程序来计算perl中字符串中子字符串的出现次数。我已经实现了如下

sub countnmstr{  $count =0;  $count++ while $_[0] =~ /$_[1]/g;  return $count;}$count = countnmstr("aaa","aa");print "$count\n";

now this is what i would normally do. however, in the implementation above i want to count occurrence of 'aa' in 'aaa'. here i get answer as 1 which seems reasonable but i need to consider the overlapping cases as well. hence the above case should give an answer as 2 since there are two 'aa's if we consider overlap.

现在这就是我通常会做的事情。但是,在上面的实现中,我想计算'aaa'中'aa'的出现次数。在这里我得到答案为1似乎是合理的,但我也需要考虑重叠的情况。因此,上述情况应该给出答案为2,因为如果我们考虑重叠,则有两个'aa'。

can anyone suggest how to implement such a function??

任何人都可以建议如何实现这样的功能?

6 个解决方案

#1

See ysth's answer ... I failed to realize that the pattern could consist solely of a zero width assertion and still work for this purpose.

请参阅ysth的答案......我没有意识到该模式可能只包含零宽度断言,并且仍然可以用于此目的。

You can use positive lookahead as suggested by others, and write the function as:

您可以按照其他人的建议使用正向前瞻,并将函数编写为:

sub countnmstr {    my ($haystack, $needle) = @_;    my ($first, $rest) = $needle =~ /^(.)(.*)$/;    return scalar (() = $haystack =~ /(\Q$first\E(?=\Q$rest\E))/g);}

You can also use pos to adjust where the next search picks up from:

您还可以使用pos来调整下一个搜索的位置:

#!/usr/bin/perluse strict; use warnings;sub countnmstr {    my ($haystack, $needle) = @_;    my $adj = length($needle) - 1;    die "Search string cannot be empty!" if $adj < 0;    my $count = 0;    while ( $haystack =~ /\Q$needle/g ) {        pos $haystack -= $adj;        $count += 1;    }    return $count;}print countnmstr("aaa","aa"), "\n";

Output:

C:\Temp> t2

#2

Everyone is getting pretty complicated in their answers (d'oh! daotoad should have made his comment an answer!), perhaps because they are afraid of the goatse operator. I didn't name it, that's just what people call it. It uses the trick that the result of a list assignment is the number of elements in the righthand list.

每个人的答案都变得非常复杂了(噢!daotoad应该把他的评论作为答案!),也许是因为他们害怕山羊运营商。我没有说出来,这正是人们所说的。它使用了一种技巧,即列表赋值的结果是右侧列表中的元素数量。

The Perl idiom for counting matches is then:

用于计数匹配的Perl习语是:

 my $count = () = $_[0] =~ /($pattern)/g;

The goatse part is the = () =, which is an empty list in the middle of two assignments. The lefthand part of the goatse gets the count from the righthand side of the goatse. Note the you need a capture in the pattern because that's the list the match operator will return in list context.

goatse部分是=()=,这是两个赋值中间的空列表。山羊的左手部分从山羊的右侧获得计数。请注意,您需要在模式中捕获,因为匹配运算符将在列表上下文中返回该列表。

Now, the next trick in your case is that you really want a positive lookbehind (or lookahead maybe). The lookarounds don't consume characters, so you don't need to keep track of the position:

现在,你的下一个技巧是你真的想要一个积极的外观(或者可能是前瞻)。外观不消耗字符,因此您无需跟踪位置:

 my $count = () = 'aaa' =~ /((?<=a)a)/g;

Your aaa is just an example. If you have a variable-width pattern, you have to use a lookahead. Lookbehinds in Perl have to be fixed width.

你的aaa只是一个例子。如果您有可变宽度图案,则必须使用前瞻。 Perl中的Lookbehinds必须是固定的宽度。

#3

sub countnmstr{    my ($string, $substr) = @_;    return scalar( () = $string =~ /(?=\Q$substr\E)/g );}$count = countnmstr("aaa","aa");print "$count\n";

A few points:

几点:

//g in list context matches as many times as possible.

//列表上下文中的g匹配尽可能多的次数。

\Q...\E is used to auto-escape any meta characters, so that you are doing a substring count, not a subpattern count.

\ Q ... \ E用于自动转义任何元字符,因此您正在进行子字符串计数,而不是子模式计数。

Using a lookahead (?= ... ) causes each match to not "consume" any of the string, allowing the following match to be attempted at the very next character.

使用前瞻(?= ...)会导致每个匹配不“消耗”任何字符串,从而允许在下一个字符处尝试以下匹配。

This uses the same feature where a list assignment (in this case, to an empty list) in scalar context returns the count of elements on the right of the list assignment as the goatse/flying-lentil/spread-eagle/whatever operator, but uses scalar() instead of a scalar assignment to provide the scalar context.

这使用相同的功能,其中标量上下文中的列表赋值(在本例中为空列表)返回列表赋值右侧的元素计数作为goatse / flying-lentil / spread-eagle /无论运算符,但是使用标量()而不是标量赋值来提供标量上下文。

$_[0] is not used directly, but instead copied to a lexical; a naive use of $_[0] in place of $string would cause the //g to start partway through the string instead of at the beginning if the passed string had a stored pos().

$ _ [0]不是直接使用,而是复制到词法;如果传递的字符串有一个存储的pos(),那么天使用$ _ [0]代替$ string会导致// g在字符串的中途而不是在开头处开始。

Update: s///g is faster, though not as fast as using index:

更新:s /// g更快,但不如使用索引快:

sub countnmstr{    my ($string, $substr) = @_;    return scalar( $string =~ s/(?=\Q$substr\E)//g );}

#4

You could use a lookahead assertion in the regular expression:

您可以在正则表达式中使用前瞻断言:

sub countnmstr {    my @matches = $_[0] =~ /(?=($_[1]))/g;    return scalar @matches;}

I suspect Sinan's suggestion will be quicker though.

我怀疑思南的建议会更快。

#5

you can try this, no more regex than needed.

你可以尝试这个,不再需要正则表达式。

$haystack="aaaaabbbcc";$needle = "aa";while ( 1 ){    $ind = index($haystack,$needle);    if ( $ind == -1 ) {last};    $haystack = substr($haystack,$ind+1);    $count++;}print "Total count: $count\n";

output

$ ./perl.plTotal count: 4

#6

If speed is an issue, the index approach suggested by ghostdog74 (with cjm's improvement) is likely to be considerably faster than the regex solutions.

如果速度是一个问题,ghostdog74建议的索引方法(cjm的改进)可能比正则表达式解决方案快得多。

use strict;use warnings;sub countnmstr_regex {    my ($haystack, $needle) = @_;    return scalar( () = $haystack =~ /(?=\Q$needle\E)/g );}sub countnmstr_index {    my ($haystack, $needle) = @_;    my $i = 0;    my $tally = 0;    while (1){        $i = index($haystack, $needle, $i);        last if $i == -1;        $tally ++;        $i ++;    }    return $tally;}use Benchmark qw(cmpthese);my $size = 1;my $h = 'aaa aaaaaa' x $size;my $n = 'aa';cmpthese( -2, {    countnmstr_regex => sub { countnmstr_regex($h, $n) },    countnmstr_index => sub { countnmstr_index($h, $n) },} );__END__# Benchmarks run on Windows.# Result using a small haystack ($size = 1).                     Rate countnmstr_regex countnmstr_indexcountnmstr_regex  93701/s               --             -66%countnmstr_index 271893/s             190%               --# Result using a large haystack ($size = 100).                   Rate countnmstr_regex countnmstr_indexcountnmstr_regex  929/s               --             -81%countnmstr_index 4960/s             434%               --

#1