哪个更有效的正则表达式？

I'm parsing some big log files and have some very simple string matches for example

我正在解析一些大的日志文件,并且有一些非常简单的字符串匹配例如

if(m/Some String Pattern/o){
    #Do something
}

It seems simple enough but in fact most of the matches I have could be against the start of the line, but the match would be "longer" for example

这看起来很简单,但事实上我所拥有的大多数比赛可能都是对阵线的起点,但比赛将会“更长”,例如

if(m/^Initial static string that matches Some String Pattern/o){
    #Do something
}

Obviously this is a longer regular expression and so more work to match. However I can use the start of line anchor which would allow an expression to be discarded as a failed match sooner.

显然这是一个更长的正则表达式,因此需要更多工作来匹配。但是我可以使用行锚的开头,这将允许表达式作为失败的匹配更快地被丢弃。

It is my hunch that the latter would be more efficient. Can any one back me up/shoot me down :-)

我的预感是后者会更有效率。任何人都可以支持我/射击我:-)

7 个解决方案

#1

I think you'll find that starting your regex with ^ will definitely be faster, because the regex engine doesn't have to look any further than the left edge of the string for a match.

我想你会发现用^开始你的正则表达式肯定会更快,因为正则表达式引擎不需要查看字符串左边缘的任何进一步匹配。

This is something that you could easily test and measure, of course. Do a regex match 10 million times or so, measure how long it takes, then try again with a different regex.

当然,这是你可以轻松测试和测量的东西。正则表达式匹配1000万次左右,测量需要多长时间,然后再使用不同的正则表达式再次尝试。

#2

The line anchor makes it faster. I have to add though that the //o modifier is not necessary here, in fact it does nothing. That's code smell to me.

线锚使其更快。我必须补充说,这里不需要// o修饰符,事实上它什么都不做。那是代码味道给我的。

There used to be valid usages for //o, but these days that is provided by qr//

曾经有过// o的有效用法,但这些日子是由qr //提供的

#3

Speed of an RE depends on two factors, the RE itself and the data being passed through the RE. In general, an anchored RE (start or end) with no backtracking will be faster than others. But if you're processing a file where every line is empty, there's no speed difference between /^hello/ and /hello/ (at least if the RE engine is written correctly).

RE的速度取决于两个因素,即RE本身和通过RE的数据。通常,没有回溯的锚定RE(开始或结束)将比其他更快。但是如果你正在处理一个每行都为空的文件,那么/ ^ hello /和/ hello /之间没有速度差异(至少如果RE引擎写得正确的话)。

But the rule I follow is: measure, don't guess.

但我遵循的规则是:衡量,不要猜测。

#4

I did some timings as recommended. here are the results for my app. Its the whole app, not just the regex searches. It scans 60,000 lines. 11 Regular expressions average short length was about 30 characters. The longer but anchored ones are about 120.

我按照建议做了一些时间安排。这是我的应用程序的结果。它是整个应用程序,而不仅仅是正则表达式搜索。它扫描了60,000行。 11正则表达式平均短长度约为30个字符。较长但锚定的约为120。

Short
   real    0m58.780s
   user    0m54.940s
   sys     0m0.790s

Long (anchored)
   real    0m54.260s
   user    0m53.630s
   sys     0m0.490s

Long (not anchored)
   real    0m54.705s
   user    0m54.130s
   sys     0m0.400s

So anchoring the long strings is slightly faster. Although not by much. It would appear that if my strings were any larger it might be a different matter.

因此,锚定长字符串会稍快一些。虽然不是很多。看来,如果我的琴弦更大,那可能是另一回事。

#5

You can gain tremendous insight into what the regex engine is doing in Perl with the use re debug pragma. It is documented here

通过使用re debug pragma,您可以深入了解正则表达式引擎在Perl中的作用。这里记录在案

It is always helpful to review the Perl suggested performance techniques, including suggested timing methods.

查看Perl建议的性能技术(包括建议的计时方法)总是有帮助的。

If I run this small test:

如果我运行这个小测试:

#!/usr/bin/perl 

use strict;
use warnings;
use Benchmark;

my $target="aeiou";

my $str="lkdjflzdjfljdsflkjasdjf asldkfj lasdjf dslfj sldfj asld alskdfj lasd f";

my $str2=$str.$target;

timethese(10_000_000, {
            'float'       => sub {
                die "no match" unless $str2=~m/$target/o;
            },
            'anchored'  => sub {
                die "no match" unless $str2=~m/^.*$target/o;
            },
            'prefixed'   => sub {
                die "no match" unless $str2=~m/^$str$target/o ;
            },  

    });

I get the output of:

我得到的输出:

Benchmark: timing 10000000 iterations of anchored, float, prefixed...
  anchored:  4 wallclock secs ( 3.46 usr +  0.01 sys =  3.47 CPU) @ 2881844.38/s 
     float:  2 wallclock secs ( 1.87 usr +  0.00 sys =  1.87 CPU) @ 5347593.58/s 
  prefixed:  4 wallclock secs ( 3.05 usr +  0.01 sys =  3.06 CPU) @ 3267973.86/s

Which leads to the conclusion that non-anchored (floating) version is way faster. However, the regex and the source may change that. YMMV and test test test...

这导致了非锚定(浮动)版本更快的结论。但是,正则表达式和源代码可能会改变这种情况。 YMMV和测试测试...

#6

Are you saying you can anchor the regex by adding a static prefix, like this?

你是说你可以通过添加静态前缀来锚定正则表达式,像这样?

/^blah blah The Real Regex/

That certainly won't hurt performance, and it will probably help, but not for the reason you think. Although they're best known for the "magical" stuff like anchors and lookarounds and capturing groups, what regex engines are best at is matching literal sequences of characters. The longer the sequence, the faster the match (up to a point, of course).

这肯定不会影响性能,它可能会有所帮助,但不是因为你认为的原因。虽然他们最着名的是“神奇”的东西,比如锚点和外观以及捕获组,但正则表达式引擎最擅长的是匹配字符序列。序列越长,匹配越快(当然是一点)。

In other words, it's the addition of the static prefix, not the anchor, that's giving you the boost.

换句话说,它是添加静态前缀而不是锚点,它可以为您提供支持。

#7

I vote for the one anchored at the beginning for exactly the reason you state!

因为你陈述的原因,我投票支持开头的那个!

#1