为什么再添加一个选项会使我的regex慢600倍?

I noticed something weird while testing a simple Perl script that's supposed to filter out filenames beginning with certain prefixes.

我在测试一个简单的Perl脚本时注意到了一些奇怪的事情，它应该从某些前缀过滤出文件名。

Basically, I'm constructing a regex like this:

基本上，我正在构建一个这样的regex:

my $regex = join "|", map quotemeta, @prefixes;
$regex = qr/^($regex)/;   # anchor the regex and precompile it

Here, in the scenario I originally tested, @prefixes consists of 32-character hexadecimal strings. What I found is that everything ran nice and smoothly up to 6,552 prefixes — but as soon as I tried 6,553, the execution time of the script jumped by a factor of over 25 (from 1.8 seconds to 51 seconds)!

在这里，在我最初测试的场景中，@prefixes由32个字符的十六进制字符串组成。我发现，所有的代码都运行得很好，并且运行得很流畅，达到了6552个前缀——但是当我尝试6553个前缀时，脚本的执行时间就增加了25倍(从1.8秒跳到51秒)!

I played around with it, and constructed the following benchmark. I originally used 32-character hex strings, like in my original program, but found that if I cut the length of the strings down to just 8 characters, the threshold value moved higher — to 16,383, in fact — while the slowdown factor got dramatically higher yet: the regexp with 16,383 alternatives is almost 650 times slower than the one with only 16,382!

我使用它，构建了下面的基准。我最初使用个32个字符十六进制字符串,就像在我最初的计划,但我发现,如果减少字符串的长度只有8个字符,阈值走高——到16383年,事实上,虽然经济放缓因素有显著更高:regexp与16383年选择几乎是比一个只有16382慢650倍!

#!/usr/bin/perl
use strict;
use warnings;
use Benchmark qw(timethese cmpthese);

my $count = shift || 10;

our @items = map unpack("H8", pack "V", $_), 0..99999;

our $nA = 16382; our $reA = join "|", @items[1 .. $nA];
our $nB = 16383; our $reB = join "|", @items[1 .. $nB];

$_ = qr/^($_)/ for $reA, $reB;  # anchor and compile regex

my $results = timethese( $count, {
    $nA => q{ our (@items, $nA, $reA); $nA == grep /$reA/, @items or die; },
    $nB => q{ our (@items, $nB, $reB); $nB == grep /$reB/, @items or die; },
} );
cmpthese( $results );

Here are the results of running this benchmark on my laptop, using Perl (v5.18.2):

下面是在我的笔记本上使用Perl运行这个基准测试的结果(v5.18.2):

Benchmark: timing 10 iterations of 16382, 16383...
     16382:  2 wallclock secs ( 1.79 usr +  0.00 sys =  1.79 CPU) @  5.59/s (n=10)
     16383: 1163 wallclock secs (1157.28 usr +  2.70 sys = 1159.98 CPU) @  0.01/s (n=10)
      s/iter  16383  16382
16383    116     --  -100%
16382  0.179 64703%     --

Note the 64,703% speed difference.

注意64,703%的速度差异。

My original hunch, based on the numerical coincidence that 6553 ≈ 2¹⁶ / 10, was that this might've been some kind of an arbitrary limit within the Perl regex engine, or maybe that there there might be some kind of an array of 10-byte structs that was limited to 64 kB, or something. But the fact that the threshold number of alternatives depends on their length makes things more confusing.

我最初的直觉,基于数字巧合6553≈216/10,是这可能已经某种任意限制在Perl正则引擎,或者有可能有某种形式的一个10字节的结构是有限给数组64 kB,什么的。但事实上，可选方案的阈值数量取决于它们的长度，这使得事情更加令人困惑。

(On the other hand, it's clearly not just about the length of the regex, either; the original regex with 6,553 32-byte alternatives was 2 + 6,553×33 = 216,251 bytes long, while the one with 16,383 8-byte alternatives is only 2 + 16,383×9 = 147,450 bytes long.)

(另一方面，显然也不仅仅是关于regex的长度;最初的regex 6553 32字节的替代品是2 + 6553×33 = 216251字节,而有16383 8字节的选择只有2 + 16383×9 = 147450字节)。

What is causing this weird jump in regex matching time, and why does it happen at that specific point?

是什么原因导致regex匹配时间出现这种奇怪的跳转，为什么会在那个特定的点上发生?

1 个解决方案

#1

For a long time, perl's TRIE optimization has not been applied where the initial compilation of the regex produces longjmp instead of jmp (I think) instructions (which depends on the number of alternations and the total lengths of the strings involved and what else is (earlier?) in the regex).

很长一段时间以来，当regex的初始编译产生longjmp而不是jmp(我认为)指令(这取决于修改的数量和涉及的字符串的总长度以及regex中的其他内容)时，perl的TRIE优化没有被应用。

See the difference between:

看到的区别:

perl -we'use re "debug"; qr/@{[join"|","a".."afhd"]}/'

and

和

perl -we'use re "debug"; qr/@{[join"|","a".."afhe"]}/'

You can break your alternation down into smaller chunks and precompile them separately and do e.g. (??{$rxa})|(??{$rxb})|(??{$rxc}).

您可以将更改分解为更小的块并分别预编译它们，并执行(? {$rxa})|(? {$rxb)|(? {$rxc})。

#1