哪个是更有效的regex?

时间:2021-10-06 12:50:26

I recently received some job applicant test results in which one person claimed the the solution they gave was more efficient (I won't say which since I don't want to influence the answers). Needless to say, I was sceptical, but I don't know enough about the inner workings of RE compilers to comment intelligently.

我最近收到了一些求职者的测试结果,其中一个人声称他们给出的解决方案更有效(我不会说哪个,因为我不想影响答案)。毫无疑问,我对此持怀疑态度,但我对编译器的内部工作机制了解不够,无法做出明智的评论。

The question was: Give a regular expression to recognise numbers from 0 through 99 inclusive.

问题是:给出一个正则表达式来识别从0到99的数字。

The answers were:

答案是:

[0-9]{1,2}
[0-9]?[0-9]
[0-9]|([0-9][0-9])

I'd be interested as to why any of these are faster (or better in any other way). Bonus points for providing evidence rather than conjecture, but I'll still take conjecture if you make it sound convincing enough :-)

我很想知道为什么这些都更快(或者更好)。提供证据而不是猜想的额外分数,但是如果你让它听起来足够有说服力的话,我还是会猜想的。

7 个解决方案

#1


10  

The expression [0-9]{1,2} should be the fastest I would imagine, although it will depend on the specific engine.

表达式[0-9]{1,2}应该是我能想到的最快的,尽管它取决于特定的引擎。

My reasoning is:

我的理由是:

  • [0-9]{1,2} - this exactly describes what you want to match.
  • [0-9]{1,2} -这正好描述了您想匹配的内容。
  • [0-9]?[0-9] - this could result in a backtrack if the first match fails.
  • [0 - 9]吗?[0-9] -如果第一场比赛失败,这可能会导致倒退。
  • [0-9]|([0-9][0-9]) - this requires the first character to be checked twice if it fails, and the parentheses here are unnecessary and cause an unneeded capture.
  • [0-9]|([0-9][0-9])——如果第一个字符失败,则需要检查两次,这里的括号是不必要的,会导致不必要的捕获。

Here are the iterations per second I got when testing this in .NET (without RegexOptions.Compiled):

以下是我在。net中测试时得到的每秒迭代数(没有regexoptions.e汇编):

Regex                      100% valid input   50% valid input  100% invalid input
"^[0-9]{1,2}$"             749086             800313           870748
"^[0-9]?[0-9]$"            731951             725984           740152
"^(?:[0-9]|([0-9][0-9]))$" 564654             687248           870378

With RegexOptions.Compiled:

RegexOptions.Compiled:

Regex                      100% valid input   50% valid input  100% invalid input
"^[0-9]{1,2}$"             1486212            1592535          1831843
"^[0-9]?[0-9]$"            1301557            1448812          1559193
"^(?:[0-9]|([0-9][0-9]))$" 1131179            1303213          1394146

And as a graph:

作为一个图:

哪个是更有效的regex?

Note: I modified each regular expression to require an exact match rather than performing a search.

注意:我修改了每个正则表达式以要求精确匹配,而不是执行搜索。

#2


7  

At least in theory, identical regexes like these will yield identical automata. A DFA-based matcher is going to match one character at a time and have the different possible branches encoded in its states (as opposed to taking one branch at a time and then backtracking upon failure), so the performance of each will be the same.

至少在理论上,这些相同的正则表达式会产生相同的自动机。基于dfa的matcher将每次匹配一个字符,并在其状态中编码不同的可能分支(而不是每次只处理一个分支,然后在失败时进行回溯),因此每个分支的性能都是相同的。

All three regexes would be matched by this DFA:

这三个regex将与DFA匹配:

+---+  0-9  +---+  0-9  +---+  *  .---.
| A |  -->  | B |  -->  | C | --> (ERR)
+---+       +---+       +---+     '---'
  |          / \          |
  | *     * /   \ $       | $ 
  V        /     \        V
.---.     /       \     .---.
(ERR) <--'         '--> (ACC)
'---'                   '---'

State A: Start state. Goes to B if it sees a digit, otherwise to an ERROR state.
State B: One digit matched so far. EOL ($) is ACCEPTED. A digit moves to C. Anything else is an ERROR.
State C: Two digits matched. EOL is ACCEPTED, anything else is an ERROR.

状态:开始状态。如果它看到一个数字,就会变成B,否则就会变成一个错误状态。状态B:到目前为止有一个数字匹配。生物(美元)被接受。一个数字移到c,其他都是错误。状态C:匹配的两位数。EOL是可接受的,其他的都是错误。

That's my language theoretical answer. I can't speak to real world regex engine implementations. I'm ignoring the capturing semantics of the parentheses as I am guessing that's not the point of the question. Automata also don't handle other "non-theoretical" constructs like greediness, lookahead, etc. At least not in their textbook presentation.

这就是我的语言理论回答。我无法与真实世界的regex引擎实现对话。我忽略了圆括号的捕获语义,因为我猜这不是问题的重点。自动机也不会处理其他的“非理论”结构,比如贪婪、前瞻等等。至少在教科书的介绍中是这样。

#3


4  

Without knowing the regex engine, one cannot even decide whether those are correct.

在不知道regex引擎的情况下,我们甚至无法判断它们是否正确。

For example, a POSIX ERE is longest-leftmost, not leftmost-longest, so it will choose the longest in a series of alternatives, and therefore choose a string matching "ab" against /a|ab/ will match the whole string, "ab". But a normal backtracking NFA the way you most often see would do something else there: it would care about ordering, and so matching the same "ab" string against the same /a|ab/ pattern would in them select just the beginning part, "a".

例如,POSIX ERE是最左的,而不是最左的,所以它将在一系列的选项中选择最长的,因此选择一个字符串,将ab与|ab/匹配,从而匹配整个字符串ab。但是你经常看到的一种正常的回溯NFA会做一些其他的事情:它会关心排序,所以将相同的“ab”字符串与相同的/ |ab/ pattern匹配,只选择开头的部分“a”。

The next question is the capture group in the same pattern. If they are intentional, they are odd, since you are keeping two-digit numbers but not one-digit numbers. The other patterns do not do that, yet they are said to be identical in behavior. So I am going to assume that they are in error here. Otherwise the capture group’s memory use will of course cost more to squirrel away than it would take not to do so.

下一个问题是采用相同模式的捕获组。如果它们是故意的,它们是奇数,因为你保留的是两位数,而不是一位数。其他的模式不能做到这一点,但据说它们在行为上是相同的。我假设它们是错误的。否则,捕获组的内存使用当然要比不这样做花费更多。

The next problem is the lack of any anchors whatsoever. Again, we cannot know if it those are correct, because it is not clear what the input set would look like nor what that particular engine does with unanchored patterns. Most engines will search everywhere in the string, but some of the less programmer-friendly one will “helpfully” add beginning-of-line (BOL) and end-of-line (EOL) anchors there. In more customary engines, where this does not occur, a zip code in the middle of the line would also match, since five digits obviously contains one- and two-digit substrings. Whether you would want ^ and $ anchors, or \b anchors, I can’t guess.

下一个问题是缺乏任何锚。同样,我们也不知道它们是否正确,因为不清楚输入集是什么样子,也不清楚那个特定的引擎是如何处理无锚模式的。大多数引擎都会在字符串中搜索,但是一些不太适合程序员的引擎会“有帮助地”添加开始线(BOL)和结束线(EOL)锚。在更常见的引擎中,如果不出现这种情况,那么行中间的邮政编码也会匹配,因为五位数显然包含一和两位数的子字符串。你是否想要^和$锚,或\ b锚,我不能猜测。

So I have to make some guesses here. I’m going to leave the anchors off, but I’m going to reorder the third versions branch, because otherwise you can never match a two digit number with a the normal (non-POSIX) kinds of backtracking NFAs most things run.

这里我要做一些猜测。我将不使用锚点,但是我将重新排序第三个版本分支,因为否则,您永远无法将一个两位数与正常(非posix)类型的回溯NFAs匹配。

Before even considering timing, it can really pay to look at what sort of program the regex compiler builds out of those patterns.

在考虑时间安排之前,看看regex编译器使用这些模式构建什么样的程序是值得的。

% perl -Mre=debug -ce '@pats = ( qr/[0-9]{1,2}/, qr/[0-9]?[0-9]/, qr/[0-9][0-9]|[0-9]/ )'
Compiling REx "[0-9]{1,2}"
Final program:
   1: CURLY {1,2} (14)
   3:   ANYOF[0-9][] (0)
  14: END (0)
stclass ANYOF[0-9][] minlen 1 
Compiling REx "[0-9]?[0-9]"
synthetic stclass "ANYOF[0-9][]".
Final program:
   1: CURLY {0,1} (14)
   3:   ANYOF[0-9][] (0)
  14: ANYOF[0-9][] (25)
  25: END (0)
stclass ANYOF[0-9][] minlen 1 
Compiling REx "[0-9][0-9]|[0-9]"
Final program:
   1: BRANCH (24)
   2:   ANYOF[0-9][] (13)
  13:   ANYOF[0-9][] (36)
  24: BRANCH (FAIL)
  25:   ANYOF[0-9][] (36)
  36: END (0)
minlen 1 
-e syntax OK
Freeing REx: "[0-9]{1,2}"
Freeing REx: "[0-9]?[0-9]"
Freeing REx: "[0-9][0-9]|[0-9]"

It really is a good idea to look at the compiled patterns. It can be even more instructive to observe the compiled pattern being executed. Here we’ll watch both:

查看已编译的模式确实是一个好主意。观察正在执行的编译模式可能更有意义。在这里我们会观察这两种:

% perl -Mre=debug -e '"aabbbababbaaqcccaaaabcacabba" =~ /abc|bca|cab|caab|baac|bab|aaa|bbb/'
Compiling REx "abc|bca|cab|caab|baac|bab|aaa|bbb"
Final program:
   1: TRIEC-EXACT[abc] (25)
      <abc> 
      <bca> 
      <cab> 
      <caab> 
      <baac> 
      <bab> 
      <aaa> 
      <bbb> 
  25: END (0)
stclass AHOCORASICKC-EXACT[abc] minlen 3 
Matching REx "abc|bca|cab|caab|baac|bab|aaa|bbb" against "aabbbababbaaqcccaaaabcacabba"
Matching stclass AHOCORASICKC-EXACT[abc] against "aabbbababbaaqcccaaaabcacabba" (28 chars)
   0 <> <aabbbababb>         | Charid:  1 CP:  61 State:    1, word=0 - legal
   1 <a> <abbbababba>        | Charid:  1 CP:  61 State:    2, word=0 - legal
   2 <aa> <bbbababbaa>       | Charid:  2 CP:  62 State:   11, word=0 - fail
   2 <aa> <bbbababbaa>       | Fail transition to State:    2, word=0 - legal
   3 <aab> <bbababbaaq>      | Charid:  2 CP:  62 State:    3, word=0 - fail
   3 <aab> <bbababbaaq>      | Fail transition to State:    5, word=0 - legal
   4 <aabb> <bababbaaqc>     | Charid:  2 CP:  62 State:   13, word=0 - legal
   5 <aabbb> <ababbaaqcc>    | Charid:  1 CP:  61 State:   14, word=8 - accepting
Matches word #8 at position 2. Trying full pattern...
   2 <aa> <bbbababbaa>       |  1:TRIEC-EXACT[abc](25)
   2 <aa> <bbbababbaa>       |    State:    1 Accepted:    0 Charid:  2 CP:  62 After State:    5
   3 <aab> <bbababbaaq>      |    State:    5 Accepted:    0 Charid:  2 CP:  62 After State:   13
   4 <aabb> <bababbaaqc>     |    State:   13 Accepted:    0 Charid:  2 CP:  62 After State:   14
   5 <aabbb> <ababbaaqcc>    |    State:   14 Accepted:    1 Charid:  8 CP:   0 After State:    0
                                  got 1 possible matches
                                  only one match left: #8 <bbb>
   5 <aabbb> <ababbaaqcc>    | 25:END(0)
Match successful!
Freeing REx: "abc|bca|cab|caab|baac|bab|aaa|bbb"

Here the compiler got really clever on us, and compiled that into an Aho–Corasick trie structure. Obviously this is going to perform quite differently from how a normal backtracking NFA would on the same program.

在这里,编译器变得非常聪明,并将它编译成一个Aho-Corasick trie结构。很明显,这与正常的回溯NFA在同一个程序中的表现是完全不同的。

Anyway, here are the timing for your patterns, or close to them. I added an alternate formulation for number two, and I swapped the ordering of the alternatives in number three.

无论如何,这是你的模式的时间,或者接近它们。我增加了另一种表述二的公式,然后我交换了第三个选项的顺序。

testing against short_fail
                 Rate     second      first      third second_alt
second      9488823/s         --        -9%       -21%       -29%
first      10475308/s        10%         --       -13%       -22%
third      11998438/s        26%        15%         --       -11%
second_alt 13434377/s        42%        28%        12%         --

testing against long_fail
                 Rate     second      first      third second_alt
second     11221411/s         --        -3%        -5%        -5%
first      11618967/s         4%         --        -1%        -1%
third      11776451/s         5%         1%         --        -0%
second_alt 11786700/s         5%         1%         0%         --
testing against short_pass

                 Rate      first second_alt     second      third
first      11720379/s         --        -4%        -7%        -7%
second_alt 12199048/s         4%         --        -3%        -4%
second     12593191/s         7%         3%         --        -1%
third      12663378/s         8%         4%         1%         --

testing against long_pass
                 Rate      third     second      first second_alt
third      11135053/s         --        -1%        -5%        -8%
second     11221655/s         1%         --        -4%        -7%
first      11716924/s         5%         4%         --        -3%
second_alt 12042240/s         8%         7%         3%         --

That was produce by this program:

这是由这个项目制作的

#!/usr/bin/env perl
use Benchmark qw<cmpthese>;

$short_fail = "a" x   1;
$long_fail  = "a" x 600;

$short_pass = $short_fail . 42;
$long_pass  = $long_fail  . 42;

for my $name (qw< short_fail long_fail short_pass long_pass >) {   
    print "\ntesting against $name\n";
    $_ = $$name;    
    cmpthese 0 => {
        first       => '/[0-9]{1,2}/',
        second      => '/[0-9]?[0-9]/',
        second_alt  => '/[0-9][0-9]?/',
        third       => '/[0-9][0-9]|[0-9]/',
    }    
}

Here are numbers with anchors added:

下面是添加了锚的数字:

testing against short_fail
                 Rate      first     second second_alt      third
first      11720380/s         --        -3%        -4%       -11%
second     12058622/s         3%         --        -1%        -9%
second_alt 12180583/s         4%         1%         --        -8%
third      13217006/s        13%        10%         9%         --
testing against long_fail
                 Rate      third      first second_alt     second
third      11378120/s         --        -2%        -4%       -12%
first      11566419/s         2%         --        -2%       -10%
second_alt 11830740/s         4%         2%         --        -8%
second     12860517/s        13%        11%         9%         --
testing against short_pass
                 Rate     second      third second_alt      first
second     11540465/s         --        -5%        -5%        -7%
third      12093336/s         5%         --        -0%        -3%
second_alt 12118504/s         5%         0%         --        -2%
first      12410348/s         8%         3%         2%         --
testing against long_pass
                 Rate      first     second second_alt      third
first      11423466/s         --        -1%        -4%        -7%
second     11545540/s         1%         --        -3%        -7%
second_alt 11870086/s         4%         3%         --        -4%
third      12348377/s         8%         7%         4%         --

And here is the minimally modified program that produced the second set of numbers:

这是一个最小修改的程序产生了第二组数字:

#!/usr/bin/env perl
use Benchmark qw<cmpthese>;

$short_fail = 1  . "a";
$long_fail  = 1  . "a" x 600;

$short_pass =  2;
$long_pass  = 42;

for my $name (qw< short_fail long_fail short_pass long_pass >) {
    print "testing against $name\n";
    $_ = $$name;
    cmpthese 0 => {
        first       => '/^(?:[0-9]{1,2})$/',
        second      => '/^(?:[0-9]?[0-9])$/',
        second_alt  => '/^(?:[0-9][0-9]?)$/',
        third       => '/^(?:[0-9][0-9]|[0-9])$/',
    }
}

#4


3  

If one has to be faster (will likely depend on the regex engine used), then clearly the first one in my view (which can be a simple Morris-Pratt table DFA in contrast to the other two), as the other two are likely to require backtracking or perform additional work:

如果一个必须要更快(可能取决于使用的regex引擎),那么很明显我的第一个(相对于其他两个可以是一个简单的Morris-Pratt表DFA),因为另外两个可能需要回溯或执行额外的工作:

[0-9]?[0-9] - for the case with one digit, the engine will be greedy and match the first digit, then fail the second; backtrack and then succeed

[0 - 9]吗?[0-9] -对于一个数字的情况,引擎会贪婪地匹配第一个数字,然后在第二个数字上失败;回溯,然后成功

[0-9]|([0-9][0-9]) - a capturing group is used here, which slows things down

[0-9]|([0-9][0-9][0-9])-在这里使用一个捕获组,这将减慢速度。

#5


3  

I have no clue about the internals but what about some pseudo benching? :D

我不知道内线是什么,但是一些假的弯曲呢?:D

Python

Python

import re
import time

regs = ["^[0-9]{1,2}$", "^[0-9]?[0-9]$", "^[0-9]|([0-9][0-9])$"]
numbers = [str(n) for n in range(0, 100)]

result = None

// determine loop overhead
start = time.time()
for e in xrange(0, 10000):
    for n in numbers:
        result = n

loop = time.time() - start


for i in regs:
    r = re.compile(i)
    now = time.time()
    for e in xrange(0, 10000):
        for n in numbers:
            result = r.search(n)

    print (time.time() - now) - loop

Results in Seconds

结果在几秒钟内

0.874
0.869
0.809

JavaScript

JavaScript

var regs = ["^[0-9]{1,2}$", "^[0-9]?[0-9]$", "^[0-9]|([0-9][0-9])$"]

var numbers = [];
for(var n = 0; n < 100; n++) {
    numbers.push(''+n);
}


// determine loop overhead
var result = null;
var start = new Date().getTime();
for(var e = 0; e < 10000; e++) {
    for(var n = 0; n < 100; n++) {
        result = numbers[n];
    }
}

// test regex
var loop = new Date().getTime() - start;
for(var i = 0; i < regs.length; i++) {
    var r = new RegExp(regs[i]);
    var now = new Date().getTime();
    for(var e = 0; e < 10000; e++) {
        for(var n = 0; n < 100; n++) {
            result = r.exec(numbers[n]);
        }
    }
    console.log((new Date().getTime() - now) - loop); //using document.write here in Browsers
}

Results in Seconds

结果在几秒钟内

Node.js
0.197
0.193
0.226

Opera 11
0.836
0.408
0.372

Firefox 4
2.039
2.491
2.488

So what do we learn? Well Pythons seems to be rather slow, and V8 seems to be quite fast.
But hey benching is always fun!

那么我们能学到什么呢?蟒蛇似乎很慢,V8似乎很快。但是嘿,坐下来总是很有趣!

Update: Java Version

更新:Java版本

import java.util.regex.Pattern;

public class Test {
    public static void main(String args[]) {
        test();
        test();
        test();
        test();
    }

    public static void test() {
        String regs[] = {"^[0-9]{1,2}$", "^[0-9]?[0-9]$", "^[0-9]|([0-9][0-9])$"};
        String numbers[] = new String[100];
        for(int n = 0; n < 100; n++) {
            numbers[n] = Integer.toString(n);
        }

        // determine loop overhead
        String nresult = "";
        long start = System.nanoTime();
        for(int e = 0; e < 10000; e++) {
            for(int n = 0; n < 100; n++) {
                nresult = numbers[n];
            }
        }

        long loop = System.nanoTime() - start;

        boolean result = false;
        for(int i = 0; i < regs.length; i++) {
            Pattern p = Pattern.compile(regs[i]);

            long now = System.nanoTime();
            for(int e = 0; e < 10000; e++) {
                for(int n = 0; n < 100; n++) {
                    result = p.matcher(numbers[i]).matches();
                }
            }
            System.out.println(((System.nanoTime() - now) - loop) / 1000000);
        }
        System.out.println(result);
        System.out.println(nresult);
    }
}

Results in Seconds (times of the 4th run)

以秒为单位的结果(第4次运行的时间)

0.230
0.262
0.210

#6


1  

Those regex are so trivial it shouldn't matter. However, if I had to pick a more efficient implementation, it would be either [0-9]{1,2} or [0-9][0-9]?, which is not in your choices, since there's no backtracking necessary.

那些正则表达式太琐碎了,不重要。但是,如果我必须选择一个更有效的实现,它将是[0-9]{1,2}还是[0-9][0-9][0-9]?,这不是你的选择,因为没有必要回溯。

#7


1  

Just like C and ++i versus i=i+1, a good regex compiler should compile all three of these to exactly the same finite automaton. If it doesn't, I would consider that a bug.

就像C和+i与i=i+1一样,一个好的regex编译器应该将这三个编译成完全相同的有限自动机。如果没有的话,我会认为这是一个bug。

(Exception: If parenthesized subexpression tagging is enabled, the third would obviously compile to include the extra tagging information.)

(例外:如果启用了圆括号子表达式标记,第三个显然会编译以包含额外的标记信息。)

#1


10  

The expression [0-9]{1,2} should be the fastest I would imagine, although it will depend on the specific engine.

表达式[0-9]{1,2}应该是我能想到的最快的,尽管它取决于特定的引擎。

My reasoning is:

我的理由是:

  • [0-9]{1,2} - this exactly describes what you want to match.
  • [0-9]{1,2} -这正好描述了您想匹配的内容。
  • [0-9]?[0-9] - this could result in a backtrack if the first match fails.
  • [0 - 9]吗?[0-9] -如果第一场比赛失败,这可能会导致倒退。
  • [0-9]|([0-9][0-9]) - this requires the first character to be checked twice if it fails, and the parentheses here are unnecessary and cause an unneeded capture.
  • [0-9]|([0-9][0-9])——如果第一个字符失败,则需要检查两次,这里的括号是不必要的,会导致不必要的捕获。

Here are the iterations per second I got when testing this in .NET (without RegexOptions.Compiled):

以下是我在。net中测试时得到的每秒迭代数(没有regexoptions.e汇编):

Regex                      100% valid input   50% valid input  100% invalid input
"^[0-9]{1,2}$"             749086             800313           870748
"^[0-9]?[0-9]$"            731951             725984           740152
"^(?:[0-9]|([0-9][0-9]))$" 564654             687248           870378

With RegexOptions.Compiled:

RegexOptions.Compiled:

Regex                      100% valid input   50% valid input  100% invalid input
"^[0-9]{1,2}$"             1486212            1592535          1831843
"^[0-9]?[0-9]$"            1301557            1448812          1559193
"^(?:[0-9]|([0-9][0-9]))$" 1131179            1303213          1394146

And as a graph:

作为一个图:

哪个是更有效的regex?

Note: I modified each regular expression to require an exact match rather than performing a search.

注意:我修改了每个正则表达式以要求精确匹配,而不是执行搜索。

#2


7  

At least in theory, identical regexes like these will yield identical automata. A DFA-based matcher is going to match one character at a time and have the different possible branches encoded in its states (as opposed to taking one branch at a time and then backtracking upon failure), so the performance of each will be the same.

至少在理论上,这些相同的正则表达式会产生相同的自动机。基于dfa的matcher将每次匹配一个字符,并在其状态中编码不同的可能分支(而不是每次只处理一个分支,然后在失败时进行回溯),因此每个分支的性能都是相同的。

All three regexes would be matched by this DFA:

这三个regex将与DFA匹配:

+---+  0-9  +---+  0-9  +---+  *  .---.
| A |  -->  | B |  -->  | C | --> (ERR)
+---+       +---+       +---+     '---'
  |          / \          |
  | *     * /   \ $       | $ 
  V        /     \        V
.---.     /       \     .---.
(ERR) <--'         '--> (ACC)
'---'                   '---'

State A: Start state. Goes to B if it sees a digit, otherwise to an ERROR state.
State B: One digit matched so far. EOL ($) is ACCEPTED. A digit moves to C. Anything else is an ERROR.
State C: Two digits matched. EOL is ACCEPTED, anything else is an ERROR.

状态:开始状态。如果它看到一个数字,就会变成B,否则就会变成一个错误状态。状态B:到目前为止有一个数字匹配。生物(美元)被接受。一个数字移到c,其他都是错误。状态C:匹配的两位数。EOL是可接受的,其他的都是错误。

That's my language theoretical answer. I can't speak to real world regex engine implementations. I'm ignoring the capturing semantics of the parentheses as I am guessing that's not the point of the question. Automata also don't handle other "non-theoretical" constructs like greediness, lookahead, etc. At least not in their textbook presentation.

这就是我的语言理论回答。我无法与真实世界的regex引擎实现对话。我忽略了圆括号的捕获语义,因为我猜这不是问题的重点。自动机也不会处理其他的“非理论”结构,比如贪婪、前瞻等等。至少在教科书的介绍中是这样。

#3


4  

Without knowing the regex engine, one cannot even decide whether those are correct.

在不知道regex引擎的情况下,我们甚至无法判断它们是否正确。

For example, a POSIX ERE is longest-leftmost, not leftmost-longest, so it will choose the longest in a series of alternatives, and therefore choose a string matching "ab" against /a|ab/ will match the whole string, "ab". But a normal backtracking NFA the way you most often see would do something else there: it would care about ordering, and so matching the same "ab" string against the same /a|ab/ pattern would in them select just the beginning part, "a".

例如,POSIX ERE是最左的,而不是最左的,所以它将在一系列的选项中选择最长的,因此选择一个字符串,将ab与|ab/匹配,从而匹配整个字符串ab。但是你经常看到的一种正常的回溯NFA会做一些其他的事情:它会关心排序,所以将相同的“ab”字符串与相同的/ |ab/ pattern匹配,只选择开头的部分“a”。

The next question is the capture group in the same pattern. If they are intentional, they are odd, since you are keeping two-digit numbers but not one-digit numbers. The other patterns do not do that, yet they are said to be identical in behavior. So I am going to assume that they are in error here. Otherwise the capture group’s memory use will of course cost more to squirrel away than it would take not to do so.

下一个问题是采用相同模式的捕获组。如果它们是故意的,它们是奇数,因为你保留的是两位数,而不是一位数。其他的模式不能做到这一点,但据说它们在行为上是相同的。我假设它们是错误的。否则,捕获组的内存使用当然要比不这样做花费更多。

The next problem is the lack of any anchors whatsoever. Again, we cannot know if it those are correct, because it is not clear what the input set would look like nor what that particular engine does with unanchored patterns. Most engines will search everywhere in the string, but some of the less programmer-friendly one will “helpfully” add beginning-of-line (BOL) and end-of-line (EOL) anchors there. In more customary engines, where this does not occur, a zip code in the middle of the line would also match, since five digits obviously contains one- and two-digit substrings. Whether you would want ^ and $ anchors, or \b anchors, I can’t guess.

下一个问题是缺乏任何锚。同样,我们也不知道它们是否正确,因为不清楚输入集是什么样子,也不清楚那个特定的引擎是如何处理无锚模式的。大多数引擎都会在字符串中搜索,但是一些不太适合程序员的引擎会“有帮助地”添加开始线(BOL)和结束线(EOL)锚。在更常见的引擎中,如果不出现这种情况,那么行中间的邮政编码也会匹配,因为五位数显然包含一和两位数的子字符串。你是否想要^和$锚,或\ b锚,我不能猜测。

So I have to make some guesses here. I’m going to leave the anchors off, but I’m going to reorder the third versions branch, because otherwise you can never match a two digit number with a the normal (non-POSIX) kinds of backtracking NFAs most things run.

这里我要做一些猜测。我将不使用锚点,但是我将重新排序第三个版本分支,因为否则,您永远无法将一个两位数与正常(非posix)类型的回溯NFAs匹配。

Before even considering timing, it can really pay to look at what sort of program the regex compiler builds out of those patterns.

在考虑时间安排之前,看看regex编译器使用这些模式构建什么样的程序是值得的。

% perl -Mre=debug -ce '@pats = ( qr/[0-9]{1,2}/, qr/[0-9]?[0-9]/, qr/[0-9][0-9]|[0-9]/ )'
Compiling REx "[0-9]{1,2}"
Final program:
   1: CURLY {1,2} (14)
   3:   ANYOF[0-9][] (0)
  14: END (0)
stclass ANYOF[0-9][] minlen 1 
Compiling REx "[0-9]?[0-9]"
synthetic stclass "ANYOF[0-9][]".
Final program:
   1: CURLY {0,1} (14)
   3:   ANYOF[0-9][] (0)
  14: ANYOF[0-9][] (25)
  25: END (0)
stclass ANYOF[0-9][] minlen 1 
Compiling REx "[0-9][0-9]|[0-9]"
Final program:
   1: BRANCH (24)
   2:   ANYOF[0-9][] (13)
  13:   ANYOF[0-9][] (36)
  24: BRANCH (FAIL)
  25:   ANYOF[0-9][] (36)
  36: END (0)
minlen 1 
-e syntax OK
Freeing REx: "[0-9]{1,2}"
Freeing REx: "[0-9]?[0-9]"
Freeing REx: "[0-9][0-9]|[0-9]"

It really is a good idea to look at the compiled patterns. It can be even more instructive to observe the compiled pattern being executed. Here we’ll watch both:

查看已编译的模式确实是一个好主意。观察正在执行的编译模式可能更有意义。在这里我们会观察这两种:

% perl -Mre=debug -e '"aabbbababbaaqcccaaaabcacabba" =~ /abc|bca|cab|caab|baac|bab|aaa|bbb/'
Compiling REx "abc|bca|cab|caab|baac|bab|aaa|bbb"
Final program:
   1: TRIEC-EXACT[abc] (25)
      <abc> 
      <bca> 
      <cab> 
      <caab> 
      <baac> 
      <bab> 
      <aaa> 
      <bbb> 
  25: END (0)
stclass AHOCORASICKC-EXACT[abc] minlen 3 
Matching REx "abc|bca|cab|caab|baac|bab|aaa|bbb" against "aabbbababbaaqcccaaaabcacabba"
Matching stclass AHOCORASICKC-EXACT[abc] against "aabbbababbaaqcccaaaabcacabba" (28 chars)
   0 <> <aabbbababb>         | Charid:  1 CP:  61 State:    1, word=0 - legal
   1 <a> <abbbababba>        | Charid:  1 CP:  61 State:    2, word=0 - legal
   2 <aa> <bbbababbaa>       | Charid:  2 CP:  62 State:   11, word=0 - fail
   2 <aa> <bbbababbaa>       | Fail transition to State:    2, word=0 - legal
   3 <aab> <bbababbaaq>      | Charid:  2 CP:  62 State:    3, word=0 - fail
   3 <aab> <bbababbaaq>      | Fail transition to State:    5, word=0 - legal
   4 <aabb> <bababbaaqc>     | Charid:  2 CP:  62 State:   13, word=0 - legal
   5 <aabbb> <ababbaaqcc>    | Charid:  1 CP:  61 State:   14, word=8 - accepting
Matches word #8 at position 2. Trying full pattern...
   2 <aa> <bbbababbaa>       |  1:TRIEC-EXACT[abc](25)
   2 <aa> <bbbababbaa>       |    State:    1 Accepted:    0 Charid:  2 CP:  62 After State:    5
   3 <aab> <bbababbaaq>      |    State:    5 Accepted:    0 Charid:  2 CP:  62 After State:   13
   4 <aabb> <bababbaaqc>     |    State:   13 Accepted:    0 Charid:  2 CP:  62 After State:   14
   5 <aabbb> <ababbaaqcc>    |    State:   14 Accepted:    1 Charid:  8 CP:   0 After State:    0
                                  got 1 possible matches
                                  only one match left: #8 <bbb>
   5 <aabbb> <ababbaaqcc>    | 25:END(0)
Match successful!
Freeing REx: "abc|bca|cab|caab|baac|bab|aaa|bbb"

Here the compiler got really clever on us, and compiled that into an Aho–Corasick trie structure. Obviously this is going to perform quite differently from how a normal backtracking NFA would on the same program.

在这里,编译器变得非常聪明,并将它编译成一个Aho-Corasick trie结构。很明显,这与正常的回溯NFA在同一个程序中的表现是完全不同的。

Anyway, here are the timing for your patterns, or close to them. I added an alternate formulation for number two, and I swapped the ordering of the alternatives in number three.

无论如何,这是你的模式的时间,或者接近它们。我增加了另一种表述二的公式,然后我交换了第三个选项的顺序。

testing against short_fail
                 Rate     second      first      third second_alt
second      9488823/s         --        -9%       -21%       -29%
first      10475308/s        10%         --       -13%       -22%
third      11998438/s        26%        15%         --       -11%
second_alt 13434377/s        42%        28%        12%         --

testing against long_fail
                 Rate     second      first      third second_alt
second     11221411/s         --        -3%        -5%        -5%
first      11618967/s         4%         --        -1%        -1%
third      11776451/s         5%         1%         --        -0%
second_alt 11786700/s         5%         1%         0%         --
testing against short_pass

                 Rate      first second_alt     second      third
first      11720379/s         --        -4%        -7%        -7%
second_alt 12199048/s         4%         --        -3%        -4%
second     12593191/s         7%         3%         --        -1%
third      12663378/s         8%         4%         1%         --

testing against long_pass
                 Rate      third     second      first second_alt
third      11135053/s         --        -1%        -5%        -8%
second     11221655/s         1%         --        -4%        -7%
first      11716924/s         5%         4%         --        -3%
second_alt 12042240/s         8%         7%         3%         --

That was produce by this program:

这是由这个项目制作的

#!/usr/bin/env perl
use Benchmark qw<cmpthese>;

$short_fail = "a" x   1;
$long_fail  = "a" x 600;

$short_pass = $short_fail . 42;
$long_pass  = $long_fail  . 42;

for my $name (qw< short_fail long_fail short_pass long_pass >) {   
    print "\ntesting against $name\n";
    $_ = $$name;    
    cmpthese 0 => {
        first       => '/[0-9]{1,2}/',
        second      => '/[0-9]?[0-9]/',
        second_alt  => '/[0-9][0-9]?/',
        third       => '/[0-9][0-9]|[0-9]/',
    }    
}

Here are numbers with anchors added:

下面是添加了锚的数字:

testing against short_fail
                 Rate      first     second second_alt      third
first      11720380/s         --        -3%        -4%       -11%
second     12058622/s         3%         --        -1%        -9%
second_alt 12180583/s         4%         1%         --        -8%
third      13217006/s        13%        10%         9%         --
testing against long_fail
                 Rate      third      first second_alt     second
third      11378120/s         --        -2%        -4%       -12%
first      11566419/s         2%         --        -2%       -10%
second_alt 11830740/s         4%         2%         --        -8%
second     12860517/s        13%        11%         9%         --
testing against short_pass
                 Rate     second      third second_alt      first
second     11540465/s         --        -5%        -5%        -7%
third      12093336/s         5%         --        -0%        -3%
second_alt 12118504/s         5%         0%         --        -2%
first      12410348/s         8%         3%         2%         --
testing against long_pass
                 Rate      first     second second_alt      third
first      11423466/s         --        -1%        -4%        -7%
second     11545540/s         1%         --        -3%        -7%
second_alt 11870086/s         4%         3%         --        -4%
third      12348377/s         8%         7%         4%         --

And here is the minimally modified program that produced the second set of numbers:

这是一个最小修改的程序产生了第二组数字:

#!/usr/bin/env perl
use Benchmark qw<cmpthese>;

$short_fail = 1  . "a";
$long_fail  = 1  . "a" x 600;

$short_pass =  2;
$long_pass  = 42;

for my $name (qw< short_fail long_fail short_pass long_pass >) {
    print "testing against $name\n";
    $_ = $$name;
    cmpthese 0 => {
        first       => '/^(?:[0-9]{1,2})$/',
        second      => '/^(?:[0-9]?[0-9])$/',
        second_alt  => '/^(?:[0-9][0-9]?)$/',
        third       => '/^(?:[0-9][0-9]|[0-9])$/',
    }
}

#4


3  

If one has to be faster (will likely depend on the regex engine used), then clearly the first one in my view (which can be a simple Morris-Pratt table DFA in contrast to the other two), as the other two are likely to require backtracking or perform additional work:

如果一个必须要更快(可能取决于使用的regex引擎),那么很明显我的第一个(相对于其他两个可以是一个简单的Morris-Pratt表DFA),因为另外两个可能需要回溯或执行额外的工作:

[0-9]?[0-9] - for the case with one digit, the engine will be greedy and match the first digit, then fail the second; backtrack and then succeed

[0 - 9]吗?[0-9] -对于一个数字的情况,引擎会贪婪地匹配第一个数字,然后在第二个数字上失败;回溯,然后成功

[0-9]|([0-9][0-9]) - a capturing group is used here, which slows things down

[0-9]|([0-9][0-9][0-9])-在这里使用一个捕获组,这将减慢速度。

#5


3  

I have no clue about the internals but what about some pseudo benching? :D

我不知道内线是什么,但是一些假的弯曲呢?:D

Python

Python

import re
import time

regs = ["^[0-9]{1,2}$", "^[0-9]?[0-9]$", "^[0-9]|([0-9][0-9])$"]
numbers = [str(n) for n in range(0, 100)]

result = None

// determine loop overhead
start = time.time()
for e in xrange(0, 10000):
    for n in numbers:
        result = n

loop = time.time() - start


for i in regs:
    r = re.compile(i)
    now = time.time()
    for e in xrange(0, 10000):
        for n in numbers:
            result = r.search(n)

    print (time.time() - now) - loop

Results in Seconds

结果在几秒钟内

0.874
0.869
0.809

JavaScript

JavaScript

var regs = ["^[0-9]{1,2}$", "^[0-9]?[0-9]$", "^[0-9]|([0-9][0-9])$"]

var numbers = [];
for(var n = 0; n < 100; n++) {
    numbers.push(''+n);
}


// determine loop overhead
var result = null;
var start = new Date().getTime();
for(var e = 0; e < 10000; e++) {
    for(var n = 0; n < 100; n++) {
        result = numbers[n];
    }
}

// test regex
var loop = new Date().getTime() - start;
for(var i = 0; i < regs.length; i++) {
    var r = new RegExp(regs[i]);
    var now = new Date().getTime();
    for(var e = 0; e < 10000; e++) {
        for(var n = 0; n < 100; n++) {
            result = r.exec(numbers[n]);
        }
    }
    console.log((new Date().getTime() - now) - loop); //using document.write here in Browsers
}

Results in Seconds

结果在几秒钟内

Node.js
0.197
0.193
0.226

Opera 11
0.836
0.408
0.372

Firefox 4
2.039
2.491
2.488

So what do we learn? Well Pythons seems to be rather slow, and V8 seems to be quite fast.
But hey benching is always fun!

那么我们能学到什么呢?蟒蛇似乎很慢,V8似乎很快。但是嘿,坐下来总是很有趣!

Update: Java Version

更新:Java版本

import java.util.regex.Pattern;

public class Test {
    public static void main(String args[]) {
        test();
        test();
        test();
        test();
    }

    public static void test() {
        String regs[] = {"^[0-9]{1,2}$", "^[0-9]?[0-9]$", "^[0-9]|([0-9][0-9])$"};
        String numbers[] = new String[100];
        for(int n = 0; n < 100; n++) {
            numbers[n] = Integer.toString(n);
        }

        // determine loop overhead
        String nresult = "";
        long start = System.nanoTime();
        for(int e = 0; e < 10000; e++) {
            for(int n = 0; n < 100; n++) {
                nresult = numbers[n];
            }
        }

        long loop = System.nanoTime() - start;

        boolean result = false;
        for(int i = 0; i < regs.length; i++) {
            Pattern p = Pattern.compile(regs[i]);

            long now = System.nanoTime();
            for(int e = 0; e < 10000; e++) {
                for(int n = 0; n < 100; n++) {
                    result = p.matcher(numbers[i]).matches();
                }
            }
            System.out.println(((System.nanoTime() - now) - loop) / 1000000);
        }
        System.out.println(result);
        System.out.println(nresult);
    }
}

Results in Seconds (times of the 4th run)

以秒为单位的结果(第4次运行的时间)

0.230
0.262
0.210

#6


1  

Those regex are so trivial it shouldn't matter. However, if I had to pick a more efficient implementation, it would be either [0-9]{1,2} or [0-9][0-9]?, which is not in your choices, since there's no backtracking necessary.

那些正则表达式太琐碎了,不重要。但是,如果我必须选择一个更有效的实现,它将是[0-9]{1,2}还是[0-9][0-9][0-9]?,这不是你的选择,因为没有必要回溯。

#7


1  

Just like C and ++i versus i=i+1, a good regex compiler should compile all three of these to exactly the same finite automaton. If it doesn't, I would consider that a bug.

就像C和+i与i=i+1一样,一个好的regex编译器应该将这三个编译成完全相同的有限自动机。如果没有的话,我会认为这是一个bug。

(Exception: If parenthesized subexpression tagging is enabled, the third would obviously compile to include the extra tagging information.)

(例外:如果启用了圆括号子表达式标记,第三个显然会编译以包含额外的标记信息。)