正则表达式引擎如何使用递归子模式解析正则表达式？

This regular expression matches palindromes: ^((.)(?1)\2|.?)$

这个正则表达式匹配回文:^((。)(?1)\ 2 |。?)$

Can't wrap my head around how it works. When does the recursion end, and when regex breaks from the recursive subpattern and goes to "|.?" part?

无法绕过它的工作原理。递归何时结束,并且当regex从递归子模式中断并转到“|。?”时部分?

Thanks.

edit: sorry I didn't explain \2 and (?1)

编辑:抱歉,我没有解释\ 2和(?1)

(?1) - refers to first subpattern (to itself)

(?1) - 指的是第一个子模式(对自己)

\2 - back-reference to a match of second subpattern, which is (.)

\ 2 - 反向引用第二个子模式的匹配,即(。)

Above example written in PHP. Matches both "abba" (no mid palindrome character) and "abcba" - has a middle, non-reflected character

以上用PHP编写的示例。匹配“abba”(没有中间回文字符)和“abcba” - 具有中间的,未反映的字符

3 个解决方案

#1

^((.)(?1)\2|.?)$

The ^ and $ asserts the beginning and the end of the string respectively. Let us look at the content in between, which is more interesting:

^和$分别断言字符串的开头和结尾。让我们看一下之间的内容,这更有趣:

((.)(?1)\2|.?)
1------------1 // Capturing group 1
 2-2           // Capturing group 2

Look at the first part (.)(?1)\2, we can see that it will try to match any character, and that same character at the end (back reference \2, which refers to the character matched by (.)). In the middle, it will recursively match for the whole capturing group 1. Note that there is an implicit assertion (caused by (.) matching one character at the beginning and \2 matching the same character at the end) that requires the string to be at least 2 characters. The purpose of the first part is chopping the identical ends of the string, recursively.

看看第一部分(。)(?1)\ 2,我们可以看到它将尝试匹配任何字符,并在末尾匹配相同的字符(后引用\ 2,它引用与(。)匹配的字符)。在中间,它将递归匹配整个捕获组1.请注意,有一个隐式断言(由(。)匹配开头的一个字符和\ 2匹配结尾的相同字符),需要字符串至少2个字符。第一部分的目的是递归地切割字符串的相同末端。

Look at second part .?, we can see that it will match one or 0 character. This will only be matched if the string initially has length 0 or 1, or after the leftover from the recursive match is 0 or 1 character. The purpose of the second part is to match the empty string or the single lonely character after the string is chopped from both ends.

看看第二部分。?,我们可以看到它将匹配一个或0个字符。只有当字符串最初的长度为0或1时,或者在递归匹配的剩余部分为0或1个字符之后,才会匹配。第二部分的目的是在从两端切断字符串后匹配空字符串或单个孤独字符。

The recursive matching works:

递归匹配工作:

The whole string must be palindrome to pass, asserted by ^ and $. We cannot start matching from any random position.

整个字符串必须是回文传递,由^和$断言。我们无法从任何随机位置开始匹配。

If the string is <= 1 character, it passes.

如果字符串是<= 1个字符,则它会通过。

If the string is > 2 characters, whether it is accepted is decided by the first part only. And it will be chopped by 2 ends if matches.

如果字符串> 2个字符,则是否接受仅由第一部分决定。如果匹配,它将被2个截断。

The leftover if matches, can only be chopped by the 2 ends, or passes if its length is <= 1.

剩余的如果匹配,只能被2个末端切断,或者如果长度<= 1则通过。

#2

The regex is essentially equivalent to the following pseudo-code:

正则表达式基本上等同于以下伪代码:

palin(str) {
    if (length(str) >= 2) {
      first = str[0];
      last = str[length(str)-1];
      return first == last && palin(substr(str, 1, length(str)-2));
    } else
      // empty and single-char trivially palindromes
      return true;
}

#3

I haven't found any nice debugging utility for PCRE regexps. The more I could find was how to dump the bytecode:

我没有为PCRE regexp找到任何好的调试实用程序。我能找到的更多是如何转储字节码:

$ pcretest -b
PCRE version 7.6 2008-01-28

  re> /^((.)(?1)\2|.?)$/x
------------------------------------------------------------------
  0  39 Bra
  3     ^
  4  26 CBra 1
  9   6 CBra 2
 14     Any
 15   6 Ket
 18   6 Once
 21   4 Recurse
 24   6 Ket
 27     \2
 30   5 Alt
 33     Any?
 35  31 Ket
 38     $
 39  39 Ket
 42     End
------------------------------------------------------------------

Perl has better debugging tools than PCRE, try echo 123454321 | perl -Mre=debug -ne '/^((.)(?1)\2|.?)$/x'. This gives not only some bytecode that is similar to PCRE's one, but it also shows each step, and the consumed and remaining parts of the input at each step:

Perl有比PCRE更好的调试工具,请尝试echo 123454321 | perl -Mre = debug -ne'/^((.)(?1)\2|.?)$/x'。这不仅给出了一些类似于PCRE的字节码,而且还显示了每一步,以及每一步输入的消耗和剩余部分:

Compiling REx "^((.)(?1)\2|.?)$"
Final program:
   1: BOL (2)
   2: OPEN1 (4)
   4:   BRANCH (15)
   5:     OPEN2 (7)
   7:       REG_ANY (8)
   8:     CLOSE2 (10)
  10:     GOSUB1[-8] (13)
  13:     REF2 (19)
  15:   BRANCH (FAIL)
  16:     CURLY {0,1} (19)
  18:       REG_ANY (0)
  19: CLOSE1 (21)
  21: EOL (22)
  22: END (0)
floating ""$ at 0..2147483647 (checking floating) anchored(BOL) minlen 0 
Guessing start of match in sv for REx "^((.)(?1)\2|.?)$" against "12321"
Found floating substr ""$ at offset 5...
Guessed: match at offset 0
Matching REx "^((.)(?1)\2|.?)$" against "12321"
   0 <> <12321>              |  1:BOL(2)
   0 <> <12321>              |  2:OPEN1(4)
   0 <> <12321>              |  4:BRANCH(15)
   0 <> <12321>              |  5:  OPEN2(7)
   0 <> <12321>              |  7:  REG_ANY(8)
   1 <1> <2321>              |  8:  CLOSE2(10)
   1 <1> <2321>              | 10:  GOSUB1[-8](13)
   1 <1> <2321>              |  2:    OPEN1(4)
   1 <1> <2321>              |  4:    BRANCH(15)
   1 <1> <2321>              |  5:      OPEN2(7)
   1 <1> <2321>              |  7:      REG_ANY(8)
   2 <12> <321>              |  8:      CLOSE2(10)
   2 <12> <321>              | 10:      GOSUB1[-8](13)
   2 <12> <321>              |  2:        OPEN1(4)
   2 <12> <321>              |  4:        BRANCH(15)
   2 <12> <321>              |  5:          OPEN2(7)
   2 <12> <321>              |  7:          REG_ANY(8)
   3 <123> <21>              |  8:          CLOSE2(10)
   3 <123> <21>              | 10:          GOSUB1[-8](13)
   3 <123> <21>              |  2:            OPEN1(4)
   3 <123> <21>              |  4:            BRANCH(15)
   3 <123> <21>              |  5:              OPEN2(7)
   3 <123> <21>              |  7:              REG_ANY(8)
   4 <1232> <1>              |  8:              CLOSE2(10)
   4 <1232> <1>              | 10:              GOSUB1[-8](13)
   4 <1232> <1>              |  2:                OPEN1(4)
   4 <1232> <1>              |  4:                BRANCH(15)
   4 <1232> <1>              |  5:                  OPEN2(7)
   4 <1232> <1>              |  7:                  REG_ANY(8)
   5 <12321> <>              |  8:                  CLOSE2(10)
   5 <12321> <>              | 10:                  GOSUB1[-8](13)
   5 <12321> <>              |  2:                    OPEN1(4)
   5 <12321> <>              |  4:                    BRANCH(15)
   5 <12321> <>              |  5:                      OPEN2(7)
   5 <12321> <>              |  7:                      REG_ANY(8)
                                                        failed...
   5 <12321> <>              | 15:                    BRANCH(19)
   5 <12321> <>              | 16:                      CURLY {0,1}(19)
                                                        REG_ANY can match 0 times out of 1...
   5 <12321> <>              | 19:                        CLOSE1(21)
                                                          EVAL trying tail ... 9d86dd8
   5 <12321> <>              | 13:                          REF2(19)
                                                            failed...
                                                        failed...
                                                      BRANCH failed...
   4 <1232> <1>              | 15:                BRANCH(19)
   4 <1232> <1>              | 16:                  CURLY {0,1}(19)
                                                    REG_ANY can match 1 times out of 1...
   5 <12321> <>              | 19:                    CLOSE1(21)
                                                      EVAL trying tail ... 9d86d70
   5 <12321> <>              | 13:                      REF2(19)
                                                        failed...
   4 <1232> <1>              | 19:                    CLOSE1(21)
                                                      EVAL trying tail ... 9d86d70
   4 <1232> <1>              | 13:                      REF2(19)
                                                        failed...
                                                    failed...
                                                  BRANCH failed...
   3 <123> <21>              | 15:            BRANCH(19)
   3 <123> <21>              | 16:              CURLY {0,1}(19)
                                                REG_ANY can match 1 times out of 1...
   4 <1232> <1>              | 19:                CLOSE1(21)
                                                  EVAL trying tail ... 9d86d08
   4 <1232> <1>              | 13:                  REF2(19)
                                                    failed...
   3 <123> <21>              | 19:                CLOSE1(21)
                                                  EVAL trying tail ... 9d86d08
   3 <123> <21>              | 13:                  REF2(19)
                                                    failed...
                                                failed...
                                              BRANCH failed...
   2 <12> <321>              | 15:        BRANCH(19)
   2 <12> <321>              | 16:          CURLY {0,1}(19)
                                            REG_ANY can match 1 times out of 1...
   3 <123> <21>              | 19:            CLOSE1(21)
                                              EVAL trying tail ... 9d86ca0
   3 <123> <21>              | 13:              REF2(19)
   4 <1232> <1>              | 19:              CLOSE1(21)
                                                EVAL trying tail ... 0
   4 <1232> <1>              | 13:                REF2(19)
   5 <12321> <>              | 19:                CLOSE1(21)
   5 <12321> <>              | 21:                EOL(22)
   5 <12321> <>              | 22:                END(0)
Match successful!
Freeing REx: "^((.)(?1)\2|.?)$"

As you can see, Perl first consumes all input recursing until (.) fails, then starts backtracking and trying the second branch from the alternation .? and the remainder of the first part \2, when that fails it backtracks, until it finally succeeds.

正如您所看到的,Perl首先消耗所有输入递归,直到(。)失败,然后开始回溯并尝试从交替进行第二个分支。和第一部分\ 2的其余部分,当它失败时它回溯,直到它最终成功。

#1