使用像perl这样的递归正则表达式匹配Ruby中的平衡括号

时间:2022-08-12 15:47:58

I have been looking for a way to match balanced parenthesis in a regex and found a way in Perl, that uses a recursive regular expression:

我一直在寻找一种方法来匹配正则表达式中的平衡括号,并在Perl中找到了一种使用递归正则表达式的方法:

my $re;
$re = qr{
           \(
              (?:
                 (?> [^()]+ )       # Non-parens without backtracking
                 |
                 (??{ $re })        # Group with matching parens
              )*
           \)
         }x;

from the perl regular expression site .

来自perl正则表达式网站。

Is there a way to do this in Ruby or a similar language?

有没有办法在Ruby或类似语言中执行此操作?

UPDATE:

更新:

For those interested here are some interesting links:

对于那些感兴趣的人有一些有趣的链接:

Oniguruma manual - from Sawa's answer.

Oniguruma手册 - 来自Sawa的回答。

Pragmatic Programmers' Ruby 1.9 Regular Expressions Sample Chapter

实用程序员的Ruby 1.9正则表达式示例章节

2 个解决方案

#1


18  

Yes. With oniguruma regex engine, which is built in in Ruby 1.9, and is installable on Ruby 1.8, you can do that. You name a subregex with (?<name>...) or (?'name'...). Then you call a subregex with \g<name> or \g'name' within the same regex. So your regex translated to oniguruma regex will be:

是。使用oniguruma regex引擎,它可以在Ruby 1.9中内置,并且可以在Ruby 1.8上安装,你可以做到这一点。您使用(? ...)或(?'name'...)命名子规则。然后在同一个正则表达式中使用\ g 或\ g'name'调用subregex。所以你的正则表达式转换为oniguruma正则表达式将是:

re = %r{
  (?<re>
    \(
      (?:
        (?> [^()]+ )
        |
        \g<re>
      )*
    \)
  )
}x

Also note that multi-byte string module in PHP >=5 uses oniguruma regex engine, so you will be able to do the same.

另请注意,PHP> = 5中的多字节字符串模块使用oniguruma regex引擎,因此您将能够执行相同的操作。

The manual for oniguruma is here.

oniguruma手册在这里。

#2


0  

I like the above solution but frequently one wishes to ignore escaped characters. Assuming that \ escapes the following character the following regex handles escaped characters as well.

我喜欢上面的解决方案,但经常有人希望忽略转义字符。假设\转义后续字符,以下正则表达式也处理转义字符。

ESC= /(?<![\\])(?>[\\](?:[\\][\\])*)/
UNESC= /(?:\A|(?<=[^\\]))(?:[\\][\\])*/
BALANCED_PARENS = /#{UNESC}(
                   (?<bal>\(
                    (?>
                      (?>  (?:#{ESC}\(|#{ESC}\)|[^()])+     )
                      |\g<bal>
                    )*
                    \))    ) /xm

Given the limitations of negative lookbehind the part delimited by matching parens will be the first capture not the whole match (the whole match may contain leading escaped backslashes).

鉴于负面观察的局限性,由匹配的parens划分的部分将是第一次捕获而不是整个匹配(整个匹配可能包含前导逃逸的反斜杠)。

The reason for the complexity of ESC and UNESC is the assumption that a \\ is an escaped backslash. We only use the UNESC sequence before the initial paren match since any other escaped parenthesis will be matched inside the atomic group and never backtracked. Indeed, if we tried to use the UNESC prefix for either an interior or final paren match it would fail when [^()] inside the atomic group matched the leading \'s and refused to backtrack.

ESC和UNESC复杂性的原因是假设\\是逃避反斜杠。我们只在初始paren匹配之前使用UNESC序列,因为任何其他转义括号将在原子组内匹配并且永远不会回溯。实际上,如果我们尝试将UNESC前缀用于内部或最终的匹配,当原子组内的[^()]与前导匹配并且拒绝回溯时,它将失败。

This regex will scan for the first paren that delimits a validly balanced parenthetical. Thus, given the string " ( ( stuff )" it will match "( stuff )". Frequently, the desired behavior is to locate the first (unescaped) parenthesis and either match the interior (if balanced) or fail to match. Unfortunately, atomic grouping won't stop the entire regex from being backed out of and a match attempted at a later point so we must anchor at the start of the string and only look at the 1st capture. The following regex makes this change:

这个正则表达式将扫描第一个限定有效平衡括号的paren。因此,给定字符串“((stuff)”它将匹配“(stuff)”。通常,期望的行为是定位第一个(未转义的)括号并且匹配内部(如果是平衡的)或者不匹配。不幸的是,原子分组不会阻止整个正则表达式退出并在稍后尝试匹配,因此我们必须在字符串的开头锚定并仅查看第一次捕获。以下正则表达式进行此更改:

BALANCED_PARENS = /\A(?:#{ESC}\(|#{ESC}\)|[^()])*+
                  (?<match>\(
                   (?<bal>
                    (?>
                      (?>  (?:#{ESC}\(|#{ESC}\)|[^()])+     )
                      |\(\g<bal>
                    )*
                    \))    ) /xm

#1


18  

Yes. With oniguruma regex engine, which is built in in Ruby 1.9, and is installable on Ruby 1.8, you can do that. You name a subregex with (?<name>...) or (?'name'...). Then you call a subregex with \g<name> or \g'name' within the same regex. So your regex translated to oniguruma regex will be:

是。使用oniguruma regex引擎,它可以在Ruby 1.9中内置,并且可以在Ruby 1.8上安装,你可以做到这一点。您使用(? ...)或(?'name'...)命名子规则。然后在同一个正则表达式中使用\ g 或\ g'name'调用subregex。所以你的正则表达式转换为oniguruma正则表达式将是:

re = %r{
  (?<re>
    \(
      (?:
        (?> [^()]+ )
        |
        \g<re>
      )*
    \)
  )
}x

Also note that multi-byte string module in PHP >=5 uses oniguruma regex engine, so you will be able to do the same.

另请注意,PHP> = 5中的多字节字符串模块使用oniguruma regex引擎,因此您将能够执行相同的操作。

The manual for oniguruma is here.

oniguruma手册在这里。

#2


0  

I like the above solution but frequently one wishes to ignore escaped characters. Assuming that \ escapes the following character the following regex handles escaped characters as well.

我喜欢上面的解决方案,但经常有人希望忽略转义字符。假设\转义后续字符,以下正则表达式也处理转义字符。

ESC= /(?<![\\])(?>[\\](?:[\\][\\])*)/
UNESC= /(?:\A|(?<=[^\\]))(?:[\\][\\])*/
BALANCED_PARENS = /#{UNESC}(
                   (?<bal>\(
                    (?>
                      (?>  (?:#{ESC}\(|#{ESC}\)|[^()])+     )
                      |\g<bal>
                    )*
                    \))    ) /xm

Given the limitations of negative lookbehind the part delimited by matching parens will be the first capture not the whole match (the whole match may contain leading escaped backslashes).

鉴于负面观察的局限性,由匹配的parens划分的部分将是第一次捕获而不是整个匹配(整个匹配可能包含前导逃逸的反斜杠)。

The reason for the complexity of ESC and UNESC is the assumption that a \\ is an escaped backslash. We only use the UNESC sequence before the initial paren match since any other escaped parenthesis will be matched inside the atomic group and never backtracked. Indeed, if we tried to use the UNESC prefix for either an interior or final paren match it would fail when [^()] inside the atomic group matched the leading \'s and refused to backtrack.

ESC和UNESC复杂性的原因是假设\\是逃避反斜杠。我们只在初始paren匹配之前使用UNESC序列,因为任何其他转义括号将在原子组内匹配并且永远不会回溯。实际上,如果我们尝试将UNESC前缀用于内部或最终的匹配,当原子组内的[^()]与前导匹配并且拒绝回溯时,它将失败。

This regex will scan for the first paren that delimits a validly balanced parenthetical. Thus, given the string " ( ( stuff )" it will match "( stuff )". Frequently, the desired behavior is to locate the first (unescaped) parenthesis and either match the interior (if balanced) or fail to match. Unfortunately, atomic grouping won't stop the entire regex from being backed out of and a match attempted at a later point so we must anchor at the start of the string and only look at the 1st capture. The following regex makes this change:

这个正则表达式将扫描第一个限定有效平衡括号的paren。因此,给定字符串“((stuff)”它将匹配“(stuff)”。通常,期望的行为是定位第一个(未转义的)括号并且匹配内部(如果是平衡的)或者不匹配。不幸的是,原子分组不会阻止整个正则表达式退出并在稍后尝试匹配,因此我们必须在字符串的开头锚定并仅查看第一次捕获。以下正则表达式进行此更改:

BALANCED_PARENS = /\A(?:#{ESC}\(|#{ESC}\)|[^()])*+
                  (?<match>\(
                   (?<bal>
                    (?>
                      (?>  (?:#{ESC}\(|#{ESC}\)|[^()])+     )
                      |\(\g<bal>
                    )*
                    \))    ) /xm