分隔符之间的匹配文本:贪婪还是惰性正则表达式?

时间:2022-07-03 11:33:27

For the common problem of matching text between delimiters (e.g. < and >), there's two common patterns:

对于分隔符之间匹配文本的常见问题(例如 <和> ),有两种常见的模式:

  • using the greedy * or + quantifier in the form START [^END]* END, e.g. <[^>]*>, or
  • 使用贪婪*或+量词形式开始结束[^]*结束,例如<[^ >]* >,或
  • using the lazy *? or +? quantifier in the form START .*? END, e.g. <.*?>.
  • 使用懒* ?还是+ ?表单中的量词开始。*?最终,例如<。* ? >。

Is there a particular reason to favour one over the other?

有什么特别的理由偏爱其中之一吗?

3 个解决方案

#1


12  

Some advantages:

一些好处:

[^>]*:

[^ >]*:

  • More expressive.
  • 更多的表达。
  • Captures newlines regardless of /s flag.
  • 捕获任何/s标志的新行。
  • Considered quicker, because the engine doesn't have to backtracks to find a successful match (with [^>] the engine doesn't make choices - we give it only one way to match the pattern against the string).
  • 更快,因为发动机没有放弃寻找一个成功的匹配([^ >]发动机不做出选择——我们只给它一种匹配模式的字符串)。

.*?

. * ?

  • No "code duplication" - the end character only appears once.
  • 没有“代码复制”——结束字符只出现一次。
  • Simpler in cases the end delimiter is more than a character long. (a character class would not work in this case) A common alternative is (?:(?!END).)*. This is even worse if the END delimiter is another pattern.
  • 在某些情况下,结束分隔符比字符长更简单。(在这种情况下,字符类不会起作用)一个常见的替代方法是(?如果结束分隔符是另一种模式,则更糟糕。

#2


7  

The first is more explicit, i. e. it definitely excludes the closing delimiter from being part of the matched text. This is not guaranteed in the second case (if the regular expression is extended to match more than just this tag).

第一个更显式,即它明确地将结束分隔符排除为匹配文本的一部分。这在第二种情况下是不能保证的(如果正则表达式扩展为匹配的不仅仅是这个标记)。

Example: If you try to match <tag1><tag2>Hello! with <.*?>Hello!, the regex will match

示例:如果您尝试匹配 你好!<。* ? >你好!, regex将匹配

<tag1><tag2>Hello!

whereas <[^>]*>Hello! will match

而<[^ >]* >你好!将匹配

<tag2>Hello!

#3


6  

What most people fail to consider when approaching questions like this is what happens when the regex is unable to find a match. That's when the killer performance sinkholes are most likely to appear. For example, take Tim's example, where you're looking for something like <tag>Hello!. Consider what happens with:

当遇到这样的问题时,大多数人没有考虑到的是,当regex无法找到匹配时,会发生什么。那时候,最可能出现的就是凶手的表现。例如,以Tim的例子为例,您正在寻找类似 Hello!考虑会发生什么:

<.*?>Hello!

The regex engine finds a < and it quickly finds a closing >, but not >Hello!. So the .*? continues looking for a > that is followed by Hello!. If there isn't one, it will go all the way to the end of the document before it gives up. Then the regex engine resumes scanning until it finds another <, and tries again. We already know how that's going to turn out, but the regex engine, typically, doesn't; it goes through the same rigamarole with every < in the document. Now consider the other regex:

regex引擎找到一个<,它会很快发现关闭>,但不是>Hello!所以。* ?继续寻找>,后面跟着Hello!。如果没有的话,它会一直走到文档的末尾,然后才会放弃。然后regex引擎继续扫描,直到找到另一个<,然后再次尝试。我们已经知道会发生什么,但是regex引擎,通常不会;它与文档中的每个 <都经历相同的rigamarole。现在考虑另一个正则表达式:< p>

<[^>]*>Hello!

As before, it quickly matches from the < to the >, but fails to match Hello!. It will backtrack to the <, then quit and start scanning for another <. It will still check every < like the first regex did, but it won't search all the way to the end of the document every time it finds one.

和以前一样,它快速地从 <匹配到> ,但是没有匹配Hello!它将返回到<,然后退出并开始扫描另一个<。它仍然会像第一个regex那样检查每一个<,但是它不会在每次找到一个的时候一直搜索到文档的末尾。

But it's even worse than that. If you think about it, .*? is effectively equivalent to a negative lookahead. It's saying "Before consuming the next character, make sure the remainder of the regex can't match at this position." In other words,

但更糟糕的是。如果你仔细想想,*?实际上相当于一个消极的展望。它是说“在使用下一个字符之前,请确保regex的其余部分不能匹配到这个位置。”换句话说,

/<.*?>Hello!/

...is equivalent to:

…等价于:

/<(?:(?!>Hello!).)*(?:>Hello!|\z(*FAIL))/

So at every position you're performing, not just a normal match attempt, but a much more expensive lookahead. (It's at least twice as costly, because the lookahead has to scan at least one character, then the . goes ahead and consumes a character.)

所以在你所做的每一个位置,不仅仅是一次普通的比赛尝试,而且是一次更昂贵的前瞻。(它的花费至少是它的两倍,因为前视必须扫描至少一个字符,然后是。继续,消耗一个角色)

((*FAIL) is one of Perl's backtracking-control verbs (also supported in PHP). |\z(*FAIL) means "or reach the end of the document and give up".)

(*FAIL)是Perl的回溯控制动词之一(PHP也支持)。|\z(*FAIL)表示“或达到文件的末尾而放弃”。

Finally, there's another advantage of the negated-character-class approach. While it doesn't (as @Bart pointed out) act like the quantifier is possessive, there's nothing to stop you from making it possessive, if your flavor supports it:

最后,否定字符类方法还有另一个优点。虽然它不会(正如@Bart指出的)表现得像量词是占有欲的,但如果你的口味支持它,没有什么可以阻止你使它具有占有欲:

/<[^>]*+>Hello!/

...or wrap it in an atomic group:

…或者将它封装在一个原子组中:

/(?><[^>]*>)Hello!/

Not only will these regexes never backtrack unnecessarily, they don't have to save the state information that makes backtracking possible.

这些regex不仅不会不必要地回退,而且它们不必保存使回退成为可能的状态信息。

#1


12  

Some advantages:

一些好处:

[^>]*:

[^ >]*:

  • More expressive.
  • 更多的表达。
  • Captures newlines regardless of /s flag.
  • 捕获任何/s标志的新行。
  • Considered quicker, because the engine doesn't have to backtracks to find a successful match (with [^>] the engine doesn't make choices - we give it only one way to match the pattern against the string).
  • 更快,因为发动机没有放弃寻找一个成功的匹配([^ >]发动机不做出选择——我们只给它一种匹配模式的字符串)。

.*?

. * ?

  • No "code duplication" - the end character only appears once.
  • 没有“代码复制”——结束字符只出现一次。
  • Simpler in cases the end delimiter is more than a character long. (a character class would not work in this case) A common alternative is (?:(?!END).)*. This is even worse if the END delimiter is another pattern.
  • 在某些情况下,结束分隔符比字符长更简单。(在这种情况下,字符类不会起作用)一个常见的替代方法是(?如果结束分隔符是另一种模式,则更糟糕。

#2


7  

The first is more explicit, i. e. it definitely excludes the closing delimiter from being part of the matched text. This is not guaranteed in the second case (if the regular expression is extended to match more than just this tag).

第一个更显式,即它明确地将结束分隔符排除为匹配文本的一部分。这在第二种情况下是不能保证的(如果正则表达式扩展为匹配的不仅仅是这个标记)。

Example: If you try to match <tag1><tag2>Hello! with <.*?>Hello!, the regex will match

示例:如果您尝试匹配 你好!<。* ? >你好!, regex将匹配

<tag1><tag2>Hello!

whereas <[^>]*>Hello! will match

而<[^ >]* >你好!将匹配

<tag2>Hello!

#3


6  

What most people fail to consider when approaching questions like this is what happens when the regex is unable to find a match. That's when the killer performance sinkholes are most likely to appear. For example, take Tim's example, where you're looking for something like <tag>Hello!. Consider what happens with:

当遇到这样的问题时,大多数人没有考虑到的是,当regex无法找到匹配时,会发生什么。那时候,最可能出现的就是凶手的表现。例如,以Tim的例子为例,您正在寻找类似 Hello!考虑会发生什么:

<.*?>Hello!

The regex engine finds a < and it quickly finds a closing >, but not >Hello!. So the .*? continues looking for a > that is followed by Hello!. If there isn't one, it will go all the way to the end of the document before it gives up. Then the regex engine resumes scanning until it finds another <, and tries again. We already know how that's going to turn out, but the regex engine, typically, doesn't; it goes through the same rigamarole with every < in the document. Now consider the other regex:

regex引擎找到一个<,它会很快发现关闭>,但不是>Hello!所以。* ?继续寻找>,后面跟着Hello!。如果没有的话,它会一直走到文档的末尾,然后才会放弃。然后regex引擎继续扫描,直到找到另一个<,然后再次尝试。我们已经知道会发生什么,但是regex引擎,通常不会;它与文档中的每个 <都经历相同的rigamarole。现在考虑另一个正则表达式:< p>

<[^>]*>Hello!

As before, it quickly matches from the < to the >, but fails to match Hello!. It will backtrack to the <, then quit and start scanning for another <. It will still check every < like the first regex did, but it won't search all the way to the end of the document every time it finds one.

和以前一样,它快速地从 <匹配到> ,但是没有匹配Hello!它将返回到<,然后退出并开始扫描另一个<。它仍然会像第一个regex那样检查每一个<,但是它不会在每次找到一个的时候一直搜索到文档的末尾。

But it's even worse than that. If you think about it, .*? is effectively equivalent to a negative lookahead. It's saying "Before consuming the next character, make sure the remainder of the regex can't match at this position." In other words,

但更糟糕的是。如果你仔细想想,*?实际上相当于一个消极的展望。它是说“在使用下一个字符之前,请确保regex的其余部分不能匹配到这个位置。”换句话说,

/<.*?>Hello!/

...is equivalent to:

…等价于:

/<(?:(?!>Hello!).)*(?:>Hello!|\z(*FAIL))/

So at every position you're performing, not just a normal match attempt, but a much more expensive lookahead. (It's at least twice as costly, because the lookahead has to scan at least one character, then the . goes ahead and consumes a character.)

所以在你所做的每一个位置,不仅仅是一次普通的比赛尝试,而且是一次更昂贵的前瞻。(它的花费至少是它的两倍,因为前视必须扫描至少一个字符,然后是。继续,消耗一个角色)

((*FAIL) is one of Perl's backtracking-control verbs (also supported in PHP). |\z(*FAIL) means "or reach the end of the document and give up".)

(*FAIL)是Perl的回溯控制动词之一(PHP也支持)。|\z(*FAIL)表示“或达到文件的末尾而放弃”。

Finally, there's another advantage of the negated-character-class approach. While it doesn't (as @Bart pointed out) act like the quantifier is possessive, there's nothing to stop you from making it possessive, if your flavor supports it:

最后,否定字符类方法还有另一个优点。虽然它不会(正如@Bart指出的)表现得像量词是占有欲的,但如果你的口味支持它,没有什么可以阻止你使它具有占有欲:

/<[^>]*+>Hello!/

...or wrap it in an atomic group:

…或者将它封装在一个原子组中:

/(?><[^>]*>)Hello!/

Not only will these regexes never backtrack unnecessarily, they don't have to save the state information that makes backtracking possible.

这些regex不仅不会不必要地回退,而且它们不必保存使回退成为可能的状态信息。