我可以进一步提高这个正则表达式的性能吗?

时间:2022-10-11 03:23:00

I am trying to fetch thread names from the thread dumps file. The thread names are usually contained within "double quotes" in the first line of each thread dump. It may look as simple as follows:

我试图从线程转储文件中获取线程名称。线程名称通常包含在每个线程转储的第一行的“双引号”中。它可能看起来如下所示:

"THREAD1" daemon prio=10 tid=0x00007ff6a8007000 nid=0xd4b6 runnable [0x00007ff7f8aa0000]

Or as big as follows:

或者大到如下:

"[STANDBY] ExecuteThread: '43' for queue: 'weblogic.kernel.Default (self-tuning)'" daemon prio=10 tid=0x00007ff71803a000 nid=0xd3e7 in Object.wait() [0x00007ff7f8ae1000]

The regular expression I wrote is simple one: "(.*)". It captures everything inside double quotes as a group. However it causes heavy backtracking thus requiring a lot of steps, as can be seen here. Verbally we can explain this regex as "capture anything that is enclosed inside double quotes as a group"

我写的正则表达式很简单:“(。*)”。它将双引号内的所有内容作为一组捕获。然而,这会导致严重的回溯,因此需要很多步骤,如此处所示。在口头上,我们可以将此正则表达式解释为“捕获任何包含在双引号内的任何内容”

So I came up with another regex which performs the same: "([^\"])". Verbally we can describe this regex as "capture any number of non-double quote characters that are enclosed inside double quotes". I did not found any fast regex than this. It does not perform any backtracking and hence it requires minimum steps as can be seen here.

所以我提出了另一个执行相同的正则表达式:“([^ \”])“。我们可以将这个正则表达式描述为”捕获任意数量的双引号括起来的非双引号字符“。我没有发现任何快速正则表达式。它不执行任何回溯,因此它需要最少的步骤,如此处所示。

I told this above to my colleague. He came up with yet another one: "(.*?)". I didnt get how it works. It performs considerable less backtracking than the first one but is a bit slower than the second one as can be seen here. However

我把这个告诉了我的同事。他想出了另一个:“(。*?)”。我没弄明白它是如何运作的。与第一个相比,它执行的回溯相当少,但比第二个慢一点,如此处所示。然而

  • I don't get why the backtracking stops early.
  • 我不明白为什么回溯会提前停止。

  • I understand ? is a quantifier which means once or not at all. However I dont understand how once or not at all is getting used here.
  • 我明白 ?是一个量词,意味着一次或根本不是。但是我不明白在这里曾经或根本没有使用过。

  • In fact I am not able to guess how can we describe this regex verbally.
  • 事实上,我无法猜测我们如何口头描述这个正则表达式。

My colleague tried explaining me but I am still not able to understand it completely. Can anyone explain?

我的同事试图解释我,但我仍然无法完全理解它。谁有人解释一下?

1 个解决方案

#1


7  

Brief explanation and a solution

The "(.*)" regex involves a lot of backtracking because it finds the first " and then grabs the whole string and backtracks looking for the " that is closest to the end of string. Since you have a quoted substring closer to the start, there's more backtracking than with "(.*?)" as this lazy quantifier *? makes the regex engine look for the closest " after the first " found.

“(。*)”正则表达式涉及大量的回溯,因为它找到第一个“然后抓取整个字符串并回溯寻找最接近字符串结尾的”。因为你有一个更接近开头的带引号的子串,所以有更多的回溯而不是“(。*?)”作为这个懒惰的量词*?使正则表达式引擎找到最接近的“在第一个之后”找到。

The negated character class solution "([^"]*)" is the best from the 3 because it does not have to grab everything, just all characters other than ". However, to stop any backtracking and make the expression ultimately efficient, you can use possessive quantifiers.

否定的字符类解决方案“([^”] *)“是3中最好的,因为它不必抓取所有字符,只是除了”之外的所有字符。但是,要停止任何回溯并使表达最终有效,您可以使用占有量词。

If you need to match strings like " + no quotes here + ", use

如果您需要匹配“+ no quotes here +”等字符串,请使用

"([^"]*+)"

or even you do not need to match the trailing quote in this situation:

或者甚至在这种情况下你不需要匹配尾随引用:

"([^"]*+)

See regex demo

请参阅正则表达式演示

In fact I am not able to guess how can we describe this regex verbally.

事实上,我无法猜测我们如何口头描述这个正则表达式。

The latter "([^"]*+) regex can be described as

后者“([^”] * +)正则表达式可以描述为

  • " - find the first " symbol from the left of the string
  • “ - 从字符串的左边找到第一个”符号

  • ([^"]*+) - match and capture into Group 1 zero or more symbols other than ", as many as possible, and once the engine finds a double quote, the match is returned immediately, without backtracking.
  • ([^“] * +) - 匹配并捕获到第1组零个或多个符号而不是”,尽可能多,并且一旦引擎找到双引号,匹配将立即返回,无需回溯。

Quantifiers

More information on quantifiers from Rexegg.com:

有关量词的更多信息,请访问Rexegg.com:

A* Zero or more As, as many as possible (greedy), giving up characters if the engine needs to backtrack (docile)
A*? Zero or more As, as few as needed to allow the overall pattern to match (lazy)
A*+ Zero or more As, as many as possible (greedy), not giving up characters if the engine tries to backtrack (possessive)

A *零或更多As尽可能多(贪婪),如果引擎需要回溯(温顺)A *,放弃角色?零或更多As,尽可能少地允许整个模式匹配(懒惰)A * +零或更多As尽可能多(贪婪),如果引擎试图回溯(占有),则不放弃字符

As you see, ? is not a separate quantifier, it is a part of another quantifier.

正如你看到的, ?它不是一个单独的量词,它是另一个量词的一部分。

I advise to read more about why Lazy Quantifiers are Expensive and that Negated Class Solution is really safe and fast to deal with your input string (where you just match a quote followed by non-quotes and then a final quote).

我建议阅读更多关于为什么懒惰量词是昂贵的,并且否定类解决方案是非常安全和快速处理您的输入字符串(您只需匹配报价,然后是非引号,然后是最终报价)。

Difference between .*?, .* and [^"]*+ quantifiers

  • Greedy "(.*)" solution works like this: checks each symbol from left to right looking for ", and once found grabs the whole string up to the end and checks each symbol if it is equal to ". Thus, in your input string, it backtracks 160 times.
  • 贪婪的“(。*)”解决方案的工作方式如下:从左到右检查每个符号,查找“,一旦找到,就抓住整个字符串直到结尾并检查每个符号是否等于”。因此,在输入字符串中,它会回溯160次。

我可以进一步提高这个正则表达式的性能吗?

Since the next " is not far, the number of backtrack steps is much fewer than with greedy matching.

由于下一个“不远”,回溯步骤的数量远远少于贪婪匹配。

我可以进一步提高这个正则表达式的性能吗?

  • possessive quantifier solution with a negated character class "([^"]*+)" works like this: the engine finds the leftmost ", and then grabs all characters that are not " up to the first ". The negated character class [^"]*+ greedily matches zero or more characters that are not a double quote. Therefore, we are guaranteed that the dot-star will never jump over the first encountered ". This is a more direct and efficient way of matching between some delimiters. Note that in this solution, we can fully trust the * that quantifies the [^"]. Even though it is greedy, there is no risk that [^"] will match too much as it is mutually exclusive with the ". This is the contrast principle from the regex style guide [see source].
  • 具有否定字符类的占有量词解决方案“([^”] * +)“就像这样:引擎找到最左边的”,然后抓取所有不是“直到第一个”的字符。否定的字符类[^“] * +贪婪地匹配零个或多个不是双引号的字符。因此,我们保证点星将永远不会跳过第一个遇到的”。这是在某些分隔符之间进行匹配的更直接有效的方法。请注意,在这个解决方案中,我们可以完全信任量化[^“]的*。即使它是贪婪的,也不存在[^”]匹配过多的风险,因为它与“。”相互排斥。来自正则表达式风格指南的对比原理[见源代码]。

Note that the possessive quantifier does not let the regex engine backtrack into the subexpression, once matched, the symbols between " become one hard block that cannot be "re-sorted" due to some "inconveniences" met by the regex engine, and it will be unable to shift any characters from and into this block of text.

注意,占有量量化器不会让正则表达式引擎回溯到子表达式,一旦匹配,“由于正则表达式引擎遇到一些”不便“而成为”无法“重新排序的一个硬块之间的符号,它将会无法将任何字符从此文本块中移出。

For the current expression, it does not make a big difference though.

对于当前表达式,它并没有产生很大的不同。

我可以进一步提高这个正则表达式的性能吗?

#1


7  

Brief explanation and a solution

The "(.*)" regex involves a lot of backtracking because it finds the first " and then grabs the whole string and backtracks looking for the " that is closest to the end of string. Since you have a quoted substring closer to the start, there's more backtracking than with "(.*?)" as this lazy quantifier *? makes the regex engine look for the closest " after the first " found.

“(。*)”正则表达式涉及大量的回溯,因为它找到第一个“然后抓取整个字符串并回溯寻找最接近字符串结尾的”。因为你有一个更接近开头的带引号的子串,所以有更多的回溯而不是“(。*?)”作为这个懒惰的量词*?使正则表达式引擎找到最接近的“在第一个之后”找到。

The negated character class solution "([^"]*)" is the best from the 3 because it does not have to grab everything, just all characters other than ". However, to stop any backtracking and make the expression ultimately efficient, you can use possessive quantifiers.

否定的字符类解决方案“([^”] *)“是3中最好的,因为它不必抓取所有字符,只是除了”之外的所有字符。但是,要停止任何回溯并使表达最终有效,您可以使用占有量词。

If you need to match strings like " + no quotes here + ", use

如果您需要匹配“+ no quotes here +”等字符串,请使用

"([^"]*+)"

or even you do not need to match the trailing quote in this situation:

或者甚至在这种情况下你不需要匹配尾随引用:

"([^"]*+)

See regex demo

请参阅正则表达式演示

In fact I am not able to guess how can we describe this regex verbally.

事实上,我无法猜测我们如何口头描述这个正则表达式。

The latter "([^"]*+) regex can be described as

后者“([^”] * +)正则表达式可以描述为

  • " - find the first " symbol from the left of the string
  • “ - 从字符串的左边找到第一个”符号

  • ([^"]*+) - match and capture into Group 1 zero or more symbols other than ", as many as possible, and once the engine finds a double quote, the match is returned immediately, without backtracking.
  • ([^“] * +) - 匹配并捕获到第1组零个或多个符号而不是”,尽可能多,并且一旦引擎找到双引号,匹配将立即返回,无需回溯。

Quantifiers

More information on quantifiers from Rexegg.com:

有关量词的更多信息,请访问Rexegg.com:

A* Zero or more As, as many as possible (greedy), giving up characters if the engine needs to backtrack (docile)
A*? Zero or more As, as few as needed to allow the overall pattern to match (lazy)
A*+ Zero or more As, as many as possible (greedy), not giving up characters if the engine tries to backtrack (possessive)

A *零或更多As尽可能多(贪婪),如果引擎需要回溯(温顺)A *,放弃角色?零或更多As,尽可能少地允许整个模式匹配(懒惰)A * +零或更多As尽可能多(贪婪),如果引擎试图回溯(占有),则不放弃字符

As you see, ? is not a separate quantifier, it is a part of another quantifier.

正如你看到的, ?它不是一个单独的量词,它是另一个量词的一部分。

I advise to read more about why Lazy Quantifiers are Expensive and that Negated Class Solution is really safe and fast to deal with your input string (where you just match a quote followed by non-quotes and then a final quote).

我建议阅读更多关于为什么懒惰量词是昂贵的,并且否定类解决方案是非常安全和快速处理您的输入字符串(您只需匹配报价,然后是非引号,然后是最终报价)。

Difference between .*?, .* and [^"]*+ quantifiers

  • Greedy "(.*)" solution works like this: checks each symbol from left to right looking for ", and once found grabs the whole string up to the end and checks each symbol if it is equal to ". Thus, in your input string, it backtracks 160 times.
  • 贪婪的“(。*)”解决方案的工作方式如下:从左到右检查每个符号,查找“,一旦找到,就抓住整个字符串直到结尾并检查每个符号是否等于”。因此,在输入字符串中,它会回溯160次。

我可以进一步提高这个正则表达式的性能吗?

Since the next " is not far, the number of backtrack steps is much fewer than with greedy matching.

由于下一个“不远”,回溯步骤的数量远远少于贪婪匹配。

我可以进一步提高这个正则表达式的性能吗?

  • possessive quantifier solution with a negated character class "([^"]*+)" works like this: the engine finds the leftmost ", and then grabs all characters that are not " up to the first ". The negated character class [^"]*+ greedily matches zero or more characters that are not a double quote. Therefore, we are guaranteed that the dot-star will never jump over the first encountered ". This is a more direct and efficient way of matching between some delimiters. Note that in this solution, we can fully trust the * that quantifies the [^"]. Even though it is greedy, there is no risk that [^"] will match too much as it is mutually exclusive with the ". This is the contrast principle from the regex style guide [see source].
  • 具有否定字符类的占有量词解决方案“([^”] * +)“就像这样:引擎找到最左边的”,然后抓取所有不是“直到第一个”的字符。否定的字符类[^“] * +贪婪地匹配零个或多个不是双引号的字符。因此,我们保证点星将永远不会跳过第一个遇到的”。这是在某些分隔符之间进行匹配的更直接有效的方法。请注意,在这个解决方案中,我们可以完全信任量化[^“]的*。即使它是贪婪的,也不存在[^”]匹配过多的风险,因为它与“。”相互排斥。来自正则表达式风格指南的对比原理[见源代码]。

Note that the possessive quantifier does not let the regex engine backtrack into the subexpression, once matched, the symbols between " become one hard block that cannot be "re-sorted" due to some "inconveniences" met by the regex engine, and it will be unable to shift any characters from and into this block of text.

注意,占有量量化器不会让正则表达式引擎回溯到子表达式,一旦匹配,“由于正则表达式引擎遇到一些”不便“而成为”无法“重新排序的一个硬块之间的符号,它将会无法将任何字符从此文本块中移出。

For the current expression, it does not make a big difference though.

对于当前表达式,它并没有产生很大的不同。

我可以进一步提高这个正则表达式的性能吗?