sun jdk 6.0.24中的java.util.regex中的错误?

时间:2023-01-07 13:57:01

The following code blocks on my system. Why?

我的系统上有以下代码块。为什么?

System.out.println( Pattern.compile( 
   "^((?:[^'\"][^'\"]*|\"[^\"]*\"|'[^']*')*)/\\*.*?\\*/(.*)$", 
   Pattern.MULTILINE | Pattern.DOTALL ).matcher( 
   "\n\n\n\n\n\nUPDATE \"$SCHEMA\" SET \"VERSION\" = 12 WHERE NAME = 'SOMENAMEVALUE';"
   ).matches() );

The pattern (designed to detect comments of the form /*...*/ but not within ' or ") should be fast, as it is deterministic... Why does it take soooo long?

模式(旨在检测形式/*...*/但不在'或'内的注释)应该很快,因为它是确定性的......为什么它需要太长时间?

4 个解决方案

#1


8  

You're running into catastrophic backtracking.

你正在遭遇灾难性的回溯。

Looking at your regex, it's easy to see how .*? and (.*) can match the same content since both also can match the intervening \*/ part (dot matches all, remember). Plus (and even more problematic), they can also match the same stuff that ((?:[^'"][^'"]*|"[^"]*"|'[^']*')*) matches.

看看你的正则表达式,很容易看出如何。*?和(。*)可以匹配相同的内容,因为两者也可以匹配插入的\ * /部分(点匹配所有,记住)。加(甚至更有问题),他们也可以匹配相同的东西((?:[^'“] [^'”] * |“[^”] *“|'[^'] *')*)火柴。

The regex engine gets bogged down in trying all the permutations, especially if the string you're testing against is long.

正在尝试所有排列时,正则表达式引擎陷入困境,特别是如果您正在测试的字符串很长。

I've just checked your regex against your string in RegexBuddy. It aborts the match attempt after 1.000.000 steps of the regex engine. Java will keep churning on until it gets through all permutations or until a Stack Overflow occurs...

我刚刚在RegexBuddy中检查了你的正则表达式。它在正则表达式引擎的1.000.000步之后中止匹配尝试。 Java将继续搅拌,直到它通过所有排列或直到发生Stack Overflow ...

You can greatly improve the performance of your regex by prohibiting backtracking into stuff that has already been matched. You can use atomic groups for this, changing your regex into

您可以通过禁止回溯已经匹配的内容来大大提高正则表达式的性能。您可以使用原子组,将正则表达式更改为

^((?>[^'"]+|"[^"]*"|'[^']*')*)(?>/\*.*?\*/)(.*)$

or, as a Java string:

或者,作为Java字符串:

"^((?>[^'\"]+|\"[^\"]*\"|'[^']*')*)(?>/\\*.*?\\*/)(.*)$"

This reduces the number of steps the regex engine has to go through from > 1 million to 58.

这减少了正则表达式引擎必须经历的步数从> 100万到58。

Be advised though that this will only find the first occurrence of a comment, so you'll have to apply the regex repeatedly until it fails.

请注意,这只会发现第一次出现注释,因此您必须反复应用正则表达式,直到它失败。

Edit: I just added two slashes that were important for the expressions to work. Yet I had to change more than 6 characters.... :(

编辑:我刚刚添加了两个对表达起作用很重要的斜杠。但我不得不改变超过6个字符...... :(

#2


1  

I recommend that you read Regular Expression Matching Can Be Simple And Fast (but is slow in Java, Perl, PHP, Python, Ruby, ...).

我建议您阅读正则表达式匹配可以简单快速(但在Java,Perl,PHP,Python,Ruby等方面很慢)。

#3


0  

I think it's because of this bit:

我认为这是因为这一点:

(?:[^'\"][^'\"]*|\"[^\"]*\"|'[^']*')*

Removing the second and third alternatives gives you:

删除第二个和第三个替代方案可以:

(?:[^'\"][^'\"]*)*

or:

(?:[^'\"]+)*

Repeated repeats can take a long time.

重复重复可能需要很长时间。

#4


0  

For comment /* and */ detection I would suggest having a code like this:

对于评论/ *和* /检测,我建议使用这样的代码:

String str = "\n\n\n\n\n\nUPDATE \"$SCHEMA\" /*a comment\n\n*/ SET \"VERSION\" = 12 WHERE NAME = 'SOMENAMEVALUE';";
Pattern pt = Pattern.compile("\"[^\"]*\"|'[^']*'|(/\\*.*?\\*/)", 
                             Pattern.MULTILINE | Pattern.DOTALL);
Matcher matcher = pt.matcher(str);
boolean found = false;
while (matcher.find()) {
   if (matcher.group(1) != null) {
      found = true;
      break;
   }
}
if (found)
   System.out.println("Found Comment: [" + matcher.group(1) + ']');
else
   System.out.println("Didn't find Comment");

For above string it prints:

上面的字符串打印:

Found Comment: [/*a comment

*/]

But if I change input string to:

但是,如果我将输入字符串更改为:

String str = "\n\n\n\n\n\nUPDATE \"$SCHEMA\" '/*a comment\n\n*/' SET \"VERSION\" = 12 WHERE NAME = 'SOMENAMEVALUE';";

OR

String str = "\n\n\n\n\n\nUPDATE \"$SCHEMA\" \"/*a comment\n\n*/\" SET \"VERSION\" = 12 WHERE NAME = 'SOMENAMEVALUE';";

Output is:

Didn't find Comment

#1


8  

You're running into catastrophic backtracking.

你正在遭遇灾难性的回溯。

Looking at your regex, it's easy to see how .*? and (.*) can match the same content since both also can match the intervening \*/ part (dot matches all, remember). Plus (and even more problematic), they can also match the same stuff that ((?:[^'"][^'"]*|"[^"]*"|'[^']*')*) matches.

看看你的正则表达式,很容易看出如何。*?和(。*)可以匹配相同的内容,因为两者也可以匹配插入的\ * /部分(点匹配所有,记住)。加(甚至更有问题),他们也可以匹配相同的东西((?:[^'“] [^'”] * |“[^”] *“|'[^'] *')*)火柴。

The regex engine gets bogged down in trying all the permutations, especially if the string you're testing against is long.

正在尝试所有排列时,正则表达式引擎陷入困境,特别是如果您正在测试的字符串很长。

I've just checked your regex against your string in RegexBuddy. It aborts the match attempt after 1.000.000 steps of the regex engine. Java will keep churning on until it gets through all permutations or until a Stack Overflow occurs...

我刚刚在RegexBuddy中检查了你的正则表达式。它在正则表达式引擎的1.000.000步之后中止匹配尝试。 Java将继续搅拌,直到它通过所有排列或直到发生Stack Overflow ...

You can greatly improve the performance of your regex by prohibiting backtracking into stuff that has already been matched. You can use atomic groups for this, changing your regex into

您可以通过禁止回溯已经匹配的内容来大大提高正则表达式的性能。您可以使用原子组,将正则表达式更改为

^((?>[^'"]+|"[^"]*"|'[^']*')*)(?>/\*.*?\*/)(.*)$

or, as a Java string:

或者,作为Java字符串:

"^((?>[^'\"]+|\"[^\"]*\"|'[^']*')*)(?>/\\*.*?\\*/)(.*)$"

This reduces the number of steps the regex engine has to go through from > 1 million to 58.

这减少了正则表达式引擎必须经历的步数从> 100万到58。

Be advised though that this will only find the first occurrence of a comment, so you'll have to apply the regex repeatedly until it fails.

请注意,这只会发现第一次出现注释,因此您必须反复应用正则表达式,直到它失败。

Edit: I just added two slashes that were important for the expressions to work. Yet I had to change more than 6 characters.... :(

编辑:我刚刚添加了两个对表达起作用很重要的斜杠。但我不得不改变超过6个字符...... :(

#2


1  

I recommend that you read Regular Expression Matching Can Be Simple And Fast (but is slow in Java, Perl, PHP, Python, Ruby, ...).

我建议您阅读正则表达式匹配可以简单快速(但在Java,Perl,PHP,Python,Ruby等方面很慢)。

#3


0  

I think it's because of this bit:

我认为这是因为这一点:

(?:[^'\"][^'\"]*|\"[^\"]*\"|'[^']*')*

Removing the second and third alternatives gives you:

删除第二个和第三个替代方案可以:

(?:[^'\"][^'\"]*)*

or:

(?:[^'\"]+)*

Repeated repeats can take a long time.

重复重复可能需要很长时间。

#4


0  

For comment /* and */ detection I would suggest having a code like this:

对于评论/ *和* /检测,我建议使用这样的代码:

String str = "\n\n\n\n\n\nUPDATE \"$SCHEMA\" /*a comment\n\n*/ SET \"VERSION\" = 12 WHERE NAME = 'SOMENAMEVALUE';";
Pattern pt = Pattern.compile("\"[^\"]*\"|'[^']*'|(/\\*.*?\\*/)", 
                             Pattern.MULTILINE | Pattern.DOTALL);
Matcher matcher = pt.matcher(str);
boolean found = false;
while (matcher.find()) {
   if (matcher.group(1) != null) {
      found = true;
      break;
   }
}
if (found)
   System.out.println("Found Comment: [" + matcher.group(1) + ']');
else
   System.out.println("Didn't find Comment");

For above string it prints:

上面的字符串打印:

Found Comment: [/*a comment

*/]

But if I change input string to:

但是,如果我将输入字符串更改为:

String str = "\n\n\n\n\n\nUPDATE \"$SCHEMA\" '/*a comment\n\n*/' SET \"VERSION\" = 12 WHERE NAME = 'SOMENAMEVALUE';";

OR

String str = "\n\n\n\n\n\nUPDATE \"$SCHEMA\" \"/*a comment\n\n*/\" SET \"VERSION\" = 12 WHERE NAME = 'SOMENAMEVALUE';";

Output is:

Didn't find Comment