奇怪的问题“(. *)*”,“(. *)+”,“(+)*”在Java正则表达式

时间:2021-09-02 23:03:13

In order to re-produce the problem as stated in a recent question - Why does (.*)* make two matches and select nothing in group $1? I tried various combination of * and +, inside and outside the brackets, and the result I got was not expected.

为了重新产生最近一个问题中提到的问题——为什么(.*)*在组$1中做了两个匹配,却什么都没有选择?我尝试了*和+的各种组合,在括号内和括号外,结果出乎我的意料。

I would have expected the output, same as one explained in the accepted answer in that question, and also in another duplicate question, tagged under Perl - Why doesn't the .* consume the entire string in this Perl regex? . But it's not behaving the same way.

我希望输出,就像在那个问题中被接受的答案中所解释的那样,以及在Perl下标记的另一个重复的问题中——为什么.*不使用这个Perl regex中的整个字符串?。但它的行为不一样。

To make it simple, here's the code I tried: -

简单地说,这是我试过的代码:-

String str = "input";
String[] patterns = { "(.*)*", "(.*)+", "(.+)*", "(.+)+" };

for (String pattern: patterns) {
    Matcher matcher = Pattern.compile(pattern).matcher(str);

    while (matcher.find()) {
        System.out.print("'" + matcher.group(1) + "' : '" + matcher.start() + "'" + "\t");
    }

    System.out.println();
}

And this is the output I got for all the 4 combination: -

这是我得到的4个组合的输出-

'' : '0'    '' : '5'            // For `(.*)*`
'' : '0'    '' : '5'            // For `(.*)+`  
'input' : '0'   'null' : '5'    // For `(.+)*`
'input' : '0'                   // For `(.+)+`

Now, What I can't understand, why in 1st and 2nd output, I am not getting the entire string as first result for matcher.find(). I mean, ideally, in 1st case, .* should first capture the entire string, and then also capture the empty string at the end. Now, although it is giving expected result for 2nd match, it's not behaving well for 1st match.

现在,我不明白的是,为什么在第1和第2个输出中,我没有将整个字符串作为matcher.find()的第一个结果。我的意思是,理想情况下,在第一种情况下,。*应该首先捕获整个字符串,然后在末尾捕获空字符串。现在,尽管第二场比赛给出了预期的结果,但第一场比赛表现不佳。

And also, in 2nd case, I should not even get the 2nd match, because I'm having a + quantifier outside the bracket.

而且,在第二种情况下,我甚至不应该得到第二种匹配,因为我在括号外有一个+量词。

My expected output is: -

我的期望输出是:-

'input' : '0'   '' : '5'  // For 1st
'input' : '0'    // For 2nd

Also, in the 3rd output, why I got null as 2nd match instead of empty string? Shouldn't the 2nd match for first 3 combination be same?

同样,在第三个输出中,为什么我把null作为第二个匹配,而不是空字符串?前三组的第二场比赛不应该是一样的吗?

4th output is as per expectation. So, no doubt in that.

第4个产出与预期相符。毫无疑问。

1 个解决方案

#1


7  

You're seeing the effect of the same phenomenon you see in the question you linked to:

你看到了同样的现象的影响你在你的问题中看到了:

For (.*)*:

(. *)*:

  • The first matcher.start() is 0 because that's where the match ("input") starts.
  • 第一个matcher.start()为0,因为这是匹配(“input”)的起点。
  • The first matcher.group(1) is "" because the repeated (.*) has overwritten the captured "input" with the empty string (but matcher.group(0) does contain input").
  • 第一个matcher.group(1)是“”,因为repeat(.*)用空字符串覆盖了捕获的“input”(但matcher.group(0)确实包含输入)。
  • The second matcher.start() is 5 because that's where the regex engine is after the first successful match.
  • 第二个matcher.start()是5,因为这是regex引擎在第一次成功匹配之后的位置。
  • The second matcher.group(1) (as well as matcher.group(0)) is "" because that's all there was to match at the end of the string.
  • 第二个matcher.group(1)(以及matcher.group(0))是“”,因为这就是在字符串末尾匹配的所有内容。

For (.*)+ it's the same. After all, the empty string can be repeated as many times as you want and still be the empty string.

对于(.*)+它是一样的。毕竟,空字符串可以重复多次,而且仍然是空字符串。

For (.+)* you get null because while the second match succeeds (zero repetitions of a string of length 1 matches the empty string), the capturing parentheses haven't been able to capture anything, so its contents are null (as in undefined, instead of the empty string).

对于(.+)*,您会得到null,因为当第二个匹配成功时(长度为1的字符串与空字符串匹配的次数为零),捕获圆括号无法捕获任何内容,因此它的内容为null(如未定义的字符串,而不是空字符串)。

#1


7  

You're seeing the effect of the same phenomenon you see in the question you linked to:

你看到了同样的现象的影响你在你的问题中看到了:

For (.*)*:

(. *)*:

  • The first matcher.start() is 0 because that's where the match ("input") starts.
  • 第一个matcher.start()为0,因为这是匹配(“input”)的起点。
  • The first matcher.group(1) is "" because the repeated (.*) has overwritten the captured "input" with the empty string (but matcher.group(0) does contain input").
  • 第一个matcher.group(1)是“”,因为repeat(.*)用空字符串覆盖了捕获的“input”(但matcher.group(0)确实包含输入)。
  • The second matcher.start() is 5 because that's where the regex engine is after the first successful match.
  • 第二个matcher.start()是5,因为这是regex引擎在第一次成功匹配之后的位置。
  • The second matcher.group(1) (as well as matcher.group(0)) is "" because that's all there was to match at the end of the string.
  • 第二个matcher.group(1)(以及matcher.group(0))是“”,因为这就是在字符串末尾匹配的所有内容。

For (.*)+ it's the same. After all, the empty string can be repeated as many times as you want and still be the empty string.

对于(.*)+它是一样的。毕竟,空字符串可以重复多次,而且仍然是空字符串。

For (.+)* you get null because while the second match succeeds (zero repetitions of a string of length 1 matches the empty string), the capturing parentheses haven't been able to capture anything, so its contents are null (as in undefined, instead of the empty string).

对于(.+)*,您会得到null,因为当第二个匹配成功时(长度为1的字符串与空字符串匹配的次数为零),捕获圆括号无法捕获任何内容,因此它的内容为null(如未定义的字符串,而不是空字符串)。