用正则表达式解析Java Scanner换行符(Bug?)

时间:2021-11-12 00:50:31

I'm developing a syntax analyzer by hand in Java, and I'd like to use regex's to parse the various token types. The problem is that I'd also like to be able to accurately report the current line number, if the input doesn't conform to the syntax.

我正在用Java手工开发语法分析器,我想使用正则表达式解析各种令牌类型。问题是,如果输入不符合语法,我还希望能够准确报告当前行号。

Long story short, I've run into a problem when I try to actually match a newline with the Scanner class. To be specific, when I try to match a newline with a pattern using the Scanner class, it fails. Almost always. But when I perform the same matching using a Matcher and the same source string, it retrieves the newline exactly as you'd expect it too. Is there a reason for this, that I can't seem to discover, or is this a bug, as I suspect?

简而言之,当我尝试将换行符与Scanner类匹配时,我遇到了一个问题。具体来说,当我尝试使用Scanner类将换行符与模式匹配时,它会失败。几乎总是。但是当我使用匹配器和相同的源字符串执行相同的匹配时,它会完全按照您的预期检索换行符。这是否有原因,我似乎无法发现,或者这是一个错误,我怀疑?

FYI: I was unable to find a bug in the Sun database that describes this issue, so if it is a bug, it hasn't been reported.

仅供参考:我无法在Sun数据库中找到描述此问题的错误,因此如果是错误,则尚未报告。

Example Code:

Pattern newLinePattern = Pattern.compile("(\\r\\n?|\\n)", Pattern.MULTILINE);
String sourceString = "\r\n\n\r\r\n\n";
Scanner scan = new Scanner(sourceString);
scan.useDelimiter("");
int count = 0;
while (scan.hasNext(newLinePattern)) {
    scan.next(newLinePattern);
    count++;
}
System.out.println("found "+count+" newlines"); // finds 7 newlines
Matcher match = newLinePattern.matcher(sourceString);
count = 0;
while (match.find()) {
    count++;
}
System.out.println("found "+count+" newlines"); // finds 5 newlines

4 个解决方案

#1


6  

Your useDelimiter() and next() combo is faulty. useDelimiter("") will return 1-length substring on next(), because an empty string does in fact sit between every two characters.

你的useDelimiter()和next()组合有问题。 useDelimiter(“”)将在next()上返回1长度子字符串,因为空字符串实际上位于每两个字符之间。

That is, because "\r\n".equals("\r" + "" + "\n") so "\r\n" are in fact two tokens, "\r" and "\n", delimited by "".

也就是说,因为“\ r \ n”.equals(“\ r”+“”+“\ n”)所以“\ r \ n”实际上是两个标记,“\ r”和“\ n”,分隔通过“”。

To get the Matcher-behavior, you need findWithinHorizon, which ignores delimiters.

要获得Matcher行为,您需要findWithinHorizo​​n,它会忽略分隔符。

    Pattern newLinePattern = Pattern.compile("(\\r\\n?|\\n)", Pattern.MULTILINE);
    String sourceString = "\r\n\n\r\r\n\n";
    Scanner scan = new Scanner(sourceString);
    int count = 0;
    while (scan.findWithinHorizon(newLinePattern, 0) != null) {
        count++;
    }
    System.out.println("found "+count+" newlines"); // finds 5 newlines

API links

  • findWithinHorizon(Pattern pattern, int horizon)

    Attempts to find the next occurrence of the specified pattern [...] ignoring delimiters [...] If no such pattern is detected then the null is returned [...] If horizon is 0, then [...] this method continues to search through the input looking for the specified pattern without bound.

    尝试找到指定模式的下一次出现[...]忽略分隔符[...]如果没有检测到这样的模式,则返回null [...]如果horizo​​n是0,则[...]方法继续搜索输入,无需绑定即可查找指定的模式。

Related questions

  • Scanner method to get a char
    • useDelimiter("") will tokenize into 1-length substrings
    • useDelimiter(“”)将标记为1长度的子串

  • 获取char useDelimiter(“”)的扫描程序方法将标记为1长度的子字符串

#2


3  

That is, in fact, the expected behaviour of both. The scanner primarily cares about splitting things into tokens using your delimiter. So it (lazily) takes your sourceString and sees it as the following set of tokens: \r, \n, \n, \r, \r, \n, and \n. When you then call hasNext it checks if the next token matches your pattern (which they all trivially do thanks to the ? on the \r\n?). The while loop therefore iterates over each of the 7 tokens.

事实上,这两者的预期行为。扫描仪主要关心使用分隔符将事物分成标记。所以它(懒惰地)获取你的sourceString并将其视为以下一组标记:\ r,\ n,\ n,\ r,\ r,\ n和\ n。然后当你调用hasNext时,它会检查下一个令牌是否与你的模式相匹配(由于\ r \ n?上的?,它们都可以轻松完成)。因此while循环遍历7个标记中的每一个。

On the other hand, the matcher will match the regex greedily - so it bundles the \r\ns together as you expect.

另一方面,匹配器将贪婪地匹配正则表达式 - 因此它将\ r \ ns捆绑在一起如您所期望的那样。

One way to emphasise the behaviour of Scanner is to change your regexp to (\\r\\n|\\n). This results in a count of 0. This is because the scanner reads the first token as \r (not \r\n), and then notices it doesn't match your pattern, so returns false when you call hasNext.

强调扫描程序行为的一种方法是将正则表达式更改为(\\ r \\ n | \\ n)。这导致计数为0.这是因为扫描程序将第一个标记读取为\ r \ n(不是\ r \ n),然后通知它与您的模式不匹配,因此在调用hasNext时返回false。

(Short version: the scanner tokenises using your delimiter before using your token pattern, the matcher doesn't do any form of tokenising)

(简短版本:在使用您的令牌模式之前,扫描程序使用您的分隔符标记,匹配器不进行任何形式的标记)

#3


2  

It might be worth mentioning that your example is ambiguous. It could be:

值得一提的是,你的例子含糊不清。它可能是:

\r
\n
\n
\r
\r
\n
\n

(seven lines)

or:

\r\n
\n
\r
\r\n
\n

(five lines)

The ? quantifier you have used is a greedy quantifier, which would probably make five the right answer, but because Scanner iterates over tokens (in your case individual characters, due to the delimiting pattern you chose), it will match reluctantly, one character at a time, arriving at the incorrect answer of seven.

的?你使用过的量词是一个贪婪的量词,它可能会得到五个正确的答案,但是因为Scanner迭代了令牌(在你的情况下,由于你选择的分界模式,个别字符),它会不情愿地匹配,一次一个字符到达七个错误答案。

#4


0  

When you use the Scanner with a delimiter of "" it will produce tokens that are each one character long. This is before your new line regex is applied. It then matches each of these characters against the new line regex; each one matches, so it produces 7 tokens. However, because it split the string into 1-character tokens it will not group adjacent \r\n characters into one token.

当您使用具有“”分隔符的扫描仪时,它将生成每个字符长的标记。这是在应用新行正则表达式之前。然后它将这些字符中的每一个与新行正则表达式匹配;每一个匹配,所以它产生7个令牌。但是,因为它将字符串拆分为1个字符的标记,所以它不会将相邻的\ r \ n字符分组为一个标记。

#1


6  

Your useDelimiter() and next() combo is faulty. useDelimiter("") will return 1-length substring on next(), because an empty string does in fact sit between every two characters.

你的useDelimiter()和next()组合有问题。 useDelimiter(“”)将在next()上返回1长度子字符串,因为空字符串实际上位于每两个字符之间。

That is, because "\r\n".equals("\r" + "" + "\n") so "\r\n" are in fact two tokens, "\r" and "\n", delimited by "".

也就是说,因为“\ r \ n”.equals(“\ r”+“”+“\ n”)所以“\ r \ n”实际上是两个标记,“\ r”和“\ n”,分隔通过“”。

To get the Matcher-behavior, you need findWithinHorizon, which ignores delimiters.

要获得Matcher行为,您需要findWithinHorizo​​n,它会忽略分隔符。

    Pattern newLinePattern = Pattern.compile("(\\r\\n?|\\n)", Pattern.MULTILINE);
    String sourceString = "\r\n\n\r\r\n\n";
    Scanner scan = new Scanner(sourceString);
    int count = 0;
    while (scan.findWithinHorizon(newLinePattern, 0) != null) {
        count++;
    }
    System.out.println("found "+count+" newlines"); // finds 5 newlines

API links

  • findWithinHorizon(Pattern pattern, int horizon)

    Attempts to find the next occurrence of the specified pattern [...] ignoring delimiters [...] If no such pattern is detected then the null is returned [...] If horizon is 0, then [...] this method continues to search through the input looking for the specified pattern without bound.

    尝试找到指定模式的下一次出现[...]忽略分隔符[...]如果没有检测到这样的模式,则返回null [...]如果horizo​​n是0,则[...]方法继续搜索输入,无需绑定即可查找指定的模式。

Related questions

  • Scanner method to get a char
    • useDelimiter("") will tokenize into 1-length substrings
    • useDelimiter(“”)将标记为1长度的子串

  • 获取char useDelimiter(“”)的扫描程序方法将标记为1长度的子字符串

#2


3  

That is, in fact, the expected behaviour of both. The scanner primarily cares about splitting things into tokens using your delimiter. So it (lazily) takes your sourceString and sees it as the following set of tokens: \r, \n, \n, \r, \r, \n, and \n. When you then call hasNext it checks if the next token matches your pattern (which they all trivially do thanks to the ? on the \r\n?). The while loop therefore iterates over each of the 7 tokens.

事实上,这两者的预期行为。扫描仪主要关心使用分隔符将事物分成标记。所以它(懒惰地)获取你的sourceString并将其视为以下一组标记:\ r,\ n,\ n,\ r,\ r,\ n和\ n。然后当你调用hasNext时,它会检查下一个令牌是否与你的模式相匹配(由于\ r \ n?上的?,它们都可以轻松完成)。因此while循环遍历7个标记中的每一个。

On the other hand, the matcher will match the regex greedily - so it bundles the \r\ns together as you expect.

另一方面,匹配器将贪婪地匹配正则表达式 - 因此它将\ r \ ns捆绑在一起如您所期望的那样。

One way to emphasise the behaviour of Scanner is to change your regexp to (\\r\\n|\\n). This results in a count of 0. This is because the scanner reads the first token as \r (not \r\n), and then notices it doesn't match your pattern, so returns false when you call hasNext.

强调扫描程序行为的一种方法是将正则表达式更改为(\\ r \\ n | \\ n)。这导致计数为0.这是因为扫描程序将第一个标记读取为\ r \ n(不是\ r \ n),然后通知它与您的模式不匹配,因此在调用hasNext时返回false。

(Short version: the scanner tokenises using your delimiter before using your token pattern, the matcher doesn't do any form of tokenising)

(简短版本:在使用您的令牌模式之前,扫描程序使用您的分隔符标记,匹配器不进行任何形式的标记)

#3


2  

It might be worth mentioning that your example is ambiguous. It could be:

值得一提的是,你的例子含糊不清。它可能是:

\r
\n
\n
\r
\r
\n
\n

(seven lines)

or:

\r\n
\n
\r
\r\n
\n

(five lines)

The ? quantifier you have used is a greedy quantifier, which would probably make five the right answer, but because Scanner iterates over tokens (in your case individual characters, due to the delimiting pattern you chose), it will match reluctantly, one character at a time, arriving at the incorrect answer of seven.

的?你使用过的量词是一个贪婪的量词,它可能会得到五个正确的答案,但是因为Scanner迭代了令牌(在你的情况下,由于你选择的分界模式,个别字符),它会不情愿地匹配,一次一个字符到达七个错误答案。

#4


0  

When you use the Scanner with a delimiter of "" it will produce tokens that are each one character long. This is before your new line regex is applied. It then matches each of these characters against the new line regex; each one matches, so it produces 7 tokens. However, because it split the string into 1-character tokens it will not group adjacent \r\n characters into one token.

当您使用具有“”分隔符的扫描仪时,它将生成每个字符长的标记。这是在应用新行正则表达式之前。然后它将这些字符中的每一个与新行正则表达式匹配;每一个匹配,所以它产生7个令牌。但是,因为它将字符串拆分为1个字符的标记,所以它不会将相邻的\ r \ n字符分组为一个标记。