用正则表达式解析Java Scanner换行符(Bug?)

时间:2021-11-12 00:50:31

I'm developing a syntax analyzer by hand in Java, and I'd like to use regex's to parse the various token types. The problem is that I'd also like to be able to accurately report the current line number, if the input doesn't conform to the syntax.


Long story short, I've run into a problem when I try to actually match a newline with the Scanner class. To be specific, when I try to match a newline with a pattern using the Scanner class, it fails. Almost always. But when I perform the same matching using a Matcher and the same source string, it retrieves the newline exactly as you'd expect it too. Is there a reason for this, that I can't seem to discover, or is this a bug, as I suspect?


FYI: I was unable to find a bug in the Sun database that describes this issue, so if it is a bug, it hasn't been reported.


Example Code:

Pattern newLinePattern = Pattern.compile("(\\r\\n?|\\n)", Pattern.MULTILINE);
String sourceString = "\r\n\n\r\r\n\n";
Scanner scan = new Scanner(sourceString);
int count = 0;
while (scan.hasNext(newLinePattern)) {
System.out.println("found "+count+" newlines"); // finds 7 newlines
Matcher match = newLinePattern.matcher(sourceString);
count = 0;
while (match.find()) {
System.out.println("found "+count+" newlines"); // finds 5 newlines

Your useDelimiter() and next() combo is faulty. useDelimiter("") will return 1-length substring on next(), because an empty string does in fact sit between every two characters.

你的useDelimiter()和next()组合有问题。 useDelimiter(“”)将在next()上返回1长度子字符串,因为空字符串实际上位于每两个字符之间。

That is, because "\r\n".equals("\r" + "" + "\n") so "\r\n" are in fact two tokens, "\r" and "\n", delimited by "".

也就是说,因为“\ r \ n”.equals(“\ r”+“”+“\ n”)所以“\ r \ n”实际上是两个标记,“\ r”和“\ n”,分隔通过“”。

To get the Matcher-behavior, you need findWithinHorizon, which ignores delimiters.


    Pattern newLinePattern = Pattern.compile("(\\r\\n?|\\n)", Pattern.MULTILINE);
    String sourceString = "\r\n\n\r\r\n\n";
    Scanner scan = new Scanner(sourceString);
    int count = 0;
    while (scan.findWithinHorizon(newLinePattern, 0) != null) {
    System.out.println("found "+count+" newlines"); // finds 5 newlines

That is, in fact, the expected behaviour of both. The scanner primarily cares about splitting things into tokens using your delimiter. So it (lazily) takes your sourceString and sees it as the following set of tokens: \r, \n, \n, \r, \r, \n, and \n. When you then call hasNext it checks if the next token matches your pattern (which they all trivially do thanks to the ? on the \r\n?). The while loop therefore iterates over each of the 7 tokens.

事实上,这两者的预期行为。扫描仪主要关心使用分隔符将事物分成标记。所以它(懒惰地)获取你的sourceString并将其视为以下一组标记:\ r,\ n,\ n,\ r,\ r,\ n和\ n。然后当你调用hasNext时,它会检查下一个令牌是否与你的模式相匹配(由于\ r \ n?上的?,它们都可以轻松完成)。因此while循环遍历7个标记中的每一个。

On the other hand, the matcher will match the regex greedily - so it bundles the \r\ns together as you expect.

另一方面,匹配器将贪婪地匹配正则表达式 - 因此它将\ r \ ns捆绑在一起如您所期望的那样。

One way to emphasise the behaviour of Scanner is to change your regexp to (\\r\\n|\\n). This results in a count of 0. This is because the scanner reads the first token as \r (not \r\n), and then notices it doesn't match your pattern, so returns false when you call hasNext.

强调扫描程序行为的一种方法是将正则表达式更改为(\\ r \\ n | \\ n)。这导致计数为0.这是因为扫描程序将第一个标记读取为\ r \ n(不是\ r \ n),然后通知它与您的模式不匹配,因此在调用hasNext时返回false。

(Short version: the scanner tokenises using your delimiter before using your token pattern, the matcher doesn't do any form of tokenising)




It might be worth mentioning that your example is ambiguous. It could be:



(seven lines)



(five lines)

The ? quantifier you have used is a greedy quantifier, which would probably make five the right answer, but because Scanner iterates over tokens (in your case individual characters, due to the delimiting pattern you chose), it will match reluctantly, one character at a time, arriving at the incorrect answer of seven.




When you use the Scanner with a delimiter of "" it will produce tokens that are each one character long. This is before your new line regex is applied. It then matches each of these characters against the new line regex; each one matches, so it produces 7 tokens. However, because it split the string into 1-character tokens it will not group adjacent \r\n characters into one token.

当您使用具有“”分隔符的扫描仪时,它将生成每个字符长的标记。这是在应用新行正则表达式之前。然后它将这些字符中的每一个与新行正则表达式匹配;每一个匹配,所以它产生7个令牌。但是,因为它将字符串拆分为1个字符的标记,所以它不会将相邻的\ r \ n字符分组为一个标记。



