使用Java扫描程序读取文件

时间:2022-06-08 00:08:24

One of the lines in a java file I'm trying to understand is as below.

我试图理解的java文件中的一行是如下所示。

return new Scanner(file).useDelimiter("\\Z").next();

The file is expected to return upto "The end of the input but for the final terminator, if any" as per java.util.regex.Pattern documentation. But what happens is it returns only the first 1024 characters from the file. Is this a limitation imposed by the regex Pattern matcher? Can this be overcome? Currently I'm going ahead using a filereader. But I would like to know the reason for this behaviour.

根据java.util.regex.Pattern文档,该文件应返回到“输入的结尾,但对于最终终止符,如果有的话”。但是,它只会返回文件中的前1024个字符。这是正则表达式模式匹配器施加的限制吗?这可以克服吗?目前我正在使用文件阅读器。但我想知道这种行为的原因。

4 个解决方案

#1


2  

Try wrapping the file object in a FileInputStream

尝试将文件对象包装在FileInputStream中

#2


5  

Myself, I couldn't reproduce this. But I think I can shed light as to what is going on.

我自己,我无法重现这一点。但我想我可以说明发生了什么。

Internally, the Scanner uses a character buffer of 1024 characters. The Scanner will read from your Readable 1024 characters by default, if possible, and then apply the pattern.

在内部,扫描仪使用1024个字符的字符缓冲区。默认情况下,扫描仪将从可读的1024个字符中读取(如果可能),然后应用该模式。

The problem is in your pattern...it will always match the end of the input, but that doesn't mean the end of your input stream/data. When Java applies your pattern to the buffered data, it tries to find the first occurrence of the end of input. Since 1024 characters are in the buffer, the matching engine calls position 1024 the first match of the delimiter and everything before it is returned as the first token.

问题在于你的模式...它总是匹配输入的结尾,但这并不意味着输入流/数据的结束。当Java将模式应用于缓冲数据时,它会尝试查找输入结束的第一个匹配项。由于缓冲区中有1024个字符,因此匹配引擎将位置1024调用分隔符的第一个匹配项,并将其前面的所有内容作为第一个标记返回。

I don't think the end-of-input anchor is valid for use in the Scanner for that reason. It could be reading from an infinite stream, after all.

由于这个原因,我认为输入结束锚不适用于扫描仪。毕竟,它可能是从无限的流中读取的。

#3


1  

Scanner is intended to read multiple primitives from a file. It really isn't intended to read an entire file.

扫描程序旨在从文件中读取多个基元。它实际上并不打算读取整个文件。

If you don't want to include third party libraries, you're better off looping over a BufferedReader that wraps a FileReader/InputStreamReader for text, or looping over a FileInputStream for binary data.

如果您不想包含第三方库,最好循环一个BufferedReader,它包装文件的FileReader / InputStreamReader,或者循环遍历FileInputStream以获取二进制数据。

If you're OK using a third-party library, Apache commons-io has a FileUtils class that contains the static methods readFileToString and readLines for text and readFileToByteArray for binary data..

如果你可以使用第三方库,那么Apache commons-io有一个FileUtils类,它包含静态方法readFileToString和readLines for text和readFileToByteArray for binary data ..

#4


0  

You can use the Scanner class, just specify a char-set when opening the scanner, i.e.:

您可以使用Scanner类,只需在打开扫描仪时指定一个字符集,即:

Scanner sc = new Scanner(file, "ISO-8859-1");

Java converts bytes read from the file into characters using the specified charset, which is the default one (from underlying OS) if nothing is given (source). It is still not clear to me why Scanner reads only 1024 bytes with the default one, whilst with another one it reaches the end of a file. Anyway, it works fine!

Java使用指定的字符集将从文件读取的字节转换为字符,如果没有给出(源),则该字符集是默认字符集(来自底层操作系统)。我仍然不清楚为什么Scanner只使用默认值读取1024个字节,而另一个则到达文件末尾。无论如何,它工作正常!

#1


2  

Try wrapping the file object in a FileInputStream

尝试将文件对象包装在FileInputStream中

#2


5  

Myself, I couldn't reproduce this. But I think I can shed light as to what is going on.

我自己,我无法重现这一点。但我想我可以说明发生了什么。

Internally, the Scanner uses a character buffer of 1024 characters. The Scanner will read from your Readable 1024 characters by default, if possible, and then apply the pattern.

在内部,扫描仪使用1024个字符的字符缓冲区。默认情况下,扫描仪将从可读的1024个字符中读取(如果可能),然后应用该模式。

The problem is in your pattern...it will always match the end of the input, but that doesn't mean the end of your input stream/data. When Java applies your pattern to the buffered data, it tries to find the first occurrence of the end of input. Since 1024 characters are in the buffer, the matching engine calls position 1024 the first match of the delimiter and everything before it is returned as the first token.

问题在于你的模式...它总是匹配输入的结尾,但这并不意味着输入流/数据的结束。当Java将模式应用于缓冲数据时,它会尝试查找输入结束的第一个匹配项。由于缓冲区中有1024个字符,因此匹配引擎将位置1024调用分隔符的第一个匹配项,并将其前面的所有内容作为第一个标记返回。

I don't think the end-of-input anchor is valid for use in the Scanner for that reason. It could be reading from an infinite stream, after all.

由于这个原因,我认为输入结束锚不适用于扫描仪。毕竟,它可能是从无限的流中读取的。

#3


1  

Scanner is intended to read multiple primitives from a file. It really isn't intended to read an entire file.

扫描程序旨在从文件中读取多个基元。它实际上并不打算读取整个文件。

If you don't want to include third party libraries, you're better off looping over a BufferedReader that wraps a FileReader/InputStreamReader for text, or looping over a FileInputStream for binary data.

如果您不想包含第三方库,最好循环一个BufferedReader,它包装文件的FileReader / InputStreamReader,或者循环遍历FileInputStream以获取二进制数据。

If you're OK using a third-party library, Apache commons-io has a FileUtils class that contains the static methods readFileToString and readLines for text and readFileToByteArray for binary data..

如果你可以使用第三方库,那么Apache commons-io有一个FileUtils类,它包含静态方法readFileToString和readLines for text和readFileToByteArray for binary data ..

#4


0  

You can use the Scanner class, just specify a char-set when opening the scanner, i.e.:

您可以使用Scanner类,只需在打开扫描仪时指定一个字符集,即:

Scanner sc = new Scanner(file, "ISO-8859-1");

Java converts bytes read from the file into characters using the specified charset, which is the default one (from underlying OS) if nothing is given (source). It is still not clear to me why Scanner reads only 1024 bytes with the default one, whilst with another one it reaches the end of a file. Anyway, it works fine!

Java使用指定的字符集将从文件读取的字节转换为字符,如果没有给出(源),则该字符集是默认字符集(来自底层操作系统)。我仍然不清楚为什么Scanner只使用默认值读取1024个字节,而另一个则到达文件末尾。无论如何,它工作正常!