可以安全地在二进制文件上调用File.ReadAllText()吗?

时间:2022-07-20 13:26:53

I am writing a little program that iterates through all files in a directory and searches for a substring.
It's basically something like this:

我正在编写一个小程序,它遍历目录中的所有文件并搜索子字符串。它基本上是这样的:

s = File.ReadAllText(FileName)
If s.Contains("Find this substring") Then
    MatchesFound += 1
End If

I also have a Regex version of this program, but still using File.ReadAllText() to read the files.

我也有这个程序的正则表达式版本,但仍然使用File.ReadAllText()来读取文件。

Should I be concerned with calling File.ReadAllText() on binary files?
I don't mind getting a few false positives in the search results, but I don't want my program to crash.
MSDN docs don't show any exceptions for this method that result from not being able to read or interpret file data.

我应该关注在二进制文件上调用File.ReadAllText()吗?我不介意在搜索结果中得到一些误报,但我不希望我的程序崩溃。 MSDN文档不显示由于无法读取或解释文件数据而导致的此方法的任何异常。

3 个解决方案

#1


2  

Your program won't crash. If the file is too long, it might just take up lot of memory. ReadAllText releases file handle before returning to you. As such, your handles would get properly disposed.

你的程序不会崩溃。如果文件太长,可能会占用大量内存。 ReadAllText在返回给您之前释放文件句柄。因此,您的手柄将得到妥善处理。

Your string will just have text representation of the binary file. Most of it probably would be invalid characters. Framework internally uses unicode for string (UTF16).

您的字符串将只包含二进制文件的文本表示。大多数可能是无效字符。 Framework内部使用unicode作为字符串(UTF16)。

Only thing you should be concerned about is extremely large files, e.g. a 4GB ISO file. If you have files that big in your directory then you should probably make better algorithm to make code efficient instead of blindly getting ReadAllText.

你应该关心的只是非常大的文件,例如一个4GB的ISO文件。如果您的目录中有大文件,那么您应该制作更好的算法来提高代码效率,而不是盲目地获取ReadAllText。

Also, before you read, you can check file size; and if its obvious that its a pure binary file (for ex. 100MB zip file); you can skip that and move to next.

另外,在阅读之前,您可以检查文件大小;如果它显然是一个纯二进制文件(例如100MB zip文件);你可以跳过它然后转到下一个。

#2


1  

Your code should work. Calling the method ReadAllText returns a String. Therefore, even if the format is not the good one, you will still end up with String.

你的代码应该有效。调用方法ReadAllText返回一个String。因此,即使格式不是好格式,您仍然会使用String。

The method itself is supposed to throw exception for file related issues; not for String format issues.

该方法本身应该为文件相关问题抛出异常;不适用于String格式问题。

The only problem I could think of is if you try to open a file which is too large to fit in your memory, an exception will be thrown. Otherwise, your code should work just fine.

我能想到的唯一问题是,如果你试图打开一个太大而无法放入内存的文件,就会抛出异常。否则,您的代码应该可以正常工作。

#3


0  

Note that ReadAllText depends on a guessed file encoding. Strings in binary files could be stored in any encoding, without being guessable because of the binary file's header. Also note that binary files could store correctly encoded strings in a way that makes the reader not decode the string properly, for example because a UTF-16 string starts at an odd position in the file. And if the reader guesses UTF-8 encoding, there's even room for encoding errors that possibly cause the first character of the string to be decoded as garbage.

请注意,ReadAllText依赖于猜测的文件编码。二进制文件中的字符串可以以任何编码存储,因为二进制文件的标题而无法猜测。另请注意,二进制文件可以以一种使读者无法正确解码字符串的方式存储正确编码的字符串,例如因为UTF-16字符串从文件中的奇数位置开始。如果读者猜测UTF-8编码,那么编码错误的空间甚至可能导致字符串的第一个字符被解码为垃圾。

#1


2  

Your program won't crash. If the file is too long, it might just take up lot of memory. ReadAllText releases file handle before returning to you. As such, your handles would get properly disposed.

你的程序不会崩溃。如果文件太长,可能会占用大量内存。 ReadAllText在返回给您之前释放文件句柄。因此,您的手柄将得到妥善处理。

Your string will just have text representation of the binary file. Most of it probably would be invalid characters. Framework internally uses unicode for string (UTF16).

您的字符串将只包含二进制文件的文本表示。大多数可能是无效字符。 Framework内部使用unicode作为字符串(UTF16)。

Only thing you should be concerned about is extremely large files, e.g. a 4GB ISO file. If you have files that big in your directory then you should probably make better algorithm to make code efficient instead of blindly getting ReadAllText.

你应该关心的只是非常大的文件,例如一个4GB的ISO文件。如果您的目录中有大文件,那么您应该制作更好的算法来提高代码效率,而不是盲目地获取ReadAllText。

Also, before you read, you can check file size; and if its obvious that its a pure binary file (for ex. 100MB zip file); you can skip that and move to next.

另外,在阅读之前,您可以检查文件大小;如果它显然是一个纯二进制文件(例如100MB zip文件);你可以跳过它然后转到下一个。

#2


1  

Your code should work. Calling the method ReadAllText returns a String. Therefore, even if the format is not the good one, you will still end up with String.

你的代码应该有效。调用方法ReadAllText返回一个String。因此,即使格式不是好格式,您仍然会使用String。

The method itself is supposed to throw exception for file related issues; not for String format issues.

该方法本身应该为文件相关问题抛出异常;不适用于String格式问题。

The only problem I could think of is if you try to open a file which is too large to fit in your memory, an exception will be thrown. Otherwise, your code should work just fine.

我能想到的唯一问题是,如果你试图打开一个太大而无法放入内存的文件,就会抛出异常。否则,您的代码应该可以正常工作。

#3


0  

Note that ReadAllText depends on a guessed file encoding. Strings in binary files could be stored in any encoding, without being guessable because of the binary file's header. Also note that binary files could store correctly encoded strings in a way that makes the reader not decode the string properly, for example because a UTF-16 string starts at an odd position in the file. And if the reader guesses UTF-8 encoding, there's even room for encoding errors that possibly cause the first character of the string to be decoded as garbage.

请注意,ReadAllText依赖于猜测的文件编码。二进制文件中的字符串可以以任何编码存储,因为二进制文件的标题而无法猜测。另请注意,二进制文件可以以一种使读者无法正确解码字符串的方式存储正确编码的字符串,例如因为UTF-16字符串从文件中的奇数位置开始。如果读者猜测UTF-8编码,那么编码错误的空间甚至可能导致字符串的第一个字符被解码为垃圾。