如何验证文件是否可供人阅读?

时间:2022-09-17 13:32:26

How I can make sure that a file is readable by humans.

我怎样才能确保人类可以读取文件。

By that I essentially want to check if the file is a txt, a yml, a doc, a json file and so on.

我基本上想检查文件是txt,yml,doc,json文件等等。

The issue is that in the case i want to perform this check, file extensions are misleading, and by that i mean that a plain text file (That should be .txt) has an extension of .d and various others :- (

问题是,在我想要执行此检查的情况下,文件扩展名是误导性的,并且我的意思是纯文本文件(应该是.txt)具有.d和其他各种的扩展名: - (

What is the best way to verify that a file can be read by humans?

验证人类可以读取文件的最佳方法是什么?

So far i have tried my luck with extensions as follows:

到目前为止,我已尝试使用扩展程序,如下所示:

private boolean humansCanRead(String extention) {
        switch (extention.toLowerCase()) {
        case "txt":
        case "doc":
        case "json":
        case "yml":
        case "html":
        case "htm":
        case "java":
        case "docx":
            return true;
        default:
            return false;
        }
    }

But as i said extensions are not as expected.

但正如我所说,扩展不如预期。

EDIT: To clarify, i am looking for a solution that is platform independed and without using external libraries, And to narrow down what i mean "human readable", i mean plain text files that contain characters of any language, also i dont really mind if the text in the file makes sense like if it is encoded, i dont really care at this point.

编辑:澄清,我正在寻找一个平*立的解决方案,而不使用外部库,并缩小我的意思“人类可读”,我的意思是包含任何语言字符的纯文本文件,我也不介意如果文件中的文本有意义,就像它被编码一样,我真的不在乎这一点。

Thanks so far for all the responses! :D

感谢所有回复! :d

2 个解决方案

#1


1  

For some files, a check on the proportion of bytes in the printable ASCII range will help. If more than 75% of the bytes are in that range within the first few hundred bytes then it is probably 'readable'.

对于某些文件,检查可打印ASCII范围内的字节比例将有所帮助。如果超过75%的字节在前几百个字节内的那个范围内,那么它可能是“可读的”。

Some files have headers, like the various forms of BoM on UTF files, the 0xA5EC which starts MS doc files or the "MZ" signature at the start of .exe, which will tell you if the file is readable or not.

有些文件有标题,比如UTF文件上各种形式的BoM,0xA5EC启动MS doc文件或.exe开头的“MZ”签名,它会告诉你文件是否可读。

A lot of modern text files are in one of the UTF formats, which can usually be identified by reading the first chunk of the file, even if they don't have a BoM.

许多现代文本文件都采用UTF格式之一,通常可以通过读取文件的第一个块来识别,即使它们没有BoM。

Basically, you are going to have to run through a lot of different file types to see if you get a match. Load the first kilobyte of the file into memory and run a lot of different checks on it. Once you have some data, you can order the checks to look for the most common formats first.

基本上,您将不得不运行许多不同的文件类型来查看是否匹配。将文件的第一个千字节加载到内存中并对其运行许多不同的检查。获得一些数据后,您可以先订购检查以查找最常用的格式。

#2


2  

In general, you cannot do that. You could use a language identification algorithm to guess whether a given text is a text that could be spoken by humans. Since your example contains formal languages like html, however, you are in some deep trouble. If you really want to implement your check for (a finite set of) formal languages, you could use a GLR parser to parse the (ambiguous) grammar that combines all these languages. This, however would not yet solve the problem of syntax-errors (although it might be possible to define a heuristic). Finally, you need to consider what you actually mean by "human readable": E.g. do you include Base64?

一般来说,你不能这样做。您可以使用语言识别算法来猜测给定文本是否是人类可以使用的文本。因为你的例子包含像html这样的正式语言,所以你遇到了一些麻烦。如果你真的想要对(一组有限的)形式语言进行检查,你可以使用GLR解析器来解析组合所有这些语言的(模糊的)语法。然而,这还不能解决语法错误的问题(尽管可能有可能定义启发式)。最后,您需要考虑“人类可读”的实际含义:例如你包括Base64吗?

edit: In case you are only interested in the character set: See this questions' answer. Basically, you have to read the file and check whether the content is valid in whatever character encoding you think of as human readable (utf-8 should cover most of your real-world cases).

编辑:如果您只对字符集感兴趣:请参阅此问题的答案。基本上,您必须阅读文件并检查内容是否有效,无论您认为哪种字符编码是人类可读的(utf-8应涵盖您的大多数实际情况)。

#1


1  

For some files, a check on the proportion of bytes in the printable ASCII range will help. If more than 75% of the bytes are in that range within the first few hundred bytes then it is probably 'readable'.

对于某些文件,检查可打印ASCII范围内的字节比例将有所帮助。如果超过75%的字节在前几百个字节内的那个范围内,那么它可能是“可读的”。

Some files have headers, like the various forms of BoM on UTF files, the 0xA5EC which starts MS doc files or the "MZ" signature at the start of .exe, which will tell you if the file is readable or not.

有些文件有标题,比如UTF文件上各种形式的BoM,0xA5EC启动MS doc文件或.exe开头的“MZ”签名,它会告诉你文件是否可读。

A lot of modern text files are in one of the UTF formats, which can usually be identified by reading the first chunk of the file, even if they don't have a BoM.

许多现代文本文件都采用UTF格式之一,通常可以通过读取文件的第一个块来识别,即使它们没有BoM。

Basically, you are going to have to run through a lot of different file types to see if you get a match. Load the first kilobyte of the file into memory and run a lot of different checks on it. Once you have some data, you can order the checks to look for the most common formats first.

基本上,您将不得不运行许多不同的文件类型来查看是否匹配。将文件的第一个千字节加载到内存中并对其运行许多不同的检查。获得一些数据后,您可以先订购检查以查找最常用的格式。

#2


2  

In general, you cannot do that. You could use a language identification algorithm to guess whether a given text is a text that could be spoken by humans. Since your example contains formal languages like html, however, you are in some deep trouble. If you really want to implement your check for (a finite set of) formal languages, you could use a GLR parser to parse the (ambiguous) grammar that combines all these languages. This, however would not yet solve the problem of syntax-errors (although it might be possible to define a heuristic). Finally, you need to consider what you actually mean by "human readable": E.g. do you include Base64?

一般来说,你不能这样做。您可以使用语言识别算法来猜测给定文本是否是人类可以使用的文本。因为你的例子包含像html这样的正式语言,所以你遇到了一些麻烦。如果你真的想要对(一组有限的)形式语言进行检查,你可以使用GLR解析器来解析组合所有这些语言的(模糊的)语法。然而,这还不能解决语法错误的问题(尽管可能有可能定义启发式)。最后,您需要考虑“人类可读”的实际含义:例如你包括Base64吗?

edit: In case you are only interested in the character set: See this questions' answer. Basically, you have to read the file and check whether the content is valid in whatever character encoding you think of as human readable (utf-8 should cover most of your real-world cases).

编辑:如果您只对字符集感兴趣:请参阅此问题的答案。基本上,您必须阅读文件并检查内容是否有效,无论您认为哪种字符编码是人类可读的(utf-8应涵盖您的大多数实际情况)。