The linux file
command does a very good job in recognising file types and gives very fine-grained results. The diff
tool is able to tell binary files from text files, producing a different output.
linux文件命令在识别文件类型方面做得非常好,并给出非常细粒度的结果。diff工具能够区分二进制文件和文本文件,产生不同的输出。
Is there a way to tell binary files form text files? All I want is a yes/no answer whether a given file is binary. Because it's difficult to define binary, let's say I want to know if diff
will attempt a text-based comparison.
有没有办法把二进制文件和文本文件区分开来?我想要的是一个给定文件是否为二进制的yes/no答案。因为很难定义二进制,假设我想知道diff是否会尝试基于文本的比较。
To clarify the question: I do not care if it's ASCII text or XML as long as it's text. Also, I do not want to differentiate between MP3 and JPEG files, as they're all binary.
为了澄清这个问题:只要是文本,我不关心它是ASCII文本还是XML。另外,我不想区分MP3和JPEG文件,因为它们都是二进制文件。
8 个解决方案
#1
6
The diff manual specifies that
diff手册指定了这一点
diff determines whether a file is text or binary by checking the first few bytes in the file; the exact number of bytes is system dependent, but it is typically several thousand. If every byte in that part of the file is non-null, diff considers the file to be text; otherwise it considers the file to be binary.
diff通过检查文件中的前几个字节来确定文件是文本还是二进制;确切的字节数与系统有关,但通常是几千个。如果该文件部分中的每个字节都是非空的,那么diff将该文件视为文本;否则它认为文件是二进制的。
#2
11
file
is still the command you want. Any file that is text (according to its heuristics) will include the word "text" in the output of file
; anything that is binary will not include the word "text".
文件仍然是您想要的命令。任何文本文件(根据其启发式)都将在文件的输出中包含“文本”一词;任何二进制的东西都不会包含“文本”这个词。
If you don't agree with the heuristics that file
uses to determine text vs. not-text, then the question needs to be better specified, since text vs. non-text is an inherently vague question. For example, file
does not identify a PGP public key block in ASCII as "text", but you might (since it is composed only of printable characters, even though it is not human-readable).
如果您不同意文件用来确定文本与非文本的启发式,那么需要更好地指定问题,因为文本与非文本是一个固有的模糊问题。例如,文件没有将ASCII中的PGP公钥块标识为“text”,但是您可以(因为它仅由可打印字符组成,即使它不是人类可读的)。
#3
6
A quick-and-dirty way is to look for a NUL
character (a zero byte) in the first K or two of the file. As long as you're not worried about UTF-16 or UTF-32, no text file should ever contain a NUL
.
一种快速而肮脏的方法是在文件的前K或2中寻找NUL字符(0字节)。只要您不担心UTF-16或UTF-32,任何文本文件都不应该包含NUL。
Update: According to the diff manual, this is exactly what diff does.
更新:根据diff手册,这正是diff所做的。
#4
3
You could try to give a
你可以试着给a。
strings yourfile
command and compare the size of the results with the file size ... i'm not totally sure, but if they are the same the file is really a text file.
命令并将结果的大小与文件大小进行比较……我不太确定,但是如果它们是相同的,那么这个文件就是一个文本文件。
#5
1
These days the term "text file" is ambiguous, because a text file can be encoded in ASCII, ISO-8859-*, UTF-8, UTF-16, UTF-32 and so on.
现在,“文本文件”这个术语很模糊,因为文本文件可以用ASCII、ISO-8859-*、UTF-8、UTF-16、UTF-32等格式编码。
See here for how Subversion does it.
在这里查看Subversion是如何实现的。
#6
1
This approach uses same criteria as grep
in determining whether a file is binary or text:
这种方法使用与grep相同的标准来确定文件是二进制还是文本:
is_text_file() {
grep -qI '.' "$1"
}
grep options used:
-
-q
Quiet; Exit immediately with zero status if any match is found - q安静;如果发现任何匹配,立即以零状态退出
-
-I
Process a binary file as if it did not contain matching data - -我处理一个二进制文件,就好像它不包含匹配的数据一样。
grep pattern used:
-
'.'
match any single character. All files (except an empty file) will match this pattern. - ”。“匹配任何一个人物。”所有文件(除了一个空文件)都将匹配此模式。
Notes
- An empty file is not considered a text file according to this test.
- 根据这个测试,空文件不被视为文本文件。
- Symbolic links are followed.
- 符号链接之后。
#7
0
A fast way to do this in ubuntu is use nautilus in the "list" view. The type column will show you if its text or binary
在ubuntu中快速实现这一点的方法是在“列表”视图中使用nautilus。type列将显示它的文本或二进制
#8
-1
Commands like less, grep detect it quite easily(and fast). You can have a look at their source.
像less这样的命令,grep很容易(而且很快)检测到它。你可以看看他们的来源。
#1
6
The diff manual specifies that
diff手册指定了这一点
diff determines whether a file is text or binary by checking the first few bytes in the file; the exact number of bytes is system dependent, but it is typically several thousand. If every byte in that part of the file is non-null, diff considers the file to be text; otherwise it considers the file to be binary.
diff通过检查文件中的前几个字节来确定文件是文本还是二进制;确切的字节数与系统有关,但通常是几千个。如果该文件部分中的每个字节都是非空的,那么diff将该文件视为文本;否则它认为文件是二进制的。
#2
11
file
is still the command you want. Any file that is text (according to its heuristics) will include the word "text" in the output of file
; anything that is binary will not include the word "text".
文件仍然是您想要的命令。任何文本文件(根据其启发式)都将在文件的输出中包含“文本”一词;任何二进制的东西都不会包含“文本”这个词。
If you don't agree with the heuristics that file
uses to determine text vs. not-text, then the question needs to be better specified, since text vs. non-text is an inherently vague question. For example, file
does not identify a PGP public key block in ASCII as "text", but you might (since it is composed only of printable characters, even though it is not human-readable).
如果您不同意文件用来确定文本与非文本的启发式,那么需要更好地指定问题,因为文本与非文本是一个固有的模糊问题。例如,文件没有将ASCII中的PGP公钥块标识为“text”,但是您可以(因为它仅由可打印字符组成,即使它不是人类可读的)。
#3
6
A quick-and-dirty way is to look for a NUL
character (a zero byte) in the first K or two of the file. As long as you're not worried about UTF-16 or UTF-32, no text file should ever contain a NUL
.
一种快速而肮脏的方法是在文件的前K或2中寻找NUL字符(0字节)。只要您不担心UTF-16或UTF-32,任何文本文件都不应该包含NUL。
Update: According to the diff manual, this is exactly what diff does.
更新:根据diff手册,这正是diff所做的。
#4
3
You could try to give a
你可以试着给a。
strings yourfile
command and compare the size of the results with the file size ... i'm not totally sure, but if they are the same the file is really a text file.
命令并将结果的大小与文件大小进行比较……我不太确定,但是如果它们是相同的,那么这个文件就是一个文本文件。
#5
1
These days the term "text file" is ambiguous, because a text file can be encoded in ASCII, ISO-8859-*, UTF-8, UTF-16, UTF-32 and so on.
现在,“文本文件”这个术语很模糊,因为文本文件可以用ASCII、ISO-8859-*、UTF-8、UTF-16、UTF-32等格式编码。
See here for how Subversion does it.
在这里查看Subversion是如何实现的。
#6
1
This approach uses same criteria as grep
in determining whether a file is binary or text:
这种方法使用与grep相同的标准来确定文件是二进制还是文本:
is_text_file() {
grep -qI '.' "$1"
}
grep options used:
-
-q
Quiet; Exit immediately with zero status if any match is found - q安静;如果发现任何匹配,立即以零状态退出
-
-I
Process a binary file as if it did not contain matching data - -我处理一个二进制文件,就好像它不包含匹配的数据一样。
grep pattern used:
-
'.'
match any single character. All files (except an empty file) will match this pattern. - ”。“匹配任何一个人物。”所有文件(除了一个空文件)都将匹配此模式。
Notes
- An empty file is not considered a text file according to this test.
- 根据这个测试,空文件不被视为文本文件。
- Symbolic links are followed.
- 符号链接之后。
#7
0
A fast way to do this in ubuntu is use nautilus in the "list" view. The type column will show you if its text or binary
在ubuntu中快速实现这一点的方法是在“列表”视图中使用nautilus。type列将显示它的文本或二进制
#8
-1
Commands like less, grep detect it quite easily(and fast). You can have a look at their source.
像less这样的命令,grep很容易(而且很快)检测到它。你可以看看他们的来源。