如何使用GNU diff来区分utf-16文件?

时间:2022-02-14 10:44:45

GNU diff doesn't seem to be smart enough to detect and handle UTF-16 files, which surprises me. Am I missing an obvious command-line option? Is there a good alternative?

GNU diff似乎不够智能,无法检测和处理UTF-16文件,令我感到惊讶。我错过了一个明显的命令行选项吗?有没有好的选择?

4 个解决方案

#1


From the GNU diff documentation:

从GNU diff文档:

Handling Multibyte and Varying-Width Characters

处理多字节和变宽字符

diff, diff3 and sdiff treat each line of input as a string of unibyte characters. This can mishandle multibyte characters in some cases. For example, when asked to ignore spaces, diff does not properly ignore a multibyte space character.

diff,diff3和sdiff将每行输入视为一串unibyte字符。在某些情况下,这可能会错误处理多字节字符。例如,当要求忽略空格时,diff不会正确忽略多字节空格字符。

Also, diff currently assumes that each byte is one column wide, and this assumption is incorrect in some locales, e.g., locales that use UTF-8 encoding. This causes problems with the -y or --side-by-side option of diff.

此外,diff当前假设每个字节是一列宽,并且在某些语言环境中这种假设是不正确的,例如,使用UTF-8编码的语言环境。这会导致diff的-y或--side-by-side选项出现问题。

These problems need to be fixed without unduly affecting the performance of the utilities in unibyte environments.

需要修复这些问题,而不会在unibyte环境中过度影响实用程序的性能。

The IBM GNU/Linux Technology Center Internationalization Team has proposed some patches to support internationalized diff http://oss.software.ibm.com/developer/opensource/linux/patches/i18n/diffutils-2.7.2-i18n-0.1.patch.gz. Unfortunately, these patches are incomplete and are to an older version of diff, so more work needs to be done in this area.

IBM GNU / Linux技术中心国际化团队提出了一些补丁来支持国际化的差异化http://oss.software.ibm.com/developer/opensource/linux/patches/i18n/diffutils-2.7.2-i18n-0.1.patch 。广州。不幸的是,这些补丁是不完整的,并且是旧版本的差异,因此需要在这个领域做更多的工作。

I never realized that myself.

我从来没有意识到自己。

It looks like Guiffy could to the job if a nonfree, non-command line tool will do the job, still looking for a freeware command line tool:

如果一个非*的非命令行工具可以完成这项工作,看起来像Guiffy可以找到工作,仍然在寻找一个免费的命令行工具:

http://www.guiffy.com/Diff-Tool.html

#2


vimdiff works quite nicely for this purpose.

vimdiff为此目的很好地工作。

I found it while reading this * answer.

我在阅读*答案时发现了它。

#3


You could maybe build something in python with the excellent chardet, then convert your files to UTF-8 and send this to GNU diff ?

您可以使用优秀的chardet在python中构建一些东西,然后将您的文件转换为UTF-8并将其发送到GNU diff?

http://chardet.feedparser.org/

#4


In Python, you can use difflib.HtmlDiff to create an HTML table that shows the differences between two sequences of lines, and it seems to work fine with Unicode strings (provided, of course, you read and write them with the appropriate codecs).

在Python中,您可以使用difflib.HtmlDiff来创建一个HTML表,该表显示两个行序列之间的差异,并且它似乎可以正常使用Unicode字符串(当然,前提是您使用适当的编解码器读取和写入它们)。

>>> hd = difflib.HtmlDiff()
>>> htmldiff = hd.make_file(codecs.open('file1', 'r', 'utf-16').readlines(), codecs.open('file2', 'r', 'utf-16').readlines())
>>> print >> codecs.open('diff.html', 'w', 'utf-16'), htmldiff

#1


From the GNU diff documentation:

从GNU diff文档:

Handling Multibyte and Varying-Width Characters

处理多字节和变宽字符

diff, diff3 and sdiff treat each line of input as a string of unibyte characters. This can mishandle multibyte characters in some cases. For example, when asked to ignore spaces, diff does not properly ignore a multibyte space character.

diff,diff3和sdiff将每行输入视为一串unibyte字符。在某些情况下,这可能会错误处理多字节字符。例如,当要求忽略空格时,diff不会正确忽略多字节空格字符。

Also, diff currently assumes that each byte is one column wide, and this assumption is incorrect in some locales, e.g., locales that use UTF-8 encoding. This causes problems with the -y or --side-by-side option of diff.

此外,diff当前假设每个字节是一列宽,并且在某些语言环境中这种假设是不正确的,例如,使用UTF-8编码的语言环境。这会导致diff的-y或--side-by-side选项出现问题。

These problems need to be fixed without unduly affecting the performance of the utilities in unibyte environments.

需要修复这些问题,而不会在unibyte环境中过度影响实用程序的性能。

The IBM GNU/Linux Technology Center Internationalization Team has proposed some patches to support internationalized diff http://oss.software.ibm.com/developer/opensource/linux/patches/i18n/diffutils-2.7.2-i18n-0.1.patch.gz. Unfortunately, these patches are incomplete and are to an older version of diff, so more work needs to be done in this area.

IBM GNU / Linux技术中心国际化团队提出了一些补丁来支持国际化的差异化http://oss.software.ibm.com/developer/opensource/linux/patches/i18n/diffutils-2.7.2-i18n-0.1.patch 。广州。不幸的是,这些补丁是不完整的,并且是旧版本的差异,因此需要在这个领域做更多的工作。

I never realized that myself.

我从来没有意识到自己。

It looks like Guiffy could to the job if a nonfree, non-command line tool will do the job, still looking for a freeware command line tool:

如果一个非*的非命令行工具可以完成这项工作,看起来像Guiffy可以找到工作,仍然在寻找一个免费的命令行工具:

http://www.guiffy.com/Diff-Tool.html

#2


vimdiff works quite nicely for this purpose.

vimdiff为此目的很好地工作。

I found it while reading this * answer.

我在阅读*答案时发现了它。

#3


You could maybe build something in python with the excellent chardet, then convert your files to UTF-8 and send this to GNU diff ?

您可以使用优秀的chardet在python中构建一些东西,然后将您的文件转换为UTF-8并将其发送到GNU diff?

http://chardet.feedparser.org/

#4


In Python, you can use difflib.HtmlDiff to create an HTML table that shows the differences between two sequences of lines, and it seems to work fine with Unicode strings (provided, of course, you read and write them with the appropriate codecs).

在Python中,您可以使用difflib.HtmlDiff来创建一个HTML表,该表显示两个行序列之间的差异,并且它似乎可以正常使用Unicode字符串(当然,前提是您使用适当的编解码器读取和写入它们)。

>>> hd = difflib.HtmlDiff()
>>> htmldiff = hd.make_file(codecs.open('file1', 'r', 'utf-16').readlines(), codecs.open('file2', 'r', 'utf-16').readlines())
>>> print >> codecs.open('diff.html', 'w', 'utf-16'), htmldiff