批量转换未知文件编码为UTF-8

时间:2021-06-28 20:11:59

I need to convert some files to UTF-8 because they're being outputted in an otherwise UTF-8 site and the content looks a little fugly at times.

我需要将一些文件转换为UTF-8,因为它们是在UTF-8网站上输出的,而且内容看起来有点难看。

I can either do this now or I can do it as they're read in (through PHP, just using fopen, nothing fancy). Any suggestions welcome.

我现在可以做到这一点,或者我可以在阅读时做到这一点(通过PHP,只使用fopen,没什么特别的)。欢迎任何建议。

4 个解决方案

#1


I don't have a clear solution for PHP, but for Python I personally used Universal Encoding Detector library which does a pretty good job at guessing what encoding the file is being written as.

我没有一个明确的PHP解决方案,但对于Python我个人使用通用编码检测器库,它可以很好地猜测文件的编码方式。

Just to get you started, here's a Python script that I had used to do the conversion (the original purpose is that I wanted to converted a Japanese code base from a mixture of UTF-16 and Shift-JIS, which I made a default guess if chardet is not confident of detecting the encoding):

为了让你开始,这是我用来进行转换的Python脚本(最初的目的是我想从UTF-16和Shift-JIS的混合转换日语代码库,我做了默认猜测如果chardet对检测编码没有信心):

import sys
import codecs
import chardet
from chardet.universaldetector import UniversalDetector

""" Detects encoding

Returns chardet result"""
def DetectEncoding(fileHdl):
detector = UniversalDetector()
for line in fileHdl:
    detector.feed(line)
    if detector.done: break
detector.close()
return detector.result


""" Reencode file to UTF-8
"""
def ReencodeFileToUtf8(fileName, encoding):
    #TODO: This is dangerous ^^||, would need a backup option :)
    #NOTE: Use 'replace' option which tolerates errorneous characters
    data = codecs.open(fileName, 'rb', encoding, 'replace').read()
    open(fileName, 'wb').write(data.encode('utf-8', 'replace'))

""" Main function
"""
if __name__=='__main__':
    # Check for arguments first
    if len(sys.argv) <> 2:
    sys.exit("Invalid arguments supplied")

    fileName = sys.argv[1]
    try:
        # Open file and detect encoding
        fileHdl = open(fileName, 'rb')
        encResult = DetectEncoding(fileHdl)
        fileHdl.close()

        # Was it an empty file?
        if encResult['confidence'] == 0 and encResult['encoding'] == None:
            sys.exit("Possible empty file")

        # Only attempt to reencode file if we are confident about the
        # encoding and if it's not UTF-8
        encoding = encResult['encoding'].lower()
        if encResult['confidence'] >= 0.7:
            if encoding != 'utf-8':
                ReencodeFileToUtf8(fileName, encoding)
        else:
            # TODO: Probably you could make a default guess and try to encode, or
            #       just simply make it fail

        except IOError:
            sys.exit('An IOError occured')

#2


Doing it only once would improve performance and reduce the potential for future errors, but if you don't know the encoding, you cannot do a correct conversion at all.

仅执行一次可以提高性能并减少将来出错的可能性,但如果您不知道编码,则根本无法进行正确的转换。

#3


My first attempt at this would be:

我的第一次尝试是:

  1. If it is syntactically valid UTF-8, assume it's UTF-8.
  2. 如果它在语法上是有效的UTF-8,则假设它是UTF-8。

  3. If there are only bytes corresponding to valid characters in ISO 8859-1 (Latin-1), assume that.
  4. 如果ISO 8859-1(Latin-1)中只有与有效字符对应的字节,则假定为。

  5. Otherwise, fail.

#4


Can a file contain data from different codepages?

文件是否可以包含来自不同代码页的数据?

If yes, then you can't do the batch conversion at all. You would have to know every single codepage of every single sub string in your file.

如果是,则根本不能进行批量转换。您必须知道文件中每个子字符串的每个代码页。

If no it's possible to batch convert a file at a time, but assuming you know what codepage that file has. So we're more or less back the same situation as above, we've just moved the abstraction from sub string scope to file scope.

如果没有,可以一次批量转换文件,但假设您知道该文件具有哪个代码页。所以我们或多或少地回到了与上面相同的情况,我们只是将抽象从子字符串范围移到了文件范围。

So, the question you need to ask yourself is. Do you have information about what codepage some data belongs to? If not, it will still look fugly.

所以,你需要问自己的问题是。您是否了解某些数据所属的代码页?如果没有,它仍然会看起来很难看。

You can always do some analysis on your data and guess codepage, and although this might make it a little less fuglier, you are still guessing, and therefore it will still be fugly :)

你总是可以对你的数据做一些分析并猜测代码页,尽管这可能会让它变得更加微不足道,但你仍然在猜测,因此它仍然会很难看:)

#1


I don't have a clear solution for PHP, but for Python I personally used Universal Encoding Detector library which does a pretty good job at guessing what encoding the file is being written as.

我没有一个明确的PHP解决方案,但对于Python我个人使用通用编码检测器库,它可以很好地猜测文件的编码方式。

Just to get you started, here's a Python script that I had used to do the conversion (the original purpose is that I wanted to converted a Japanese code base from a mixture of UTF-16 and Shift-JIS, which I made a default guess if chardet is not confident of detecting the encoding):

为了让你开始,这是我用来进行转换的Python脚本(最初的目的是我想从UTF-16和Shift-JIS的混合转换日语代码库,我做了默认猜测如果chardet对检测编码没有信心):

import sys
import codecs
import chardet
from chardet.universaldetector import UniversalDetector

""" Detects encoding

Returns chardet result"""
def DetectEncoding(fileHdl):
detector = UniversalDetector()
for line in fileHdl:
    detector.feed(line)
    if detector.done: break
detector.close()
return detector.result


""" Reencode file to UTF-8
"""
def ReencodeFileToUtf8(fileName, encoding):
    #TODO: This is dangerous ^^||, would need a backup option :)
    #NOTE: Use 'replace' option which tolerates errorneous characters
    data = codecs.open(fileName, 'rb', encoding, 'replace').read()
    open(fileName, 'wb').write(data.encode('utf-8', 'replace'))

""" Main function
"""
if __name__=='__main__':
    # Check for arguments first
    if len(sys.argv) <> 2:
    sys.exit("Invalid arguments supplied")

    fileName = sys.argv[1]
    try:
        # Open file and detect encoding
        fileHdl = open(fileName, 'rb')
        encResult = DetectEncoding(fileHdl)
        fileHdl.close()

        # Was it an empty file?
        if encResult['confidence'] == 0 and encResult['encoding'] == None:
            sys.exit("Possible empty file")

        # Only attempt to reencode file if we are confident about the
        # encoding and if it's not UTF-8
        encoding = encResult['encoding'].lower()
        if encResult['confidence'] >= 0.7:
            if encoding != 'utf-8':
                ReencodeFileToUtf8(fileName, encoding)
        else:
            # TODO: Probably you could make a default guess and try to encode, or
            #       just simply make it fail

        except IOError:
            sys.exit('An IOError occured')

#2


Doing it only once would improve performance and reduce the potential for future errors, but if you don't know the encoding, you cannot do a correct conversion at all.

仅执行一次可以提高性能并减少将来出错的可能性,但如果您不知道编码,则根本无法进行正确的转换。

#3


My first attempt at this would be:

我的第一次尝试是:

  1. If it is syntactically valid UTF-8, assume it's UTF-8.
  2. 如果它在语法上是有效的UTF-8,则假设它是UTF-8。

  3. If there are only bytes corresponding to valid characters in ISO 8859-1 (Latin-1), assume that.
  4. 如果ISO 8859-1(Latin-1)中只有与有效字符对应的字节,则假定为。

  5. Otherwise, fail.

#4


Can a file contain data from different codepages?

文件是否可以包含来自不同代码页的数据?

If yes, then you can't do the batch conversion at all. You would have to know every single codepage of every single sub string in your file.

如果是,则根本不能进行批量转换。您必须知道文件中每个子字符串的每个代码页。

If no it's possible to batch convert a file at a time, but assuming you know what codepage that file has. So we're more or less back the same situation as above, we've just moved the abstraction from sub string scope to file scope.

如果没有,可以一次批量转换文件,但假设您知道该文件具有哪个代码页。所以我们或多或少地回到了与上面相同的情况,我们只是将抽象从子字符串范围移到了文件范围。

So, the question you need to ask yourself is. Do you have information about what codepage some data belongs to? If not, it will still look fugly.

所以,你需要问自己的问题是。您是否了解某些数据所属的代码页?如果没有,它仍然会看起来很难看。

You can always do some analysis on your data and guess codepage, and although this might make it a little less fuglier, you are still guessing, and therefore it will still be fugly :)

你总是可以对你的数据做一些分析并猜测代码页,尽管这可能会让它变得更加微不足道,但你仍然在猜测,因此它仍然会很难看:)