如何读取可在python中作为ansi或unicode保存的文件?

时间:2021-01-04 00:09:41

I have to write a script that support reading of a file which can be saved as either Unicode or Ansi (using MS's notepad).

我必须编写一个支持读取文件的脚本,该脚本可以保存为Unicode或Ansi(使用MS的记事本)。

I don't have any indication of the encoding format in the file, how can I support both encoding formats? (kind of a generic way of reading files with out knowing the format in advanced).

我在文件中没有任何编码格式的指示,如何支持这两种编码格式?(这是一种普通的读取文件的方式,而不需要事先知道格式)。

2 个解决方案

#1


11  

MS Notepad gives the user a choice of 4 encodings, expressed in clumsy confusing terminology:

记事本给用户选择了4个编码,用笨拙的术语表示:

"Unicode" is UTF-16, written little-endian. "Unicode big endian" is UTF-16, written big-endian. In both UTF-16 cases, this means that the appropriate BOM will be written. Use utf-16 to decode such a file.

“Unicode”是UTF-16,由little-endian编写。“Unicode big endian”是UTF-16,写成big-endian。在这两种UTF-16情况下,这意味着将编写适当的BOM。使用utf-16解码这样的文件。

"UTF-8" is UTF-8; Notepad explicitly writes a "UTF-8 BOM". Use utf-8-sig to decode such a file.

“utf - 8”是utf - 8;记事本显式地写一个“UTF-8 BOM”。使用utf-8-sig解码这样的文件。

"ANSI" is a shocker. This is MS terminology for "whatever the default legacy encoding is on this computer".

“ANSI”是令人震惊的。这是MS术语,表示“无论该计算机上的默认遗留编码是什么”。

Here is a list of Windows encodings that I know of and the languages/scripts that they are used for:

以下是我所知的Windows编码列表,以及它们所使用的语言/脚本:

cp874  Thai
cp932  Japanese 
cp936  Unified Chinese (P.R. China, Singapore)
cp949  Korean 
cp950  Traditional Chinese (*, *, Macao(?))
cp1250 Central and Eastern Europe 
cp1251 Cyrillic ( Belarusian, Bulgarian, Macedonian, Russian, Serbian, Ukrainian)
cp1252 Western European languages
cp1253 Greek 
cp1254 Turkish 
cp1255 Hebrew 
cp1256 Arabic script
cp1257 Baltic languages 
cp1258 Vietnamese
cp???? languages/scripts of India  

If the file has been created on the computer where it is being read, then you can obtain the "ANSI" encoding by locale.getpreferredencoding(). Otherwise if you know where it came from, you can specify what encoding to use if it's not UTF-16. Failing that, guess.

如果文件已经在正在读取的计算机上创建,那么可以通过locale.getpreferredencoding()获得“ANSI”编码。否则,如果您知道它来自哪里,您可以指定使用什么编码,如果它不是UTF-16。如果做不到这一点,猜测。

Be careful using codecs.open() to read files on Windows. The docs say: """Note Files are always opened in binary mode, even if no binary mode was specified. This is done to avoid data loss due to encodings using 8-bit values. This means that no automatic conversion of '\n' is done on reading and writing.""" This means that your lines will end in \r\n and you will need/want to strip those off.

使用coads .open()读取Windows上的文件要小心。文档说:“”注释文件总是以二进制模式打开,即使没有指定二进制模式。这样做是为了避免由于使用8位值进行编码而导致数据丢失。这意味着,“\n”的自动转换不是在阅读和写作上完成的。“这意味着你的台词将以\r\n结尾,你需要/想要去掉它们。”

Putting it all together:

把它放在一起:

Sample text file, saved with all 4 encoding choices, looks like this in Notepad:

示例文本文件保存在所有4种编码选项中,如记事本中所示:

The quick brown fox jumped over the lazy dogs.
àáâãäå

Here is some demo code:

下面是一些演示代码:

import locale

def guess_notepad_encoding(filepath, default_ansi_encoding=None):
    with open(filepath, 'rb') as f:
        data = f.read(3)
    if data[:2] in ('\xff\xfe', '\xfe\xff'):
        return 'utf-16'
    if data == u''.encode('utf-8-sig'):
        return 'utf-8-sig'
    # presumably "ANSI"
    return default_ansi_encoding or locale.getpreferredencoding()

if __name__ == "__main__":
    import sys, glob, codecs
    defenc = sys.argv[1]
    for fpath in glob.glob(sys.argv[2]):
        print
        print (fpath, defenc)
        with open(fpath, 'rb') as f:
            print "raw:", repr(f.read())
        enc = guess_notepad_encoding(fpath, defenc)
        print "guessed encoding:", enc
        with codecs.open(fpath, 'r', enc) as f:
            for lino, line in enumerate(f, 1):
                print lino, repr(line)
                print lino, repr(line.rstrip('\r\n'))

and here is the output when run in a Windows "Command Prompt" window using the command \python27\python read_notepad.py "" t1-*.txt

这是在Windows“命令提示符”窗口中使用命令\python27\python read_notepad运行时的输出。py”“t1 - * . txt

('t1-ansi.txt', '')
raw: 'The quick brown fox jumped over the lazy dogs.\r\n\xe0\xe1\xe2\xe3\xe4\xe5
\r\n'
guessed encoding: cp1252
1 u'The quick brown fox jumped over the lazy dogs.\r\n'
1 u'The quick brown fox jumped over the lazy dogs.'
2 u'\xe0\xe1\xe2\xe3\xe4\xe5\r\n'
2 u'\xe0\xe1\xe2\xe3\xe4\xe5'

('t1-u8.txt', '')
raw: '\xef\xbb\xbfThe quick brown fox jumped over the lazy dogs.\r\n\xc3\xa0\xc3
\xa1\xc3\xa2\xc3\xa3\xc3\xa4\xc3\xa5\r\n'
guessed encoding: utf-8-sig
1 u'The quick brown fox jumped over the lazy dogs.\r\n'
1 u'The quick brown fox jumped over the lazy dogs.'
2 u'\xe0\xe1\xe2\xe3\xe4\xe5\r\n'
2 u'\xe0\xe1\xe2\xe3\xe4\xe5'

('t1-uc.txt', '')
raw: '\xff\xfeT\x00h\x00e\x00 \x00q\x00u\x00i\x00c\x00k\x00 \x00b\x00r\x00o\x00w
\x00n\x00 \x00f\x00o\x00x\x00 \x00j\x00u\x00m\x00p\x00e\x00d\x00 \x00o\x00v\x00e
\x00r\x00 \x00t\x00h\x00e\x00 \x00l\x00a\x00z\x00y\x00 \x00d\x00o\x00g\x00s\x00.
\x00\r\x00\n\x00\xe0\x00\xe1\x00\xe2\x00\xe3\x00\xe4\x00\xe5\x00\r\x00\n\x00'
guessed encoding: utf-16
1 u'The quick brown fox jumped over the lazy dogs.\r\n'
1 u'The quick brown fox jumped over the lazy dogs.'
2 u'\xe0\xe1\xe2\xe3\xe4\xe5\r\n'
2 u'\xe0\xe1\xe2\xe3\xe4\xe5'

('t1-ucb.txt', '')
raw: '\xfe\xff\x00T\x00h\x00e\x00 \x00q\x00u\x00i\x00c\x00k\x00 \x00b\x00r\x00o\
x00w\x00n\x00 \x00f\x00o\x00x\x00 \x00j\x00u\x00m\x00p\x00e\x00d\x00 \x00o\x00v\
x00e\x00r\x00 \x00t\x00h\x00e\x00 \x00l\x00a\x00z\x00y\x00 \x00d\x00o\x00g\x00s\
x00.\x00\r\x00\n\x00\xe0\x00\xe1\x00\xe2\x00\xe3\x00\xe4\x00\xe5\x00\r\x00\n'
guessed encoding: utf-16
1 u'The quick brown fox jumped over the lazy dogs.\r\n'
1 u'The quick brown fox jumped over the lazy dogs.'
2 u'\xe0\xe1\xe2\xe3\xe4\xe5\r\n'
2 u'\xe0\xe1\xe2\xe3\xe4\xe5'

Things to be aware of:

需要注意的事项:

(1) "mbcs" is a file-system pseudo-encoding which has no relevance at all to decoding the contents of files. On a system where the default encoding is cp1252, it makes like latin1 (aarrgghh!!); see below

(1)“mbcs”是一种文件系统伪编码,与解码文件内容没有任何关系。在默认编码为cp1252的系统上,它生成类似latin1 (aarrgghh!!)见下文

>>> all_bytes = "".join(map(chr, range(256)))
>>> u1 = all_bytes.decode('cp1252', 'replace')
>>> u2 = all_bytes.decode('mbcs', 'replace')
>>> u1 == u2
False
>>> [(i, u1[i], u2[i]) for i in xrange(256) if u1[i] != u2[i]]
[(129, u'\ufffd', u'\x81'), (141, u'\ufffd', u'\x8d'), (143, u'\ufffd', u'\x8f')
, (144, u'\ufffd', u'\x90'), (157, u'\ufffd', u'\x9d')]
>>>

(2) chardet is very good at detecting encodings based on non-Latin scripts (Chinese/Japanese/Korean, Cyrillic, Hebrew, Greek) but not much good at Latin-based encodings (Western/Central/Eastern Europe, Turkish, Vietnamese) and doesn't grok Arabic at all.

(2) chardet很擅长检测基于非拉丁文字(汉语/日语/韩语,西里尔语,希伯来语,希腊语)的编码,但不太擅长基于拉丁的编码(西欧/中欧/东欧,土耳其语,越南语),也不懂阿拉伯语。

#2


3  

Notepad saves Unicode files with a byte order mark. This means that the first bytes of the file will be:

记事本使用字节顺序标记保存Unicode文件。这意味着文件的第一个字节将是:

  • EF BB BF -- UTF-8
  • EF BB BF - UTF-8
  • FF FE -- "Unicode" (actually UTF-16 little-endian, looks like)
  • FF FE——“Unicode”(实际上是UTF-16 little-endian,看起来像)
  • FE FF -- "Unicode big-endian" (looks like UTF-16 big-endian)
  • FE FF——“Unicode big-endian”(看起来像UTF-16 big-endian)

Other text editors may or may not have the same behavior, but if you know for sure Notepad is being used, this will give you a decent heuristic for auto-selecting the encoding. All these sequences are valid in the ANSI encoding as well, however, so it is possible for this heuristic to make mistakes. It is not possible to guarantee that the correct encoding is used.

其他的文本编辑器可能也可能没有相同的行为,但是如果您知道使用了记事本,这将给您一个良好的启发式自动选择编码。所有这些序列在ANSI编码中都是有效的,但是,这种启发式算法也有可能出错。不能保证使用了正确的编码。

#1


11  

MS Notepad gives the user a choice of 4 encodings, expressed in clumsy confusing terminology:

记事本给用户选择了4个编码,用笨拙的术语表示:

"Unicode" is UTF-16, written little-endian. "Unicode big endian" is UTF-16, written big-endian. In both UTF-16 cases, this means that the appropriate BOM will be written. Use utf-16 to decode such a file.

“Unicode”是UTF-16,由little-endian编写。“Unicode big endian”是UTF-16,写成big-endian。在这两种UTF-16情况下,这意味着将编写适当的BOM。使用utf-16解码这样的文件。

"UTF-8" is UTF-8; Notepad explicitly writes a "UTF-8 BOM". Use utf-8-sig to decode such a file.

“utf - 8”是utf - 8;记事本显式地写一个“UTF-8 BOM”。使用utf-8-sig解码这样的文件。

"ANSI" is a shocker. This is MS terminology for "whatever the default legacy encoding is on this computer".

“ANSI”是令人震惊的。这是MS术语,表示“无论该计算机上的默认遗留编码是什么”。

Here is a list of Windows encodings that I know of and the languages/scripts that they are used for:

以下是我所知的Windows编码列表,以及它们所使用的语言/脚本:

cp874  Thai
cp932  Japanese 
cp936  Unified Chinese (P.R. China, Singapore)
cp949  Korean 
cp950  Traditional Chinese (*, *, Macao(?))
cp1250 Central and Eastern Europe 
cp1251 Cyrillic ( Belarusian, Bulgarian, Macedonian, Russian, Serbian, Ukrainian)
cp1252 Western European languages
cp1253 Greek 
cp1254 Turkish 
cp1255 Hebrew 
cp1256 Arabic script
cp1257 Baltic languages 
cp1258 Vietnamese
cp???? languages/scripts of India  

If the file has been created on the computer where it is being read, then you can obtain the "ANSI" encoding by locale.getpreferredencoding(). Otherwise if you know where it came from, you can specify what encoding to use if it's not UTF-16. Failing that, guess.

如果文件已经在正在读取的计算机上创建,那么可以通过locale.getpreferredencoding()获得“ANSI”编码。否则,如果您知道它来自哪里,您可以指定使用什么编码,如果它不是UTF-16。如果做不到这一点,猜测。

Be careful using codecs.open() to read files on Windows. The docs say: """Note Files are always opened in binary mode, even if no binary mode was specified. This is done to avoid data loss due to encodings using 8-bit values. This means that no automatic conversion of '\n' is done on reading and writing.""" This means that your lines will end in \r\n and you will need/want to strip those off.

使用coads .open()读取Windows上的文件要小心。文档说:“”注释文件总是以二进制模式打开,即使没有指定二进制模式。这样做是为了避免由于使用8位值进行编码而导致数据丢失。这意味着,“\n”的自动转换不是在阅读和写作上完成的。“这意味着你的台词将以\r\n结尾,你需要/想要去掉它们。”

Putting it all together:

把它放在一起:

Sample text file, saved with all 4 encoding choices, looks like this in Notepad:

示例文本文件保存在所有4种编码选项中,如记事本中所示:

The quick brown fox jumped over the lazy dogs.
àáâãäå

Here is some demo code:

下面是一些演示代码:

import locale

def guess_notepad_encoding(filepath, default_ansi_encoding=None):
    with open(filepath, 'rb') as f:
        data = f.read(3)
    if data[:2] in ('\xff\xfe', '\xfe\xff'):
        return 'utf-16'
    if data == u''.encode('utf-8-sig'):
        return 'utf-8-sig'
    # presumably "ANSI"
    return default_ansi_encoding or locale.getpreferredencoding()

if __name__ == "__main__":
    import sys, glob, codecs
    defenc = sys.argv[1]
    for fpath in glob.glob(sys.argv[2]):
        print
        print (fpath, defenc)
        with open(fpath, 'rb') as f:
            print "raw:", repr(f.read())
        enc = guess_notepad_encoding(fpath, defenc)
        print "guessed encoding:", enc
        with codecs.open(fpath, 'r', enc) as f:
            for lino, line in enumerate(f, 1):
                print lino, repr(line)
                print lino, repr(line.rstrip('\r\n'))

and here is the output when run in a Windows "Command Prompt" window using the command \python27\python read_notepad.py "" t1-*.txt

这是在Windows“命令提示符”窗口中使用命令\python27\python read_notepad运行时的输出。py”“t1 - * . txt

('t1-ansi.txt', '')
raw: 'The quick brown fox jumped over the lazy dogs.\r\n\xe0\xe1\xe2\xe3\xe4\xe5
\r\n'
guessed encoding: cp1252
1 u'The quick brown fox jumped over the lazy dogs.\r\n'
1 u'The quick brown fox jumped over the lazy dogs.'
2 u'\xe0\xe1\xe2\xe3\xe4\xe5\r\n'
2 u'\xe0\xe1\xe2\xe3\xe4\xe5'

('t1-u8.txt', '')
raw: '\xef\xbb\xbfThe quick brown fox jumped over the lazy dogs.\r\n\xc3\xa0\xc3
\xa1\xc3\xa2\xc3\xa3\xc3\xa4\xc3\xa5\r\n'
guessed encoding: utf-8-sig
1 u'The quick brown fox jumped over the lazy dogs.\r\n'
1 u'The quick brown fox jumped over the lazy dogs.'
2 u'\xe0\xe1\xe2\xe3\xe4\xe5\r\n'
2 u'\xe0\xe1\xe2\xe3\xe4\xe5'

('t1-uc.txt', '')
raw: '\xff\xfeT\x00h\x00e\x00 \x00q\x00u\x00i\x00c\x00k\x00 \x00b\x00r\x00o\x00w
\x00n\x00 \x00f\x00o\x00x\x00 \x00j\x00u\x00m\x00p\x00e\x00d\x00 \x00o\x00v\x00e
\x00r\x00 \x00t\x00h\x00e\x00 \x00l\x00a\x00z\x00y\x00 \x00d\x00o\x00g\x00s\x00.
\x00\r\x00\n\x00\xe0\x00\xe1\x00\xe2\x00\xe3\x00\xe4\x00\xe5\x00\r\x00\n\x00'
guessed encoding: utf-16
1 u'The quick brown fox jumped over the lazy dogs.\r\n'
1 u'The quick brown fox jumped over the lazy dogs.'
2 u'\xe0\xe1\xe2\xe3\xe4\xe5\r\n'
2 u'\xe0\xe1\xe2\xe3\xe4\xe5'

('t1-ucb.txt', '')
raw: '\xfe\xff\x00T\x00h\x00e\x00 \x00q\x00u\x00i\x00c\x00k\x00 \x00b\x00r\x00o\
x00w\x00n\x00 \x00f\x00o\x00x\x00 \x00j\x00u\x00m\x00p\x00e\x00d\x00 \x00o\x00v\
x00e\x00r\x00 \x00t\x00h\x00e\x00 \x00l\x00a\x00z\x00y\x00 \x00d\x00o\x00g\x00s\
x00.\x00\r\x00\n\x00\xe0\x00\xe1\x00\xe2\x00\xe3\x00\xe4\x00\xe5\x00\r\x00\n'
guessed encoding: utf-16
1 u'The quick brown fox jumped over the lazy dogs.\r\n'
1 u'The quick brown fox jumped over the lazy dogs.'
2 u'\xe0\xe1\xe2\xe3\xe4\xe5\r\n'
2 u'\xe0\xe1\xe2\xe3\xe4\xe5'

Things to be aware of:

需要注意的事项:

(1) "mbcs" is a file-system pseudo-encoding which has no relevance at all to decoding the contents of files. On a system where the default encoding is cp1252, it makes like latin1 (aarrgghh!!); see below

(1)“mbcs”是一种文件系统伪编码,与解码文件内容没有任何关系。在默认编码为cp1252的系统上,它生成类似latin1 (aarrgghh!!)见下文

>>> all_bytes = "".join(map(chr, range(256)))
>>> u1 = all_bytes.decode('cp1252', 'replace')
>>> u2 = all_bytes.decode('mbcs', 'replace')
>>> u1 == u2
False
>>> [(i, u1[i], u2[i]) for i in xrange(256) if u1[i] != u2[i]]
[(129, u'\ufffd', u'\x81'), (141, u'\ufffd', u'\x8d'), (143, u'\ufffd', u'\x8f')
, (144, u'\ufffd', u'\x90'), (157, u'\ufffd', u'\x9d')]
>>>

(2) chardet is very good at detecting encodings based on non-Latin scripts (Chinese/Japanese/Korean, Cyrillic, Hebrew, Greek) but not much good at Latin-based encodings (Western/Central/Eastern Europe, Turkish, Vietnamese) and doesn't grok Arabic at all.

(2) chardet很擅长检测基于非拉丁文字(汉语/日语/韩语,西里尔语,希伯来语,希腊语)的编码,但不太擅长基于拉丁的编码(西欧/中欧/东欧,土耳其语,越南语),也不懂阿拉伯语。

#2


3  

Notepad saves Unicode files with a byte order mark. This means that the first bytes of the file will be:

记事本使用字节顺序标记保存Unicode文件。这意味着文件的第一个字节将是:

  • EF BB BF -- UTF-8
  • EF BB BF - UTF-8
  • FF FE -- "Unicode" (actually UTF-16 little-endian, looks like)
  • FF FE——“Unicode”(实际上是UTF-16 little-endian,看起来像)
  • FE FF -- "Unicode big-endian" (looks like UTF-16 big-endian)
  • FE FF——“Unicode big-endian”(看起来像UTF-16 big-endian)

Other text editors may or may not have the same behavior, but if you know for sure Notepad is being used, this will give you a decent heuristic for auto-selecting the encoding. All these sequences are valid in the ANSI encoding as well, however, so it is possible for this heuristic to make mistakes. It is not possible to guarantee that the correct encoding is used.

其他的文本编辑器可能也可能没有相同的行为,但是如果您知道使用了记事本,这将给您一个良好的启发式自动选择编码。所有这些序列在ANSI编码中都是有效的,但是,这种启发式算法也有可能出错。不能保证使用了正确的编码。