In python 2.7 I have this:
在python 2.7中,我有:
# -*- coding: utf-8 -*-
from nltk.corpus import abc
with open("abc.txt","w") as f:
f.write(" ".join(i.words()))
I then try to read in this document in Python 3:
然后,我尝试在Python 3中阅读本文档:
with open("abc.txt", 'r', encoding='utf-8') as f:
f.read()
only to get:
只有得到:
File "C:\Python32\lib\codecs.py", line 300, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 633096: invalid continuation byte
What have I done wrong? Notepad++ seems to indicate that the document is Unicode utf-8. Even if I try to convert the document to this format with Notepad++ I still get this error in python 3, which is strange since I read many other utf-8 encoded documents without any problems.
我做错了什么?Notepad++似乎表明文档是Unicode utf-8。即使我尝试用Notepad++将文档转换为这种格式,在python 3中仍然会出现这个错误,这很奇怪,因为我阅读了许多其他utf-8编码的文档,没有任何问题。
2 个解决方案
#1
2
My guess is that your input is encoded as ISO-8859-2 which contains Ă
as 0xC3
. Check the encoding of your input file.
我的猜测是,你输入编码为iso - 8859 - 2包含Ă0 xc3。检查输入文件的编码。
#2
2
Based on the fact that your piece of Python 2.7 doesn't throw an exception, I would infer that i.words()
returns a sequence of bytestrings. These are unlikely to be encoded in UTF8 - I'd guess maybe Latin-1 or something like that. You then write them to the file. No encoding happens at this point.
基于您的Python 2.7没有抛出异常的事实,我推断i.words()返回一个字节串序列。这些不太可能用UTF8编码——我猜可能是Latin-1或者类似的东西。然后将它们写入文件。此时没有编码。
You probably need to convert these to unicode strings, for which you'll need to know their existing encoding, and then you'll need to encode these as UTF-8 when writing the file.
您可能需要将它们转换为unicode字符串,您需要知道它们的现有编码,然后在编写文件时需要将它们编码为UTF-8。
For example:
例如:
# -*- coding: utf-8 -*-
from nltk.corpus import abc
import codecs
with codecs.open("abc.txt","w","utf-8") as f:
f.write(u" ".join(codecs.decode(word,"latin-1") for word in i.words()))
Some further notes, in case there's any confusion:
如有任何疑问,请作进一步说明:
- The
-*- coding: utf-8 -*-
line refers to the encoding used to write the Python script itself. It has no effect on the input or output of that script. - -*-编码:utf-8 -*- line是用于编写Python脚本本身的编码。它对脚本的输入或输出没有影响。
- In Python 2.7, there are two kinds of strings: bytestrings, which are sequences of bytes with an unspecified encoding, and unicode strings, which are sequences of unicode code points. Bytestrings are most common and are what you get if you use the regular
"abc"
string literal syntax. Unicode strings are what you get when you use theu"abc"
syntax. - 在Python 2.7中,有两种字符串:bytestring(未指定编码的字节序列)和unicode字符串(unicode字符串),它们是unicode代码点的序列。Bytestrings是最常见的,如果您使用常规的“abc”字符串字面语法,那么您将得到它。当您使用u“abc”语法时,会得到Unicode字符串。
- In Python 2.7, if you just use the open function to open a file and write bytestrings to it, no encoding will happen. The bytes of the bytestring are written straight into the file. If you try to write unicode strings to it, you'll get an exception if they contain characters that can't be encoded by the default (ASCII) codec.
- 在Python 2.7中,如果您只是使用open函数打开一个文件并向其写入bytestring,则不会发生任何编码。字节字符串的字节直接写入文件。如果您尝试为它编写unicode字符串,如果它们包含不能由缺省(ASCII)编解码器编码的字符,就会得到一个异常。
#1
2
My guess is that your input is encoded as ISO-8859-2 which contains Ă
as 0xC3
. Check the encoding of your input file.
我的猜测是,你输入编码为iso - 8859 - 2包含Ă0 xc3。检查输入文件的编码。
#2
2
Based on the fact that your piece of Python 2.7 doesn't throw an exception, I would infer that i.words()
returns a sequence of bytestrings. These are unlikely to be encoded in UTF8 - I'd guess maybe Latin-1 or something like that. You then write them to the file. No encoding happens at this point.
基于您的Python 2.7没有抛出异常的事实,我推断i.words()返回一个字节串序列。这些不太可能用UTF8编码——我猜可能是Latin-1或者类似的东西。然后将它们写入文件。此时没有编码。
You probably need to convert these to unicode strings, for which you'll need to know their existing encoding, and then you'll need to encode these as UTF-8 when writing the file.
您可能需要将它们转换为unicode字符串,您需要知道它们的现有编码,然后在编写文件时需要将它们编码为UTF-8。
For example:
例如:
# -*- coding: utf-8 -*-
from nltk.corpus import abc
import codecs
with codecs.open("abc.txt","w","utf-8") as f:
f.write(u" ".join(codecs.decode(word,"latin-1") for word in i.words()))
Some further notes, in case there's any confusion:
如有任何疑问,请作进一步说明:
- The
-*- coding: utf-8 -*-
line refers to the encoding used to write the Python script itself. It has no effect on the input or output of that script. - -*-编码:utf-8 -*- line是用于编写Python脚本本身的编码。它对脚本的输入或输出没有影响。
- In Python 2.7, there are two kinds of strings: bytestrings, which are sequences of bytes with an unspecified encoding, and unicode strings, which are sequences of unicode code points. Bytestrings are most common and are what you get if you use the regular
"abc"
string literal syntax. Unicode strings are what you get when you use theu"abc"
syntax. - 在Python 2.7中,有两种字符串:bytestring(未指定编码的字节序列)和unicode字符串(unicode字符串),它们是unicode代码点的序列。Bytestrings是最常见的,如果您使用常规的“abc”字符串字面语法,那么您将得到它。当您使用u“abc”语法时,会得到Unicode字符串。
- In Python 2.7, if you just use the open function to open a file and write bytestrings to it, no encoding will happen. The bytes of the bytestring are written straight into the file. If you try to write unicode strings to it, you'll get an exception if they contain characters that can't be encoded by the default (ASCII) codec.
- 在Python 2.7中,如果您只是使用open函数打开一个文件并向其写入bytestring,则不会发生任何编码。字节字符串的字节直接写入文件。如果您尝试为它编写unicode字符串,如果它们包含不能由缺省(ASCII)编解码器编码的字符,就会得到一个异常。