“utf-8”编码解码器无法解码Python3.4中的一个文件，但不能在Python2.7中解码。

I was trying to read a file in python2.7, and it was readen perfectly. The problem that I have is when I execute the same program in Python3.4 and then appear the error:

我试着在python2.7里读到一个文件，它的可读性非常好。我遇到的问题是当我在Python3.4中执行相同的程序，然后出现错误:

'utf-8' codec can't decode byte 0xf2 in position 424: invalid continuation byte'

Also, when I run the program in Windows (with python3.4), the error doesn't appear. The first line of the document is: Codi;Codi_lloc_anonim;Nom

而且，当我在Windows中运行程序时(使用python3.4)，错误不会出现。该文件的第一行是:Codi;编码:lloc_anonim;Nom。

and the code of my program is:

我的程序代码是:

def lectdict(filename,colkey,colvalue):
    f = open(filename,'r')
    D = dict()

    for line in f:
       if line == '\n': continue
       D[line.split(';')[colkey]] = D.get(line.split(';')[colkey],[]) + [line.split(';')[colvalue]]

f.close
return D

Traduccio = lectdict('Noms_departaments_centres.txt',1,2)

2 个解决方案

#1

In Python2,

在Python2,

f = open(filename,'r')
for line in f:

reads lines from the file as bytes.

从文件中读取行作为字节。

In Python3, the same code reads lines from the file as strings. Python3 strings are what Python2 call unicode objects. These are bytes decoded according to some encoding. The default encoding in Python3 is utf-8.

在Python3中，相同的代码从文件中读取行作为字符串。Python3字符串是Python2所说的unicode对象。这些是根据一些编码解码的字节。Python3中的默认编码是utf-8。

The error message

错误消息

'utf-8' codec can't decode byte 0xf2 in position 424: invalid continuation byte'

shows Python3 is trying to decode the bytes as utf-8. Since there is an error, the file apparently does not contain utf-8 encoded bytes.

显示Python3试图将字节解码为utf-8。由于存在错误，该文件显然不包含utf-8编码的字节。

To fix the problem you need to specify the correct encoding of the file:

要解决这个问题，您需要指定文件的正确编码:

with open(filename, encoding=enc) as f:
    for line in f:

If you do not know the correct encoding, you could run this program to simply try all the encodings known to Python. If you are lucky there will be an encoding which turns the bytes into recognizable characters. Sometimes more than one encoding may appear to work, in which case you'll need to check and compare the results carefully.

如果您不知道正确的编码，您可以运行这个程序来简单地尝试所有已知的Python编码。如果幸运的话，将会有一个编码将字节转换成可识别的字符。有时，不止一种编码可能起作用，在这种情况下，您需要仔细检查和比较结果。

# Python3
import pkgutil
import os
import encodings

def all_encodings():
    modnames = set(
        [modname for importer, modname, ispkg in pkgutil.walk_packages(
            path=[os.path.dirname(encodings.__file__)], prefix='')])
    aliases = set(encodings.aliases.aliases.values())
    return modnames.union(aliases)

filename = '/tmp/test'
encodings = all_encodings()
for enc in encodings:
    try:
        with open(filename, encoding=enc) as f:
            # print the encoding and the first 500 characters
            print(enc, f.read(500))
    except Exception:
        pass

#2

Ok, I did the same as @unutbu tell me. The result was a lot of encodings one of these are cp1250, for that reason I change :

好吧，我做的和@unutbu告诉我的一样。结果是很多编码其中一个是cp1250，因为这个原因我改变了:

f = open(filename,'r')

来

f = open(filename,'r', encoding='cp1250')

like @triplee suggest me. And now I can read my files.

像@triplee建议我。现在我可以读取我的文件了。

#1