UnicodeEncodeError:“ascii”编码解码器无法编码字符u'\xa3'

时间:2021-06-05 20:19:51

I have an Excel spreadsheet that I'm reading in that contains some £ signs.

我有一个Excel电子表格,我阅读,其中包含一些£的迹象。

When I try to read it in using the xlrd module, I get the following error:

当我尝试使用xlrd模块读取它时,我得到了以下错误:

x = table.cell_value(row, col)
x = x.decode("ISO-8859-1")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 0: ordinal not in range(128)

If I rewrite this to x.encode('utf-8') it stops throwing an error, but unfortunately when I then write the data out somewhere else (as latin-1), the £ signs have all become garbled.

如果我将其重写为x.encode('utf-8'),它将停止抛出错误,但不幸的是,当我将数据写到其他地方(如latin-1)时,所有的迹象都变得模糊了。

How can I fix this, and read the £ signs in correctly?

我怎样才能纠正这个错误,并正确地阅读这些“标记”呢?

--- UPDATE ---

推荐- - - - - - - - - - - -更新

Some kind readers have suggested that I don't need to decode it at all, or that I can just encode it to Latin-1 when I need to. The problem with this is that I need to write the data to a CSV file eventually, and it seems to object to the raw strings.

有些读者建议我根本不需要解码,或者我可以在需要的时候把它编码到Latin-1。问题是,我最终需要将数据写到CSV文件中,它似乎对原始字符串表示反对。

If I don't encode or decode the data at all, then this happens (after I've added the string to an array called items):

如果我没有对数据进行编码或解码,那么就会发生这种情况(在我将字符串添加到一个名为items的数组之后):

for item in items:
    #item = [x.encode('latin-1') for x in item]
    cleancsv.writerow(item)
File "clean_up_barnet.py", line 104, in <module>
 cleancsv.writerow(item)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2022' in position 43: ordinal not in range(128)

I get the same error even if I uncomment the Latin-1 line.

即使我取消了Latin-1行,也会得到相同的错误。

6 个解决方案

#1


9  

Your code snippet says x.decode, but you're getting an encode error -- meaning x is Unicode already, so, to "decode" it, it must be first turned into a string of bytes (and that's where the default codec ansi comes up and fails). In your text then you say "if I rewrite ot to x.encode"... which seems to imply that you do know x is Unicode.

你的代码片段说x.decode,但是你得到了一个编码错误——意思是x已经是Unicode了,所以,为了“解码”它,它必须首先变成一个字节字符串(这就是默认的codec ansi出现和失败的地方)。在你的文本中,你会说“如果我重写到x。编码”……这似乎意味着你知道x是Unicode。

So what it IS you're doing -- and what it is you mean to be doing -- encoding a unicode x to get a coded string of bytes, or decoding a string of bytes into a unicode object?

那么,你所做的是什么——以及你想做的是什么——编码一个unicode的x,得到一个编码的字节字符串,或者把一个字节的字节解码成一个unicode的对象?

I find it unfortunate that you can call encode on a byte string, and decode on a unicode object, because I find it seems to lead users to nothing but confusion... but at least in this case you seem to manage to propagate the confusion (at least to me;-).

不幸的是,您可以调用一个字节字符串的编码,并对unicode对象进行解码,因为我发现它似乎只会让用户感到困惑……但至少在这种情况下,你似乎设法传播了困惑(至少对我来说是这样的)。

If, as it seems, x is unicode, then you never want to "decode" it -- you may want to encode it to get a byte string with a certain codec, e.g. latin-1, if that's what you need for some kind of I/O purposes (for your own internal program use I recommend sticking with unicode all the time -- only encode/decode if and when you absolutely need, or receive, coded byte strings for input / output purposes).

看上去,如果x是unicode,那么你再也不想“解码”,您可能希望它得到一个字节字符串编码与一定的编解码器,例如latin - 1,如果你需要一些I / O的目的(为自己的内部程序使用我建议坚持unicode,只有编码/解码如果当你绝对需要,或接收、编码字节字符串输入/输出目的)。

#2


19  

A very easy way around all the "'ascii' codec can't encode character…" issues with csvwriter is to instead use unicodecsv, a drop-in replacement for csvwriter.

一个非常简单的方法,所有的“‘ascii’编解码器不能编码字符……”与csvwriter的问题是使用unicodecsv,一个替代csvwriter的替代。

Install unicodecsv with pip and then you can use it in the exact same way, eg:

用pip安装unicodecsv,然后你可以用同样的方式使用它。

import unicodecsv
file = open('users.csv', 'w')
w = unicodecsv.writer(file)
for user in User.objects.all().values_list('first_name', 'last_name', 'email', 'last_login'):
    w.writerow(user)

#3


10  

For what it's worth: I'm the author of xlrd.

它的价值:我是xlrd的作者。

Does xlrd produce unicode?
Option 1: Read the Unicode section at the bottom of the first screenful of xlrd doc: This module presents all text strings as Python unicode objects.
Option 2: print type(text), repr(text)

xlrd产生unicode吗?选项1:在xlrd文档的第一个屏幕的底部读取Unicode部分:该模块将所有文本字符串作为Python Unicode对象呈现。选项2:打印类型(文本),repr(文本)

You say """If I rewrite this to x.encode('utf-8') it stops throwing an error, but unfortunately when I then write the data out somewhere else (as latin-1), the £ signs have all become garbled.""" Of course if you write UTF-8-encoded text to a device that's expecting latin1, it will be garbled. What do did you expect?

你说“”“如果我重写这个x.encode(utf - 8)停止抛出错误,但不幸的是当我写数据到其他地方(latin - 1),£都变得混乱的迹象。”“当然,如果你将utf -8编码的文本写入一个预期latin1的设备,它将被篡改。”你期望的是什么?

You say in your edit: """I get the same error even if I uncomment the Latin-1 line""". This is very unlikely -- much more likely is that you got a slightly different error (mentioning the latin1 codec instead of the ascii codec) in a different source line (the uncommented latin1 line instead of the writerow line). Reading error messages carefully aids understanding.

你在编辑栏中说:“我也有同样的错误,即使我取消对Latin-1的评论。”这是非常不可能的——更有可能的是,您在不同的源行(未注释的latin1行而不是writerow行)中得到了一个稍微不同的错误(提到了latin1 codec,而不是ascii codec)。阅读错误信息有助于理解。

Your problem here is that in general your data is NOT encodable in latin1; very little real-world data is. Your POUND SIGN is encodable in latin1, but that's not all your non-ASCII data. The problematic character is U+2022 BULLET which is not encodable in latin1.

您的问题是,一般来说,您的数据在latin1中是不可行的;真实世界的数据很少。您的磅符号在latin1中是可编码的,但这并不是所有的非ascii数据。有问题的字符是U+2022的子弹,在latin1中是不可用的。

It would have helped you get a better answer sooner if you had mentioned up front that you were working on Mac OS X ... the usual suspect for a CSV-suitable encoding is cp1252 (Windows), not mac-roman.

如果你之前提到过你在Mac OS X上工作的话,它会帮助你更快地得到一个更好的答案。对csv合适的编码的一般疑点是cp1252 (Windows),而不是macroman。

#4


5  

x = x.decode("ISO-8859-1")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 0: ordinal not in range(128)

Look closely: You got a Unicode***Encode***Error calling the decode method.

仔细看:您得到了一个Unicode***编码***的错误,调用了decode方法。

The reason for this is that decode is intended to convert from a byte sequence (str) to a unicode object. But, as John said, xlrd already uses Unicode strings, so x is already a unicode object.

这样做的原因是,decode打算将字节序列(str)转换为unicode对象。但是,正如John所说,xlrd已经使用了Unicode字符串,所以x已经是Unicode对象了。

In this situation, Python 2.x assumes that you meant to decode a str object, so it "helpfully" creates one for you. But in order to convert a unicode to a str, it needs an encoding, and chooses ASCII because it's the lowest common denominator of character encodings. Your code effectively gets interpreted as

在这种情况下,Python 2。x假设您打算解码一个str对象,因此它“帮助”为您创建了一个。但是为了将unicode转换为str,它需要编码,并选择ASCII,因为它是字符编码的最小公分母。您的代码实际上被解释为。

x = x.encode('ascii').decode("ISO-8859-1")

which fails because x contains a non-ASCII character.

这是因为x包含非ascii字符。

Since x is already a unicode object, the decode is unnecessary. However, now you run into the problem that the Python 2.x csv module doesn't support Unicode. You have to convert your data to str objects.

由于x已经是unicode对象,所以解码是不必要的。但是,现在您遇到了Python 2的问题。x csv模块不支持Unicode。您必须将数据转换为str对象。

for item in items:
    item = [x.encode('latin-1') for x in item]
    cleancsv.writerow(item)

This would be correct, except that you have the character (U+2022 BULLET) in your data, and Latin-1 can't represent it. There are several ways around this problem:

这是正确的,除非您的数据中有•字符(U+2022),而Latin-1不能表示它。关于这个问题有很多方法:

  • Write x.encode('latin-1', 'ignore') to remove the bullet (or other non-Latin-1 characters).
  • 写x。编码('latin-1', 'ignore')删除子弹(或其他非latin-1字符)。
  • Write x.encode('latin-1', 'replace') to replace the bullet with a question mark.
  • 写x。编码('latin-1', 'replace')代替有问号的子弹。
  • Replace the bullets with a Latin-1 character like * or ·.
  • 用一个像*或*这样的拉丁字母替换子弹。
  • Use a character encoding that does contain all the characters you need.
  • 使用一个包含所有你需要的字符的字符编码。

These days, UTF-8 is widely supported, so there is little reason to use any other encoding for text files.

现在,UTF-8得到了广泛的支持,因此几乎没有理由对文本文件使用任何其他编码。

#5


2  

xlrd works with Unicode, so the string you get back is a Unicode string. The £-sign has code point U+00A3, so the representation of said string should be u'\xa3'. This has been read in correctly; it is the string that you should be working with throughout your program.

xlrd使用Unicode,所以您得到的字符串是Unicode字符串。符号有代码点U+00A3,所以说字符串的表示应该是U '\xa3'。这是正确的读法;这是您应该在整个程序中使用的字符串。

When you write this (abstract, Unicode) string somewhere, you need to choose an encoding. At that point, you should .encode it into that encoding, say latin-1.

当您在某个地方编写这个(抽象的Unicode)字符串时,您需要选择一个编码。在这一点上,你应该把它编码到编码中,比如latin-1。


>>> book = xlrd.open_workbook( "test.xls" )
>>> sh = book.sheet_by_index( 0 )
>>> x = sh.cell_value( 0, 0 )
>>> x
u'\xa3'
>>> print x
£

# sample outputs (for e.g. writing to a file)
>>> x.encode( "latin-1" )
'\xa3'
>>> x.encode( "utf-8" )
'\xc2\xa3'

# garbage, because x is already Unicode
>>> x.decode( "ascii" )
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 0:
ordinal not in range(128)
>>>

#6


0  

Working with xlrd, I have in a line ...xl_data.find(str(cell_value))... which gives the error:"'ascii' codec can't encode character u'\xdf' in position 3: ordinal not in range(128)". All suggestions in the forums have been useless for my german words. But changing into: ...xl_data.find(cell.value)... gives no error. So, I suppose using strings as arguments in certain commands with xldr has specific encoding problems.

使用xlrd,我有一个行…xl_data.find(str(cell_value))…这就给出了错误:“‘ascii’编解码器不能在位置3中对字符u'\xdf进行编码:序号不在范围(128)”。论坛上的所有建议对我的德语单词都没用。但改变成:……xl_data.find(cell.value)…不给任何错误。因此,我认为使用xldr在某些命令中使用字符串作为参数具有特定的编码问题。

#1


9  

Your code snippet says x.decode, but you're getting an encode error -- meaning x is Unicode already, so, to "decode" it, it must be first turned into a string of bytes (and that's where the default codec ansi comes up and fails). In your text then you say "if I rewrite ot to x.encode"... which seems to imply that you do know x is Unicode.

你的代码片段说x.decode,但是你得到了一个编码错误——意思是x已经是Unicode了,所以,为了“解码”它,它必须首先变成一个字节字符串(这就是默认的codec ansi出现和失败的地方)。在你的文本中,你会说“如果我重写到x。编码”……这似乎意味着你知道x是Unicode。

So what it IS you're doing -- and what it is you mean to be doing -- encoding a unicode x to get a coded string of bytes, or decoding a string of bytes into a unicode object?

那么,你所做的是什么——以及你想做的是什么——编码一个unicode的x,得到一个编码的字节字符串,或者把一个字节的字节解码成一个unicode的对象?

I find it unfortunate that you can call encode on a byte string, and decode on a unicode object, because I find it seems to lead users to nothing but confusion... but at least in this case you seem to manage to propagate the confusion (at least to me;-).

不幸的是,您可以调用一个字节字符串的编码,并对unicode对象进行解码,因为我发现它似乎只会让用户感到困惑……但至少在这种情况下,你似乎设法传播了困惑(至少对我来说是这样的)。

If, as it seems, x is unicode, then you never want to "decode" it -- you may want to encode it to get a byte string with a certain codec, e.g. latin-1, if that's what you need for some kind of I/O purposes (for your own internal program use I recommend sticking with unicode all the time -- only encode/decode if and when you absolutely need, or receive, coded byte strings for input / output purposes).

看上去,如果x是unicode,那么你再也不想“解码”,您可能希望它得到一个字节字符串编码与一定的编解码器,例如latin - 1,如果你需要一些I / O的目的(为自己的内部程序使用我建议坚持unicode,只有编码/解码如果当你绝对需要,或接收、编码字节字符串输入/输出目的)。

#2


19  

A very easy way around all the "'ascii' codec can't encode character…" issues with csvwriter is to instead use unicodecsv, a drop-in replacement for csvwriter.

一个非常简单的方法,所有的“‘ascii’编解码器不能编码字符……”与csvwriter的问题是使用unicodecsv,一个替代csvwriter的替代。

Install unicodecsv with pip and then you can use it in the exact same way, eg:

用pip安装unicodecsv,然后你可以用同样的方式使用它。

import unicodecsv
file = open('users.csv', 'w')
w = unicodecsv.writer(file)
for user in User.objects.all().values_list('first_name', 'last_name', 'email', 'last_login'):
    w.writerow(user)

#3


10  

For what it's worth: I'm the author of xlrd.

它的价值:我是xlrd的作者。

Does xlrd produce unicode?
Option 1: Read the Unicode section at the bottom of the first screenful of xlrd doc: This module presents all text strings as Python unicode objects.
Option 2: print type(text), repr(text)

xlrd产生unicode吗?选项1:在xlrd文档的第一个屏幕的底部读取Unicode部分:该模块将所有文本字符串作为Python Unicode对象呈现。选项2:打印类型(文本),repr(文本)

You say """If I rewrite this to x.encode('utf-8') it stops throwing an error, but unfortunately when I then write the data out somewhere else (as latin-1), the £ signs have all become garbled.""" Of course if you write UTF-8-encoded text to a device that's expecting latin1, it will be garbled. What do did you expect?

你说“”“如果我重写这个x.encode(utf - 8)停止抛出错误,但不幸的是当我写数据到其他地方(latin - 1),£都变得混乱的迹象。”“当然,如果你将utf -8编码的文本写入一个预期latin1的设备,它将被篡改。”你期望的是什么?

You say in your edit: """I get the same error even if I uncomment the Latin-1 line""". This is very unlikely -- much more likely is that you got a slightly different error (mentioning the latin1 codec instead of the ascii codec) in a different source line (the uncommented latin1 line instead of the writerow line). Reading error messages carefully aids understanding.

你在编辑栏中说:“我也有同样的错误,即使我取消对Latin-1的评论。”这是非常不可能的——更有可能的是,您在不同的源行(未注释的latin1行而不是writerow行)中得到了一个稍微不同的错误(提到了latin1 codec,而不是ascii codec)。阅读错误信息有助于理解。

Your problem here is that in general your data is NOT encodable in latin1; very little real-world data is. Your POUND SIGN is encodable in latin1, but that's not all your non-ASCII data. The problematic character is U+2022 BULLET which is not encodable in latin1.

您的问题是,一般来说,您的数据在latin1中是不可行的;真实世界的数据很少。您的磅符号在latin1中是可编码的,但这并不是所有的非ascii数据。有问题的字符是U+2022的子弹,在latin1中是不可用的。

It would have helped you get a better answer sooner if you had mentioned up front that you were working on Mac OS X ... the usual suspect for a CSV-suitable encoding is cp1252 (Windows), not mac-roman.

如果你之前提到过你在Mac OS X上工作的话,它会帮助你更快地得到一个更好的答案。对csv合适的编码的一般疑点是cp1252 (Windows),而不是macroman。

#4


5  

x = x.decode("ISO-8859-1")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 0: ordinal not in range(128)

Look closely: You got a Unicode***Encode***Error calling the decode method.

仔细看:您得到了一个Unicode***编码***的错误,调用了decode方法。

The reason for this is that decode is intended to convert from a byte sequence (str) to a unicode object. But, as John said, xlrd already uses Unicode strings, so x is already a unicode object.

这样做的原因是,decode打算将字节序列(str)转换为unicode对象。但是,正如John所说,xlrd已经使用了Unicode字符串,所以x已经是Unicode对象了。

In this situation, Python 2.x assumes that you meant to decode a str object, so it "helpfully" creates one for you. But in order to convert a unicode to a str, it needs an encoding, and chooses ASCII because it's the lowest common denominator of character encodings. Your code effectively gets interpreted as

在这种情况下,Python 2。x假设您打算解码一个str对象,因此它“帮助”为您创建了一个。但是为了将unicode转换为str,它需要编码,并选择ASCII,因为它是字符编码的最小公分母。您的代码实际上被解释为。

x = x.encode('ascii').decode("ISO-8859-1")

which fails because x contains a non-ASCII character.

这是因为x包含非ascii字符。

Since x is already a unicode object, the decode is unnecessary. However, now you run into the problem that the Python 2.x csv module doesn't support Unicode. You have to convert your data to str objects.

由于x已经是unicode对象,所以解码是不必要的。但是,现在您遇到了Python 2的问题。x csv模块不支持Unicode。您必须将数据转换为str对象。

for item in items:
    item = [x.encode('latin-1') for x in item]
    cleancsv.writerow(item)

This would be correct, except that you have the character (U+2022 BULLET) in your data, and Latin-1 can't represent it. There are several ways around this problem:

这是正确的,除非您的数据中有•字符(U+2022),而Latin-1不能表示它。关于这个问题有很多方法:

  • Write x.encode('latin-1', 'ignore') to remove the bullet (or other non-Latin-1 characters).
  • 写x。编码('latin-1', 'ignore')删除子弹(或其他非latin-1字符)。
  • Write x.encode('latin-1', 'replace') to replace the bullet with a question mark.
  • 写x。编码('latin-1', 'replace')代替有问号的子弹。
  • Replace the bullets with a Latin-1 character like * or ·.
  • 用一个像*或*这样的拉丁字母替换子弹。
  • Use a character encoding that does contain all the characters you need.
  • 使用一个包含所有你需要的字符的字符编码。

These days, UTF-8 is widely supported, so there is little reason to use any other encoding for text files.

现在,UTF-8得到了广泛的支持,因此几乎没有理由对文本文件使用任何其他编码。

#5


2  

xlrd works with Unicode, so the string you get back is a Unicode string. The £-sign has code point U+00A3, so the representation of said string should be u'\xa3'. This has been read in correctly; it is the string that you should be working with throughout your program.

xlrd使用Unicode,所以您得到的字符串是Unicode字符串。符号有代码点U+00A3,所以说字符串的表示应该是U '\xa3'。这是正确的读法;这是您应该在整个程序中使用的字符串。

When you write this (abstract, Unicode) string somewhere, you need to choose an encoding. At that point, you should .encode it into that encoding, say latin-1.

当您在某个地方编写这个(抽象的Unicode)字符串时,您需要选择一个编码。在这一点上,你应该把它编码到编码中,比如latin-1。


>>> book = xlrd.open_workbook( "test.xls" )
>>> sh = book.sheet_by_index( 0 )
>>> x = sh.cell_value( 0, 0 )
>>> x
u'\xa3'
>>> print x
£

# sample outputs (for e.g. writing to a file)
>>> x.encode( "latin-1" )
'\xa3'
>>> x.encode( "utf-8" )
'\xc2\xa3'

# garbage, because x is already Unicode
>>> x.decode( "ascii" )
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 0:
ordinal not in range(128)
>>>

#6


0  

Working with xlrd, I have in a line ...xl_data.find(str(cell_value))... which gives the error:"'ascii' codec can't encode character u'\xdf' in position 3: ordinal not in range(128)". All suggestions in the forums have been useless for my german words. But changing into: ...xl_data.find(cell.value)... gives no error. So, I suppose using strings as arguments in certain commands with xldr has specific encoding problems.

使用xlrd,我有一个行…xl_data.find(str(cell_value))…这就给出了错误:“‘ascii’编解码器不能在位置3中对字符u'\xdf进行编码:序号不在范围(128)”。论坛上的所有建议对我的德语单词都没用。但改变成:……xl_data.find(cell.value)…不给任何错误。因此,我认为使用xldr在某些命令中使用字符串作为参数具有特定的编码问题。