UnicodeEncodeError:“ascii”编解码器不能在位置20中对字符u'\xa0进行编码:序数不在范围(128)

I'm having problems dealing with unicode characters from text fetched from different web pages (on different sites). I am using BeautifulSoup.

我在处理来自不同web页面(不同站点)的文本的unicode字符时遇到了一些问题。我用BeautifulSoup。

The problem is that the error is not always reproducible; it sometimes works with some pages, and sometimes, it barfs by throwing a UnicodeEncodeError. I have tried just about everything I can think of, and yet I have not found anything that works consistently without throwing some kind of Unicode-related error.

问题是错误并不总是可复制的;它有时使用一些页面，有时，它通过抛出一个UnicodeEncodeError来进行barfs。我已经试过了我能想到的所有东西，但我还没有找到任何能持续工作的东西，而不会抛出一些与unicoid相关的错误。

One of the sections of code that is causing problems is shown below:

导致问题的代码段如下所示:

agent_telno = agent.find('div', 'agent_contact_number')
agent_telno = '' if agent_telno is None else agent_telno.contents[0]
p.agent_info = str(agent_contact + ' ' + agent_telno).strip()

Here is a stack trace produced on SOME strings when the snippet above is run:

下面是在运行代码片段时在某些字符串上生成的堆栈跟踪:

Traceback (most recent call last):
  File "foobar.py", line 792, in <module>
    p.agent_info = str(agent_contact + ' ' + agent_telno).strip()
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)

I suspect that this is because some pages (or more specifically, pages from some of the sites) may be encoded, whilst others may be unencoded. All the sites are based in the UK and provide data meant for UK consumption - so there are no issues relating to internalization or dealing with text written in anything other than English.

我怀疑这是因为某些页面(或者更确切地说，来自某些站点的页面)可能被编码，而其他页面可能没有编码。所有的网站都是基于英国的，并且提供了英国消费的数据——因此，没有任何与内部化有关的问题，也没有涉及到用英语以外的任何东西来处理文本的问题。

Does anyone have any ideas as to how to solve this so that I can CONSISTENTLY fix this problem?

有没有人知道如何解决这个问题，这样我就能一直解决这个问题?

17 个解决方案

#1

972

You need to read the Python Unicode HOWTO. This error is the very first example.

您需要阅读Python Unicode HOWTO。这个错误就是第一个例子。

Basically, stop using str to convert from unicode to encoded text / bytes.

基本上，停止使用str从unicode转换为编码的文本/字节。

Instead, properly use .encode() to encode the string:

相反，正确使用.encode()来对字符串进行编码:

p.agent_info = u' '.join((agent_contact, agent_telno)).encode('utf-8').strip()

or work entirely in unicode.

或者完全使用unicode。

#2

348

This is a classic python unicode pain point! Consider the following:

这是一个经典的python unicode痛点!考虑以下:

a = u'bats\u00E0'
print a
 => batsà

All good so far, but if we call str(a), let's see what happens:

到目前为止一切顺利，但如果我们调用str(a)，我们看看会发生什么:

str(a)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)

Oh dip, that's not gonna do anyone any good! To fix the error, encode the bytes explicitly with .encode and tell python what codec to use:

哦，dip，这对任何人都没有好处!要修复错误，请显式地用.encode编码字节，并告诉python要使用的代码:

a.encode('utf-8')
 => 'bats\xc3\xa0'
print a.encode('utf-8')
 => batsà

Voil\u00E0!

你瞧\ u00E0 !

The issue is that when you call str(), python uses the default character encoding to try and encode the bytes you gave it, which in your case are sometimes representations of unicode characters. To fix the problem, you have to tell python how to deal with the string you give it by using .encode('whatever_unicode'). Most of the time, you should be fine using utf-8.

问题是，当您调用str()时，python使用默认的字符编码来尝试编码您给它的字节，在您的情况下，这是unicode字符的表示。要解决这个问题，您必须告诉python如何使用.encode(“whatever_unicode”)来处理您给它的字符串。大多数情况下，使用utf-8应该没问题。

For an excellent exposition on this topic, see Ned Batchelder's PyCon talk here: http://nedbatchelder.com/text/unipain.html

关于这个主题的精彩阐述，请参见Ned Batchelder的PyCon对话:http://nedbatchelder.com/text/unipain.html。

#3

145

I found elegant work around for me to remove symbols and continue to keep string as string in follows:

我找到了优雅的工作，让我删除符号，并继续保持字符串作为字符串:

yourstring = yourstring.encode('ascii', 'ignore').decode('ascii')

It's important to notice that using the ignore option is dangerous because it silently drops any unicode(and internationalization) support from the code that uses it, as seen here:

需要注意的是，使用ignore选项是危险的，因为它会悄悄地从使用它的代码中删除任何unicode(和国际化)支持，如下所示:

>>> 'City: Malmö'.encode('ascii', 'ignore').decode('ascii')
'City: Malm'

#4

well i tried everything but it did not help, after googling around i figured the following and it helped. python 2.7 is in use.

我试了所有的方法，但没有用，我在谷歌上搜索之后发现了下面的内容，这对我很有帮助。python 2.7正在使用中。

# encoding=utf8
import sys
reload(sys)
sys.setdefaultencoding('utf8')

#5

A subtle problem causing even print to fail is having your environment variables set wrong, eg. here LC_ALL set to "C". In Debian they discourage setting it: Debian wiki on Locale

导致打印失败的一个微妙问题是环境变量设置错误。这里LC_ALL设为C。在Debian中，他们不鼓励设置:Debian wiki on Locale。

$ echo $LANG
en_US.utf8
$ echo $LC_ALL 
C
$ python -c "print (u'voil\u00e0')"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)
$ export LC_ALL='en_US.utf8'
$ python -c "print (u'voil\u00e0')"
voilà
$ unset LC_ALL
$ python -c "print (u'voil\u00e0')"
voilà

#6

For me, what worked was:

对我来说，工作是:

BeautifulSoup(html_text,from_encoding="utf-8")

Hope this helps someone.

希望这可以帮助别人。

#7

I've actually found that in most of my cases, just stripping out those characters is much simpler:

实际上，我发现在大多数情况下，去掉这些字符会简单得多:

s = mystring.decode('ascii', 'ignore')

#8

Add line below at the beginning of your script ( or as second line):

在脚本的开头添加一行(或作为第二行):

# -*- coding: utf-8 -*-

That's definition of python source code encoding. More info in PEP 263.

这就是python源代码编码的定义。更多信息在PEP 263。

#9

The problem is that you're trying to print a unicode character, but your terminal doesn't support it.

问题是，您正在尝试打印unicode字符，但是您的终端不支持它。

You can try installing language-pack-en package to fix that:

您可以尝试安装语言包包来解决这个问题:

sudo apt-get install language-pack-en

which provides English translation data updates for all supported packages (including Python). Install different language package if necessary (depending which characters you're trying to print).

为所有支持的包(包括Python)提供英语翻译数据更新。如果需要，安装不同的语言包(取决于您想要打印的字符)。

On some Linux distributions it's required in order to make sure that the default English locales are set-up properly (so unicode characters can be handled by shell/terminal). Sometimes it's easier to install it, than configuring it manually.

在一些Linux发行版中，为了确保默认的英语区域设置正确(所以unicode字符可以由shell/终端处理)，这是必需的。有时安装它比手动配置更容易。

Then when writing the code, make sure you use the right encoding in your code.

然后在编写代码时，确保在代码中使用了正确的编码。

For example:

例如:

open(foo, encoding='utf-8')

If you've still a problem, double check your system configuration, such as:

如果您仍然有问题，请仔细检查系统配置，例如:

Your locale file (/etc/default/locale), which should have e.g.

您的地区文件(/etc/default/locale)，应该有。
```
LANG="en_US.UTF-8"
LC_ALL="en_US.UTF-8"
```
Value of LANG/LC_CTYPE in shell.

LANG/LC_CTYPE在shell中的值。
Check which locale your shell supports by:

检查您的shell支持的地区:
```
locale -a | grep "UTF-8"
```

Demonstrating the problem and solution in fresh VM.

在新的VM中演示问题和解决方案。

Initialize and provision the VM (e.g. using vagrant):

初始化和提供VM(例如，使用流浪汉):
```
vagrant init ubuntu/trusty64; vagrant up; vagrant ssh
```
^{See: available Ubuntu boxes.}.

看:可用Ubuntu盒. .

Printing unicode characters (such as trade mark sign like ™):

印刷unicode字符(如商标标志如™):

$ python -c 'print(u"\u2122");'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2122' in position 0: ordinal not in range(128)

Now installing language-pack-en:

现在安装language-pack-en:

$ sudo apt-get -y install language-pack-en
The following extra packages will be installed:
  language-pack-en-base
Generating locales...
  en_GB.UTF-8... /usr/sbin/locale-gen: done
Generation complete.

Now problem is solved:

现在问题已经解决了:
```
$ python -c 'print(u"\u2122");'
™
```

#10

Here's a rehashing of some other so-called "cop out" answers. There are situations in which simply throwing away the troublesome characters/strings is a good solution, despite the protests voiced here.

这里是对其他一些所谓的“逃避现实”的回答。有些情况下，简单地扔掉麻烦的字符/字符串是一个很好的解决方案，尽管这里有*。

def safeStr(obj):
    try: return str(obj)
    except UnicodeEncodeError:
        return obj.encode('ascii', 'ignore').decode('ascii')
    except: return ""

Testing it:

测试:

if __name__ == '__main__': 
    print safeStr( 1 ) 
    print safeStr( "test" ) 
    print u'98\xb0'
    print safeStr( u'98\xb0' )

Results:

结果:

1
test
98°
98

Suggestion: you might want to name this function to toAscii instead? That's a matter of preference.

建议:您可能想将此函数命名为toAscii ?这是一个优先考虑的问题。

#11

Simple helper functions found here.

这里找到了简单的助手函数。

def safe_unicode(obj, *args):
    """ return the unicode representation of obj """
    try:
        return unicode(obj, *args)
    except UnicodeDecodeError:
        # obj is byte string
        ascii_text = str(obj).encode('string_escape')
        return unicode(ascii_text)

def safe_str(obj):
    """ return the byte string representation of obj """
    try:
        return str(obj)
    except UnicodeEncodeError:
        # obj is unicode
        return unicode(obj).encode('unicode_escape')

#12

I just used the following:

我只是用了以下几点:

import unicodedata
message = unicodedata.normalize("NFKD", message)

Check what documentation says about it:

检查文档说明了什么:

unicodedata.normalize(form, unistr) Return the normal form form for the Unicode string unistr. Valid values for form are ‘NFC’, ‘NFKC’, ‘NFD’, and ‘NFKD’.

unicodedata。normalize(form, unistr)返回Unicode字符串unistr的普通表单。表单的有效值是“NFC”、“NFKC”、“NFD”和“NFKD”。

The Unicode standard defines various normalization forms of a Unicode string, based on the definition of canonical equivalence and compatibility equivalence. In Unicode, several characters can be expressed in various way. For example, the character U+00C7 (LATIN CAPITAL LETTER C WITH CEDILLA) can also be expressed as the sequence U+0043 (LATIN CAPITAL LETTER C) U+0327 (COMBINING CEDILLA).

Unicode标准定义了Unicode字符串的各种标准化形式，基于标准等价和兼容性等价的定义。在Unicode中，几个字符可以用不同的方式表示。例如，字母U+00C7(拉丁字母C和CEDILLA)也可以表示为U+0043(拉丁大写字母C) U+0327(结合了CEDILLA)。

For each character, there are two normal forms: normal form C and normal form D. Normal form D (NFD) is also known as canonical decomposition, and translates each character into its decomposed form. Normal form C (NFC) first applies a canonical decomposition, then composes pre-combined characters again.

对于每个字符，有两种正常的形式:普通形式C和普通形式D (NFD)也被称为规范分解，并将每个字符转换为其分解形式。普通形式C (NFC)首先应用一个规范分解，然后再次组合预组合的字符。

In addition to these two forms, there are two additional normal forms based on compatibility equivalence. In Unicode, certain characters are supported which normally would be unified with other characters. For example, U+2160 (ROMAN NUMERAL ONE) is really the same thing as U+0049 (LATIN CAPITAL LETTER I). However, it is supported in Unicode for compatibility with existing character sets (e.g. gb2312).

除了这两种形式之外，还有两种基于兼容性等价的正常形式。在Unicode中，某些字符被支持，通常与其他字符统一。例如，U+2160(罗马数字1)和U+0049(拉丁字母I)是一样的，但是它在Unicode中支持现有字符集(例如gb2312)。

The normal form KD (NFKD) will apply the compatibility decomposition, i.e. replace all compatibility characters with their equivalents. The normal form KC (NFKC) first applies the compatibility decomposition, followed by the canonical composition.

正常形式KD (NFKD)将应用兼容性分解，即替换所有与它们等价的兼容性字符。正常形式KC (NFKC)首先应用兼容性分解，其次是规范组合。

Even if two unicode strings are normalized and look the same to a human reader, if one has combining characters and the other doesn’t, they may not compare equal.

即使两个unicode字符串是规范化的，对于一个人读者来说也是一样的，如果一个人结合了字符，而另一个没有，他们可能不会比较相等。

Solves it for me. Simple and easy.

为我解决了它。简单和容易。

#13

I just had this problem, and Google led me here, so just to add to the general solutions here, this is what worked for me:

我刚刚有这个问题，谷歌把我带到了这里，为了给大家补充一些通解，这就是我的工作:

# 'value' contains the problematic data
unic = u''
unic += value
value = unic

I had this idea after reading Ned's presentation.

读了奈德的演讲后，我有了这个想法。

I don't claim to fully understand why this works, though. So if anyone can edit this answer or put in a comment to explain, I'll appreciate it.

不过，我并不完全理解为什么会这样。所以，如果有人能编辑这个答案或发表评论，我将不胜感激。

#14

Below solution worked for me, Just added

下面的解决方案对我来说很有效。

u "String"

u“字符串”

(representing the string as unicode) before my string.

(将字符串表示为unicode)在我的字符串之前。

result_html = result.to_html(col_space=1, index=False, justify={'right'})

text = u"""
<html>
<body>
<p>
Hello all, <br>
<br>
Here's weekly enterprise enrollment summary report.  Let me know if you have any questions. <br>
<br>
7 Day Summary <br>
<br>
<br>
{0}
</p>
<p>Thanks,</p>
<p>Lookout Data Team</p>
</body></html>
""".format(result_html)

#15

Just add to a variable encode('utf-8')

只需添加一个变量编码('utf-8')

agent_contact.encode('utf-8')

#16

I always put the code below in the first two lines of the python files:

我总是把代码放在python文件的前两行:

# -*- coding: utf-8 -*-
from __future__ import unicode_literals

#17

If you have something like packet_data = "This is data" then do this on the next line, right after initializing packet_data:

如果您有类似packet_data = "This is data"的东西，那么在初始化packet_data之后，在下一行执行这个操作:

unic = u''
packet_data = unic

#1

972