I'm trying to pass big strings of random html through regular expressions and my Python 2.6 script is choking on this:
我试图通过正则表达式传递大字符串,我的Python 2.6脚本被阻塞了:
UnicodeEncodeError: 'ascii' codec can't encode character
UnicodeEncodeError:“ascii”编解码器不能编码字符
I traced it back to a trademark superscript on the end of this word: Protection™ -- and I expect to encounter others like it in the future.
我将其追溯到一个商标上标这个词:保护™,我期待在未来遇到类似。
Is there a module to process non-ascii characters? or, what is the best way to handle/escape non-ascii stuff in python?
有处理非ascii字符的模块吗?或者,在python中处理/转义非ascii内容的最佳方式是什么?
Thanks! Full error:
谢谢!完整的错误:
E
======================================================================
ERROR: test_untitled (__main__.Untitled)
----------------------------------------------------------------------
Traceback (most recent call last):
File "C:\Python26\Test2.py", line 26, in test_untitled
ofile.write(Whois + '\n')
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2122' in position 1005: ordinal not in range(128)
Full Script:
完整脚本:
from selenium import selenium
import unittest, time, re, csv, logging
class Untitled(unittest.TestCase):
def setUp(self):
self.verificationErrors = []
self.selenium = selenium("localhost", 4444, "*firefox", "http://www.BaseDomain.com/")
self.selenium.start()
self.selenium.set_timeout("90000")
def test_untitled(self):
sel = self.selenium
spamReader = csv.reader(open('SubDomainList.csv', 'rb'))
for row in spamReader:
sel.open(row[0])
time.sleep(10)
Test = sel.get_text("//html/body/div/table/tbody/tr/td/form/div/table/tbody/tr[7]/td")
Test = Test.replace(",","")
Test = Test.replace("\n", "")
ofile = open('TestOut.csv', 'ab')
ofile.write(Test + '\n')
ofile.close()
def tearDown(self):
self.selenium.stop()
self.assertEqual([], self.verificationErrors)
if __name__ == "__main__":
unittest.main()
4 个解决方案
#1
21
You're trying to pass a bytestring to something, but it's impossible (from the scarcity of info you provide) to tell what you're trying to pass it to. You start with a Unicode string that cannot be encoded as ASCII (the default codec), so, you'll have to encode by some different codec (or transliterate it, as @R.Pate suggests) -- but it's impossible for use to say what codec you should use, because we don't know what you're passing the bytestring and therefore don't know what that unknown subsystem is going to be able to accept and process correctly in terms of codecs.
您正在尝试将bytestring传递给某些东西,但是(由于您提供的信息很少)不可能告诉您要传递给什么。您首先使用不能编码为ASCII(默认编解码器)的Unicode字符串,因此,您必须使用一些不同的编解码器(或将其转换为@R)进行编码。Pate建议)——但是我们不可能说你应该使用什么编解码,因为我们不知道你传递的是什么字节串,因此不知道那个未知的子系统将会接受什么,并根据编解码正确地处理。
In such total darkness as you leave us in, utf-8
is a reasonable blind guess (since it's a codec that can represent any Unicode string exactly as a bytestring, and it's the standard codec for many purposes, such as XML) -- but it can't be any more than a blind guess, until and unless you're going to tell us more about what you're trying to pass that bytestring to, and for what purposes.
在你离开我们,等完全黑暗utf - 8是一种合理的瞎猜(因为它的编解码器可以代表任何Unicode字符串完全bytestring,多种用途的标准编解码器,如XML),但它不能被任何一个多瞎猜,除非你会告诉我们更多关于你想通过bytestring,和什么目的。
Passing thestring.encode('utf-8')
rather than bare thestring
will definitely avoid the particular error you're seeing right now, but it may result in peculiar displays (or whatever it is you're trying to do with that bytestring!) unless the recipient is ready, willing and able to accept utf-8 encoding (and how could WE know, having absolutely zero idea about what the recipient could possibly be?!-)
通过thestring.encode(“utf - 8”)而不是裸thestring肯定会避免特殊的错误你看到现在,但它可能会导致特殊的显示(或任何你想做的bytestring !),除非接受者准备,愿意并且能够接受utf - 8编码(我们怎么知道,绝对零知道收件人可能什么? !)
#2
31
You're trying to convert unicode to ascii in "strict" mode:
您正在尝试以“严格”模式将unicode转换为ascii码:
>>> help(str.encode)
Help on method_descriptor:
encode(...)
S.encode([encoding[,errors]]) -> object
Encodes S using the codec registered for encoding. encoding defaults
to the default encoding. errors may be given to set a different error
handling scheme. Default is 'strict' meaning that encoding errors raise
a UnicodeEncodeError. Other possible values are 'ignore', 'replace' and
'xmlcharrefreplace' as well as any other name registered with
codecs.register_error that is able to handle UnicodeEncodeErrors.
You probably want something like one of the following:
你可能想要以下的东西:
s = u'Protection™'
print s.encode('ascii', 'ignore') # removes the ™
print s.encode('ascii', 'replace') # replaces with ?
print s.encode('ascii','xmlcharrefreplace') # turn into xml entities
print s.encode('ascii', 'strict') # throw UnicodeEncodeErrors
#3
1
The "best" way always depends on your requirements; so, what are yours? Is ignoring non-ASCII appropriate? Should you replace ™ with "(tm)"? (Which looks fancy for this example, but quickly breaks down for other codepoints—but it may be just what you want.) Could the exception be exactly what you need; now you just need to handle it in some way?
“最佳”的方法总是取决于你的需求;所以,什么是你的吗?忽略非ascii是适合的吗?你应该™替换为“(tm)”?(这个例子看起来很不错,但其他代码点很快就不行了——但这可能正是你想要的。)这个例外正是你所需要的吗?现在你只需要以某种方式处理它?
Only you can really answer this question.
只有你才能真正回答这个问题。
#4
0
First of all, try installing translations for English language (or any other if needed):
首先,尝试为英语安装翻译(或任何其他需要):
sudo apt-get install language-pack-en
which provides translation data updates for all supported packages (including Python).
它为所有支持的包(包括Python)提供翻译数据更新。
And make sure you use the right encoding in your code.
确保在代码中使用正确的编码。
For example:
例如:
open(foo, encoding='utf-8')
Then double check your system configuration like value of LANG
or configuration of locale (/etc/default/locale
) and don't forget to re-login your session.
然后再次检查系统配置,比如LANG的值或locale (/etc/default/locale)的配置,不要忘记重新登录会话。
#1
21
You're trying to pass a bytestring to something, but it's impossible (from the scarcity of info you provide) to tell what you're trying to pass it to. You start with a Unicode string that cannot be encoded as ASCII (the default codec), so, you'll have to encode by some different codec (or transliterate it, as @R.Pate suggests) -- but it's impossible for use to say what codec you should use, because we don't know what you're passing the bytestring and therefore don't know what that unknown subsystem is going to be able to accept and process correctly in terms of codecs.
您正在尝试将bytestring传递给某些东西,但是(由于您提供的信息很少)不可能告诉您要传递给什么。您首先使用不能编码为ASCII(默认编解码器)的Unicode字符串,因此,您必须使用一些不同的编解码器(或将其转换为@R)进行编码。Pate建议)——但是我们不可能说你应该使用什么编解码,因为我们不知道你传递的是什么字节串,因此不知道那个未知的子系统将会接受什么,并根据编解码正确地处理。
In such total darkness as you leave us in, utf-8
is a reasonable blind guess (since it's a codec that can represent any Unicode string exactly as a bytestring, and it's the standard codec for many purposes, such as XML) -- but it can't be any more than a blind guess, until and unless you're going to tell us more about what you're trying to pass that bytestring to, and for what purposes.
在你离开我们,等完全黑暗utf - 8是一种合理的瞎猜(因为它的编解码器可以代表任何Unicode字符串完全bytestring,多种用途的标准编解码器,如XML),但它不能被任何一个多瞎猜,除非你会告诉我们更多关于你想通过bytestring,和什么目的。
Passing thestring.encode('utf-8')
rather than bare thestring
will definitely avoid the particular error you're seeing right now, but it may result in peculiar displays (or whatever it is you're trying to do with that bytestring!) unless the recipient is ready, willing and able to accept utf-8 encoding (and how could WE know, having absolutely zero idea about what the recipient could possibly be?!-)
通过thestring.encode(“utf - 8”)而不是裸thestring肯定会避免特殊的错误你看到现在,但它可能会导致特殊的显示(或任何你想做的bytestring !),除非接受者准备,愿意并且能够接受utf - 8编码(我们怎么知道,绝对零知道收件人可能什么? !)
#2
31
You're trying to convert unicode to ascii in "strict" mode:
您正在尝试以“严格”模式将unicode转换为ascii码:
>>> help(str.encode)
Help on method_descriptor:
encode(...)
S.encode([encoding[,errors]]) -> object
Encodes S using the codec registered for encoding. encoding defaults
to the default encoding. errors may be given to set a different error
handling scheme. Default is 'strict' meaning that encoding errors raise
a UnicodeEncodeError. Other possible values are 'ignore', 'replace' and
'xmlcharrefreplace' as well as any other name registered with
codecs.register_error that is able to handle UnicodeEncodeErrors.
You probably want something like one of the following:
你可能想要以下的东西:
s = u'Protection™'
print s.encode('ascii', 'ignore') # removes the ™
print s.encode('ascii', 'replace') # replaces with ?
print s.encode('ascii','xmlcharrefreplace') # turn into xml entities
print s.encode('ascii', 'strict') # throw UnicodeEncodeErrors
#3
1
The "best" way always depends on your requirements; so, what are yours? Is ignoring non-ASCII appropriate? Should you replace ™ with "(tm)"? (Which looks fancy for this example, but quickly breaks down for other codepoints—but it may be just what you want.) Could the exception be exactly what you need; now you just need to handle it in some way?
“最佳”的方法总是取决于你的需求;所以,什么是你的吗?忽略非ascii是适合的吗?你应该™替换为“(tm)”?(这个例子看起来很不错,但其他代码点很快就不行了——但这可能正是你想要的。)这个例外正是你所需要的吗?现在你只需要以某种方式处理它?
Only you can really answer this question.
只有你才能真正回答这个问题。
#4
0
First of all, try installing translations for English language (or any other if needed):
首先,尝试为英语安装翻译(或任何其他需要):
sudo apt-get install language-pack-en
which provides translation data updates for all supported packages (including Python).
它为所有支持的包(包括Python)提供翻译数据更新。
And make sure you use the right encoding in your code.
确保在代码中使用正确的编码。
For example:
例如:
open(foo, encoding='utf-8')
Then double check your system configuration like value of LANG
or configuration of locale (/etc/default/locale
) and don't forget to re-login your session.
然后再次检查系统配置,比如LANG的值或locale (/etc/default/locale)的配置,不要忘记重新登录会话。