Python:为unicode清除字符串?(复制)

时间:2021-06-06 20:14:30

Possible Duplicate:
Python UnicodeDecodeError - Am I misunderstanding encode?

可能重复:Python UnicodeDecodeError—我是否误解了编码?

I have a string that I'm trying to make safe for the unicode() function:

我有一个字符串,我正试图确保unicode()函数的安全性:

>>> s = " foo “bar bar ” weasel"
>>> s.encode('utf-8', 'ignore')

Traceback (most recent call last):
  File "<pyshell#8>", line 1, in <module>
    s.encode('utf-8', 'ignore')
UnicodeDecodeError: 'ascii' codec can't decode byte 0x93 in position 5: ordinal not in range(128)
>>> unicode(s)

Traceback (most recent call last):
  File "<pyshell#9>", line 1, in <module>
    unicode(s)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x93 in position 5: ordinal not in range(128)

I'm mostly flailing around here. What do I need to do to remove the unsafe characters from the string?

我大部分时间都在这里乱转。要从字符串中删除不安全的字符,我需要做什么?

Somewhat related to this question, although I was unable to solve my problem from it.

虽然我不能从这个问题中解决我的问题,但这和这个问题有点关系。

This also fails:

这也失败:

>>> s
' foo \x93bar bar \x94 weasel'
>>> s.decode('utf-8')

Traceback (most recent call last):
  File "<pyshell#13>", line 1, in <module>
    s.decode('utf-8')
  File "C:\Python25\254\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x93 in position 5: unexpected code byte

2 个解决方案

#1


37  

Good question. Encoding issues are tricky. Let's start with "I have a string." Strings in Python 2 aren't really "strings," they're byte arrays. So your string, where did it come from and what encoding is it in? Your example shows curly quotes in the literal, and I'm not even sure how you did that. I try to paste it into a Python interpreter, or type it on OS X with Option-[, and it doesn't come through.

好问题。编码的问题是棘手的。我们从"I have a string "开始Python 2中的字符串不是真正的“字符串”,它们是字节数组。那么你的字符串,它是从哪里来的,编码是什么?你的例子在字面上显示了花括号,我甚至不知道你是怎么做到的。我尝试将它粘贴到Python解释器中,或者在OS X上键入选项-[,并且它没有通过。

Looking at your second example though, you have a character of hex 93. That can't be UTF-8, because in UTF-8, any byte higher than 127 is part of a multibyte sequence. So I'm guessing it's supposed to be Latin-1. The problem is, x93 isn't a character in the Latin-1 character set. There's this "invalid" range in Latin-1 from x7f to x9f that's considered illegal. However, Microsoft saw that unused range and decided to put "curly quotes" in there. In doing so they created this similar encoding called "windows-1252", which is like Latin-1 with stuff in that invalid range.

看看你的第二个例子,你有一个hex 93的特征。不能是UTF-8,因为在UTF-8中,任何大于127的字节都是多字节序列的一部分。所以我猜应该是Latin-1。问题是,x93不是Latin-1字符集中的一个字符。在从x7f到x9f的Latin-1中存在一个被认为是非法的“无效”范围。然而,微软看到了这个未使用的范围,并决定在其中加入“花括号”。在这样做的过程中,他们创建了类似的编码“windows-1252”,类似于Latin-1,在无效的范围内。

So, let's assume it is windows-1252. What now? String.decode converts bytes into Unicode, so that's the one you want. Your second example was on the right track, but it failed because the string wasn't UTF-8. Try:

假设是windows-1252。现在该做什么?decode将字节转换成Unicode,所以这是你想要的。第二个示例在正确的轨道上,但是它失败了,因为字符串不是UTF-8。试一试:

>>> uni = 'foo \x93bar bar\x94 weasel'.decode("windows-1252")
u'foo \u201cbar bar\u201d weasel'
>>> print uni
foo “bar bar” weasel
>>> type(uni)
<type 'unicode'>

That's correct, because opening curly quote is Unicode U+201C. Now that you have Unicode, you can serialize it to bytes in any encoding you choose (if you need to pass it across the wire) or just keep it as Unicode if it's staying within Python. If you want to convert to UTF-8, use the oppose function, string.encode.

这是正确的,因为左大括号是Unicode U+201C。现在已经有了Unicode,您可以将它序列化为您选择的任何编码(如果您需要跨线传递它),或者如果它在Python中,则将它作为Unicode保存。如果您想要转换到UTF-8,请使用反函数,string.encode。

>>> uni.encode("utf-8")
'foo \xe2\x80\x9cbar bar \xe2\x80\x9d weasel'

Curly quotes take 3 bytes to encode in UTF-8. You could use UTF-16 and they'd only be two bytes. You can't encode as ASCII or Latin-1 though, because those don't have curly quotes.

在UTF-8中,花括号需要3个字节进行编码。你可以使用UTF-16,它们只有两个字节。你不能编码为ASCII或Latin-1,因为那些没有卷曲的引号。

#2


4  

EDIT. Looks like your string is encoded in such a way that (LEFT DOUBLE QUOTATION MARK) becomes \x93 and (RIGHT DOUBLE QUOTATION MARK) becomes \x94. There is a number of codepages with such a mapping, CP1250 is one of them, so you may use this:

编辑。看起来你的字符串是这样编码的:“(左双引号)变成\x93,”(右双引号)变成\x94。有许多代码页具有这样的映射,CP1250就是其中之一,所以您可以使用以下方法:

s = s.decode('cp1250')

For all the codepages which map to \x93 see here (all of them also map to \x94, which can be verified here).

对于所有映射到\x93的代码页,请参见这里(所有的代码页也映射到\x94,可以在这里验证)。

#1


37  

Good question. Encoding issues are tricky. Let's start with "I have a string." Strings in Python 2 aren't really "strings," they're byte arrays. So your string, where did it come from and what encoding is it in? Your example shows curly quotes in the literal, and I'm not even sure how you did that. I try to paste it into a Python interpreter, or type it on OS X with Option-[, and it doesn't come through.

好问题。编码的问题是棘手的。我们从"I have a string "开始Python 2中的字符串不是真正的“字符串”,它们是字节数组。那么你的字符串,它是从哪里来的,编码是什么?你的例子在字面上显示了花括号,我甚至不知道你是怎么做到的。我尝试将它粘贴到Python解释器中,或者在OS X上键入选项-[,并且它没有通过。

Looking at your second example though, you have a character of hex 93. That can't be UTF-8, because in UTF-8, any byte higher than 127 is part of a multibyte sequence. So I'm guessing it's supposed to be Latin-1. The problem is, x93 isn't a character in the Latin-1 character set. There's this "invalid" range in Latin-1 from x7f to x9f that's considered illegal. However, Microsoft saw that unused range and decided to put "curly quotes" in there. In doing so they created this similar encoding called "windows-1252", which is like Latin-1 with stuff in that invalid range.

看看你的第二个例子,你有一个hex 93的特征。不能是UTF-8,因为在UTF-8中,任何大于127的字节都是多字节序列的一部分。所以我猜应该是Latin-1。问题是,x93不是Latin-1字符集中的一个字符。在从x7f到x9f的Latin-1中存在一个被认为是非法的“无效”范围。然而,微软看到了这个未使用的范围,并决定在其中加入“花括号”。在这样做的过程中,他们创建了类似的编码“windows-1252”,类似于Latin-1,在无效的范围内。

So, let's assume it is windows-1252. What now? String.decode converts bytes into Unicode, so that's the one you want. Your second example was on the right track, but it failed because the string wasn't UTF-8. Try:

假设是windows-1252。现在该做什么?decode将字节转换成Unicode,所以这是你想要的。第二个示例在正确的轨道上,但是它失败了,因为字符串不是UTF-8。试一试:

>>> uni = 'foo \x93bar bar\x94 weasel'.decode("windows-1252")
u'foo \u201cbar bar\u201d weasel'
>>> print uni
foo “bar bar” weasel
>>> type(uni)
<type 'unicode'>

That's correct, because opening curly quote is Unicode U+201C. Now that you have Unicode, you can serialize it to bytes in any encoding you choose (if you need to pass it across the wire) or just keep it as Unicode if it's staying within Python. If you want to convert to UTF-8, use the oppose function, string.encode.

这是正确的,因为左大括号是Unicode U+201C。现在已经有了Unicode,您可以将它序列化为您选择的任何编码(如果您需要跨线传递它),或者如果它在Python中,则将它作为Unicode保存。如果您想要转换到UTF-8,请使用反函数,string.encode。

>>> uni.encode("utf-8")
'foo \xe2\x80\x9cbar bar \xe2\x80\x9d weasel'

Curly quotes take 3 bytes to encode in UTF-8. You could use UTF-16 and they'd only be two bytes. You can't encode as ASCII or Latin-1 though, because those don't have curly quotes.

在UTF-8中,花括号需要3个字节进行编码。你可以使用UTF-16,它们只有两个字节。你不能编码为ASCII或Latin-1,因为那些没有卷曲的引号。

#2


4  

EDIT. Looks like your string is encoded in such a way that (LEFT DOUBLE QUOTATION MARK) becomes \x93 and (RIGHT DOUBLE QUOTATION MARK) becomes \x94. There is a number of codepages with such a mapping, CP1250 is one of them, so you may use this:

编辑。看起来你的字符串是这样编码的:“(左双引号)变成\x93,”(右双引号)变成\x94。有许多代码页具有这样的映射,CP1250就是其中之一,所以您可以使用以下方法:

s = s.decode('cp1250')

For all the codepages which map to \x93 see here (all of them also map to \x94, which can be verified here).

对于所有映射到\x93的代码页,请参见这里(所有的代码页也映射到\x94,可以在这里验证)。