I have a unicode like this:
我有一个像这样的unicode:
\xE5\xB1\xB1\xE4\xB8\x9C \xE6\x97\xA5\xE7\x85\xA7
And I know it is the string representative of bytes
which is encoded with utf-8
我知道这是用utf-8编码的字节代表字符串
Note that the string \xE5\xB1\xB1\xE4\xB8\x9C \xE6\x97\xA5\xE7\x85\xA7
itself is <type 'unicode'>
请注意,字符串\ xE5 \ xB1 \ xB1 \ xE4 \ xB8 \ x9C \ xE6 \ x97 \ xA5 \ xE7 \ x85 \ xA7本身是
How to decode it to the real string 山东 日照
?
如何将其解码为真正的字符串山东日照?
1 个解决方案
#1
7
If you printed the repr()
output of your unicode
string then you appear to have a Mojibake, bytes data decoded using the wrong encoding.
如果您打印了unicode字符串的repr()输出,那么您似乎有一个Mojibake,使用错误的编码解码字节数据。
First encode back to bytes, then decode using the right codec. This may be as simple as encoding as Latin-1:
首先编码回字节,然后使用正确的编解码器进行解码。这可能像编码Latin-1一样简单:
unicode_string.encode('latin1').decode('utf8')
This depends on how the incorrect decoding was applied however. If a Windows codepage (like CP1252) was used, you can end up with Unicode data that is not actually encodable back to CP1252 if UTF-8 bytes outside the CP1252 range were force-decoded anyway.
这取决于如何应用不正确的解码。如果使用Windows代码页(如CP1252),如果CP1252范围之外的UTF-8字节无论如何都被强制解码,您最终可能无法将可编码的Unicode数据反馈回CP1252。
The best way to repair such mistakes is using the ftfy
library, which knows how to deal with forced-decoded Mojibake texts for a variety of codecs.
修复此类错误的最佳方法是使用ftfy库,该库知道如何处理各种编解码器的强制解码Mojibake文本。
For your small sample, Latin-1 appears to work just fine:
对于您的小样本,Latin-1似乎工作得很好:
>>> unicode_string = u'\xE5\xB1\xB1\xE4\xB8\x9C \xE6\x97\xA5\xE7\x85\xA7'
>>> print unicode_string.encode('latin1').decode('utf8')
山东 日照
>>> import ftfy
>>> print ftfy.fix_text(unicode_string)
山东 日照
If you have the literal character \
, x
, followed by two digits, you have another layer of encoding where the bytes where replaced by 4 characters each. You'd have to 'decode' those to actual bytes first, by asking Python to interpret the escapes with the string_escape
codec:
如果你有文字字符\,x,后跟两个数字,你有另一层编码,其中每个字节被4个字符替换。您必须首先通过要求Python使用string_escape编解码器解释转义来“解码”那些实际的字节:
>>> unicode_string = ur'\xE5\xB1\xB1\xE4\xB8\x9C \xE6\x97\xA5\xE7\x85\xA7'
>>> unicode_string
u'\\xE5\\xB1\\xB1\\xE4\\xB8\\x9C \\xE6\\x97\\xA5\\xE7\\x85\\xA7'
>>> print unicode_string.decode('string_escape').decode('utf8')
山东 日照
'string_escape'
is a Python 2 only codec that produces a bytestring, so it is safe to decode that as UTF-8 afterwards.
'string_escape'是一个只生成字节串的Python 2编解码器,因此可以安全地将其解码为UTF-8。
#1
7
If you printed the repr()
output of your unicode
string then you appear to have a Mojibake, bytes data decoded using the wrong encoding.
如果您打印了unicode字符串的repr()输出,那么您似乎有一个Mojibake,使用错误的编码解码字节数据。
First encode back to bytes, then decode using the right codec. This may be as simple as encoding as Latin-1:
首先编码回字节,然后使用正确的编解码器进行解码。这可能像编码Latin-1一样简单:
unicode_string.encode('latin1').decode('utf8')
This depends on how the incorrect decoding was applied however. If a Windows codepage (like CP1252) was used, you can end up with Unicode data that is not actually encodable back to CP1252 if UTF-8 bytes outside the CP1252 range were force-decoded anyway.
这取决于如何应用不正确的解码。如果使用Windows代码页(如CP1252),如果CP1252范围之外的UTF-8字节无论如何都被强制解码,您最终可能无法将可编码的Unicode数据反馈回CP1252。
The best way to repair such mistakes is using the ftfy
library, which knows how to deal with forced-decoded Mojibake texts for a variety of codecs.
修复此类错误的最佳方法是使用ftfy库,该库知道如何处理各种编解码器的强制解码Mojibake文本。
For your small sample, Latin-1 appears to work just fine:
对于您的小样本,Latin-1似乎工作得很好:
>>> unicode_string = u'\xE5\xB1\xB1\xE4\xB8\x9C \xE6\x97\xA5\xE7\x85\xA7'
>>> print unicode_string.encode('latin1').decode('utf8')
山东 日照
>>> import ftfy
>>> print ftfy.fix_text(unicode_string)
山东 日照
If you have the literal character \
, x
, followed by two digits, you have another layer of encoding where the bytes where replaced by 4 characters each. You'd have to 'decode' those to actual bytes first, by asking Python to interpret the escapes with the string_escape
codec:
如果你有文字字符\,x,后跟两个数字,你有另一层编码,其中每个字节被4个字符替换。您必须首先通过要求Python使用string_escape编解码器解释转义来“解码”那些实际的字节:
>>> unicode_string = ur'\xE5\xB1\xB1\xE4\xB8\x9C \xE6\x97\xA5\xE7\x85\xA7'
>>> unicode_string
u'\\xE5\\xB1\\xB1\\xE4\\xB8\\x9C \\xE6\\x97\\xA5\\xE7\\x85\\xA7'
>>> print unicode_string.decode('string_escape').decode('utf8')
山东 日照
'string_escape'
is a Python 2 only codec that produces a bytestring, so it is safe to decode that as UTF-8 afterwards.
'string_escape'是一个只生成字节串的Python 2编解码器,因此可以安全地将其解码为UTF-8。