In Python 2, Unicode strings may contain both unicode and bytes:
在Python 2中,Unicode字符串可能包含Unicode和字节:
a = u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'
I understand that this is absolutely not something one should write in his own code, but this is a string that I have to deal with.
我明白这绝对不是一个人应该在他自己的代码中写的东西,但这是我必须处理的字符串。
The bytes in the string above are UTF-8 for ек
(Unicode \u0435\u043a
).
上面的字符串中的字节是екutf - 8(Unicode \ u0435 \ u043a)。
My objective is to get a unicode string containing everything in Unicode, which is to say Русский ек
(\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u0435\u043a
).
我的目标是得到一个unicode字符串包含unicode的一切,也就是说Русскийек(\ u0420 \ u0443 \ u0441 \ u0441 \ u043a \ u0438 \ u0439 \ u0435 \ u043a)。
Encoding it to UTF-8 yields
将其编码为UTF-8收益率。
>>> a.encode('utf-8')
'\xd0\xa0\xd1\x83\xd1\x81\xd1\x81\xd0\xba\xd0\xb8\xd0\xb9 \xc3\x90\xc2\xb5\xc3\x90\xc2\xba'
Which then decoded from UTF-8 gives the initial string with bytes in them, which is not good:
然后从UTF-8中解码出初始字符串,其中有字节,这是不好的:
>>> a.encode('utf-8').decode('utf-8')
u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'
I found a hacky way to solve the problem, however:
然而,我发现了一个解决问题的方法:
>>> repr(a)
"u'\\u0420\\u0443\\u0441\\u0441\\u043a\\u0438\\u0439 \\xd0\\xb5\\xd0\\xba'"
>>> eval(repr(a)[1:])
'\\u0420\\u0443\\u0441\\u0441\\u043a\\u0438\\u0439 \xd0\xb5\xd0\xba'
>>> s = eval(repr(a)[1:]).decode('utf8')
>>> s
u'\\u0420\\u0443\\u0441\\u0441\\u043a\\u0438\\u0439 \u0435\u043a'
# Almost there, the bytes are proper now but the former real-unicode characters
# are now escaped with \u's; need to un-escape them.
>>> import re
>>> re.sub(u'\\\\u([a-f\\d]+)', lambda x : unichr(int(x.group(1), 16)), s)
u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u0435\u043a' # Success!
This works fine but looks very hacky due to its use of eval
, repr
, and then additional regex'ing of the unicode string representation. Is there a cleaner way?
这很好,但是由于它使用了eval、repr和unicode字符串表示的额外的正则表达式,看起来非常的粗糙。有更清洁的方法吗?
5 个解决方案
#1
21
In Python 2, Unicode strings may contain both unicode and bytes:
在Python 2中,Unicode字符串可能包含Unicode和字节:
No, they may not. They contain Unicode characters.
不,他们不。它们包含Unicode字符。
Within the original string, \xd0
is not a byte that's part of a UTF-8 encoding. It is the Unicode character with code point 208. u'\xd0'
== u'\u00d0'
. It just happens that the repr
for Unicode strings in Python 2 prefers to represent characters with \x
escapes where possible (i.e. code points < 256).
在原来的字符串中,\xd0不是UTF-8编码的一部分的字节。它是代码点208的Unicode字符。u ' \ xd0 ' = = ' \ u00d0 '。在Python 2中,Unicode字符串的repr更喜欢用\x转义字符来表示字符(即代码点< 256)。
There is no way to look at the string and tell that the \xd0
byte is supposed to be part of some UTF-8 encoded character, or if it actually stands for that Unicode character by itself.
没有办法查看字符串并告诉它\xd0字节应该是某些UTF-8编码字符的一部分,或者它本身实际上代表了Unicode字符。
However, if you assume that you can always interpret those values as encoded ones, you could try writing something that analyzes each character in turn (use ord
to convert to a code-point integer), decodes characters < 256 as UTF-8, and passes characters >= 256 as they were.
但是,如果您假设您总是可以将这些值解释为编码的值,那么您可以尝试编写一些东西来分析每个字符(使用ord转换为一个代码点整数),将字符解码为UTF-8,并通过字符>= 256。
#2
12
(In response to the comments above): this code converts everything that looks like utf8 and leaves other codepoints as is:
(针对上述评论):此代码将所有看起来像utf8的代码转换为其他代码点,如下所示:
a = u'\u0420\u0443\u0441 utf:\xd0\xb5\xd0\xba bytes:bl\xe4\xe4'
def convert(s):
try:
return s.group(0).encode('latin1').decode('utf8')
except:
return s.group(0)
import re
a = re.sub(r'[\x80-\xFF]+', convert, a)
print a.encode('utf8')
Result:
结果:
Рус utf:ек bytes:blää
#3
11
The problem is that your string is not actually encoded in a specific encoding. Your example string:
问题是,您的字符串实际上并没有编码在特定的编码中。你的例子字符串:
a = u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'
Is mixing python's internal representation of unicode strings with utf-8
encoded text. If we just consider the 'special' characters:
将python内部的unicode字符串表示与utf-8编码的文本混合在一起。如果我们只考虑“特殊”字符:
>>> orig = u'\u0435\u043a'
>>> bytes = u'\xd0\xb5\xd0\xba'
>>> print orig
ек
>>> print bytes
ек
But you say, bytes
is utf-8
encoded:
但是你说,字节是utf-8编码的:
>>> print bytes.encode('utf-8')
ек
>>> print bytes.encode('utf-8').decode('utf-8')
ек
Wrong! But what about:
错了!但:
>>> bytes = '\xd0\xb5\xd0\xba'
>>> print bytes
ек
>>> print bytes.decode('utf-8')
ек
Hurrah.
好哇。
So. What does this mean for me? It means you're (probably) solving the wrong problem. What you should be asking us/trying to figure out is why your strings are in this form to begin with and how to avoid it/fix it before you have them all mixed up.
所以。这对我意味着什么?这意味着你(很可能)解决了错误的问题。你应该问我们/试图弄明白的是为什么你的字符串会以这种形式开始,以及如何在你把它们混在一起之前避免它/修正它。
#4
5
You should convert unichr
s to chr
s, then decode them.
您应该将unichrs转换为chrs,然后对它们进行解码。
u'\xd0' == u'\u00d0'
is True
u'\xd0' == u'\u00d0'是真的。
$ python
>>> import re
>>> a = u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'
>>> re.sub(r'[\000-\377]*', lambda m:''.join([chr(ord(i)) for i in m.group(0)]).decode('utf8'), a)
u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u0435\u043a'
-
r'[\000-\377]*'
will match unichrsu'[\u0000-\u00ff]*'
- r'[\000-\377]*'将匹配unichrs u'[\u0000-\u00ff]*'
u'\xd0\xb5\xd0\xba' == u'\u00d0\u00b5\u00d0\u00ba'
- u ' \ xd0 \ xb5 \ xd0 \ xba ' = = u ' \ u00d0 \ u00b5 \ u00d0 \ u00ba '
- You use
utf8
encoded bytes as unicode code points (this is the PROBLEM) - 您使用utf8编码的字节作为unicode代码点(这是问题)
- I solve the problem by pretending those mistaken unichars as the corresponding bytes
- 我通过将那些错误的unichars作为相应的字节来解决这个问题。
- I search all these mistaken unichars, and convert them to chars, then decode them.
- 我搜索所有这些错误的unichars,然后将它们转换成chars,然后解码它们。
If I'm wrong, please tell me.
如果我错了,请告诉我。
#5
5
You've already got an answer, but here's a way to unscramble UTF-8-like Unicode sequences that is less likely to decode latin-1 Unicode sequences in error. The re.sub
function:
您已经得到了一个答案,但这里有一种方法可以使utf -8类似的Unicode序列进行解码,从而不太可能在错误中解码latin-1 Unicode序列。re.sub函数:
- Matches Unicode characters < U+0100 that resemble valid UTF-8 sequences (ref: RFC 3629).
- 匹配Unicode字符< U+0100,类似于有效的UTF-8序列(ref: RFC 3629)。
- Encodes the Unicode sequence into its equivalent latin-1 byte sequence.
- 将Unicode序列编码为等效的latin-1字节序列。
- Decodes the sequence using UTF-8 back into Unicode.
- 将使用UTF-8的序列解码为Unicode。
- Replaces the original UTF-8-like sequence with the matching Unicode character.
- 用匹配的Unicode字符替换原来的utf -8样序列。
Note this could still match a Unicode sequence if just the right characters appear next to each other, but it is much less likely.
注意,如果正确的字符出现在彼此的旁边,那么它仍然可以匹配一个Unicode序列,但是可能性要小得多。
import re
# your example
a = u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'
# printable Unicode characters < 256.
a += ''.join(chr(n) for n in range(32,256)).decode('latin1')
# a few UTF-8 characters decoded as latin1.
a += ''.join(unichr(n) for n in [2**7-1,2**7,2**11-1,2**11]).encode('utf8').decode('latin1')
# Some non-BMP characters
a += u'\U00010000\U0010FFFF'.encode('utf8').decode('latin1')
print repr(a)
# Unicode codepoint sequences that resemble UTF-8 sequences.
p = re.compile(ur'''(?x)
\xF0[\x90-\xBF][\x80-\xBF]{2} | # Valid 4-byte sequences
[\xF1-\xF3][\x80-\xBF]{3} |
\xF4[\x80-\x8F][\x80-\xBF]{2} |
\xE0[\xA0-\xBF][\x80-\xBF] | # Valid 3-byte sequences
[\xE1-\xEC][\x80-\xBF]{2} |
\xED[\x80-\x9F][\x80-\xBF] |
[\xEE-\xEF][\x80-\xBF]{2} |
[\xC2-\xDF][\x80-\xBF] # Valid 2-byte sequences
''')
def replace(m):
return m.group(0).encode('latin1').decode('utf8')
print
print repr(p.sub(replace,a))
Output
u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff\x7f\xc2\x80\xdf\xbf\xe0\xa0\x80\xf0\x90\x80\x80\xf4\x8f\xbf\xbf'
u ' \ u0420 \ u0443 \ u0441 \ u0441 \ u043a \ u0438 \ u0439 \ xd0 \ xb5 \ xd0 \ xba! " # $ % &’()* +,”/ 0123456789:;< = > ? @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^ _“abcdefghijklmnopqrstuvwxyz { | } ~ \ x7f \ x80 \ x81 \ x82 \ x83 \ x84 \ x85 \ x86 \ x87 \ x88 \ x89 \ x8a \ x8b \ x8c \ x8d \ x8e \ x8f \ x90 \ x91 \ x92 \ x93 \ x94 \ x95 \ x96 \ x97 \ x98 \ x99 \ x9a \ x9b \ x9c \ x9d \ x9e \ x9f \ xa0 \ xa1 \ xa2 \ xa3 \ xa4 \ xa5 \ xa6 \ xa7 \ xa8 \ xa9 \ xaa \ xab \ xac \ xad \机加区\ xaf \ xb0 \ xb1 \ xb2 \ xb3 \ xb4 \ xb5 \ xb6 \ xb7 \ xb8 \ xb9 \ xba \ xbb \ xbc \ xbd \ xbe \ xbf \ xc0 \ xc1之前\ xc2 \ xc3 \ xc4 \ xc5 \ . xc6 \ xc7 \ xc8 \ xc9 \ xca \ xcb \ xcc \ xcd \ xce \ xcf \ xd0 \ xd1 \ xd2 \ xd3 \ xd4 \ xd5 \ xd6 \ xd7 \ xd8 \ xd9 \ xda \ xdb \ xdc \ xdd \ xde \ xdf \ xe0 \ xe1 \ xe2 \ xe3 \xe4 \ xe5 \ xe6 \ xe7 \ xe8 \ xe9 \ xea \ xeb \ xec \中\ xee \ xef \ xf0 \ xf1 \ xf2 \ xf3 \ xf4 \ xf5 \ xf6 \ xf7 \ xf8 \ xf9 \ xfa \ xfb以\ xfc \ xfd \ xfe \ xff \ x7f \ xc2 \ x80 \ xdf \ xbf \ xe0 \ xa0 \ x80 \ xf0 \ x90 \ x80 \ x80 \ xf4 \ x8f \ xbf \ xbf”
u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u0435\u043a !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff\x7f\x80\u07ff\u0800\U00010000\U0010ffff'
u ' \ u0420 \ u0443 \ u0441 \ u0441 \ u043a \ u0438 \ u0439 \ u0435 \ u043a! " # $ % &’()* +,”/ 0123456789:;< = > ? @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^ _“abcdefghijklmnopqrstuvwxyz { | } ~ \ x7f \ x80 \ x81 \ x82 \ x83 \ x84 \ x85 \ x86 \ x87 \ x88 \ x89 \ x8a \ x8b \ x8c \ x8d \ x8e \ x8f \ x90 \ x91 \ x92 \ x93 \ x94 \ x95 \ x96 \ x97 \ x98 \ x99 \ x9a \ x9b \ x9c \ x9d \ x9e \ x9f \ xa0 \ xa1 \ xa2 \ xa3 \ xa4 \ xa5 \ xa6 \ xa7 \ xa8 \ xa9 \ xaa \ xab \ xac \ xad \机加区\ xaf \ xb0 \ xb1 \ xb2 \ xb3 \ xb4 \ xb5 \ xb6 \ xb7 \ xb8 \ xb9 \ xba \ xbb \ xbc \ xbd \ xbe \ xbf \ xc0 \ xc1之前\ xc2 \ xc3 \ xc4 \ xc5 \ . xc6 \ xc7 \ xc8 \ xc9 \ xca \ xcb \ xcc \ xcd \ xce \ xcf \ xd0 \ xd1 \ xd2 \ xd3 \ xd4 \ xd5 \ xd6 \ xd7 \ xd8 \ xd9 \ xda \ xdb \ xdc \ xdd \ xde \ xdf \ xe0 \ xe1 \ xe2 \ xe3 \xe4 \ xe5 \ xe6 \ xe7 \ xe8 \ xe9 \ xea \ xeb \ xec \中\ xee \ xef \ xf0 \ xf1 \ xf2 \ xf3 \ xf4 \ xf5 \ xf6 \ xf7 \ xf8 \ xf9 \ xfa \ xfb以\ xfc \ xfd \ xfe \ xff \ x7f \ x80 \ u07ff \ u0800 \ U00010000 \ U0010ffff '
#1
21
In Python 2, Unicode strings may contain both unicode and bytes:
在Python 2中,Unicode字符串可能包含Unicode和字节:
No, they may not. They contain Unicode characters.
不,他们不。它们包含Unicode字符。
Within the original string, \xd0
is not a byte that's part of a UTF-8 encoding. It is the Unicode character with code point 208. u'\xd0'
== u'\u00d0'
. It just happens that the repr
for Unicode strings in Python 2 prefers to represent characters with \x
escapes where possible (i.e. code points < 256).
在原来的字符串中,\xd0不是UTF-8编码的一部分的字节。它是代码点208的Unicode字符。u ' \ xd0 ' = = ' \ u00d0 '。在Python 2中,Unicode字符串的repr更喜欢用\x转义字符来表示字符(即代码点< 256)。
There is no way to look at the string and tell that the \xd0
byte is supposed to be part of some UTF-8 encoded character, or if it actually stands for that Unicode character by itself.
没有办法查看字符串并告诉它\xd0字节应该是某些UTF-8编码字符的一部分,或者它本身实际上代表了Unicode字符。
However, if you assume that you can always interpret those values as encoded ones, you could try writing something that analyzes each character in turn (use ord
to convert to a code-point integer), decodes characters < 256 as UTF-8, and passes characters >= 256 as they were.
但是,如果您假设您总是可以将这些值解释为编码的值,那么您可以尝试编写一些东西来分析每个字符(使用ord转换为一个代码点整数),将字符解码为UTF-8,并通过字符>= 256。
#2
12
(In response to the comments above): this code converts everything that looks like utf8 and leaves other codepoints as is:
(针对上述评论):此代码将所有看起来像utf8的代码转换为其他代码点,如下所示:
a = u'\u0420\u0443\u0441 utf:\xd0\xb5\xd0\xba bytes:bl\xe4\xe4'
def convert(s):
try:
return s.group(0).encode('latin1').decode('utf8')
except:
return s.group(0)
import re
a = re.sub(r'[\x80-\xFF]+', convert, a)
print a.encode('utf8')
Result:
结果:
Рус utf:ек bytes:blää
#3
11
The problem is that your string is not actually encoded in a specific encoding. Your example string:
问题是,您的字符串实际上并没有编码在特定的编码中。你的例子字符串:
a = u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'
Is mixing python's internal representation of unicode strings with utf-8
encoded text. If we just consider the 'special' characters:
将python内部的unicode字符串表示与utf-8编码的文本混合在一起。如果我们只考虑“特殊”字符:
>>> orig = u'\u0435\u043a'
>>> bytes = u'\xd0\xb5\xd0\xba'
>>> print orig
ек
>>> print bytes
ек
But you say, bytes
is utf-8
encoded:
但是你说,字节是utf-8编码的:
>>> print bytes.encode('utf-8')
ек
>>> print bytes.encode('utf-8').decode('utf-8')
ек
Wrong! But what about:
错了!但:
>>> bytes = '\xd0\xb5\xd0\xba'
>>> print bytes
ек
>>> print bytes.decode('utf-8')
ек
Hurrah.
好哇。
So. What does this mean for me? It means you're (probably) solving the wrong problem. What you should be asking us/trying to figure out is why your strings are in this form to begin with and how to avoid it/fix it before you have them all mixed up.
所以。这对我意味着什么?这意味着你(很可能)解决了错误的问题。你应该问我们/试图弄明白的是为什么你的字符串会以这种形式开始,以及如何在你把它们混在一起之前避免它/修正它。
#4
5
You should convert unichr
s to chr
s, then decode them.
您应该将unichrs转换为chrs,然后对它们进行解码。
u'\xd0' == u'\u00d0'
is True
u'\xd0' == u'\u00d0'是真的。
$ python
>>> import re
>>> a = u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'
>>> re.sub(r'[\000-\377]*', lambda m:''.join([chr(ord(i)) for i in m.group(0)]).decode('utf8'), a)
u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u0435\u043a'
-
r'[\000-\377]*'
will match unichrsu'[\u0000-\u00ff]*'
- r'[\000-\377]*'将匹配unichrs u'[\u0000-\u00ff]*'
u'\xd0\xb5\xd0\xba' == u'\u00d0\u00b5\u00d0\u00ba'
- u ' \ xd0 \ xb5 \ xd0 \ xba ' = = u ' \ u00d0 \ u00b5 \ u00d0 \ u00ba '
- You use
utf8
encoded bytes as unicode code points (this is the PROBLEM) - 您使用utf8编码的字节作为unicode代码点(这是问题)
- I solve the problem by pretending those mistaken unichars as the corresponding bytes
- 我通过将那些错误的unichars作为相应的字节来解决这个问题。
- I search all these mistaken unichars, and convert them to chars, then decode them.
- 我搜索所有这些错误的unichars,然后将它们转换成chars,然后解码它们。
If I'm wrong, please tell me.
如果我错了,请告诉我。
#5
5
You've already got an answer, but here's a way to unscramble UTF-8-like Unicode sequences that is less likely to decode latin-1 Unicode sequences in error. The re.sub
function:
您已经得到了一个答案,但这里有一种方法可以使utf -8类似的Unicode序列进行解码,从而不太可能在错误中解码latin-1 Unicode序列。re.sub函数:
- Matches Unicode characters < U+0100 that resemble valid UTF-8 sequences (ref: RFC 3629).
- 匹配Unicode字符< U+0100,类似于有效的UTF-8序列(ref: RFC 3629)。
- Encodes the Unicode sequence into its equivalent latin-1 byte sequence.
- 将Unicode序列编码为等效的latin-1字节序列。
- Decodes the sequence using UTF-8 back into Unicode.
- 将使用UTF-8的序列解码为Unicode。
- Replaces the original UTF-8-like sequence with the matching Unicode character.
- 用匹配的Unicode字符替换原来的utf -8样序列。
Note this could still match a Unicode sequence if just the right characters appear next to each other, but it is much less likely.
注意,如果正确的字符出现在彼此的旁边,那么它仍然可以匹配一个Unicode序列,但是可能性要小得多。
import re
# your example
a = u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'
# printable Unicode characters < 256.
a += ''.join(chr(n) for n in range(32,256)).decode('latin1')
# a few UTF-8 characters decoded as latin1.
a += ''.join(unichr(n) for n in [2**7-1,2**7,2**11-1,2**11]).encode('utf8').decode('latin1')
# Some non-BMP characters
a += u'\U00010000\U0010FFFF'.encode('utf8').decode('latin1')
print repr(a)
# Unicode codepoint sequences that resemble UTF-8 sequences.
p = re.compile(ur'''(?x)
\xF0[\x90-\xBF][\x80-\xBF]{2} | # Valid 4-byte sequences
[\xF1-\xF3][\x80-\xBF]{3} |
\xF4[\x80-\x8F][\x80-\xBF]{2} |
\xE0[\xA0-\xBF][\x80-\xBF] | # Valid 3-byte sequences
[\xE1-\xEC][\x80-\xBF]{2} |
\xED[\x80-\x9F][\x80-\xBF] |
[\xEE-\xEF][\x80-\xBF]{2} |
[\xC2-\xDF][\x80-\xBF] # Valid 2-byte sequences
''')
def replace(m):
return m.group(0).encode('latin1').decode('utf8')
print
print repr(p.sub(replace,a))
Output
u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff\x7f\xc2\x80\xdf\xbf\xe0\xa0\x80\xf0\x90\x80\x80\xf4\x8f\xbf\xbf'
u ' \ u0420 \ u0443 \ u0441 \ u0441 \ u043a \ u0438 \ u0439 \ xd0 \ xb5 \ xd0 \ xba! " # $ % &’()* +,”/ 0123456789:;< = > ? @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^ _“abcdefghijklmnopqrstuvwxyz { | } ~ \ x7f \ x80 \ x81 \ x82 \ x83 \ x84 \ x85 \ x86 \ x87 \ x88 \ x89 \ x8a \ x8b \ x8c \ x8d \ x8e \ x8f \ x90 \ x91 \ x92 \ x93 \ x94 \ x95 \ x96 \ x97 \ x98 \ x99 \ x9a \ x9b \ x9c \ x9d \ x9e \ x9f \ xa0 \ xa1 \ xa2 \ xa3 \ xa4 \ xa5 \ xa6 \ xa7 \ xa8 \ xa9 \ xaa \ xab \ xac \ xad \机加区\ xaf \ xb0 \ xb1 \ xb2 \ xb3 \ xb4 \ xb5 \ xb6 \ xb7 \ xb8 \ xb9 \ xba \ xbb \ xbc \ xbd \ xbe \ xbf \ xc0 \ xc1之前\ xc2 \ xc3 \ xc4 \ xc5 \ . xc6 \ xc7 \ xc8 \ xc9 \ xca \ xcb \ xcc \ xcd \ xce \ xcf \ xd0 \ xd1 \ xd2 \ xd3 \ xd4 \ xd5 \ xd6 \ xd7 \ xd8 \ xd9 \ xda \ xdb \ xdc \ xdd \ xde \ xdf \ xe0 \ xe1 \ xe2 \ xe3 \xe4 \ xe5 \ xe6 \ xe7 \ xe8 \ xe9 \ xea \ xeb \ xec \中\ xee \ xef \ xf0 \ xf1 \ xf2 \ xf3 \ xf4 \ xf5 \ xf6 \ xf7 \ xf8 \ xf9 \ xfa \ xfb以\ xfc \ xfd \ xfe \ xff \ x7f \ xc2 \ x80 \ xdf \ xbf \ xe0 \ xa0 \ x80 \ xf0 \ x90 \ x80 \ x80 \ xf4 \ x8f \ xbf \ xbf”
u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u0435\u043a !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff\x7f\x80\u07ff\u0800\U00010000\U0010ffff'
u ' \ u0420 \ u0443 \ u0441 \ u0441 \ u043a \ u0438 \ u0439 \ u0435 \ u043a! " # $ % &’()* +,”/ 0123456789:;< = > ? @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^ _“abcdefghijklmnopqrstuvwxyz { | } ~ \ x7f \ x80 \ x81 \ x82 \ x83 \ x84 \ x85 \ x86 \ x87 \ x88 \ x89 \ x8a \ x8b \ x8c \ x8d \ x8e \ x8f \ x90 \ x91 \ x92 \ x93 \ x94 \ x95 \ x96 \ x97 \ x98 \ x99 \ x9a \ x9b \ x9c \ x9d \ x9e \ x9f \ xa0 \ xa1 \ xa2 \ xa3 \ xa4 \ xa5 \ xa6 \ xa7 \ xa8 \ xa9 \ xaa \ xab \ xac \ xad \机加区\ xaf \ xb0 \ xb1 \ xb2 \ xb3 \ xb4 \ xb5 \ xb6 \ xb7 \ xb8 \ xb9 \ xba \ xbb \ xbc \ xbd \ xbe \ xbf \ xc0 \ xc1之前\ xc2 \ xc3 \ xc4 \ xc5 \ . xc6 \ xc7 \ xc8 \ xc9 \ xca \ xcb \ xcc \ xcd \ xce \ xcf \ xd0 \ xd1 \ xd2 \ xd3 \ xd4 \ xd5 \ xd6 \ xd7 \ xd8 \ xd9 \ xda \ xdb \ xdc \ xdd \ xde \ xdf \ xe0 \ xe1 \ xe2 \ xe3 \xe4 \ xe5 \ xe6 \ xe7 \ xe8 \ xe9 \ xea \ xeb \ xec \中\ xee \ xef \ xf0 \ xf1 \ xf2 \ xf3 \ xf4 \ xf5 \ xf6 \ xf7 \ xf8 \ xf9 \ xfa \ xfb以\ xfc \ xfd \ xfe \ xff \ x7f \ x80 \ u07ff \ u0800 \ U00010000 \ U0010ffff '