unicode Python字符串中的字节。

时间:2021-09-12 18:09:56

In Python 2, Unicode strings may contain both unicode and bytes:

在Python 2中,Unicode字符串可能包含Unicode和字节:

a = u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'

I understand that this is absolutely not something one should write in his own code, but this is a string that I have to deal with.

我明白这绝对不是一个人应该在他自己的代码中写的东西,但这是我必须处理的字符串。

The bytes in the string above are UTF-8 for ек (Unicode \u0435\u043a).

上面的字符串中的字节是екutf - 8(Unicode \ u0435 \ u043a)。

My objective is to get a unicode string containing everything in Unicode, which is to say Русский ек (\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u0435\u043a).

我的目标是得到一个unicode字符串包含unicode的一切,也就是说Русскийек(\ u0420 \ u0443 \ u0441 \ u0441 \ u043a \ u0438 \ u0439 \ u0435 \ u043a)。

Encoding it to UTF-8 yields

将其编码为UTF-8收益率。

>>> a.encode('utf-8')
'\xd0\xa0\xd1\x83\xd1\x81\xd1\x81\xd0\xba\xd0\xb8\xd0\xb9 \xc3\x90\xc2\xb5\xc3\x90\xc2\xba'

Which then decoded from UTF-8 gives the initial string with bytes in them, which is not good:

然后从UTF-8中解码出初始字符串,其中有字节,这是不好的:

>>> a.encode('utf-8').decode('utf-8')
u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'

I found a hacky way to solve the problem, however:

然而,我发现了一个解决问题的方法:

>>> repr(a)
"u'\\u0420\\u0443\\u0441\\u0441\\u043a\\u0438\\u0439 \\xd0\\xb5\\xd0\\xba'"
>>> eval(repr(a)[1:])
'\\u0420\\u0443\\u0441\\u0441\\u043a\\u0438\\u0439 \xd0\xb5\xd0\xba'
>>> s = eval(repr(a)[1:]).decode('utf8')
>>> s
u'\\u0420\\u0443\\u0441\\u0441\\u043a\\u0438\\u0439 \u0435\u043a'
# Almost there, the bytes are proper now but the former real-unicode characters
# are now escaped with \u's; need to un-escape them.
>>> import re
>>> re.sub(u'\\\\u([a-f\\d]+)', lambda x : unichr(int(x.group(1), 16)), s)
u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u0435\u043a' # Success!

This works fine but looks very hacky due to its use of eval, repr, and then additional regex'ing of the unicode string representation. Is there a cleaner way?

这很好,但是由于它使用了eval、repr和unicode字符串表示的额外的正则表达式,看起来非常的粗糙。有更清洁的方法吗?

5 个解决方案

#1


21  

In Python 2, Unicode strings may contain both unicode and bytes:

在Python 2中,Unicode字符串可能包含Unicode和字节:

No, they may not. They contain Unicode characters.

不,他们不。它们包含Unicode字符。

Within the original string, \xd0 is not a byte that's part of a UTF-8 encoding. It is the Unicode character with code point 208. u'\xd0' == u'\u00d0'. It just happens that the repr for Unicode strings in Python 2 prefers to represent characters with \x escapes where possible (i.e. code points < 256).

在原来的字符串中,\xd0不是UTF-8编码的一部分的字节。它是代码点208的Unicode字符。u ' \ xd0 ' = = ' \ u00d0 '。在Python 2中,Unicode字符串的repr更喜欢用\x转义字符来表示字符(即代码点< 256)。

There is no way to look at the string and tell that the \xd0 byte is supposed to be part of some UTF-8 encoded character, or if it actually stands for that Unicode character by itself.

没有办法查看字符串并告诉它\xd0字节应该是某些UTF-8编码字符的一部分,或者它本身实际上代表了Unicode字符。

However, if you assume that you can always interpret those values as encoded ones, you could try writing something that analyzes each character in turn (use ord to convert to a code-point integer), decodes characters < 256 as UTF-8, and passes characters >= 256 as they were.

但是,如果您假设您总是可以将这些值解释为编码的值,那么您可以尝试编写一些东西来分析每个字符(使用ord转换为一个代码点整数),将字符解码为UTF-8,并通过字符>= 256。

#2


12  

(In response to the comments above): this code converts everything that looks like utf8 and leaves other codepoints as is:

(针对上述评论):此代码将所有看起来像utf8的代码转换为其他代码点,如下所示:

a = u'\u0420\u0443\u0441 utf:\xd0\xb5\xd0\xba bytes:bl\xe4\xe4'

def convert(s):
    try:
        return s.group(0).encode('latin1').decode('utf8')
    except:
        return s.group(0)

import re
a = re.sub(r'[\x80-\xFF]+', convert, a)
print a.encode('utf8')   

Result:

结果:

Рус utf:ек bytes:blää  

#3


11  

The problem is that your string is not actually encoded in a specific encoding. Your example string:

问题是,您的字符串实际上并没有编码在特定的编码中。你的例子字符串:

a = u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'

Is mixing python's internal representation of unicode strings with utf-8 encoded text. If we just consider the 'special' characters:

将python内部的unicode字符串表示与utf-8编码的文本混合在一起。如果我们只考虑“特殊”字符:

>>> orig = u'\u0435\u043a'
>>> bytes = u'\xd0\xb5\xd0\xba'
>>> print orig
ек
>>> print bytes
ек

But you say, bytes is utf-8 encoded:

但是你说,字节是utf-8编码的:

>>> print bytes.encode('utf-8')
ек
>>> print bytes.encode('utf-8').decode('utf-8')
ек

Wrong! But what about:

错了!但:

>>> bytes = '\xd0\xb5\xd0\xba'
>>> print bytes
ек
>>> print bytes.decode('utf-8')
ек

Hurrah.

好哇。

So. What does this mean for me? It means you're (probably) solving the wrong problem. What you should be asking us/trying to figure out is why your strings are in this form to begin with and how to avoid it/fix it before you have them all mixed up.

所以。这对我意味着什么?这意味着你(很可能)解决了错误的问题。你应该问我们/试图弄明白的是为什么你的字符串会以这种形式开始,以及如何在你把它们混在一起之前避免它/修正它。

#4


5  

You should convert unichrs to chrs, then decode them.

您应该将unichrs转换为chrs,然后对它们进行解码。

u'\xd0' == u'\u00d0' is True

u'\xd0' == u'\u00d0'是真的。

$ python
>>> import re
>>> a = u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'
>>> re.sub(r'[\000-\377]*', lambda m:''.join([chr(ord(i)) for i in m.group(0)]).decode('utf8'), a)
u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u0435\u043a'
  • r'[\000-\377]*' will match unichrs u'[\u0000-\u00ff]*'
  • r'[\000-\377]*'将匹配unichrs u'[\u0000-\u00ff]*'
  • u'\xd0\xb5\xd0\xba' == u'\u00d0\u00b5\u00d0\u00ba'
  • u ' \ xd0 \ xb5 \ xd0 \ xba ' = = u ' \ u00d0 \ u00b5 \ u00d0 \ u00ba '
  • You use utf8 encoded bytes as unicode code points (this is the PROBLEM)
  • 您使用utf8编码的字节作为unicode代码点(这是问题)
  • I solve the problem by pretending those mistaken unichars as the corresponding bytes
  • 我通过将那些错误的unichars作为相应的字节来解决这个问题。
  • I search all these mistaken unichars, and convert them to chars, then decode them.
  • 我搜索所有这些错误的unichars,然后将它们转换成chars,然后解码它们。

If I'm wrong, please tell me.

如果我错了,请告诉我。

#5


5  

You've already got an answer, but here's a way to unscramble UTF-8-like Unicode sequences that is less likely to decode latin-1 Unicode sequences in error. The re.sub function:

您已经得到了一个答案,但这里有一种方法可以使utf -8类似的Unicode序列进行解码,从而不太可能在错误中解码latin-1 Unicode序列。re.sub函数:

  1. Matches Unicode characters < U+0100 that resemble valid UTF-8 sequences (ref: RFC 3629).
  2. 匹配Unicode字符< U+0100,类似于有效的UTF-8序列(ref: RFC 3629)。
  3. Encodes the Unicode sequence into its equivalent latin-1 byte sequence.
  4. 将Unicode序列编码为等效的latin-1字节序列。
  5. Decodes the sequence using UTF-8 back into Unicode.
  6. 将使用UTF-8的序列解码为Unicode。
  7. Replaces the original UTF-8-like sequence with the matching Unicode character.
  8. 用匹配的Unicode字符替换原来的utf -8样序列。

Note this could still match a Unicode sequence if just the right characters appear next to each other, but it is much less likely.

注意,如果正确的字符出现在彼此的旁边,那么它仍然可以匹配一个Unicode序列,但是可能性要小得多。

import re

# your example
a = u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'

# printable Unicode characters < 256.
a += ''.join(chr(n) for n in range(32,256)).decode('latin1')

# a few UTF-8 characters decoded as latin1.
a += ''.join(unichr(n) for n in [2**7-1,2**7,2**11-1,2**11]).encode('utf8').decode('latin1')

# Some non-BMP characters
a += u'\U00010000\U0010FFFF'.encode('utf8').decode('latin1')

print repr(a)

# Unicode codepoint sequences that resemble UTF-8 sequences.
p = re.compile(ur'''(?x)
    \xF0[\x90-\xBF][\x80-\xBF]{2} |  # Valid 4-byte sequences
        [\xF1-\xF3][\x80-\xBF]{3} |
    \xF4[\x80-\x8F][\x80-\xBF]{2} |

    \xE0[\xA0-\xBF][\x80-\xBF]    |  # Valid 3-byte sequences
        [\xE1-\xEC][\x80-\xBF]{2} |
    \xED[\x80-\x9F][\x80-\xBF]    |
        [\xEE-\xEF][\x80-\xBF]{2} |

    [\xC2-\xDF][\x80-\xBF]           # Valid 2-byte sequences
    ''')

def replace(m):
    return m.group(0).encode('latin1').decode('utf8')

print
print repr(p.sub(replace,a))

Output

u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff\x7f\xc2\x80\xdf\xbf\xe0\xa0\x80\xf0\x90\x80\x80\xf4\x8f\xbf\xbf'

u ' \ u0420 \ u0443 \ u0441 \ u0441 \ u043a \ u0438 \ u0439 \ xd0 \ xb5 \ xd0 \ xba! " # $ % &’()* +,”/ 0123456789:;< = > ? @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^ _“abcdefghijklmnopqrstuvwxyz { | } ~ \ x7f \ x80 \ x81 \ x82 \ x83 \ x84 \ x85 \ x86 \ x87 \ x88 \ x89 \ x8a \ x8b \ x8c \ x8d \ x8e \ x8f \ x90 \ x91 \ x92 \ x93 \ x94 \ x95 \ x96 \ x97 \ x98 \ x99 \ x9a \ x9b \ x9c \ x9d \ x9e \ x9f \ xa0 \ xa1 \ xa2 \ xa3 \ xa4 \ xa5 \ xa6 \ xa7 \ xa8 \ xa9 \ xaa \ xab \ xac \ xad \机加区\ xaf \ xb0 \ xb1 \ xb2 \ xb3 \ xb4 \ xb5 \ xb6 \ xb7 \ xb8 \ xb9 \ xba \ xbb \ xbc \ xbd \ xbe \ xbf \ xc0 \ xc1之前\ xc2 \ xc3 \ xc4 \ xc5 \ . xc6 \ xc7 \ xc8 \ xc9 \ xca \ xcb \ xcc \ xcd \ xce \ xcf \ xd0 \ xd1 \ xd2 \ xd3 \ xd4 \ xd5 \ xd6 \ xd7 \ xd8 \ xd9 \ xda \ xdb \ xdc \ xdd \ xde \ xdf \ xe0 \ xe1 \ xe2 \ xe3 \xe4 \ xe5 \ xe6 \ xe7 \ xe8 \ xe9 \ xea \ xeb \ xec \中\ xee \ xef \ xf0 \ xf1 \ xf2 \ xf3 \ xf4 \ xf5 \ xf6 \ xf7 \ xf8 \ xf9 \ xfa \ xfb以\ xfc \ xfd \ xfe \ xff \ x7f \ xc2 \ x80 \ xdf \ xbf \ xe0 \ xa0 \ x80 \ xf0 \ x90 \ x80 \ x80 \ xf4 \ x8f \ xbf \ xbf”

u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u0435\u043a !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff\x7f\x80\u07ff\u0800\U00010000\U0010ffff'

u ' \ u0420 \ u0443 \ u0441 \ u0441 \ u043a \ u0438 \ u0439 \ u0435 \ u043a! " # $ % &’()* +,”/ 0123456789:;< = > ? @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^ _“abcdefghijklmnopqrstuvwxyz { | } ~ \ x7f \ x80 \ x81 \ x82 \ x83 \ x84 \ x85 \ x86 \ x87 \ x88 \ x89 \ x8a \ x8b \ x8c \ x8d \ x8e \ x8f \ x90 \ x91 \ x92 \ x93 \ x94 \ x95 \ x96 \ x97 \ x98 \ x99 \ x9a \ x9b \ x9c \ x9d \ x9e \ x9f \ xa0 \ xa1 \ xa2 \ xa3 \ xa4 \ xa5 \ xa6 \ xa7 \ xa8 \ xa9 \ xaa \ xab \ xac \ xad \机加区\ xaf \ xb0 \ xb1 \ xb2 \ xb3 \ xb4 \ xb5 \ xb6 \ xb7 \ xb8 \ xb9 \ xba \ xbb \ xbc \ xbd \ xbe \ xbf \ xc0 \ xc1之前\ xc2 \ xc3 \ xc4 \ xc5 \ . xc6 \ xc7 \ xc8 \ xc9 \ xca \ xcb \ xcc \ xcd \ xce \ xcf \ xd0 \ xd1 \ xd2 \ xd3 \ xd4 \ xd5 \ xd6 \ xd7 \ xd8 \ xd9 \ xda \ xdb \ xdc \ xdd \ xde \ xdf \ xe0 \ xe1 \ xe2 \ xe3 \xe4 \ xe5 \ xe6 \ xe7 \ xe8 \ xe9 \ xea \ xeb \ xec \中\ xee \ xef \ xf0 \ xf1 \ xf2 \ xf3 \ xf4 \ xf5 \ xf6 \ xf7 \ xf8 \ xf9 \ xfa \ xfb以\ xfc \ xfd \ xfe \ xff \ x7f \ x80 \ u07ff \ u0800 \ U00010000 \ U0010ffff '

#1


21  

In Python 2, Unicode strings may contain both unicode and bytes:

在Python 2中,Unicode字符串可能包含Unicode和字节:

No, they may not. They contain Unicode characters.

不,他们不。它们包含Unicode字符。

Within the original string, \xd0 is not a byte that's part of a UTF-8 encoding. It is the Unicode character with code point 208. u'\xd0' == u'\u00d0'. It just happens that the repr for Unicode strings in Python 2 prefers to represent characters with \x escapes where possible (i.e. code points < 256).

在原来的字符串中,\xd0不是UTF-8编码的一部分的字节。它是代码点208的Unicode字符。u ' \ xd0 ' = = ' \ u00d0 '。在Python 2中,Unicode字符串的repr更喜欢用\x转义字符来表示字符(即代码点< 256)。

There is no way to look at the string and tell that the \xd0 byte is supposed to be part of some UTF-8 encoded character, or if it actually stands for that Unicode character by itself.

没有办法查看字符串并告诉它\xd0字节应该是某些UTF-8编码字符的一部分,或者它本身实际上代表了Unicode字符。

However, if you assume that you can always interpret those values as encoded ones, you could try writing something that analyzes each character in turn (use ord to convert to a code-point integer), decodes characters < 256 as UTF-8, and passes characters >= 256 as they were.

但是,如果您假设您总是可以将这些值解释为编码的值,那么您可以尝试编写一些东西来分析每个字符(使用ord转换为一个代码点整数),将字符解码为UTF-8,并通过字符>= 256。

#2


12  

(In response to the comments above): this code converts everything that looks like utf8 and leaves other codepoints as is:

(针对上述评论):此代码将所有看起来像utf8的代码转换为其他代码点,如下所示:

a = u'\u0420\u0443\u0441 utf:\xd0\xb5\xd0\xba bytes:bl\xe4\xe4'

def convert(s):
    try:
        return s.group(0).encode('latin1').decode('utf8')
    except:
        return s.group(0)

import re
a = re.sub(r'[\x80-\xFF]+', convert, a)
print a.encode('utf8')   

Result:

结果:

Рус utf:ек bytes:blää  

#3


11  

The problem is that your string is not actually encoded in a specific encoding. Your example string:

问题是,您的字符串实际上并没有编码在特定的编码中。你的例子字符串:

a = u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'

Is mixing python's internal representation of unicode strings with utf-8 encoded text. If we just consider the 'special' characters:

将python内部的unicode字符串表示与utf-8编码的文本混合在一起。如果我们只考虑“特殊”字符:

>>> orig = u'\u0435\u043a'
>>> bytes = u'\xd0\xb5\xd0\xba'
>>> print orig
ек
>>> print bytes
ек

But you say, bytes is utf-8 encoded:

但是你说,字节是utf-8编码的:

>>> print bytes.encode('utf-8')
ек
>>> print bytes.encode('utf-8').decode('utf-8')
ек

Wrong! But what about:

错了!但:

>>> bytes = '\xd0\xb5\xd0\xba'
>>> print bytes
ек
>>> print bytes.decode('utf-8')
ек

Hurrah.

好哇。

So. What does this mean for me? It means you're (probably) solving the wrong problem. What you should be asking us/trying to figure out is why your strings are in this form to begin with and how to avoid it/fix it before you have them all mixed up.

所以。这对我意味着什么?这意味着你(很可能)解决了错误的问题。你应该问我们/试图弄明白的是为什么你的字符串会以这种形式开始,以及如何在你把它们混在一起之前避免它/修正它。

#4


5  

You should convert unichrs to chrs, then decode them.

您应该将unichrs转换为chrs,然后对它们进行解码。

u'\xd0' == u'\u00d0' is True

u'\xd0' == u'\u00d0'是真的。

$ python
>>> import re
>>> a = u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'
>>> re.sub(r'[\000-\377]*', lambda m:''.join([chr(ord(i)) for i in m.group(0)]).decode('utf8'), a)
u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u0435\u043a'
  • r'[\000-\377]*' will match unichrs u'[\u0000-\u00ff]*'
  • r'[\000-\377]*'将匹配unichrs u'[\u0000-\u00ff]*'
  • u'\xd0\xb5\xd0\xba' == u'\u00d0\u00b5\u00d0\u00ba'
  • u ' \ xd0 \ xb5 \ xd0 \ xba ' = = u ' \ u00d0 \ u00b5 \ u00d0 \ u00ba '
  • You use utf8 encoded bytes as unicode code points (this is the PROBLEM)
  • 您使用utf8编码的字节作为unicode代码点(这是问题)
  • I solve the problem by pretending those mistaken unichars as the corresponding bytes
  • 我通过将那些错误的unichars作为相应的字节来解决这个问题。
  • I search all these mistaken unichars, and convert them to chars, then decode them.
  • 我搜索所有这些错误的unichars,然后将它们转换成chars,然后解码它们。

If I'm wrong, please tell me.

如果我错了,请告诉我。

#5


5  

You've already got an answer, but here's a way to unscramble UTF-8-like Unicode sequences that is less likely to decode latin-1 Unicode sequences in error. The re.sub function:

您已经得到了一个答案,但这里有一种方法可以使utf -8类似的Unicode序列进行解码,从而不太可能在错误中解码latin-1 Unicode序列。re.sub函数:

  1. Matches Unicode characters < U+0100 that resemble valid UTF-8 sequences (ref: RFC 3629).
  2. 匹配Unicode字符< U+0100,类似于有效的UTF-8序列(ref: RFC 3629)。
  3. Encodes the Unicode sequence into its equivalent latin-1 byte sequence.
  4. 将Unicode序列编码为等效的latin-1字节序列。
  5. Decodes the sequence using UTF-8 back into Unicode.
  6. 将使用UTF-8的序列解码为Unicode。
  7. Replaces the original UTF-8-like sequence with the matching Unicode character.
  8. 用匹配的Unicode字符替换原来的utf -8样序列。

Note this could still match a Unicode sequence if just the right characters appear next to each other, but it is much less likely.

注意,如果正确的字符出现在彼此的旁边,那么它仍然可以匹配一个Unicode序列,但是可能性要小得多。

import re

# your example
a = u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'

# printable Unicode characters < 256.
a += ''.join(chr(n) for n in range(32,256)).decode('latin1')

# a few UTF-8 characters decoded as latin1.
a += ''.join(unichr(n) for n in [2**7-1,2**7,2**11-1,2**11]).encode('utf8').decode('latin1')

# Some non-BMP characters
a += u'\U00010000\U0010FFFF'.encode('utf8').decode('latin1')

print repr(a)

# Unicode codepoint sequences that resemble UTF-8 sequences.
p = re.compile(ur'''(?x)
    \xF0[\x90-\xBF][\x80-\xBF]{2} |  # Valid 4-byte sequences
        [\xF1-\xF3][\x80-\xBF]{3} |
    \xF4[\x80-\x8F][\x80-\xBF]{2} |

    \xE0[\xA0-\xBF][\x80-\xBF]    |  # Valid 3-byte sequences
        [\xE1-\xEC][\x80-\xBF]{2} |
    \xED[\x80-\x9F][\x80-\xBF]    |
        [\xEE-\xEF][\x80-\xBF]{2} |

    [\xC2-\xDF][\x80-\xBF]           # Valid 2-byte sequences
    ''')

def replace(m):
    return m.group(0).encode('latin1').decode('utf8')

print
print repr(p.sub(replace,a))

Output

u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff\x7f\xc2\x80\xdf\xbf\xe0\xa0\x80\xf0\x90\x80\x80\xf4\x8f\xbf\xbf'

u ' \ u0420 \ u0443 \ u0441 \ u0441 \ u043a \ u0438 \ u0439 \ xd0 \ xb5 \ xd0 \ xba! " # $ % &’()* +,”/ 0123456789:;< = > ? @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^ _“abcdefghijklmnopqrstuvwxyz { | } ~ \ x7f \ x80 \ x81 \ x82 \ x83 \ x84 \ x85 \ x86 \ x87 \ x88 \ x89 \ x8a \ x8b \ x8c \ x8d \ x8e \ x8f \ x90 \ x91 \ x92 \ x93 \ x94 \ x95 \ x96 \ x97 \ x98 \ x99 \ x9a \ x9b \ x9c \ x9d \ x9e \ x9f \ xa0 \ xa1 \ xa2 \ xa3 \ xa4 \ xa5 \ xa6 \ xa7 \ xa8 \ xa9 \ xaa \ xab \ xac \ xad \机加区\ xaf \ xb0 \ xb1 \ xb2 \ xb3 \ xb4 \ xb5 \ xb6 \ xb7 \ xb8 \ xb9 \ xba \ xbb \ xbc \ xbd \ xbe \ xbf \ xc0 \ xc1之前\ xc2 \ xc3 \ xc4 \ xc5 \ . xc6 \ xc7 \ xc8 \ xc9 \ xca \ xcb \ xcc \ xcd \ xce \ xcf \ xd0 \ xd1 \ xd2 \ xd3 \ xd4 \ xd5 \ xd6 \ xd7 \ xd8 \ xd9 \ xda \ xdb \ xdc \ xdd \ xde \ xdf \ xe0 \ xe1 \ xe2 \ xe3 \xe4 \ xe5 \ xe6 \ xe7 \ xe8 \ xe9 \ xea \ xeb \ xec \中\ xee \ xef \ xf0 \ xf1 \ xf2 \ xf3 \ xf4 \ xf5 \ xf6 \ xf7 \ xf8 \ xf9 \ xfa \ xfb以\ xfc \ xfd \ xfe \ xff \ x7f \ xc2 \ x80 \ xdf \ xbf \ xe0 \ xa0 \ x80 \ xf0 \ x90 \ x80 \ x80 \ xf4 \ x8f \ xbf \ xbf”

u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u0435\u043a !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff\x7f\x80\u07ff\u0800\U00010000\U0010ffff'

u ' \ u0420 \ u0443 \ u0441 \ u0441 \ u043a \ u0438 \ u0439 \ u0435 \ u043a! " # $ % &’()* +,”/ 0123456789:;< = > ? @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^ _“abcdefghijklmnopqrstuvwxyz { | } ~ \ x7f \ x80 \ x81 \ x82 \ x83 \ x84 \ x85 \ x86 \ x87 \ x88 \ x89 \ x8a \ x8b \ x8c \ x8d \ x8e \ x8f \ x90 \ x91 \ x92 \ x93 \ x94 \ x95 \ x96 \ x97 \ x98 \ x99 \ x9a \ x9b \ x9c \ x9d \ x9e \ x9f \ xa0 \ xa1 \ xa2 \ xa3 \ xa4 \ xa5 \ xa6 \ xa7 \ xa8 \ xa9 \ xaa \ xab \ xac \ xad \机加区\ xaf \ xb0 \ xb1 \ xb2 \ xb3 \ xb4 \ xb5 \ xb6 \ xb7 \ xb8 \ xb9 \ xba \ xbb \ xbc \ xbd \ xbe \ xbf \ xc0 \ xc1之前\ xc2 \ xc3 \ xc4 \ xc5 \ . xc6 \ xc7 \ xc8 \ xc9 \ xca \ xcb \ xcc \ xcd \ xce \ xcf \ xd0 \ xd1 \ xd2 \ xd3 \ xd4 \ xd5 \ xd6 \ xd7 \ xd8 \ xd9 \ xda \ xdb \ xdc \ xdd \ xde \ xdf \ xe0 \ xe1 \ xe2 \ xe3 \xe4 \ xe5 \ xe6 \ xe7 \ xe8 \ xe9 \ xea \ xeb \ xec \中\ xee \ xef \ xf0 \ xf1 \ xf2 \ xf3 \ xf4 \ xf5 \ xf6 \ xf7 \ xf8 \ xf9 \ xfa \ xfb以\ xfc \ xfd \ xfe \ xff \ x7f \ x80 \ u07ff \ u0800 \ U00010000 \ U0010ffff '