unicode Python字符串中的字节。

In Python 2, Unicode strings may contain both unicode and bytes:

在Python 2中，Unicode字符串可能包含Unicode和字节:

a = u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'

I understand that this is absolutely not something one should write in his own code, but this is a string that I have to deal with.

我明白这绝对不是一个人应该在他自己的代码中写的东西，但这是我必须处理的字符串。

The bytes in the string above are UTF-8 for ек (Unicode \u0435\u043a).

上面的字符串中的字节是екutf - 8(Unicode \ u0435 \ u043a)。

My objective is to get a unicode string containing everything in Unicode, which is to say Русский ек (\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u0435\u043a).

我的目标是得到一个unicode字符串包含unicode的一切,也就是说Русскийек(\ u0420 \ u0443 \ u0441 \ u0441 \ u043a \ u0438 \ u0439 \ u0435 \ u043a)。

Encoding it to UTF-8 yields

将其编码为UTF-8收益率。

>>> a.encode('utf-8')
'\xd0\xa0\xd1\x83\xd1\x81\xd1\x81\xd0\xba\xd0\xb8\xd0\xb9 \xc3\x90\xc2\xb5\xc3\x90\xc2\xba'

Which then decoded from UTF-8 gives the initial string with bytes in them, which is not good:

然后从UTF-8中解码出初始字符串，其中有字节，这是不好的:

>>> a.encode('utf-8').decode('utf-8')
u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'

I found a hacky way to solve the problem, however:

然而，我发现了一个解决问题的方法:

>>> repr(a)
"u'\\u0420\\u0443\\u0441\\u0441\\u043a\\u0438\\u0439 \\xd0\\xb5\\xd0\\xba'"
>>> eval(repr(a)[1:])
'\\u0420\\u0443\\u0441\\u0441\\u043a\\u0438\\u0439 \xd0\xb5\xd0\xba'
>>> s = eval(repr(a)[1:]).decode('utf8')
>>> s
u'\\u0420\\u0443\\u0441\\u0441\\u043a\\u0438\\u0439 \u0435\u043a'
# Almost there, the bytes are proper now but the former real-unicode characters
# are now escaped with \u's; need to un-escape them.
>>> import re
>>> re.sub(u'\\\\u([a-f\\d]+)', lambda x : unichr(int(x.group(1), 16)), s)
u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u0435\u043a' # Success!

This works fine but looks very hacky due to its use of eval, repr, and then additional regex'ing of the unicode string representation. Is there a cleaner way?

这很好，但是由于它使用了eval、repr和unicode字符串表示的额外的正则表达式，看起来非常的粗糙。有更清洁的方法吗?

5 个解决方案

#1

In Python 2, Unicode strings may contain both unicode and bytes:

在Python 2中，Unicode字符串可能包含Unicode和字节:

No, they may not. They contain Unicode characters.

不,他们不。它们包含Unicode字符。

Within the original string, \xd0 is not a byte that's part of a UTF-8 encoding. It is the Unicode character with code point 208. u'\xd0' == u'\u00d0'. It just happens that the repr for Unicode strings in Python 2 prefers to represent characters with \x escapes where possible (i.e. code points < 256).

在原来的字符串中，\xd0不是UTF-8编码的一部分的字节。它是代码点208的Unicode字符。u ' \ xd0 ' = = ' \ u00d0 '。在Python 2中，Unicode字符串的repr更喜欢用\x转义字符来表示字符(即代码点< 256)。

There is no way to look at the string and tell that the \xd0 byte is supposed to be part of some UTF-8 encoded character, or if it actually stands for that Unicode character by itself.

没有办法查看字符串并告诉它\xd0字节应该是某些UTF-8编码字符的一部分，或者它本身实际上代表了Unicode字符。

However, if you assume that you can always interpret those values as encoded ones, you could try writing something that analyzes each character in turn (use ord to convert to a code-point integer), decodes characters < 256 as UTF-8, and passes characters >= 256 as they were.

但是，如果您假设您总是可以将这些值解释为编码的值，那么您可以尝试编写一些东西来分析每个字符(使用ord转换为一个代码点整数)，将字符解码为UTF-8，并通过字符>= 256。

#2

(In response to the comments above): this code converts everything that looks like utf8 and leaves other codepoints as is:

(针对上述评论):此代码将所有看起来像utf8的代码转换为其他代码点，如下所示:

a = u'\u0420\u0443\u0441 utf:\xd0\xb5\xd0\xba bytes:bl\xe4\xe4'

def convert(s):
    try:
        return s.group(0).encode('latin1').decode('utf8')
    except:
        return s.group(0)

import re
a = re.sub(r'[\x80-\xFF]+', convert, a)
print a.encode('utf8')

Result:

结果:

Рус utf:ек bytes:blää

#3

The problem is that your string is not actually encoded in a specific encoding. Your example string:

问题是，您的字符串实际上并没有编码在特定的编码中。你的例子字符串:

a = u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'

Is mixing python's internal representation of unicode strings with utf-8 encoded text. If we just consider the 'special' characters:

将python内部的unicode字符串表示与utf-8编码的文本混合在一起。如果我们只考虑“特殊”字符:

>>> orig = u'\u0435\u043a'
>>> bytes = u'\xd0\xb5\xd0\xba'
>>> print orig
ек
>>> print bytes
ÐµÐº

But you say, bytes is utf-8 encoded:

但是你说，字节是utf-8编码的:

>>> print bytes.encode('utf-8')
ÐµÐº
>>> print bytes.encode('utf-8').decode('utf-8')
ÐµÐº

Wrong! But what about:

错了!但:

>>> bytes = '\xd0\xb5\xd0\xba'
>>> print bytes
ек
>>> print bytes.decode('utf-8')
ек

Hurrah.

好哇。

So. What does this mean for me? It means you're (probably) solving the wrong problem. What you should be asking us/trying to figure out is why your strings are in this form to begin with and how to avoid it/fix it before you have them all mixed up.

所以。这对我意味着什么?这意味着你(很可能)解决了错误的问题。你应该问我们/试图弄明白的是为什么你的字符串会以这种形式开始，以及如何在你把它们混在一起之前避免它/修正它。

#4

You should convert unichrs to chrs, then decode them.

您应该将unichrs转换为chrs，然后对它们进行解码。

u'\xd0' == u'\u00d0' is True

u'\xd0' == u'\u00d0'是真的。

$ python
>>> import re
>>> a = u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'
>>> re.sub(r'[\000-\377]*', lambda m:''.join([chr(ord(i)) for i in m.group(0)]).decode('utf8'), a)
u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u0435\u043a'

r'[\000-\377]*' will match unichrs u'[\u0000-\u00ff]*'
r'[\000-\377]*'将匹配unichrs u'[\u0000-\u00ff]*'
u'\xd0\xb5\xd0\xba' == u'\u00d0\u00b5\u00d0\u00ba'
u ' \ xd0 \ xb5 \ xd0 \ xba ' = = u ' \ u00d0 \ u00b5 \ u00d0 \ u00ba '
You use utf8 encoded bytes as unicode code points (this is the PROBLEM)
您使用utf8编码的字节作为unicode代码点(这是问题)
I solve the problem by pretending those mistaken unichars as the corresponding bytes
我通过将那些错误的unichars作为相应的字节来解决这个问题。
I search all these mistaken unichars, and convert them to chars, then decode them.
我搜索所有这些错误的unichars，然后将它们转换成chars，然后解码它们。

If I'm wrong, please tell me.

如果我错了，请告诉我。

#5

You've already got an answer, but here's a way to unscramble UTF-8-like Unicode sequences that is less likely to decode latin-1 Unicode sequences in error. The re.sub function:

您已经得到了一个答案，但这里有一种方法可以使utf -8类似的Unicode序列进行解码，从而不太可能在错误中解码latin-1 Unicode序列。re.sub函数:

Matches Unicode characters < U+0100 that resemble valid UTF-8 sequences (ref: RFC 3629).
匹配Unicode字符< U+0100，类似于有效的UTF-8序列(ref: RFC 3629)。
Encodes the Unicode sequence into its equivalent latin-1 byte sequence.
将Unicode序列编码为等效的latin-1字节序列。
Decodes the sequence using UTF-8 back into Unicode.
将使用UTF-8的序列解码为Unicode。
Replaces the original UTF-8-like sequence with the matching Unicode character.
用匹配的Unicode字符替换原来的utf -8样序列。

Note this could still match a Unicode sequence if just the right characters appear next to each other, but it is much less likely.

注意，如果正确的字符出现在彼此的旁边，那么它仍然可以匹配一个Unicode序列，但是可能性要小得多。

import re

# your example
a = u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'

# printable Unicode characters < 256.
a += ''.join(chr(n) for n in range(32,256)).decode('latin1')

# a few UTF-8 characters decoded as latin1.
a += ''.join(unichr(n) for n in [2**7-1,2**7,2**11-1,2**11]).encode('utf8').decode('latin1')

# Some non-BMP characters
a += u'\U00010000\U0010FFFF'.encode('utf8').decode('latin1')

print repr(a)

# Unicode codepoint sequences that resemble UTF-8 sequences.
p = re.compile(ur'''(?x)
    \xF0[\x90-\xBF][\x80-\xBF]{2} |  # Valid 4-byte sequences
        [\xF1-\xF3][\x80-\xBF]{3} |
    \xF4[\x80-\x8F][\x80-\xBF]{2} |

    \xE0[\xA0-\xBF][\x80-\xBF]    |  # Valid 3-byte sequences
        [\xE1-\xEC][\x80-\xBF]{2} |
    \xED[\x80-\x9F][\x80-\xBF]    |
        [\xEE-\xEF][\x80-\xBF]{2} |

    [\xC2-\xDF][\x80-\xBF]           # Valid 2-byte sequences
    ''')

def replace(m):
    return m.group(0).encode('latin1').decode('utf8')

print
print repr(p.sub(replace,a))

Output

u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff\x7f\xc2\x80\xdf\xbf\xe0\xa0\x80\xf0\x90\x80\x80\xf4\x8f\xbf\xbf'

u ' \ u0420 \ u0443 \ u0441 \ u0441 \ u043a \ u0438 \ u0439 \ xd0 \ xb5 \ xd0 \ xba! " # $ % &’()* +,”/ 0123456789:;< = > ? @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^ _“abcdefghijklmnopqrstuvwxyz { | } ~ \ x7f \ x80 \ x81 \ x82 \ x83 \ x84 \ x85 \ x86 \ x87 \ x88 \ x89 \ x8a \ x8b \ x8c \ x8d \ x8e \ x8f \ x90 \ x91 \ x92 \ x93 \ x94 \ x95 \ x96 \ x97 \ x98 \ x99 \ x9a \ x9b \ x9c \ x9d \ x9e \ x9f \ xa0 \ xa1 \ xa2 \ xa3 \ xa4 \ xa5 \ xa6 \ xa7 \ xa8 \ xa9 \ xaa \ xab \ xac \ xad \机加区\ xaf \ xb0 \ xb1 \ xb2 \ xb3 \ xb4 \ xb5 \ xb6 \ xb7 \ xb8 \ xb9 \ xba \ xbb \ xbc \ xbd \ xbe \ xbf \ xc0 \ xc1之前\ xc2 \ xc3 \ xc4 \ xc5 \ . xc6 \ xc7 \ xc8 \ xc9 \ xca \ xcb \ xcc \ xcd \ xce \ xcf \ xd0 \ xd1 \ xd2 \ xd3 \ xd4 \ xd5 \ xd6 \ xd7 \ xd8 \ xd9 \ xda \ xdb \ xdc \ xdd \ xde \ xdf \ xe0 \ xe1 \ xe2 \ xe3 \xe4 \ xe5 \ xe6 \ xe7 \ xe8 \ xe9 \ xea \ xeb \ xec \中\ xee \ xef \ xf0 \ xf1 \ xf2 \ xf3 \ xf4 \ xf5 \ xf6 \ xf7 \ xf8 \ xf9 \ xfa \ xfb以\ xfc \ xfd \ xfe \ xff \ x7f \ xc2 \ x80 \ xdf \ xbf \ xe0 \ xa0 \ x80 \ xf0 \ x90 \ x80 \ x80 \ xf4 \ x8f \ xbf \ xbf”

u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u0435\u043a !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff\x7f\x80\u07ff\u0800\U00010000\U0010ffff'

u ' \ u0420 \ u0443 \ u0441 \ u0441 \ u043a \ u0438 \ u0439 \ u0435 \ u043a! " # $ % &’()* +,”/ 0123456789:;< = > ? @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^ _“abcdefghijklmnopqrstuvwxyz { | } ~ \ x7f \ x80 \ x81 \ x82 \ x83 \ x84 \ x85 \ x86 \ x87 \ x88 \ x89 \ x8a \ x8b \ x8c \ x8d \ x8e \ x8f \ x90 \ x91 \ x92 \ x93 \ x94 \ x95 \ x96 \ x97 \ x98 \ x99 \ x9a \ x9b \ x9c \ x9d \ x9e \ x9f \ xa0 \ xa1 \ xa2 \ xa3 \ xa4 \ xa5 \ xa6 \ xa7 \ xa8 \ xa9 \ xaa \ xab \ xac \ xad \机加区\ xaf \ xb0 \ xb1 \ xb2 \ xb3 \ xb4 \ xb5 \ xb6 \ xb7 \ xb8 \ xb9 \ xba \ xbb \ xbc \ xbd \ xbe \ xbf \ xc0 \ xc1之前\ xc2 \ xc3 \ xc4 \ xc5 \ . xc6 \ xc7 \ xc8 \ xc9 \ xca \ xcb \ xcc \ xcd \ xce \ xcf \ xd0 \ xd1 \ xd2 \ xd3 \ xd4 \ xd5 \ xd6 \ xd7 \ xd8 \ xd9 \ xda \ xdb \ xdc \ xdd \ xde \ xdf \ xe0 \ xe1 \ xe2 \ xe3 \xe4 \ xe5 \ xe6 \ xe7 \ xe8 \ xe9 \ xea \ xeb \ xec \中\ xee \ xef \ xf0 \ xf1 \ xf2 \ xf3 \ xf4 \ xf5 \ xf6 \ xf7 \ xf8 \ xf9 \ xfa \ xfb以\ xfc \ xfd \ xfe \ xff \ x7f \ x80 \ u07ff \ u0800 \ U00010000 \ U0010ffff '

#1

In Python 2, Unicode strings may contain both unicode and bytes:

在Python 2中，Unicode字符串可能包含Unicode和字节:

No, they may not. They contain Unicode characters.

不,他们不。它们包含Unicode字符。

There is no way to look at the string and tell that the \xd0 byte is supposed to be part of some UTF-8 encoded character, or if it actually stands for that Unicode character by itself.

没有办法查看字符串并告诉它\xd0字节应该是某些UTF-8编码字符的一部分，或者它本身实际上代表了Unicode字符。

#2

(In response to the comments above): this code converts everything that looks like utf8 and leaves other codepoints as is:

(针对上述评论):此代码将所有看起来像utf8的代码转换为其他代码点，如下所示:

a = u'\u0420\u0443\u0441 utf:\xd0\xb5\xd0\xba bytes:bl\xe4\xe4'

def convert(s):
    try:
        return s.group(0).encode('latin1').decode('utf8')
    except:
        return s.group(0)

import re
a = re.sub(r'[\x80-\xFF]+', convert, a)
print a.encode('utf8')

Result:

结果:

Рус utf:ек bytes:blää

#3

The problem is that your string is not actually encoded in a specific encoding. Your example string:

问题是，您的字符串实际上并没有编码在特定的编码中。你的例子字符串:

a = u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'

Is mixing python's internal representation of unicode strings with utf-8 encoded text. If we just consider the 'special' characters:

将python内部的unicode字符串表示与utf-8编码的文本混合在一起。如果我们只考虑“特殊”字符:

>>> orig = u'\u0435\u043a'
>>> bytes = u'\xd0\xb5\xd0\xba'
>>> print orig
ек
>>> print bytes
ÐµÐº

But you say, bytes is utf-8 encoded:

但是你说，字节是utf-8编码的:

>>> print bytes.encode('utf-8')
ÐµÐº
>>> print bytes.encode('utf-8').decode('utf-8')
ÐµÐº

Wrong! But what about:

错了!但:

>>> bytes = '\xd0\xb5\xd0\xba'
>>> print bytes
ек
>>> print bytes.decode('utf-8')
ек

Hurrah.

好哇。

#4

You should convert unichrs to chrs, then decode them.

您应该将unichrs转换为chrs，然后对它们进行解码。

u'\xd0' == u'\u00d0' is True

u'\xd0' == u'\u00d0'是真的。

$ python
>>> import re
>>> a = u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'
>>> re.sub(r'[\000-\377]*', lambda m:''.join([chr(ord(i)) for i in m.group(0)]).decode('utf8'), a)
u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u0435\u043a'

r'[\000-\377]*' will match unichrs u'[\u0000-\u00ff]*'
r'[\000-\377]*'将匹配unichrs u'[\u0000-\u00ff]*'
u'\xd0\xb5\xd0\xba' == u'\u00d0\u00b5\u00d0\u00ba'
u ' \ xd0 \ xb5 \ xd0 \ xba ' = = u ' \ u00d0 \ u00b5 \ u00d0 \ u00ba '
You use utf8 encoded bytes as unicode code points (this is the PROBLEM)
您使用utf8编码的字节作为unicode代码点(这是问题)
I solve the problem by pretending those mistaken unichars as the corresponding bytes
我通过将那些错误的unichars作为相应的字节来解决这个问题。
I search all these mistaken unichars, and convert them to chars, then decode them.
我搜索所有这些错误的unichars，然后将它们转换成chars，然后解码它们。

If I'm wrong, please tell me.

如果我错了，请告诉我。

#5

You've already got an answer, but here's a way to unscramble UTF-8-like Unicode sequences that is less likely to decode latin-1 Unicode sequences in error. The re.sub function:

您已经得到了一个答案，但这里有一种方法可以使utf -8类似的Unicode序列进行解码，从而不太可能在错误中解码latin-1 Unicode序列。re.sub函数:

Matches Unicode characters < U+0100 that resemble valid UTF-8 sequences (ref: RFC 3629).
匹配Unicode字符< U+0100，类似于有效的UTF-8序列(ref: RFC 3629)。
Encodes the Unicode sequence into its equivalent latin-1 byte sequence.
将Unicode序列编码为等效的latin-1字节序列。
Decodes the sequence using UTF-8 back into Unicode.
将使用UTF-8的序列解码为Unicode。
Replaces the original UTF-8-like sequence with the matching Unicode character.
用匹配的Unicode字符替换原来的utf -8样序列。

Note this could still match a Unicode sequence if just the right characters appear next to each other, but it is much less likely.

注意，如果正确的字符出现在彼此的旁边，那么它仍然可以匹配一个Unicode序列，但是可能性要小得多。

import re

# your example
a = u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'

# printable Unicode characters < 256.
a += ''.join(chr(n) for n in range(32,256)).decode('latin1')

# a few UTF-8 characters decoded as latin1.
a += ''.join(unichr(n) for n in [2**7-1,2**7,2**11-1,2**11]).encode('utf8').decode('latin1')

# Some non-BMP characters
a += u'\U00010000\U0010FFFF'.encode('utf8').decode('latin1')

print repr(a)

# Unicode codepoint sequences that resemble UTF-8 sequences.
p = re.compile(ur'''(?x)
    \xF0[\x90-\xBF][\x80-\xBF]{2} |  # Valid 4-byte sequences
        [\xF1-\xF3][\x80-\xBF]{3} |
    \xF4[\x80-\x8F][\x80-\xBF]{2} |

    \xE0[\xA0-\xBF][\x80-\xBF]    |  # Valid 3-byte sequences
        [\xE1-\xEC][\x80-\xBF]{2} |
    \xED[\x80-\x9F][\x80-\xBF]    |
        [\xEE-\xEF][\x80-\xBF]{2} |

    [\xC2-\xDF][\x80-\xBF]           # Valid 2-byte sequences
    ''')

def replace(m):
    return m.group(0).encode('latin1').decode('utf8')

print
print repr(p.sub(replace,a))

Output

u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff\x7f\xc2\x80\xdf\xbf\xe0\xa0\x80\xf0\x90\x80\x80\xf4\x8f\xbf\xbf'

u ' \ u0420 \ u0443 \ u0441 \ u0441 \ u043a \ u0438 \ u0439 \ xd0 \ xb5 \ xd0 \ xba! " # $ % &’()* +,”/ 0123456789:;< = > ? @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^ _“abcdefghijklmnopqrstuvwxyz { | } ~ \ x7f \ x80 \ x81 \ x82 \ x83 \ x84 \ x85 \ x86 \ x87 \ x88 \ x89 \ x8a \ x8b \ x8c \ x8d \ x8e \ x8f \ x90 \ x91 \ x92 \ x93 \ x94 \ x95 \ x96 \ x97 \ x98 \ x99 \ x9a \ x9b \ x9c \ x9d \ x9e \ x9f \ xa0 \ xa1 \ xa2 \ xa3 \ xa4 \ xa5 \ xa6 \ xa7 \ xa8 \ xa9 \ xaa \ xab \ xac \ xad \机加区\ xaf \ xb0 \ xb1 \ xb2 \ xb3 \ xb4 \ xb5 \ xb6 \ xb7 \ xb8 \ xb9 \ xba \ xbb \ xbc \ xbd \ xbe \ xbf \ xc0 \ xc1之前\ xc2 \ xc3 \ xc4 \ xc5 \ . xc6 \ xc7 \ xc8 \ xc9 \ xca \ xcb \ xcc \ xcd \ xce \ xcf \ xd0 \ xd1 \ xd2 \ xd3 \ xd4 \ xd5 \ xd6 \ xd7 \ xd8 \ xd9 \ xda \ xdb \ xdc \ xdd \ xde \ xdf \ xe0 \ xe1 \ xe2 \ xe3 \xe4 \ xe5 \ xe6 \ xe7 \ xe8 \ xe9 \ xea \ xeb \ xec \中\ xee \ xef \ xf0 \ xf1 \ xf2 \ xf3 \ xf4 \ xf5 \ xf6 \ xf7 \ xf8 \ xf9 \ xfa \ xfb以\ xfc \ xfd \ xfe \ xff \ x7f \ xc2 \ x80 \ xdf \ xbf \ xe0 \ xa0 \ x80 \ xf0 \ x90 \ x80 \ x80 \ xf4 \ x8f \ xbf \ xbf”

u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u0435\u043a !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff\x7f\x80\u07ff\u0800\U00010000\U0010ffff'

u ' \ u0420 \ u0443 \ u0441 \ u0441 \ u043a \ u0438 \ u0439 \ u0435 \ u043a! " # $ % &’()* +,”/ 0123456789:;< = > ? @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^ _“abcdefghijklmnopqrstuvwxyz { | } ~ \ x7f \ x80 \ x81 \ x82 \ x83 \ x84 \ x85 \ x86 \ x87 \ x88 \ x89 \ x8a \ x8b \ x8c \ x8d \ x8e \ x8f \ x90 \ x91 \ x92 \ x93 \ x94 \ x95 \ x96 \ x97 \ x98 \ x99 \ x9a \ x9b \ x9c \ x9d \ x9e \ x9f \ xa0 \ xa1 \ xa2 \ xa3 \ xa4 \ xa5 \ xa6 \ xa7 \ xa8 \ xa9 \ xaa \ xab \ xac \ xad \机加区\ xaf \ xb0 \ xb1 \ xb2 \ xb3 \ xb4 \ xb5 \ xb6 \ xb7 \ xb8 \ xb9 \ xba \ xbb \ xbc \ xbd \ xbe \ xbf \ xc0 \ xc1之前\ xc2 \ xc3 \ xc4 \ xc5 \ . xc6 \ xc7 \ xc8 \ xc9 \ xca \ xcb \ xcc \ xcd \ xce \ xcf \ xd0 \ xd1 \ xd2 \ xd3 \ xd4 \ xd5 \ xd6 \ xd7 \ xd8 \ xd9 \ xda \ xdb \ xdc \ xdd \ xde \ xdf \ xe0 \ xe1 \ xe2 \ xe3 \xe4 \ xe5 \ xe6 \ xe7 \ xe8 \ xe9 \ xea \ xeb \ xec \中\ xee \ xef \ xf0 \ xf1 \ xf2 \ xf3 \ xf4 \ xf5 \ xf6 \ xf7 \ xf8 \ xf9 \ xfa \ xfb以\ xfc \ xfd \ xfe \ xff \ x7f \ x80 \ u07ff \ u0800 \ U00010000 \ U0010ffff '

秒客网

unicode Python字符串中的字节。

5 个解决方案

#1

#2

#3

#4

#5

Output

#1

#2

#3

#4

#5

Output

相关文章