为什么在Python中，en-dash被写为“\xe2\x80\x93”?

Specifically, what does each escape in \xe2\x80\x93 do and why does it need 3 escapes? Trying to decode one by itself leads to an 'unexpected end of data' error.

具体地说，每一个从xe2\x80\x93中得到什么，为什么需要3个转义?试图通过自身解码导致数据错误的“意外终止”。

>>> print(b'\xe2\x80\x93'.decode('utf-8'))
–
>>> print(b'\xe2'.decode('utf-8'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 0: unexpected end of data

1 个解决方案

#1

You have UTF-8 bytes, which is a codec, a standard to represent text as computer-readable data. The U+2013 EN-DASH codepoint encodes to those 3 bytes when encoded to that codec.

您有UTF-8字节，这是一个编解码器，一个表示文本为计算机可读数据的标准。当编码到这个编解码器时，U+2013 EN-DASH codepoint编码到这3个字节。

Trying to decode just one such byte as UTF-8 doesn't work because in the UTF-8 standard that one byte does not, on its own, carry meaning. In the UTF-8 encoding scheme, a \xe2 byte is used for all codepoints between U+2000 and U+2FFF in the Unicode standard (which would all be encoded with an additional 2 bytes); thats 4095 codepoints in all.

试图解码一个像UTF-8这样的字节是行不通的，因为在UTF-8标准中，一个字节本身并不具有意义。在UTF-8编码方案中，在Unicode标准中，U+2000和U+2FFF之间的所有代码点都使用\xe2字节(这都将被编码为额外的2字节);总共4095码。

Python represents values in a bytes object in a manner that lets you reproduce the value by copying it back into a Python script or terminal. Anything that isn't printable ASCII is then represented by a \xhh hex escape. The two characters form the hexadecimal value of the byte, an integer number between 0 and 255.

Python以一种方式表示字节对象中的值，通过将其复制回Python脚本或终端，您可以复制该值。任何不能打印的ASCII都用一个\xhh十六进制转义来表示。这两个字符构成了字节的十六进制值，一个整数在0和255之间。

Hexadecimal is a very helpful way to represent bytes because you can represent the 2 pairs of 4 bytes each with one character, a digit in the range 0 - F.

十六进制是一种很有用的表示字节的方法，因为你可以用一个字符表示2对4个字节，一个字符在0 - F范围内。

\xe2\x80\x93 then means there are three bytes, with the hexadecimal values E2, 80 and 93, or 226, 128 and 147 in decimal, respectively. The UTF-8 standard tells a decoder to take the last 4 bits of the first byte, and the last 6 bytes of each of the second and third bytes (the remaining bits are used to signal what type of byte you are dealing with for error handling). Those 4 + 6 + 6 == 16 bits then encode the hex value 2013 (0010 000000 010011 in binary).

\xe2\x80\x93意味着有三个字节，十六进制值分别为E2、80和93，或分别为226、128和147。UTF-8标准告诉译码器取第一个字节的最后4位，以及第二个和第三个字节的最后6个字节(剩余的字节用来表示处理错误的字节类型)。这些4 + 6 + 6 == 16位，然后编码十六进制值2013(0010 000000 010011二进制)。

You probably want to read up about the difference between codecs (encodings) and Unicode; UTF-8 is a codec that can handle all of the Unicode standard, but is not the same thing. See:

您可能想要了解codecs(编码)和Unicode的区别;UTF-8是一个可以处理所有Unicode标准的编解码器，但不是同一件事。看到的:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

绝对最小的软件开发人员绝对，肯定要知道关于Unicode和字符集(没有借口!)的Joel Spolsky。
Pragmatic Unicode by Ned Batchelder

内德·巴切尔德的实用统一码。
The Python Unicode HOWTO

Python Unicode HOWTO

#1