为什么在Python中,en-dash被写为“\xe2\x80\x93”?

时间:2022-03-09 09:09:08

Specifically, what does each escape in \xe2\x80\x93 do and why does it need 3 escapes? Trying to decode one by itself leads to an 'unexpected end of data' error.

具体地说,每一个从xe2\x80\x93中得到什么,为什么需要3个转义?试图通过自身解码导致数据错误的“意外终止”。

>>> print(b'\xe2\x80\x93'.decode('utf-8'))
–
>>> print(b'\xe2'.decode('utf-8'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 0: unexpected end of data

1 个解决方案

#1


15  

You have UTF-8 bytes, which is a codec, a standard to represent text as computer-readable data. The U+2013 EN-DASH codepoint encodes to those 3 bytes when encoded to that codec.

您有UTF-8字节,这是一个编解码器,一个表示文本为计算机可读数据的标准。当编码到这个编解码器时,U+2013 EN-DASH codepoint编码到这3个字节。

Trying to decode just one such byte as UTF-8 doesn't work because in the UTF-8 standard that one byte does not, on its own, carry meaning. In the UTF-8 encoding scheme, a \xe2 byte is used for all codepoints between U+2000 and U+2FFF in the Unicode standard (which would all be encoded with an additional 2 bytes); thats 4095 codepoints in all.

试图解码一个像UTF-8这样的字节是行不通的,因为在UTF-8标准中,一个字节本身并不具有意义。在UTF-8编码方案中,在Unicode标准中,U+2000和U+2FFF之间的所有代码点都使用\xe2字节(这都将被编码为额外的2字节);总共4095码。

Python represents values in a bytes object in a manner that lets you reproduce the value by copying it back into a Python script or terminal. Anything that isn't printable ASCII is then represented by a \xhh hex escape. The two characters form the hexadecimal value of the byte, an integer number between 0 and 255.

Python以一种方式表示字节对象中的值,通过将其复制回Python脚本或终端,您可以复制该值。任何不能打印的ASCII都用一个\xhh十六进制转义来表示。这两个字符构成了字节的十六进制值,一个整数在0和255之间。

Hexadecimal is a very helpful way to represent bytes because you can represent the 2 pairs of 4 bytes each with one character, a digit in the range 0 - F.

十六进制是一种很有用的表示字节的方法,因为你可以用一个字符表示2对4个字节,一个字符在0 - F范围内。

\xe2\x80\x93 then means there are three bytes, with the hexadecimal values E2, 80 and 93, or 226, 128 and 147 in decimal, respectively. The UTF-8 standard tells a decoder to take the last 4 bits of the first byte, and the last 6 bytes of each of the second and third bytes (the remaining bits are used to signal what type of byte you are dealing with for error handling). Those 4 + 6 + 6 == 16 bits then encode the hex value 2013 (0010 000000 010011 in binary).

\xe2\x80\x93意味着有三个字节,十六进制值分别为E2、80和93,或分别为226、128和147。UTF-8标准告诉译码器取第一个字节的最后4位,以及第二个和第三个字节的最后6个字节(剩余的字节用来表示处理错误的字节类型)。这些4 + 6 + 6 == 16位,然后编码十六进制值2013(0010 000000 010011二进制)。

You probably want to read up about the difference between codecs (encodings) and Unicode; UTF-8 is a codec that can handle all of the Unicode standard, but is not the same thing. See:

您可能想要了解codecs(编码)和Unicode的区别;UTF-8是一个可以处理所有Unicode标准的编解码器,但不是同一件事。看到的:

#1


15  

You have UTF-8 bytes, which is a codec, a standard to represent text as computer-readable data. The U+2013 EN-DASH codepoint encodes to those 3 bytes when encoded to that codec.

您有UTF-8字节,这是一个编解码器,一个表示文本为计算机可读数据的标准。当编码到这个编解码器时,U+2013 EN-DASH codepoint编码到这3个字节。

Trying to decode just one such byte as UTF-8 doesn't work because in the UTF-8 standard that one byte does not, on its own, carry meaning. In the UTF-8 encoding scheme, a \xe2 byte is used for all codepoints between U+2000 and U+2FFF in the Unicode standard (which would all be encoded with an additional 2 bytes); thats 4095 codepoints in all.

试图解码一个像UTF-8这样的字节是行不通的,因为在UTF-8标准中,一个字节本身并不具有意义。在UTF-8编码方案中,在Unicode标准中,U+2000和U+2FFF之间的所有代码点都使用\xe2字节(这都将被编码为额外的2字节);总共4095码。

Python represents values in a bytes object in a manner that lets you reproduce the value by copying it back into a Python script or terminal. Anything that isn't printable ASCII is then represented by a \xhh hex escape. The two characters form the hexadecimal value of the byte, an integer number between 0 and 255.

Python以一种方式表示字节对象中的值,通过将其复制回Python脚本或终端,您可以复制该值。任何不能打印的ASCII都用一个\xhh十六进制转义来表示。这两个字符构成了字节的十六进制值,一个整数在0和255之间。

Hexadecimal is a very helpful way to represent bytes because you can represent the 2 pairs of 4 bytes each with one character, a digit in the range 0 - F.

十六进制是一种很有用的表示字节的方法,因为你可以用一个字符表示2对4个字节,一个字符在0 - F范围内。

\xe2\x80\x93 then means there are three bytes, with the hexadecimal values E2, 80 and 93, or 226, 128 and 147 in decimal, respectively. The UTF-8 standard tells a decoder to take the last 4 bits of the first byte, and the last 6 bytes of each of the second and third bytes (the remaining bits are used to signal what type of byte you are dealing with for error handling). Those 4 + 6 + 6 == 16 bits then encode the hex value 2013 (0010 000000 010011 in binary).

\xe2\x80\x93意味着有三个字节,十六进制值分别为E2、80和93,或分别为226、128和147。UTF-8标准告诉译码器取第一个字节的最后4位,以及第二个和第三个字节的最后6个字节(剩余的字节用来表示处理错误的字节类型)。这些4 + 6 + 6 == 16位,然后编码十六进制值2013(0010 000000 010011二进制)。

You probably want to read up about the difference between codecs (encodings) and Unicode; UTF-8 is a codec that can handle all of the Unicode standard, but is not the same thing. See:

您可能想要了解codecs(编码)和Unicode的区别;UTF-8是一个可以处理所有Unicode标准的编解码器,但不是同一件事。看到的: