是否有UTF-8编码中未使用的字节?

时间:2023-01-06 08:14:00

As i understand it UTF-8 is a superset of ascii and therefore includes the control characters which are not used to represent printable characters.

据我所知,UTF-8是ascii的超集,因此包括不用于表示可打印字符的控制字符。

My question is: Are there any bytes (of the 256 different) that are not used by the UTF-8 encoding?

我的问题是:UTF-8编码没有使用任何字节(256种不同)吗?

I wondered if you could convert/encode UTF-8 text to binary.

我想知道你是否可以将UTF-8文本转换/编码为二进制文件。

Here my though process:

这是我的过程:

I have no idea how the utf-8 text encoding works and how it can use so many characters(only that it uses multiple bytes for characters not in ascii (latin-1??)) but i know that ascii text is valid in utf-8 so the control characters (bytes 0-30) are not used differently by the utf-8 encoding but they are at the same time not used for displaying characters, right??

我不知道utf-8文本编码是如何工作的以及它如何使用这么多字符(只是它为不在ascii中的字符使用多个字节(latin-1 ??))但我知道ascii文本在utf中有效-8所以控制字符(字节0-30)不是由utf-8编码使用不同但它们同时不用于显示字符,对吧?

so of the 256 different bytes only ~230 are used. for a 1000(binary) long unicode text there are only 1000^230 different texts? right

因此,在256个不同的字节中仅使用~230。对于1000(二进制)长的unicode文本,只有1000 ^ 230个不同的文本?对

if that is true you could convert it to a binary data which is smaller than 1000 bytes.

如果确实如此,您可以将其转换为小于1000字节的二进制数据。

Wolfram alpha: 1000 bytes of unicode (assumption unicode only uses 230 of the 256 different bytes) --> 496 bytes

Wolfram alpha:1000字节的unicode(假设unicode仅使用256个不同字节中的230个) - > 496个字节

3 个解决方案

#1


0xF8-0xFF are not valid anywhere in UTF-8, and some other bytes are not valid at certain positions.

0xF8-0xFF在UTF-8中的任何位置都无效,而某些其他字节在某些位置无效。

The lead byte of a character indicates the number of bytes used to encode the character, and each continuation byte has 10 as its two high order bits. This is so that you can pick any byte within the text and find the start of the character containing it. If you don't mind loosing this ability, you could certainly come up with more efficient encoding.

字符的前导字节表示用于编码字符的字节数,每个连续字节有10个作为其两个高位。这样您就可以选择文本中的任何字节并找到包含它的字符的开头。如果你不介意失去这种能力,你当然可以提出更有效的编码。

#2


You have to distinguish Characters, Unicode and UTF-8 encoding:

您必须区分字符,Unicode和UTF-8编码:

In encodings like ASCII, LATIN-1, etc. there is a one-to-one relation of one character to one number between 0 and 255 so a character can be encoded by exactly one byte (e.g. "A"->65). For decoding such a text you need to know which encoding was used (does 65 really mean "A"?).

在诸如ASCII,LATIN-1等编码中,一个字符与一个在0到255之间的数字之间存在一对一的关系,因此一个字符可以被恰好一个字节编码(例如“A” - > 65)。要解码这样的文本,你需要知道使用了哪种编码(65真的意思是“A”?)。

To overcome this situation Unicode assigns every Character (including all kinds of special things like control characters, diacritic marks, etc.) a unique number in the range from 0 to 0x10FFFF (so-called Unicode codepoint). As this range does not fit into one byte the question is how to encode. There are several ways to do this, e.g. simplest way would always use 4 bytes for each character. As this consumes a lot of space a more efficient encoding is UTF-8: Here every Unicode codepoint (= Character) is encoded in one, two, three or four bytes (for this encoding not all byte values from 0 to 255 are used but this is only a technical detail).

为了克服这种情况,Unicode为每个字符(包括各种特殊事物,如控制字符,变音符号等)分配一个0到0x10FFFF(所谓的Unicode代码点)范围内的唯一编号。由于此范围不适合一个字节,因此问题是如何编码。有几种方法可以做到这一点,例如最简单的方法总是为每个字符使用4个字节。由于这会消耗大量空间,因此更有效的编码是UTF-8:这里每个Unicode代码点(=字符)都以一个,两个,三个或四个字节编码(对于这种编码,并非使用0到255之间的所有字节值,但这只是一个技术细节)。

#3


Yes, it is possible to devise encodings which are more space-efficient than UTF-8, but you have to weigh the advantages against the disadvantages.

是的,有可能设计出比UTF-8更节省空间的编码,但你必须权衡优势和劣势。

For example, if your primary target is (say) ISO-8859-1, you could map the character codes 0xA0-0xFF to themselves, and only use 0x80-0x9F to select an extension map somewhat vaguely like UTF-8 uses (nearly) all of 0x80-0xFF to encode sequences which can represent all of Unicode > 0x80. You would gain a significant advantage when the majority of your text does not use characters in the ranges 0x80-0x9F or 0x0100-0x1EFFFFFFFF, but correspondingly lose when this is not the case.

例如,如果您的主要目标是(例如)ISO-8859-1,您可以将字符代码0xA0-0xFF映射到它们自己,并且仅使用0x80-0x9F来选择一个有点像UTF-8使用的扩展映射(几乎)所有0x80-0xFF编码序列,可以表示所有Unicode> 0x80。当大部分文本不使用0x80-0x9F或0x0100-0x1EFFFFFFFF范围内的字符时,您将获得显着优势,但在不是这种情况时相应地会丢失。

Or you could require the user to keep a state variable which tells you which range of characters is currently selected, and have each byte in the stream act as an index into that range. This has significant disadvantages, but used to be how these things were done way back when (witness e.g. ISO-2022).

或者您可以要求用户保留一个状态变量,该变量告诉您​​当前选择了哪个字符范围,并使流中的每个字节充当该范围的索引。这具有明显的缺点,但过去常常是这样做的(例如ISO-2022)。

The original UTF-8 draft before Ken Thompson and Rob Pike famously intervened was probably also somewhat more space-efficient than the final specification, but the changes they introduced had some very attractive properties, trading (I assume) some space efficiency for lack of contextual ambiguity.

在肯·汤普森和罗伯·派克着名干预之前的原始UTF-8草案可能比最终规范更节省空间,但他们引入的变化具有一些非常有吸引力的特性,交易(我假设)因缺乏背景而具有一定的空间效率歧义。

I would urge you to read the Wikipedia article about UTF-8 to understand the design desiderata -- the spec is possible to grasp in just a few minutes, although you might want to reserve an hour or more to follow footnotes etc. (The Thompson anecdote is currently footnote #7.)

我想请你阅读关于UTF-8的*文章,以了解设计需求 - 只需几分钟即可掌握规范,尽管你可能想要预留一小时或更长时间来跟踪脚注等。(汤普森轶事目前是脚注#7。)

All in all, unless you are working on space travel or some similarly effeciency-intensive application, losing UTF-8 compatibility is probably not worth the time you have already spent, and you should stop now.

总而言之,除非您正在从事太空旅行或类似的效率密集型应用,否则失去UTF-8兼容性可能不值得您花费的时间,您现在应该停止。

#1


0xF8-0xFF are not valid anywhere in UTF-8, and some other bytes are not valid at certain positions.

0xF8-0xFF在UTF-8中的任何位置都无效,而某些其他字节在某些位置无效。

The lead byte of a character indicates the number of bytes used to encode the character, and each continuation byte has 10 as its two high order bits. This is so that you can pick any byte within the text and find the start of the character containing it. If you don't mind loosing this ability, you could certainly come up with more efficient encoding.

字符的前导字节表示用于编码字符的字节数,每个连续字节有10个作为其两个高位。这样您就可以选择文本中的任何字节并找到包含它的字符的开头。如果你不介意失去这种能力,你当然可以提出更有效的编码。

#2


You have to distinguish Characters, Unicode and UTF-8 encoding:

您必须区分字符,Unicode和UTF-8编码:

In encodings like ASCII, LATIN-1, etc. there is a one-to-one relation of one character to one number between 0 and 255 so a character can be encoded by exactly one byte (e.g. "A"->65). For decoding such a text you need to know which encoding was used (does 65 really mean "A"?).

在诸如ASCII,LATIN-1等编码中,一个字符与一个在0到255之间的数字之间存在一对一的关系,因此一个字符可以被恰好一个字节编码(例如“A” - > 65)。要解码这样的文本,你需要知道使用了哪种编码(65真的意思是“A”?)。

To overcome this situation Unicode assigns every Character (including all kinds of special things like control characters, diacritic marks, etc.) a unique number in the range from 0 to 0x10FFFF (so-called Unicode codepoint). As this range does not fit into one byte the question is how to encode. There are several ways to do this, e.g. simplest way would always use 4 bytes for each character. As this consumes a lot of space a more efficient encoding is UTF-8: Here every Unicode codepoint (= Character) is encoded in one, two, three or four bytes (for this encoding not all byte values from 0 to 255 are used but this is only a technical detail).

为了克服这种情况,Unicode为每个字符(包括各种特殊事物,如控制字符,变音符号等)分配一个0到0x10FFFF(所谓的Unicode代码点)范围内的唯一编号。由于此范围不适合一个字节,因此问题是如何编码。有几种方法可以做到这一点,例如最简单的方法总是为每个字符使用4个字节。由于这会消耗大量空间,因此更有效的编码是UTF-8:这里每个Unicode代码点(=字符)都以一个,两个,三个或四个字节编码(对于这种编码,并非使用0到255之间的所有字节值,但这只是一个技术细节)。

#3


Yes, it is possible to devise encodings which are more space-efficient than UTF-8, but you have to weigh the advantages against the disadvantages.

是的,有可能设计出比UTF-8更节省空间的编码,但你必须权衡优势和劣势。

For example, if your primary target is (say) ISO-8859-1, you could map the character codes 0xA0-0xFF to themselves, and only use 0x80-0x9F to select an extension map somewhat vaguely like UTF-8 uses (nearly) all of 0x80-0xFF to encode sequences which can represent all of Unicode > 0x80. You would gain a significant advantage when the majority of your text does not use characters in the ranges 0x80-0x9F or 0x0100-0x1EFFFFFFFF, but correspondingly lose when this is not the case.

例如,如果您的主要目标是(例如)ISO-8859-1,您可以将字符代码0xA0-0xFF映射到它们自己,并且仅使用0x80-0x9F来选择一个有点像UTF-8使用的扩展映射(几乎)所有0x80-0xFF编码序列,可以表示所有Unicode> 0x80。当大部分文本不使用0x80-0x9F或0x0100-0x1EFFFFFFFF范围内的字符时,您将获得显着优势,但在不是这种情况时相应地会丢失。

Or you could require the user to keep a state variable which tells you which range of characters is currently selected, and have each byte in the stream act as an index into that range. This has significant disadvantages, but used to be how these things were done way back when (witness e.g. ISO-2022).

或者您可以要求用户保留一个状态变量,该变量告诉您​​当前选择了哪个字符范围,并使流中的每个字节充当该范围的索引。这具有明显的缺点,但过去常常是这样做的(例如ISO-2022)。

The original UTF-8 draft before Ken Thompson and Rob Pike famously intervened was probably also somewhat more space-efficient than the final specification, but the changes they introduced had some very attractive properties, trading (I assume) some space efficiency for lack of contextual ambiguity.

在肯·汤普森和罗伯·派克着名干预之前的原始UTF-8草案可能比最终规范更节省空间,但他们引入的变化具有一些非常有吸引力的特性,交易(我假设)因缺乏背景而具有一定的空间效率歧义。

I would urge you to read the Wikipedia article about UTF-8 to understand the design desiderata -- the spec is possible to grasp in just a few minutes, although you might want to reserve an hour or more to follow footnotes etc. (The Thompson anecdote is currently footnote #7.)

我想请你阅读关于UTF-8的*文章,以了解设计需求 - 只需几分钟即可掌握规范,尽管你可能想要预留一小时或更长时间来跟踪脚注等。(汤普森轶事目前是脚注#7。)

All in all, unless you are working on space travel or some similarly effeciency-intensive application, losing UTF-8 compatibility is probably not worth the time you have already spent, and you should stop now.

总而言之,除非您正在从事太空旅行或类似的效率密集型应用,否则失去UTF-8兼容性可能不值得您花费的时间,您现在应该停止。