中文名称和Unicode基本多语言平面(BMP)

时间:2021-01-06 20:24:12

I am building an application using MySQL, where Chinese names need to be stored in the database. I'm trying to decide whether or not using the basic utf8 encoding (which only works with the Basic Multilingual Plane, and stores a maximum of 3 bytes per character in a UTF-8 encoding), or if I need to make use of the utf8mb4 encoding, which permits characters from higher planes to be encoded/stored.

我正在使用MySQL构建一个应用程序,其中中文名称需要存储在数据库中。我正在尝试决定是否使用基本的utf8编码(仅适用于基本多语言平面,并且在UTF-8编码中每个字符最多存储3个字节),或者如果我需要使用utf8mb4编码,允许编码/存储更高层的字符。

Is the Unicode Basic Multilingual Plane (BMP) sufficient to store all Chinese proper names?

Unicode基本多语言平面(BMP)是否足以存储所有中文专有名称?

2 个解决方案

#1


1  

MySQL's CHARACTER SET utf8 only handles 3-byte UTF-8 codes (BMP). Instead, use CHARACTER SET utf8mb4, which handles all 4-byte codes. Yes that includes all of currently defined Unicode for Chinese, Emoji, etc.

MySQL的CHARACTER SET utf8仅处理3字节UTF-8代码(BMP)。而是使用CHARACTER SET utf8mb4,它处理所有4字节代码。是的,包括所有当前为中文,表情符号等定义的Unicode。

Use version 5.7, if practical.

如果可行,请使用5.7版。

#2


0  

TL;DR it doesn't matter, stick with utf8mb4 encoding, especially for new applications.

TL; DR没关系,坚持使用utf8mb4编码,特别是对于新的应用程序。

Long-form answer: the key difference between the two encodings is that utf8, long supported by MySQL, supports UTF8-encoded characters up to three bytes in length. As of 5.5.3, as noted by @rick-james, a new encoding, utf8mb4 relaxes this restriction, and otherwise has no disadvantages.

长篇答案:两种编码之间的关键区别在于,MySQL长期支持的utf8支持长度最多为三个字节的UTF8编码字符。从5.5.3开始,如@ rick-james所述,一种新的编码,utf8mb4放宽了这个限制,否则没有任何缺点。

According to the MySQL documentation, the newer utf8mb4 encoding lifts this arbitrary three-character restriction, and there are few, if any disadvantages:

根据MySQL文档,较新的utf8mb4编码解除了这个任意的三字符限制,并且几乎没有任何缺点:

  • For a BMP character, utf8 and utf8mb4 have identical storage characteristics: same code values, same encoding, same length.
  • 对于BMP字符,utf8和utf8mb4具有相同的存储特性:相同的代码值,相同的编码,相同的长度。
  • For a supplementary character, utf8 cannot store the character at all, whereas utf8mb4 requires four bytes to store it. Because utf8 cannot store the character at all, you have no supplementary characters in utf8 columns and need not worry about converting characters or losing data when upgrading utf8 data from older versions of MySQL.
  • 对于补充字符,utf8根本不能存储字符,而utf8mb4需要四个字节来存储它。因为utf8根本无法存储字符,所以在utf8列中没有补充字符,并且在从旧版本的MySQL升级utf8数据时无需担心转换字符或丢失数据。

Thus, my original question was misconceived: the maximum number of bytes to encode each character of a Chinese name shouldn't matter so long as the encoding you use actually supports encoding all Unicode code points.

因此,我的原始问题是错误的:只要您使用的编码实际上支持编码所有Unicode代码点,编码中文名称的每个字符的最大字节数就无关紧要。

#1


1  

MySQL's CHARACTER SET utf8 only handles 3-byte UTF-8 codes (BMP). Instead, use CHARACTER SET utf8mb4, which handles all 4-byte codes. Yes that includes all of currently defined Unicode for Chinese, Emoji, etc.

MySQL的CHARACTER SET utf8仅处理3字节UTF-8代码(BMP)。而是使用CHARACTER SET utf8mb4,它处理所有4字节代码。是的,包括所有当前为中文,表情符号等定义的Unicode。

Use version 5.7, if practical.

如果可行,请使用5.7版。

#2


0  

TL;DR it doesn't matter, stick with utf8mb4 encoding, especially for new applications.

TL; DR没关系,坚持使用utf8mb4编码,特别是对于新的应用程序。

Long-form answer: the key difference between the two encodings is that utf8, long supported by MySQL, supports UTF8-encoded characters up to three bytes in length. As of 5.5.3, as noted by @rick-james, a new encoding, utf8mb4 relaxes this restriction, and otherwise has no disadvantages.

长篇答案:两种编码之间的关键区别在于,MySQL长期支持的utf8支持长度最多为三个字节的UTF8编码字符。从5.5.3开始,如@ rick-james所述,一种新的编码,utf8mb4放宽了这个限制,否则没有任何缺点。

According to the MySQL documentation, the newer utf8mb4 encoding lifts this arbitrary three-character restriction, and there are few, if any disadvantages:

根据MySQL文档,较新的utf8mb4编码解除了这个任意的三字符限制,并且几乎没有任何缺点:

  • For a BMP character, utf8 and utf8mb4 have identical storage characteristics: same code values, same encoding, same length.
  • 对于BMP字符,utf8和utf8mb4具有相同的存储特性:相同的代码值,相同的编码,相同的长度。
  • For a supplementary character, utf8 cannot store the character at all, whereas utf8mb4 requires four bytes to store it. Because utf8 cannot store the character at all, you have no supplementary characters in utf8 columns and need not worry about converting characters or losing data when upgrading utf8 data from older versions of MySQL.
  • 对于补充字符,utf8根本不能存储字符,而utf8mb4需要四个字节来存储它。因为utf8根本无法存储字符,所以在utf8列中没有补充字符,并且在从旧版本的MySQL升级utf8数据时无需担心转换字符或丢失数据。

Thus, my original question was misconceived: the maximum number of bytes to encode each character of a Chinese name shouldn't matter so long as the encoding you use actually supports encoding all Unicode code points.

因此,我的原始问题是错误的:只要您使用的编码实际上支持编码所有Unicode代码点,编码中文名称的每个字符的最大字节数就无关紧要。