utf8_general_ci和utf8_unicode_ci之间有什么不同?(复制)

时间:2021-11-26 20:17:11

Possible Duplicate:
What's the difference between utf8_general_ci and utf8_unicode_ci

可能的重复:utf8_general_ci和utf8_unicode_ci之间的区别是什么

I've got two options for unicode that look promising for a mysql database.

我有两个unicode选项,看起来很适合mysql数据库。

utf8_general_ci unicode (multilingual), case-insensitive
utf8_unicode_ci unicode (multilingual), case-insensitive

Can you please explain what is the difference between utf8_general_ci and utf8_unicode_ci? What are the effects of choosing one over the other when designing a database?

你能解释一下utf8_general_ci和utf8_unicode_ci之间的区别吗?在设计数据库时,选择其中之一会产生什么影响?

2 个解决方案

#1


121  

utf8_general_ci is a very simple — and on Unicode, very broken — collation, one that gives incorrect results on general Unicode text. What it does is:

utf8_general_ci是一种非常简单的排序规则,在Unicode文本中,这种排序规则的结果是不正确的。它所做的是:

  • converts to Unicode normalization form D for canonical decomposition
  • 转换为Unicode规范化格式D进行规范化分解
  • removes any combining characters
  • 删除任何组合字符
  • converts to upper case
  • 转换为大写

This does not work correctly on Unicode, because it does not understand Unicode casing. Unicode casing alone is much more complicated than an ASCII-minded approach can handle. For example:

这在Unicode上不正确,因为它不理解Unicode的外壳。单是Unicode的大小写就比一种性情暴躁的方法要复杂得多。例如:

  • The lowercase of “ẞ” is “ß”, but the uppercase of “ß” is “SS”.
  • 小写的“ẞ”是“ß”,但“ß”的大写字母是“党*”。
  • There are two lowercase Greek sigmas, but only one uppercase one; consider “Σίσυφος”.
  • 有两个小写的希腊符号,但只有一个大写的;考虑“Σίσυφος”。
  • Letters like “ø” do not decompose to an “o” plus a diacritic, meaning that it won’t correctly sort.
  • 信,像“ø”不分解一个“o”加上一个可区别的,这意味着它不会正确排序。

There are many other subtleties.

还有许多其他的微妙之处。

  1. utf8_unicode_ci uses the standard Unicode Collation Algorithm, supports so called expansions and ligatures, for example: German letter ß (U+00DF LETTER SHARP S) is sorted near "ss" Letter Œ (U+0152 LATIN CAPITAL LIGATURE OE) is sorted near "OE".
  2. utf8_unicode_ci使用Unicode标准排序算法,支持所谓的扩张和绑扎,例如:德国信ß(U + 00 df信锋利的年代)排序附近“党*”信œ(U + 0152拉丁资本结扎OE)“OE”附近进行排序。

utf8_general_ci does not support expansions/ligatures, it sorts all these letters as single characters, and sometimes in a wrong order.

utf8_general_ci不支持扩展/连接,它将所有这些字母排序为单个字符,有时顺序是错误的。

  1. utf8_unicode_ci is generally more accurate for all scripts. For example, on Cyrillic block: utf8_unicode_ci is fine for all these languages: Russian, Bulgarian, Belarusian, Macedonian, Serbian, and Ukrainian. While utf8_general_ci is fine only for Russian and Bulgarian subset of Cyrillic. Extra letters used in Belarusian, Macedonian, Serbian, and Ukrainian are sorted not well.
  2. utf8_unicode_ci通常对所有脚本更准确。例如,在Cyrillic block上:utf8_unicode_ci适用于所有这些语言:俄语、保加利亚语、白俄罗斯语、马其顿语、塞尔维亚语和乌克兰语。而utf8_general_ci仅适用于Cyrillic的俄罗斯和保加利亚子集。白俄罗斯、马其顿、塞尔维亚和乌克兰使用的额外信件分类不太好。

The cost of utf8_unicode_ci is that it is a little bit slower than utf8_general_ci. But that’s the price you pay for correctness. Either you can have a fast answer that’s wrong, or a very slightly slower answer that’s right. Your choice. It is very difficult to ever justify giving wrong answers, so it’s best to assume that utf8_general_ci doesn’t exist and to always use utf8_unicode_ci. Well, unless you want wrong answers.

utf8_unicode_ci的成本是它比utf8_general_ci慢一点。但这就是正确的代价。要么你有一个快速的错误答案,要么你有一个稍微慢一点的正确答案。你的选择。给出错误的答案是非常困难的,所以最好假设utf8_general_ci不存在,并且总是使用utf8_unicode_ci。除非你想要错误的答案。

Source: http://forums.mysql.com/read.php?103,187048,188748#msg-188748

来源:http://forums.mysql.com/read.php?103,187048、188748 #味精- 188748

#2


19  

From Unicode Character Sets in the MySQL documentation:

来自MySQL文档中的Unicode字符集:

For any Unicode character set, operations performed using the _general_ci collation are faster than those for the _unicode_ci collation. For example, comparisons for the utf8_general_ci collation are faster, but slightly less correct, than comparisons for utf8_unicode_ci. The reason for this is that utf8_unicode_ci supports mappings such as expansions; that is, when one character compares as equal to combinations of other characters. For example, in German and some other languages “ß” is equal to “ss”. utf8_unicode_ci also supports contractions and ignorable characters. utf8_general_ci is a legacy collation that does not support expansions, contractions, or ignorable characters. It can make only one-to-one comparisons between characters.

对于任何Unicode字符集,使用_general_ci排序的操作比_unicode_ci排序的操作要快。例如,与utf8_general_ci的比较相比,utf8_unicode_ci的比较速度更快,但准确性略低。原因是utf8_unicode_ci支持扩展等映射;也就是说,当一个字符与其他字符的组合相比较时。例如,在德国和其他一些语言“ß”等于“党*”。utf8_unicode_ci还支持收缩和可忽略字符。utf8_general_ci是一个不支持扩展、收缩或可忽略字符的遗留排序。它只能对字符进行一对一的比较。

#1


121  

utf8_general_ci is a very simple — and on Unicode, very broken — collation, one that gives incorrect results on general Unicode text. What it does is:

utf8_general_ci是一种非常简单的排序规则,在Unicode文本中,这种排序规则的结果是不正确的。它所做的是:

  • converts to Unicode normalization form D for canonical decomposition
  • 转换为Unicode规范化格式D进行规范化分解
  • removes any combining characters
  • 删除任何组合字符
  • converts to upper case
  • 转换为大写

This does not work correctly on Unicode, because it does not understand Unicode casing. Unicode casing alone is much more complicated than an ASCII-minded approach can handle. For example:

这在Unicode上不正确,因为它不理解Unicode的外壳。单是Unicode的大小写就比一种性情暴躁的方法要复杂得多。例如:

  • The lowercase of “ẞ” is “ß”, but the uppercase of “ß” is “SS”.
  • 小写的“ẞ”是“ß”,但“ß”的大写字母是“党*”。
  • There are two lowercase Greek sigmas, but only one uppercase one; consider “Σίσυφος”.
  • 有两个小写的希腊符号,但只有一个大写的;考虑“Σίσυφος”。
  • Letters like “ø” do not decompose to an “o” plus a diacritic, meaning that it won’t correctly sort.
  • 信,像“ø”不分解一个“o”加上一个可区别的,这意味着它不会正确排序。

There are many other subtleties.

还有许多其他的微妙之处。

  1. utf8_unicode_ci uses the standard Unicode Collation Algorithm, supports so called expansions and ligatures, for example: German letter ß (U+00DF LETTER SHARP S) is sorted near "ss" Letter Œ (U+0152 LATIN CAPITAL LIGATURE OE) is sorted near "OE".
  2. utf8_unicode_ci使用Unicode标准排序算法,支持所谓的扩张和绑扎,例如:德国信ß(U + 00 df信锋利的年代)排序附近“党*”信œ(U + 0152拉丁资本结扎OE)“OE”附近进行排序。

utf8_general_ci does not support expansions/ligatures, it sorts all these letters as single characters, and sometimes in a wrong order.

utf8_general_ci不支持扩展/连接,它将所有这些字母排序为单个字符,有时顺序是错误的。

  1. utf8_unicode_ci is generally more accurate for all scripts. For example, on Cyrillic block: utf8_unicode_ci is fine for all these languages: Russian, Bulgarian, Belarusian, Macedonian, Serbian, and Ukrainian. While utf8_general_ci is fine only for Russian and Bulgarian subset of Cyrillic. Extra letters used in Belarusian, Macedonian, Serbian, and Ukrainian are sorted not well.
  2. utf8_unicode_ci通常对所有脚本更准确。例如,在Cyrillic block上:utf8_unicode_ci适用于所有这些语言:俄语、保加利亚语、白俄罗斯语、马其顿语、塞尔维亚语和乌克兰语。而utf8_general_ci仅适用于Cyrillic的俄罗斯和保加利亚子集。白俄罗斯、马其顿、塞尔维亚和乌克兰使用的额外信件分类不太好。

The cost of utf8_unicode_ci is that it is a little bit slower than utf8_general_ci. But that’s the price you pay for correctness. Either you can have a fast answer that’s wrong, or a very slightly slower answer that’s right. Your choice. It is very difficult to ever justify giving wrong answers, so it’s best to assume that utf8_general_ci doesn’t exist and to always use utf8_unicode_ci. Well, unless you want wrong answers.

utf8_unicode_ci的成本是它比utf8_general_ci慢一点。但这就是正确的代价。要么你有一个快速的错误答案,要么你有一个稍微慢一点的正确答案。你的选择。给出错误的答案是非常困难的,所以最好假设utf8_general_ci不存在,并且总是使用utf8_unicode_ci。除非你想要错误的答案。

Source: http://forums.mysql.com/read.php?103,187048,188748#msg-188748

来源:http://forums.mysql.com/read.php?103,187048、188748 #味精- 188748

#2


19  

From Unicode Character Sets in the MySQL documentation:

来自MySQL文档中的Unicode字符集:

For any Unicode character set, operations performed using the _general_ci collation are faster than those for the _unicode_ci collation. For example, comparisons for the utf8_general_ci collation are faster, but slightly less correct, than comparisons for utf8_unicode_ci. The reason for this is that utf8_unicode_ci supports mappings such as expansions; that is, when one character compares as equal to combinations of other characters. For example, in German and some other languages “ß” is equal to “ss”. utf8_unicode_ci also supports contractions and ignorable characters. utf8_general_ci is a legacy collation that does not support expansions, contractions, or ignorable characters. It can make only one-to-one comparisons between characters.

对于任何Unicode字符集,使用_general_ci排序的操作比_unicode_ci排序的操作要快。例如,与utf8_general_ci的比较相比,utf8_unicode_ci的比较速度更快,但准确性略低。原因是utf8_unicode_ci支持扩展等映射;也就是说,当一个字符与其他字符的组合相比较时。例如,在德国和其他一些语言“ß”等于“党*”。utf8_unicode_ci还支持收缩和可忽略字符。utf8_general_ci是一个不支持扩展、收缩或可忽略字符的遗留排序。它只能对字符进行一对一的比较。