utf8_general_ci和utf8_unicode_ci之间有什么不同?(复制)

Possible Duplicate:
What's the difference between utf8_general_ci and utf8_unicode_ci

可能的重复:utf8_general_ci和utf8_unicode_ci之间的区别是什么

I've got two options for unicode that look promising for a mysql database.

我有两个unicode选项，看起来很适合mysql数据库。

utf8_general_ci unicode (multilingual), case-insensitive
utf8_unicode_ci unicode (multilingual), case-insensitive

Can you please explain what is the difference between utf8_general_ci and utf8_unicode_ci? What are the effects of choosing one over the other when designing a database?

你能解释一下utf8_general_ci和utf8_unicode_ci之间的区别吗?在设计数据库时，选择其中之一会产生什么影响?

2 个解决方案

#1

121

utf8_general_ci is a very simple — and on Unicode, very broken — collation, one that gives incorrect results on general Unicode text. What it does is:

utf8_general_ci是一种非常简单的排序规则，在Unicode文本中，这种排序规则的结果是不正确的。它所做的是:

converts to Unicode normalization form D for canonical decomposition
转换为Unicode规范化格式D进行规范化分解
removes any combining characters
删除任何组合字符
converts to upper case
转换为大写

This does not work correctly on Unicode, because it does not understand Unicode casing. Unicode casing alone is much more complicated than an ASCII-minded approach can handle. For example:

这在Unicode上不正确，因为它不理解Unicode的外壳。单是Unicode的大小写就比一种性情暴躁的方法要复杂得多。例如:

The lowercase of “ẞ” is “ß”, but the uppercase of “ß” is “SS”.
小写的“ẞ”是“ß”,但“ß”的大写字母是“党*”。
There are two lowercase Greek sigmas, but only one uppercase one; consider “Σίσυφος”.
有两个小写的希腊符号，但只有一个大写的;考虑“Σίσυφος”。
Letters like “ø” do not decompose to an “o” plus a diacritic, meaning that it won’t correctly sort.
信,像“ø”不分解一个“o”加上一个可区别的,这意味着它不会正确排序。

There are many other subtleties.

还有许多其他的微妙之处。

utf8_unicode_ci uses the standard Unicode Collation Algorithm, supports so called expansions and ligatures, for example: German letter ß (U+00DF LETTER SHARP S) is sorted near "ss" Letter Œ (U+0152 LATIN CAPITAL LIGATURE OE) is sorted near "OE".
utf8_unicode_ci使用Unicode标准排序算法,支持所谓的扩张和绑扎,例如:德国信ß(U + 00 df信锋利的年代)排序附近“党*”信œ(U + 0152拉丁资本结扎OE)“OE”附近进行排序。

utf8_general_ci does not support expansions/ligatures, it sorts all these letters as single characters, and sometimes in a wrong order.

utf8_general_ci不支持扩展/连接，它将所有这些字母排序为单个字符，有时顺序是错误的。

utf8_unicode_ci is generally more accurate for all scripts. For example, on Cyrillic block: utf8_unicode_ci is fine for all these languages: Russian, Bulgarian, Belarusian, Macedonian, Serbian, and Ukrainian. While utf8_general_ci is fine only for Russian and Bulgarian subset of Cyrillic. Extra letters used in Belarusian, Macedonian, Serbian, and Ukrainian are sorted not well.
utf8_unicode_ci通常对所有脚本更准确。例如，在Cyrillic block上:utf8_unicode_ci适用于所有这些语言:俄语、保加利亚语、白俄罗斯语、马其顿语、塞尔维亚语和乌克兰语。而utf8_general_ci仅适用于Cyrillic的俄罗斯和保加利亚子集。白俄罗斯、马其顿、塞尔维亚和乌克兰使用的额外信件分类不太好。

The cost of utf8_unicode_ci is that it is a little bit slower than utf8_general_ci. But that’s the price you pay for correctness. Either you can have a fast answer that’s wrong, or a very slightly slower answer that’s right. Your choice. It is very difficult to ever justify giving wrong answers, so it’s best to assume that utf8_general_ci doesn’t exist and to always use utf8_unicode_ci. Well, unless you want wrong answers.

utf8_unicode_ci的成本是它比utf8_general_ci慢一点。但这就是正确的代价。要么你有一个快速的错误答案，要么你有一个稍微慢一点的正确答案。你的选择。给出错误的答案是非常困难的，所以最好假设utf8_general_ci不存在，并且总是使用utf8_unicode_ci。除非你想要错误的答案。

Source: http://forums.mysql.com/read.php?103,187048,188748#msg-188748

来源:http://forums.mysql.com/read.php?103,187048、188748 #味精- 188748

#2

From Unicode Character Sets in the MySQL documentation:

来自MySQL文档中的Unicode字符集:

For any Unicode character set, operations performed using the _general_ci collation are faster than those for the _unicode_ci collation. For example, comparisons for the utf8_general_ci collation are faster, but slightly less correct, than comparisons for utf8_unicode_ci. The reason for this is that utf8_unicode_ci supports mappings such as expansions; that is, when one character compares as equal to combinations of other characters. For example, in German and some other languages “ß” is equal to “ss”. utf8_unicode_ci also supports contractions and ignorable characters. utf8_general_ci is a legacy collation that does not support expansions, contractions, or ignorable characters. It can make only one-to-one comparisons between characters.

对于任何Unicode字符集，使用_general_ci排序的操作比_unicode_ci排序的操作要快。例如，与utf8_general_ci的比较相比，utf8_unicode_ci的比较速度更快，但准确性略低。原因是utf8_unicode_ci支持扩展等映射;也就是说，当一个字符与其他字符的组合相比较时。例如,在德国和其他一些语言“ß”等于“党*”。utf8_unicode_ci还支持收缩和可忽略字符。utf8_general_ci是一个不支持扩展、收缩或可忽略字符的遗留排序。它只能对字符进行一对一的比较。

#1

121