Possible Duplicate:
What's the difference between utf8_general_ci and utf8_unicode_ci可能的重复:utf8_general_ci和utf8_unicode_ci之间的区别是什么
I've got two options for unicode that look promising for a mysql database.
我有两个unicode选项,看起来很适合mysql数据库。
utf8_general_ci unicode (multilingual), case-insensitive
utf8_unicode_ci unicode (multilingual), case-insensitive
Can you please explain what is the difference between utf8_general_ci and utf8_unicode_ci? What are the effects of choosing one over the other when designing a database?
你能解释一下utf8_general_ci和utf8_unicode_ci之间的区别吗?在设计数据库时,选择其中之一会产生什么影响?
2 个解决方案
#1
121
utf8_general_ci
is a very simple — and on Unicode, very broken — collation, one that gives incorrect results on general Unicode text. What it does is:
utf8_general_ci是一种非常简单的排序规则,在Unicode文本中,这种排序规则的结果是不正确的。它所做的是:
- converts to Unicode normalization form D for canonical decomposition
- 转换为Unicode规范化格式D进行规范化分解
- removes any combining characters
- 删除任何组合字符
- converts to upper case
- 转换为大写
This does not work correctly on Unicode, because it does not understand Unicode casing. Unicode casing alone is much more complicated than an ASCII-minded approach can handle. For example:
这在Unicode上不正确,因为它不理解Unicode的外壳。单是Unicode的大小写就比一种性情暴躁的方法要复杂得多。例如:
- The lowercase of “ẞ” is “ß”, but the uppercase of “ß” is “SS”.
- 小写的“ẞ”是“ß”,但“ß”的大写字母是“党*”。
- There are two lowercase Greek sigmas, but only one uppercase one; consider “Σίσυφος”.
- 有两个小写的希腊符号,但只有一个大写的;考虑“Σίσυφος”。
- Letters like “ø” do not decompose to an “o” plus a diacritic, meaning that it won’t correctly sort.
- 信,像“ø”不分解一个“o”加上一个可区别的,这意味着它不会正确排序。
There are many other subtleties.
还有许多其他的微妙之处。
-
utf8_unicode_ci
uses the standard Unicode Collation Algorithm, supports so called expansions and ligatures, for example: German letter ß (U+00DF LETTER SHARP S) is sorted near "ss" Letter Œ (U+0152 LATIN CAPITAL LIGATURE OE) is sorted near "OE". - utf8_unicode_ci使用Unicode标准排序算法,支持所谓的扩张和绑扎,例如:德国信ß(U + 00 df信锋利的年代)排序附近“党*”信œ(U + 0152拉丁资本结扎OE)“OE”附近进行排序。
utf8_general_ci
does not support expansions/ligatures, it sorts all these letters as single characters, and sometimes in a wrong order.
utf8_general_ci不支持扩展/连接,它将所有这些字母排序为单个字符,有时顺序是错误的。
-
utf8_unicode_ci
is generally more accurate for all scripts. For example, on Cyrillic block:utf8_unicode_ci
is fine for all these languages: Russian, Bulgarian, Belarusian, Macedonian, Serbian, and Ukrainian. While utf8_general_ci is fine only for Russian and Bulgarian subset of Cyrillic. Extra letters used in Belarusian, Macedonian, Serbian, and Ukrainian are sorted not well. - utf8_unicode_ci通常对所有脚本更准确。例如,在Cyrillic block上:utf8_unicode_ci适用于所有这些语言:俄语、保加利亚语、白俄罗斯语、马其顿语、塞尔维亚语和乌克兰语。而utf8_general_ci仅适用于Cyrillic的俄罗斯和保加利亚子集。白俄罗斯、马其顿、塞尔维亚和乌克兰使用的额外信件分类不太好。
The cost of utf8_unicode_ci
is that it is a little bit slower than utf8_general_ci
. But that’s the price you pay for correctness. Either you can have a fast answer that’s wrong, or a very slightly slower answer that’s right. Your choice. It is very difficult to ever justify giving wrong answers, so it’s best to assume that utf8_general_ci
doesn’t exist and to always use utf8_unicode_ci
. Well, unless you want wrong answers.
utf8_unicode_ci的成本是它比utf8_general_ci慢一点。但这就是正确的代价。要么你有一个快速的错误答案,要么你有一个稍微慢一点的正确答案。你的选择。给出错误的答案是非常困难的,所以最好假设utf8_general_ci不存在,并且总是使用utf8_unicode_ci。除非你想要错误的答案。
Source: http://forums.mysql.com/read.php?103,187048,188748#msg-188748
来源:http://forums.mysql.com/read.php?103,187048、188748 #味精- 188748
#2
19
From Unicode Character Sets in the MySQL documentation:
来自MySQL文档中的Unicode字符集:
For any Unicode character set, operations performed using the
_general_ci
collation are faster than those for the_unicode_ci
collation. For example, comparisons for theutf8_general_ci
collation are faster, but slightly less correct, than comparisons forutf8_unicode_ci
. The reason for this is thatutf8_unicode_ci
supports mappings such as expansions; that is, when one character compares as equal to combinations of other characters. For example, in German and some other languages “ß
” is equal to “ss
”.utf8_unicode_ci
also supports contractions and ignorable characters.utf8_general_ci
is a legacy collation that does not support expansions, contractions, or ignorable characters. It can make only one-to-one comparisons between characters.对于任何Unicode字符集,使用_general_ci排序的操作比_unicode_ci排序的操作要快。例如,与utf8_general_ci的比较相比,utf8_unicode_ci的比较速度更快,但准确性略低。原因是utf8_unicode_ci支持扩展等映射;也就是说,当一个字符与其他字符的组合相比较时。例如,在德国和其他一些语言“ß”等于“党*”。utf8_unicode_ci还支持收缩和可忽略字符。utf8_general_ci是一个不支持扩展、收缩或可忽略字符的遗留排序。它只能对字符进行一对一的比较。
#1
121
utf8_general_ci
is a very simple — and on Unicode, very broken — collation, one that gives incorrect results on general Unicode text. What it does is:
utf8_general_ci是一种非常简单的排序规则,在Unicode文本中,这种排序规则的结果是不正确的。它所做的是:
- converts to Unicode normalization form D for canonical decomposition
- 转换为Unicode规范化格式D进行规范化分解
- removes any combining characters
- 删除任何组合字符
- converts to upper case
- 转换为大写
This does not work correctly on Unicode, because it does not understand Unicode casing. Unicode casing alone is much more complicated than an ASCII-minded approach can handle. For example:
这在Unicode上不正确,因为它不理解Unicode的外壳。单是Unicode的大小写就比一种性情暴躁的方法要复杂得多。例如:
- The lowercase of “ẞ” is “ß”, but the uppercase of “ß” is “SS”.
- 小写的“ẞ”是“ß”,但“ß”的大写字母是“党*”。
- There are two lowercase Greek sigmas, but only one uppercase one; consider “Σίσυφος”.
- 有两个小写的希腊符号,但只有一个大写的;考虑“Σίσυφος”。
- Letters like “ø” do not decompose to an “o” plus a diacritic, meaning that it won’t correctly sort.
- 信,像“ø”不分解一个“o”加上一个可区别的,这意味着它不会正确排序。
There are many other subtleties.
还有许多其他的微妙之处。
-
utf8_unicode_ci
uses the standard Unicode Collation Algorithm, supports so called expansions and ligatures, for example: German letter ß (U+00DF LETTER SHARP S) is sorted near "ss" Letter Œ (U+0152 LATIN CAPITAL LIGATURE OE) is sorted near "OE". - utf8_unicode_ci使用Unicode标准排序算法,支持所谓的扩张和绑扎,例如:德国信ß(U + 00 df信锋利的年代)排序附近“党*”信œ(U + 0152拉丁资本结扎OE)“OE”附近进行排序。
utf8_general_ci
does not support expansions/ligatures, it sorts all these letters as single characters, and sometimes in a wrong order.
utf8_general_ci不支持扩展/连接,它将所有这些字母排序为单个字符,有时顺序是错误的。
-
utf8_unicode_ci
is generally more accurate for all scripts. For example, on Cyrillic block:utf8_unicode_ci
is fine for all these languages: Russian, Bulgarian, Belarusian, Macedonian, Serbian, and Ukrainian. While utf8_general_ci is fine only for Russian and Bulgarian subset of Cyrillic. Extra letters used in Belarusian, Macedonian, Serbian, and Ukrainian are sorted not well. - utf8_unicode_ci通常对所有脚本更准确。例如,在Cyrillic block上:utf8_unicode_ci适用于所有这些语言:俄语、保加利亚语、白俄罗斯语、马其顿语、塞尔维亚语和乌克兰语。而utf8_general_ci仅适用于Cyrillic的俄罗斯和保加利亚子集。白俄罗斯、马其顿、塞尔维亚和乌克兰使用的额外信件分类不太好。
The cost of utf8_unicode_ci
is that it is a little bit slower than utf8_general_ci
. But that’s the price you pay for correctness. Either you can have a fast answer that’s wrong, or a very slightly slower answer that’s right. Your choice. It is very difficult to ever justify giving wrong answers, so it’s best to assume that utf8_general_ci
doesn’t exist and to always use utf8_unicode_ci
. Well, unless you want wrong answers.
utf8_unicode_ci的成本是它比utf8_general_ci慢一点。但这就是正确的代价。要么你有一个快速的错误答案,要么你有一个稍微慢一点的正确答案。你的选择。给出错误的答案是非常困难的,所以最好假设utf8_general_ci不存在,并且总是使用utf8_unicode_ci。除非你想要错误的答案。
Source: http://forums.mysql.com/read.php?103,187048,188748#msg-188748
来源:http://forums.mysql.com/read.php?103,187048、188748 #味精- 188748
#2
19
From Unicode Character Sets in the MySQL documentation:
来自MySQL文档中的Unicode字符集:
For any Unicode character set, operations performed using the
_general_ci
collation are faster than those for the_unicode_ci
collation. For example, comparisons for theutf8_general_ci
collation are faster, but slightly less correct, than comparisons forutf8_unicode_ci
. The reason for this is thatutf8_unicode_ci
supports mappings such as expansions; that is, when one character compares as equal to combinations of other characters. For example, in German and some other languages “ß
” is equal to “ss
”.utf8_unicode_ci
also supports contractions and ignorable characters.utf8_general_ci
is a legacy collation that does not support expansions, contractions, or ignorable characters. It can make only one-to-one comparisons between characters.对于任何Unicode字符集,使用_general_ci排序的操作比_unicode_ci排序的操作要快。例如,与utf8_general_ci的比较相比,utf8_unicode_ci的比较速度更快,但准确性略低。原因是utf8_unicode_ci支持扩展等映射;也就是说,当一个字符与其他字符的组合相比较时。例如,在德国和其他一些语言“ß”等于“党*”。utf8_unicode_ci还支持收缩和可忽略字符。utf8_general_ci是一个不支持扩展、收缩或可忽略字符的遗留排序。它只能对字符进行一对一的比较。