I am currently working on a mysql database that contains Japanese and English strings.
我目前正在一个mysql数据库中工作,其中包含日语和英语字符串。
current collation: utf8_general_ci.
当前的排序:utf8_general_ci。
I must do queries for Japanese words in a string using LIKE %'japaneseWordHere'%. Currently it works almost ok with utf8_general_ci but sometimes it will skip a record because, i assume, the previous/proceeding character is not stored correctly in utf8_general_ci.
我必须在字符串中使用类似%'japaneseWordHere'%的字符串查询日语单词。目前它几乎可以使用utf8_general_ci,但有时它会跳过一个记录,因为我假设,前面/进程的字符在utf8_general_ci中没有正确地存储。
I have found that utf8_general_ci is a little old and buggy and learned about:
我发现utf8_general_ci是一个有点旧的和有bug的,并且了解:
- utf8_unicode_ci
- utf8_unicode_ci
- utf8mb4_unicode_ci
- utf8mb4_unicode_ci
I was doing some reading and could not specifically find a good answer to this.
我在做一些阅读,并没有找到一个好的答案。
If anyone works with Japanese myslq databases or someone who knows what is best, any responses would be welcome.
如果有人与日本的myslq数据库合作,或者谁知道什么是最好的,任何回应都是受欢迎的。
Should I change from utf8_general_ci to utf8_unicode_ci or utf8mb4_unicode_ci ?
我是否应该从utf8_general_ci改为utf8_unicode_ci或utf8mb4_unicode_ci ?
1 个解决方案
#1
1
1. Between utf8_general_ci and utf8_unicode_ci
1。utf8_general_ci和utf8_unicode_ci之间
UTF-8 is an encoding for the Unicode character set, which supports pretty much every language in the world.
UTF-8是Unicode字符集的编码,它支持世界上几乎所有的语言。
The only difference comes with sorting your results, different letters might come in a different order in other languages. Also, comparing a to ä might behave differently in another collation.
唯一的区别是对结果进行排序,不同的字母可能以其他语言的不同顺序出现。另外,比较a和a可能在另一个排序中有不同的表现。
2. Between utf8mb4_unicode_ci and utf8_unicode_ci
2。utf8mb4_unicode_ci和utf8_unicode_ci之间
For a BMP character, utf8 and utf8mb4 have identical storage characteristics: same code values, same encoding, same length
对于BMP字符,utf8和utf8mb4具有相同的存储特征:相同的代码值,相同的编码,相同的长度。
For a supplementary character, utf8 cannot store the character at all, while utf8mb4 requires four bytes to store it. Since utf8 cannot store the character at all, you do not have any supplementary characters in utf8 columns and you need not worry about converting characters or losing data when upgrading utf8 data from older versions of MySQL.
对于补充字符,utf8不能完全存储字符,而utf8mb4需要4个字节来存储它。由于utf8不能存储字符,所以在utf8列中没有任何补充字符,所以在从旧版本的MySQL升级utf8数据时,不必担心转换字符或丢失数据。
#1
1
1. Between utf8_general_ci and utf8_unicode_ci
1。utf8_general_ci和utf8_unicode_ci之间
UTF-8 is an encoding for the Unicode character set, which supports pretty much every language in the world.
UTF-8是Unicode字符集的编码,它支持世界上几乎所有的语言。
The only difference comes with sorting your results, different letters might come in a different order in other languages. Also, comparing a to ä might behave differently in another collation.
唯一的区别是对结果进行排序,不同的字母可能以其他语言的不同顺序出现。另外,比较a和a可能在另一个排序中有不同的表现。
2. Between utf8mb4_unicode_ci and utf8_unicode_ci
2。utf8mb4_unicode_ci和utf8_unicode_ci之间
For a BMP character, utf8 and utf8mb4 have identical storage characteristics: same code values, same encoding, same length
对于BMP字符,utf8和utf8mb4具有相同的存储特征:相同的代码值,相同的编码,相同的长度。
For a supplementary character, utf8 cannot store the character at all, while utf8mb4 requires four bytes to store it. Since utf8 cannot store the character at all, you do not have any supplementary characters in utf8 columns and you need not worry about converting characters or losing data when upgrading utf8 data from older versions of MySQL.
对于补充字符,utf8不能完全存储字符,而utf8mb4需要4个字节来存储它。由于utf8不能存储字符,所以在utf8列中没有任何补充字符,所以在从旧版本的MySQL升级utf8数据时,不必担心转换字符或丢失数据。