如何在Latin1编码的列- MySQL中检测UTF-8字符?

I am about to undertake the tedious and gotcha-laden task of converting a database from Latin1 to UTF-8.

我即将承担将数据库从Latin1转换为UTF-8的冗长而繁琐的任务。

At this point I simply want to check what sort of data I have stored in my tables, as that will determine what approach I should use to convert the data.

此时，我只想检查我存储在表中的数据类型，因为这将决定我应该使用什么方法来转换数据。

Specifically, I want to check if I have UTF-8 characters in the Latin1 columns, what would be the best way to do this? If only a few rows are affected, then I can just fix this manually.

具体来说，我想检查一下在Latin1列中是否有UTF-8字符，这样做的最好方法是什么?如果只有几行受到影响，那么我就可以手动修复。

Option 1. Perform a MySQL dump and use Perl to search for UTF-8 characters?

选项1。执行一个MySQL转储并使用Perl来搜索UTF-8字符?

Option 2. Use MySQL CHAR_LENGTH to find rows with multi-byte characters? e.g. SELECT name FROM clients WHERE LENGTH(name) != CHAR_LENGTH(name); Is this enough?

第二个选项。使用MySQL CHAR_LENGTH查找具有多字节字符的行吗?从长度(名称)!= CHAR_LENGTH(名称)的客户端选择名称;这是足够的吗?

At the moment I have switched my Mysql client encoding to UTF-8.

目前我已经将Mysql客户端编码转换为UTF-8。

4 个解决方案

#1

Character encoding, like time zones, is a constant source of problems.

字符编码，就像时区一样，是问题的持续来源。

What you can do is look for any "high-ASCII" characters as these are either LATIN1 accented characters or symbols, or the first of a UTF-8 multi-byte character. Telling the difference isn't going to be easy unless you cheat a bit.

您所能做的是查找任何“高ascii”字符，因为这些字符要么是拉丁字符，要么是符号，或者是UTF-8多字节字符的第一个字符。除非你作弊，否则说这种差别不会很容易。

To figure out what encoding is correct, you just SELECT two different versions and compare visually. Here's an example:

要想弄清楚什么编码是正确的，你只需选择两个不同的版本，并在视觉上进行比较。这里有一个例子:

SELECT CONVERT(CONVERT(name USING BINARY) USING latin1) AS latin1, 
       CONVERT(CONVERT(name USING BINARY) USING utf8) AS utf8 
FROM users 
WHERE CONVERT(name USING BINARY) RLIKE CONCAT('[', UNHEX('80'), '-', UNHEX('FF'), ']')

This is made unusually complicated because the MySQL regexp engine seems to ignore things like \x80 and makes it necessary to use the UNHEX() method instead.

这是非常复杂的，因为MySQL regexp引擎似乎忽略了诸如\x80之类的东西，因此需要使用unex()方法。

This produces results like this:

这就产生了这样的结果:

latin1                utf8
----------------------------------------
BjÃ¶rn                Björn

#2

Since your question is not completely clear, let's assume some scenarios:

既然你的问题不完全清楚，让我们假设一些情况:

Hitherto wrong connection: You've been connecting to your database incorrectly using the latin1 encoding, but have stored UTF-8 data in the database (the encoding of the column is irrelevant in this case). This is the case I described here. In this case, it's easy to fix: Dump the database contents to a file through a latin1 connection. This will translate the incorrectly stored data into incorrectly correctly stored UTF-8, the way it has worked so far (read the aforelinked article for the gory details). You can then reimport the data into the database through a correctly set utf8 connection, and it will be stored as it should be.
迄今为止错误的连接:您使用latin1编码错误地连接到您的数据库，但是在数据库中存储了UTF-8数据(在本例中，列的编码与此无关)。这是我在这里描述的情况。在这种情况下，很容易修复:通过latin1连接将数据库内容转储到文件。这将把错误存储的数据转换为正确存储的UTF-8，这是它迄今为止的工作方式(阅读上述关于gory细节的上述文章)。然后，您可以通过正确地设置utf8连接将数据重新导入数据库，并将其存储为该连接。
Hitherto wrong column encoding: UTF-8 data was inserted into a latin1 column through a utf8 connection. In that case forget it, the data is gone. Any non-latin1 character should be replaced by a ?.
迄今为止，错误的列编码:UTF-8数据通过utf8连接插入到latin1列中。在这种情况下，忘记它，数据就消失了。任何非拉丁字符都应该被a替换。
Hitherto everything fine, henceforth added support for UTF-8: You have Latin-1 data correctly stored in a latin1 column, inserted through a latin1 connection, but want to expand that to also allow UTF-8 data. In that case just change the column encoding to utf8. MySQL will convert the existing data for you. Then just make sure your database connection is set to utf8 when you insert UTF-8 data.
到目前为止，一切都很好，从此增加了对UTF-8的支持:您的Latin-1数据正确地存储在latin1列中，插入到latin1连接中，但是希望扩展到允许UTF-8数据。在这种情况下，只需将列编码更改为utf8。MySQL将为您转换现有数据。然后，在插入UTF-8数据时，确保将数据库连接设置为utf8。

#3

There is a script on github to help with this sort of a thing.

github上有一个脚本可以帮助解决这类问题。

#4

I would create a dump of the database and grep for all valid UTF8 sequences. Where to take it from there depends on what you get. There are multiple questions on SO about identifying invalid UTF8; you can basically just reverse the logic.

我将为所有有效的UTF8序列创建一个数据库转储和grep。从哪里取它取决于你得到什么。关于识别无效的UTF8有很多问题;你基本上可以把逻辑颠倒过来。

Edit: So basically, any field consisting entirely of 7-bit ASCII is safe, and any field containing an invalid UTF-8 sequence can be assumed to be Latin-1. The remaining data should be inspected - if you are lucky, a handful of obvious substitutions will fix the absolute majority (replace Ã¶ with Latin-1 ö, etc).

编辑:基本上，任何包含7位ASCII的字段都是安全的，任何包含无效UTF-8序列的字段都可以假定为Latin-1。剩下的数据应该检查——如果你是幸运的,少数明显的替换将解决绝对多数(替代¶latin - 1 o等)。

#1