mysql charsets,我可以在python中执行转换吗?

时间:2022-09-20 15:08:17

I have a MySQL database which contains some bad data.

我有一个MySQL数据库,其中包含一些不良数据。

I start with this Unicode string:

我从这个Unicode字符串开始:

u'TECNOLOGÍA Y EDUCACIÓN'

Encoding to UTF-8 for the database yields:

为数据库编码为UTF-8会产生:

'TECNOLOG\xc3\x8dA Y EDUCACI\xc3\x93N'

When I send these bytes to the database, using connection charset latin1 and database charset utf8 (yes, I know this is wrong, but this has already happened, many, many times, and the goal now is to figure out the exact process of corruption so it can be reversed), the data is converted to this (checked using BINARY()):

当我将这些字节发送到数据库时,使用连接charset latin1和数据库字符集utf8(是的,我知道这是错误的,但这已经发生了很多次,现在的目标是弄清楚确切的腐败过程所以它可以反转),数据转换为this(使用BINARY()检查):

'TECNOLOG\xc3\x83\xc2\x8dA Y EDUCACI\xc3\x83\xe2\x80\x9cN'

Double-encoding aside, the result I'd expect here is:

抛开双重编码,我期望的结果是:

'TECNOLOG\xc3\x83\xc2\x8dA Y EDUCACI\xc3\x83\xc2\x93N'

Most of this makes sense, as it is interpreting the multi-byte UTF-8 chars as latin1, and encoding each byte as an individual char, but the conversion of \x93 -> \xe2\x80\x9c makes no sense. latin1's \x93 does not convert to UTF-8 \xe2\x80\x9c, although \xe2\x80\x9c can be converted to Unicode, yielding u'\u201c', which is codepoint \x93 in the CP-1252 charset.

其中大部分都是有意义的,因为它将多字节UTF-8字符解释为latin1,并将每个字节编码为单个字符,但是\ x93 - > \ xe2 \ x80 \ x9c的转换没有意义。 latin1的\ x93不会转换为UTF-8 \ xe2 \ x80 \ x9c,虽然\ xe2 \ x80 \ x9c可以转换为Unicode,但会产生u'\ u201c',它是CP-1252字符集中的codepoint \ x93。

Is mysql combining latin1 and CP-1252 when it handles conversions? How can I replicate the conversion process entirely in python? I've iterated through every encoding on the system and none of them work for the entire string. How, in python, can I get from 'TECNOLOG\xc3\x83\xc2\x8dA Y EDUCACI\xc3\x83\xe2\x80\x9cN' back to 'TECNOLOG\xc3\x8dA Y EDUCACI\xc3\x93N'? Decoding as UTF-8 will handle the first 3/4ths correctly, but that last one is just wrong, and nothing I've tried will return the correct results.

mysql在处理转换时是否结合了latin1和CP-1252?如何在python中完全复制转换过程?我已经迭代了系统上的每个编码,但它们都不适用于整个字符串。如何,在python中,我可以从'TECNOLOG \ xc3 \ x83 \ xc2 \ x8dA Y EDUCACI \ xc3 \ x83 \ xe2 \ x80 \ x9cN'回到'TECNOLOG \ xc3 \ x8dA Y EDUCACI \ xc3 \ x93N'?解码为UTF-8将正确处理前3/4,但最后一个是错误的,我尝试过的任何东西都不会返回正确的结果。

1 个解决方案

#1


2  

  1. the goal now is to figure out the exact process of corruption so it can be reversed

    现在的目标是弄清楚腐败的确切过程,以便它可以逆转

    As documented under ALTER TABLE Syntax:

    如ALTER TABLE语法中所述:

    Warning

    The CONVERT TO operation converts column values between the character sets. This is not what you want if you have a column in one character set (like latin1) but the stored values actually use some other, incompatible character set (like utf8). In this case, you have to do the following for each such column:

    CONVERT TO操作在字符集之间转换列值。如果您在一个字符集中有一列(如latin1),那么这不是您想要的,但存储的值实际上使用了一些其他不兼容的字符集(如utf8)。在这种情况下,您必须为每个此类列执行以下操作:

    ALTER TABLE t1 CHANGE c1 c1 BLOB;
    ALTER TABLE t1 CHANGE c1 c1 TEXT CHARACTER SET utf8;
    

    The reason this works is that there is no conversion when you convert to or from BLOB columns.

    这样做的原因是当您转换为BLOB列或从BLOB列转换时没有转换。

    In your case:

    在你的情况下:

    1. change the column's encoding to the connection character set that was used on insertion (i.e. latin1), so that the stored bytes become the same as those that were originally received:

      将列的编码更改为插入时使用的连接字符集(即latin1),以便存储的字节与最初接收的字节相同:

      ALTER TABLE my_table MODIFY my_column TEXT CHARACTER SET latin1;
      
    2. then drop the encoding information (by modifying the column so that it becomes a binary string):

      然后删除编码信息(通过修改列使其成为二进制字符串):

      ALTER TABLE my_table MODIFY my_column BLOB;
      
    3. then apply the correct encoding information (by modifying the column so that it becomes a character string in the utf8 character set):

      然后应用正确的编码信息(通过修改列使其成为utf8字符集中的字符串):

      ALTER TABLE my_table MODIFY my_column TEXT CHARACTER SET utf8;
      

    Be careful to use datatypes of sufficient length to avoid data truncation. Also be careful to ensure that application code thenceforth uses the correct connection character set (or else you may end up with a table where some records are encoded in one manner and others in another, which can be a nightmare to resolve).

    请小心使用足够长的数据类型以避免数据截断。还要注意确保应用程序代码从此使用正确的连接字符集(否则你最终会得到一个表,其中一些记录以一种方式编码而另一些记录以另一种方式编码,这可能是一个噩梦来解决)。

    If you cannot modify the database just yet, simply fetching data whilst the connection character is set to latin1 (but with your application expecting UTF-8) will yield correct data. Or else, use CONVERT():

    如果您还不能修改数据库,只需在连接字符设置为latin1时获取数据(但您的应用程序需要UTF-8)将产生正确的数据。或者,使用CONVERT():

    SELECT CONVERT(BINARY CONVERT(my_column USING latin1) USING utf8)
    FROM   my_table
    
  2. Is mysql combining latin1 and cp1252 when it handles conversions?

    mysql在处理转换时是否结合了latin1和cp1252?

    As documented under West European Character Sets:

    如西欧字符集中所述:

    MySQL's latin1 is the same as the Windows cp1252 character set. This means it is the same as the official ISO 8859-1 or IANA (Internet Assigned Numbers Authority) latin1, except that IANA latin1 treats the code points between 0x80 and 0x9f as “undefined,” whereas cp1252, and therefore MySQL's latin1, assign characters for those positions. For example, 0x80 is the Euro sign. For the “undefined” entries in cp1252, MySQL translates 0x81 to Unicode 0x0081, 0x8d to 0x008d, 0x8f to 0x008f, 0x90 to 0x0090, and 0x9d to 0x009d.

    MySQL的latin1与Windows cp1252字符集相同。这意味着它与官方ISO 8859-1或IANA(互联网号码分配机构)latin1相同,除了IANA latin1将0x80和0x9f之间的代码点视为“未定义”,而cp1252,因此MySQL的latin1,分配字符对于那些职位。例如,0x80是欧元符号。对于cp1252中的“未定义”条目,MySQL将0x81转换为Unicode 0x0081,0x8d至0x008d,0x8f至0x008f,0x90至0x0090以及0x9d至0x009d。

#1


2  

  1. the goal now is to figure out the exact process of corruption so it can be reversed

    现在的目标是弄清楚腐败的确切过程,以便它可以逆转

    As documented under ALTER TABLE Syntax:

    如ALTER TABLE语法中所述:

    Warning

    The CONVERT TO operation converts column values between the character sets. This is not what you want if you have a column in one character set (like latin1) but the stored values actually use some other, incompatible character set (like utf8). In this case, you have to do the following for each such column:

    CONVERT TO操作在字符集之间转换列值。如果您在一个字符集中有一列(如latin1),那么这不是您想要的,但存储的值实际上使用了一些其他不兼容的字符集(如utf8)。在这种情况下,您必须为每个此类列执行以下操作:

    ALTER TABLE t1 CHANGE c1 c1 BLOB;
    ALTER TABLE t1 CHANGE c1 c1 TEXT CHARACTER SET utf8;
    

    The reason this works is that there is no conversion when you convert to or from BLOB columns.

    这样做的原因是当您转换为BLOB列或从BLOB列转换时没有转换。

    In your case:

    在你的情况下:

    1. change the column's encoding to the connection character set that was used on insertion (i.e. latin1), so that the stored bytes become the same as those that were originally received:

      将列的编码更改为插入时使用的连接字符集(即latin1),以便存储的字节与最初接收的字节相同:

      ALTER TABLE my_table MODIFY my_column TEXT CHARACTER SET latin1;
      
    2. then drop the encoding information (by modifying the column so that it becomes a binary string):

      然后删除编码信息(通过修改列使其成为二进制字符串):

      ALTER TABLE my_table MODIFY my_column BLOB;
      
    3. then apply the correct encoding information (by modifying the column so that it becomes a character string in the utf8 character set):

      然后应用正确的编码信息(通过修改列使其成为utf8字符集中的字符串):

      ALTER TABLE my_table MODIFY my_column TEXT CHARACTER SET utf8;
      

    Be careful to use datatypes of sufficient length to avoid data truncation. Also be careful to ensure that application code thenceforth uses the correct connection character set (or else you may end up with a table where some records are encoded in one manner and others in another, which can be a nightmare to resolve).

    请小心使用足够长的数据类型以避免数据截断。还要注意确保应用程序代码从此使用正确的连接字符集(否则你最终会得到一个表,其中一些记录以一种方式编码而另一些记录以另一种方式编码,这可能是一个噩梦来解决)。

    If you cannot modify the database just yet, simply fetching data whilst the connection character is set to latin1 (but with your application expecting UTF-8) will yield correct data. Or else, use CONVERT():

    如果您还不能修改数据库,只需在连接字符设置为latin1时获取数据(但您的应用程序需要UTF-8)将产生正确的数据。或者,使用CONVERT():

    SELECT CONVERT(BINARY CONVERT(my_column USING latin1) USING utf8)
    FROM   my_table
    
  2. Is mysql combining latin1 and cp1252 when it handles conversions?

    mysql在处理转换时是否结合了latin1和cp1252?

    As documented under West European Character Sets:

    如西欧字符集中所述:

    MySQL's latin1 is the same as the Windows cp1252 character set. This means it is the same as the official ISO 8859-1 or IANA (Internet Assigned Numbers Authority) latin1, except that IANA latin1 treats the code points between 0x80 and 0x9f as “undefined,” whereas cp1252, and therefore MySQL's latin1, assign characters for those positions. For example, 0x80 is the Euro sign. For the “undefined” entries in cp1252, MySQL translates 0x81 to Unicode 0x0081, 0x8d to 0x008d, 0x8f to 0x008f, 0x90 to 0x0090, and 0x9d to 0x009d.

    MySQL的latin1与Windows cp1252字符集相同。这意味着它与官方ISO 8859-1或IANA(互联网号码分配机构)latin1相同,除了IANA latin1将0x80和0x9f之间的代码点视为“未定义”,而cp1252,因此MySQL的latin1,分配字符对于那些职位。例如,0x80是欧元符号。对于cp1252中的“未定义”条目,MySQL将0x81转换为Unicode 0x0081,0x8d至0x008d,0x8f至0x008f,0x90至0x0090以及0x9d至0x009d。