如何删除不适合MySQL中的utf8编码的坏字符?

时间:2023-01-06 14:46:56

I have dirty data. Sometimes it contains characters like this. I use this data to make queries like

我有脏数据。有时它包含这样的字符。我使用这些数据进行查询

WHERE a.address IN ('mydatahere')

For this character I get

我得到了这个角色。

org.hibernate.exception.GenericJDBCException: Illegal mix of collations (utf8_bin,IMPLICIT), (utf8mb4_general_ci,COERCIBLE), (utf8mb4_general_ci,COERCIBLE) for operation ' IN '

org.hibernate.exception。GenericJDBCException:用于“IN”操作的非法排序(utf8_bin,隐式)、(utf8mb4_general_ci,矫顽性)、(utf8mb4_general_ci,矫顽性)

How can I filter out characters like this? I use Java.

我怎样才能过滤掉这样的角色呢?我使用Java。

Thanks.

谢谢。

5 个解决方案

#1


8  

You can encode and then decode it to/from UTF-8:

你可以对它进行编码,然后解码到UTF-8:

String label = "look into my eyes 〠.〠";

Charset charset = Charset.forName("UTF-8");
label = charset.decode(charset.encode(label)).toString();

System.out.println(label);

output:

输出:

look into my eyes ?.?

#2


8  

When I had problem like this, I used Perl script to ensure that data is converted to valid UTF-8 by using code like this:

当我遇到这样的问题时,我使用Perl脚本确保使用如下代码将数据转换为有效的UTF-8:

use Encode;
binmode(STDOUT, ":utf8");
while (<>) {
    print Encode::decode('UTF-8', $_);
}

This script takes (possibly corrupted) UTF-8 on stdin and re-prints valid UTF-8 to stdout. Invalid characters are replaced with (U+FFFD, Unicode replacement character).

此脚本在stdin上获取(可能损坏的)UTF-8并将有效的UTF-8重新打印到stdout。无效的字符替换�(U + FFFD,Unicode替换字符)。

If you run this script on good UTF-8 input, output should be identical to input.

如果在良好的UTF-8输入上运行此脚本,则输出应该与输入相同。

If you have data in database, it makes sense to use DBI to scan your table(s) and scrub all data using this approach to make sure that everything is valid UTF-8.

如果数据库中有数据,那么使用DBI扫描表并使用这种方法删除所有数据,以确保所有数据都是有效的UTF-8。

This is Perl one-liner version of this same script:

这是同一脚本的Perl单行版本:

perl -MEncode -e "binmode STDOUT,':utf8';while(<>){print Encode::decode 'UTF-8',\$_}" < bad.txt > good.txt

EDIT: Added Java-only solution.

编辑:添加纯java解决方案。

This is an example how to do this in Java:

这是如何在Java中做到这一点的一个例子:

import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.nio.charset.CharacterCodingException;
import java.nio.charset.Charset;
import java.nio.charset.CharsetDecoder;
import java.nio.charset.CodingErrorAction;

public class UtfFix {
    public static void main(String[] args) throws InterruptedException, CharacterCodingException {
        CharsetDecoder decoder = Charset.forName("UTF-8").newDecoder();
        decoder.onMalformedInput(CodingErrorAction.REPLACE);
        decoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
        ByteBuffer bb = ByteBuffer.wrap(new byte[] {
            (byte) 0xD0, (byte) 0x9F, // 'П'
            (byte) 0xD1, (byte) 0x80, // 'р'
            (byte) 0xD0,              // corrupted UTF-8, was 'и'
            (byte) 0xD0, (byte) 0xB2, // 'в'
            (byte) 0xD0, (byte) 0xB5, // 'е'
            (byte) 0xD1, (byte) 0x82  // 'т'
        });
        CharBuffer parsed = decoder.decode(bb);
        System.out.println(parsed);
        // this prints: Пр?вет
    }
}

#3


3  

You can filter surrogate characters with this regex:

您可以使用这个regex过滤代理字符:

String str  = "????"; //U+20000, represented by 2 chars in java (UTF-16 surrogate pair)
str = str.replaceAll( "([\\ud800-\\udbff\\udc00-\\udfff])", "");
System.out.println(str.length()); //0

#4


2  

Once you convert the byte array to String on the java machine, you'll get (by default on most machines) UTF-16 encoded string. The proper solution to get rid of non UTF-8 characters is with the following code:

一旦您在java机器上将字节数组转换为字符串,您将获得(在大多数机器上默认情况下)UTF-16编码字符串。去除非UTF-8字符的正确方法如下:

String[] values = {"\\xF0\\x9F\\x98\\x95", "\\xF0\\x9F\\x91\\x8C", "/*", "look into my eyes 〠.〠", "fkdjsf ksdjfslk", "\\xF0\\x80\\x80\\x80", "aa \\xF0\\x9F\\x98\\x95 aa", "Ok"};
for (int i = 0; i < values.length; i++) {
    System.out.println(values[i].replaceAll(
                    //"[\\\\x00-\\\\x7F]|" + //single-byte sequences   0xxxxxxx - commented because of capitol letters
                    "[\\\\xC0-\\\\xDF][\\\\x80-\\\\xBF]|" + //double-byte sequences   110xxxxx 10xxxxxx
                    "[\\\\xE0-\\\\xEF][\\\\x80-\\\\xBF]{2}|" + //triple-byte sequences   1110xxxx 10xxxxxx * 2
                    "[\\\\xF0-\\\\xF7][\\\\x80-\\\\xBF]{3}" //quadruple-byte sequence 11110xxx 10xxxxxx * 3
            , ""));
}

or if you want to validate if some string contains non utf8 characters you would use Pattern.matches like:

如果您想验证某个字符串是否包含非utf8字符,那么您将使用模式。匹配:

String[] values = {"\\xF0\\x9F\\x98\\x95", "\\xF0\\x9F\\x91\\x8C", "/*", "look into my eyes 〠.〠", "fkdjsf ksdjfslk", "\\xF0\\x80\\x80\\x80", "aa \\xF0\\x9F\\x98\\x95 aa", "Ok"};
for (int i = 0; i < values.length; i++) {
    System.out.println(Pattern.matches(
                    ".*(" +
                    //"[\\\\x00-\\\\x7F]|" + //single-byte sequences   0xxxxxxx - commented because of capitol letters
                    "[\\\\xC0-\\\\xDF][\\\\x80-\\\\xBF]|" + //double-byte sequences   110xxxxx 10xxxxxx
                    "[\\\\xE0-\\\\xEF][\\\\x80-\\\\xBF]{2}|" + //triple-byte sequences   1110xxxx 10xxxxxx * 2
                    "[\\\\xF0-\\\\xF7][\\\\x80-\\\\xBF]{3}" //quadruple-byte sequence 11110xxx 10xxxxxx * 3
                    + ").*"
            , values[i]));
}

For making a whole web app be UTF8 compatible read here:
How to get UTF-8 working in Java webapps
More on Byte Encodings and Strings.
You can check your pattern here.
The same in PHP here.

要使整个web应用程序与UTF8兼容,请阅读这里:如何使UTF-8在Java web应用程序中工作,更多地关注字节编码和字符串。您可以在这里检查您的模式。PHP也是一样。

#5


-1  

May be this will help someone as it helped me.

也许这对帮助我的人有帮助。

public static String removeBadChars(String s) {
  if (s == null) return null;
  StringBuilder sb = new StringBuilder();
  for(int i=0;i<s.length();i++){ 
    if (Character.isHighSurrogate(s.charAt(i))) continue;
    sb.append(s.charAt(i));
  }
  return sb.toString();
}

#1


8  

You can encode and then decode it to/from UTF-8:

你可以对它进行编码,然后解码到UTF-8:

String label = "look into my eyes 〠.〠";

Charset charset = Charset.forName("UTF-8");
label = charset.decode(charset.encode(label)).toString();

System.out.println(label);

output:

输出:

look into my eyes ?.?

#2


8  

When I had problem like this, I used Perl script to ensure that data is converted to valid UTF-8 by using code like this:

当我遇到这样的问题时,我使用Perl脚本确保使用如下代码将数据转换为有效的UTF-8:

use Encode;
binmode(STDOUT, ":utf8");
while (<>) {
    print Encode::decode('UTF-8', $_);
}

This script takes (possibly corrupted) UTF-8 on stdin and re-prints valid UTF-8 to stdout. Invalid characters are replaced with (U+FFFD, Unicode replacement character).

此脚本在stdin上获取(可能损坏的)UTF-8并将有效的UTF-8重新打印到stdout。无效的字符替换�(U + FFFD,Unicode替换字符)。

If you run this script on good UTF-8 input, output should be identical to input.

如果在良好的UTF-8输入上运行此脚本,则输出应该与输入相同。

If you have data in database, it makes sense to use DBI to scan your table(s) and scrub all data using this approach to make sure that everything is valid UTF-8.

如果数据库中有数据,那么使用DBI扫描表并使用这种方法删除所有数据,以确保所有数据都是有效的UTF-8。

This is Perl one-liner version of this same script:

这是同一脚本的Perl单行版本:

perl -MEncode -e "binmode STDOUT,':utf8';while(<>){print Encode::decode 'UTF-8',\$_}" < bad.txt > good.txt

EDIT: Added Java-only solution.

编辑:添加纯java解决方案。

This is an example how to do this in Java:

这是如何在Java中做到这一点的一个例子:

import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.nio.charset.CharacterCodingException;
import java.nio.charset.Charset;
import java.nio.charset.CharsetDecoder;
import java.nio.charset.CodingErrorAction;

public class UtfFix {
    public static void main(String[] args) throws InterruptedException, CharacterCodingException {
        CharsetDecoder decoder = Charset.forName("UTF-8").newDecoder();
        decoder.onMalformedInput(CodingErrorAction.REPLACE);
        decoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
        ByteBuffer bb = ByteBuffer.wrap(new byte[] {
            (byte) 0xD0, (byte) 0x9F, // 'П'
            (byte) 0xD1, (byte) 0x80, // 'р'
            (byte) 0xD0,              // corrupted UTF-8, was 'и'
            (byte) 0xD0, (byte) 0xB2, // 'в'
            (byte) 0xD0, (byte) 0xB5, // 'е'
            (byte) 0xD1, (byte) 0x82  // 'т'
        });
        CharBuffer parsed = decoder.decode(bb);
        System.out.println(parsed);
        // this prints: Пр?вет
    }
}

#3


3  

You can filter surrogate characters with this regex:

您可以使用这个regex过滤代理字符:

String str  = "????"; //U+20000, represented by 2 chars in java (UTF-16 surrogate pair)
str = str.replaceAll( "([\\ud800-\\udbff\\udc00-\\udfff])", "");
System.out.println(str.length()); //0

#4


2  

Once you convert the byte array to String on the java machine, you'll get (by default on most machines) UTF-16 encoded string. The proper solution to get rid of non UTF-8 characters is with the following code:

一旦您在java机器上将字节数组转换为字符串,您将获得(在大多数机器上默认情况下)UTF-16编码字符串。去除非UTF-8字符的正确方法如下:

String[] values = {"\\xF0\\x9F\\x98\\x95", "\\xF0\\x9F\\x91\\x8C", "/*", "look into my eyes 〠.〠", "fkdjsf ksdjfslk", "\\xF0\\x80\\x80\\x80", "aa \\xF0\\x9F\\x98\\x95 aa", "Ok"};
for (int i = 0; i < values.length; i++) {
    System.out.println(values[i].replaceAll(
                    //"[\\\\x00-\\\\x7F]|" + //single-byte sequences   0xxxxxxx - commented because of capitol letters
                    "[\\\\xC0-\\\\xDF][\\\\x80-\\\\xBF]|" + //double-byte sequences   110xxxxx 10xxxxxx
                    "[\\\\xE0-\\\\xEF][\\\\x80-\\\\xBF]{2}|" + //triple-byte sequences   1110xxxx 10xxxxxx * 2
                    "[\\\\xF0-\\\\xF7][\\\\x80-\\\\xBF]{3}" //quadruple-byte sequence 11110xxx 10xxxxxx * 3
            , ""));
}

or if you want to validate if some string contains non utf8 characters you would use Pattern.matches like:

如果您想验证某个字符串是否包含非utf8字符,那么您将使用模式。匹配:

String[] values = {"\\xF0\\x9F\\x98\\x95", "\\xF0\\x9F\\x91\\x8C", "/*", "look into my eyes 〠.〠", "fkdjsf ksdjfslk", "\\xF0\\x80\\x80\\x80", "aa \\xF0\\x9F\\x98\\x95 aa", "Ok"};
for (int i = 0; i < values.length; i++) {
    System.out.println(Pattern.matches(
                    ".*(" +
                    //"[\\\\x00-\\\\x7F]|" + //single-byte sequences   0xxxxxxx - commented because of capitol letters
                    "[\\\\xC0-\\\\xDF][\\\\x80-\\\\xBF]|" + //double-byte sequences   110xxxxx 10xxxxxx
                    "[\\\\xE0-\\\\xEF][\\\\x80-\\\\xBF]{2}|" + //triple-byte sequences   1110xxxx 10xxxxxx * 2
                    "[\\\\xF0-\\\\xF7][\\\\x80-\\\\xBF]{3}" //quadruple-byte sequence 11110xxx 10xxxxxx * 3
                    + ").*"
            , values[i]));
}

For making a whole web app be UTF8 compatible read here:
How to get UTF-8 working in Java webapps
More on Byte Encodings and Strings.
You can check your pattern here.
The same in PHP here.

要使整个web应用程序与UTF8兼容,请阅读这里:如何使UTF-8在Java web应用程序中工作,更多地关注字节编码和字符串。您可以在这里检查您的模式。PHP也是一样。

#5


-1  

May be this will help someone as it helped me.

也许这对帮助我的人有帮助。

public static String removeBadChars(String s) {
  if (s == null) return null;
  StringBuilder sb = new StringBuilder();
  for(int i=0;i<s.length();i++){ 
    if (Character.isHighSurrogate(s.charAt(i))) continue;
    sb.append(s.charAt(i));
  }
  return sb.toString();
}