Java中UTF-8字符编码。

时间:2021-04-24 09:37:49

I am having some problems getting some French text to convert to UTF8 so that it can be displayed properly, either in a console, text file or in a GUI element.

我遇到一些问题,让一些法语文本转换为UTF8,以便它能够正确显示,无论是在控制台、文本文件中,还是在GUI元素中。

The original string is

原来的字符串

HANDICAP╔ES

障碍╔西文

which is supposed to be

应该是什么?

HANDICAPÉES

HANDICAPEES

Here is a code snippet that shows how I am using the jackcess Database driver to read in the Acccess MDB file in an Eclipse/Linux environment.

下面是一个代码片段,它展示了如何使用jackcess数据库驱动程序在Eclipse/Linux环境中读取accprocessmdb文件。

Database database = Database.open(new File(filepath));
Table table = database.getTable(tableName, true);
Iterator rowIter = table.iterator();
while (rowIter.hasNext()) {
    Map<String, Object> row = this.rowIter.next();
    // convert fields to UTF
    Map<String, Object> rowUTF = new HashMap<String, Object>();
    try {
        for (String key : row.keySet()) {
            Object o = row.get(key);
            if (o != null) {
                String valueCP850 = o.toString();
                // String nameUTF8 = new String(valueCP850.getBytes("CP850"), "UTF8"); // does not work!
                String valueISO = new String(valueCP850.getBytes("CP850"), "ISO-8859-1");
                String valueUTF8 = new String(valueISO.getBytes(), "UTF-8"); // works!
                rowUTF.put(key, valueUTF8);
            }
        }
    } catch (UnsupportedEncodingException e) {
        System.err.println("Encoding exception: " + e);
    }   
}

In the code you'll see where I want to convert directly to UTF8, which doesn't seem to work, so I have to do a double conversion. Also note that there doesn't seem to be a way to specify the encoding type when using the jackcess driver.

在代码中,您将看到我想要直接转换到UTF8的地方,这似乎不可行,所以我必须进行双转换。还要注意,在使用jackcess驱动程序时,似乎没有一种方法来指定编码类型。

Thanks, Cam

谢谢,凸轮

4 个解决方案

#1


9  

New analysis, based on new information.
It looks like your problem is with the encoding of the text before it was stored in the Access DB. It seems it had been encoded as ISO-8859-1 or windows-1252, but decoded as cp850, resulting in the string HANDICAP╔ES being stored in the DB.

新的分析,基于新的信息。看起来您的问题是在它存储在Access DB之前的文本编码。似乎被编码为iso - 8859 - 1或windows - 1252,但解码cp850,导致字符串障碍╔ES被存储在数据库中。

Having correctly retrieved that string from the DB, you're now trying to reverse the original encoding error and recover the string as it should have been stored: HANDICAPÉES. And you're accomplishing that with this line:

正确地从数据库中检索了字符串之后,您现在正在尝试逆转原来的编码错误,并恢复本来应该存储的字符串:apees。你用这条线来完成它:

String valueISO = new String(valueCP850.getBytes("CP850"), "ISO-8859-1");

getBytes("CP850") converts the character to the byte value 0xC9, and the String constructor decodes that according to ISO-8859-1, resulting in the character É. The next line:

getBytes(“CP850”)转换字符╔字节值0 xc9,构造函数和字符串解码,根据iso - 8859 - 1,导致角色E。下一行:

String valueUTF8 = new String(valueISO.getBytes(), "UTF-8");

...does nothing. getBytes() encodes the string in the platform default encoding, which is UTF-8 on your Linux system. Then the String constructor decodes it with the same encoding. Delete that line and you should still get the same result.

…什么也不做。getBytes()在平台默认编码中编码字符串,在您的Linux系统上是UTF-8。然后,字符串构造函数用相同的编码对其进行解码。删除该行,您仍然可以得到相同的结果。

More to the point, your attempt to create a "UTF-8 string" was misguided. You don't need to concern yourself with the encoding of Java's strings--they're always UTF-16. When bringing text into a Java app, you just need to make sure you decode it with the correct encoding.

更重要的是,你创建“UTF-8字符串”的尝试被误导了。您不需要关心Java字符串的编码——它们总是UTF-16。将文本引入Java应用程序时,只需确保用正确的编码进行解码。

And if my analysis is correct, your Access driver is decoding it correctly; the problem is at the other end, possibly before the DB even comes into the picture. That's what you need to fix, because that new String(getBytes()) hack can't be counted on to work in all cases.

如果我的分析是正确的,你的访问驱动程序是正确解码的;问题是在另一端,可能在DB甚至出现之前。这就是您需要解决的问题,因为新字符串(getBytes())在所有情况下都无法计算。


Original analysis, based on no information. :-/
If you're seeing HANDICAP╔ES on the console, there's probably no problem. Given this code:

原始分析,基于无信息。如果你在游戏机上看到障碍,很可能没有问题。鉴于这种代码:

System.out.println("HANDICAPÉES");

The JVM converts the (Unicode) string to the platform default encoding, windows-1252, before sending it to the console. Then the console decodes that using its own default encoding, which happens to be cp850. So the console displays it wrong, but that's normal. If you want it to display correctly, you can change the console's encoding with this command:

JVM将(Unicode)字符串转换为平台默认编码(window -1252),然后将其发送到控制台。然后,控制台将使用它自己的默认编码(碰巧是cp850)进行解码。所以控制台显示错误,但这是正常的。如果您希望它正确显示,您可以使用以下命令更改控制台的编码:

CHCP 1252

To display the string in a GUI element, such as a JLabel, you don't have to do anything special. Just make sure you use a font that can display all the characters, but that shouldn't be problem for French.

要在GUI元素(如JLabel)中显示字符串,您不需要做任何特别的事情。只要确保你使用的字体可以显示所有的字符,但这对法语来说不是问题。

As for writing to a file, just specify the desired encoding when you create the Writer:

对于写入文件,只需在创建写入器时指定所需的编码:

OutputStreamWriter osw = new OutputStreamWriter(
    new FileOutputStream("myFile.txt"), "UTF-8");

#2


8  

String s = "HANDICAP╔ES";
System.out.println(new String(s.getBytes("CP850"), "ISO-8859-1")); // HANDICAPÉES

This shows the correct string value. This means that it was originally encoded/decoded with ISO-8859-1 and then incorrectly encoded with CP850 (originally CP1252 a.k.a. Windows ANSI as pointed in a comment is indeed also possible since the É has the same codepoint there as in ISO-8859-1).

这显示了正确的字符串值。这意味着它最初是用ISO-8859-1编码/解码的,然后用CP850(最初是CP1252 a.k.a)编码错误。由于E的代码点与ISO-8859-1相同,因此也有可能对其进行注释。

Align your environment and binary pipelines to use all the one and same character encoding. You can't and shouldn't convert between them. You would risk losing information in the non-ASCII range that way.

调整您的环境和二进制管道,以使用所有的一个和相同的字符编码。你不能也不应该在他们之间转换。你可能会在非ascii范围内丢失信息。

Note: do NOT use the above code snippet to "fix" the problem! That would not be the right solution.

注意:不要使用上面的代码片段来“修复”问题!这不是正确的解决方案。


Update: you are apparently still struggling with the problem. I'll repeat the important parts of the answer:

更新:你显然还在纠结于这个问题。我重复一下答案的重要部分:

  1. Align your environment and binary pipelines to use all the one and same character encoding.

    调整您的环境和二进制管道,以使用所有的一个和相同的字符编码。

  2. You can not and should not convert between them. You would risk losing information in the non-ASCII range that way.

    你不能也不应该在他们之间转换。你可能会在非ascii范围内丢失信息。

  3. Do NOT use the above code snippet to "fix" the problem! That would not be the right solution.

    不要使用上面的代码片段来“修复”问题!这不是正确的解决方案。

To fix the problem you need to choose character encoding X which you'd like to use throughout the entire application. I suggest UTF-8. Update MS Access to use encoding X. Update your development environment to use encoding X. Update the java.io readers and writers in your code to use encoding X. Update your editor to read/write files with encoding X. Update the application's user interface to use encoding X. Do not use Y or Z or whatever at some step. If the characters are already corrupted in some datastore (MS Access, files, etc), then you need to fix it by manually replacing the characters right there in the datastore. Do not use Java for this.

要解决这个问题,您需要选择在整个应用程序中使用的字符编码X。我建议utf - 8。更新MS Access以使用编码X.更新您的开发环境以使用编码X.更新java。在你的代码中,io读者和作者使用编码X.更新编辑器,用编码x来读/写文件。更新应用程序的用户界面来使用编码x。不要在某个步骤使用Y或Z或其他任何东西。如果字符在某些数据存储(MS Access、files等)中已经被损坏,那么您需要通过手动替换数据存储中的字符来修复它。不要使用Java。

If you're actually using the "command prompt" as user interface, then you're actually lost. It doesn't support UTF-8. As suggested in the comments and in the article linked in the comments, you need to create a Swing application instead of relying on the restricted command prompt environment.

如果您实际上是使用“命令提示符”作为用户界面,那么您实际上是丢失了。它不支持utf - 8。正如评论中所建议的,在评论中链接的文章中,您需要创建一个Swing应用程序,而不是依赖于受限的命令提示环境。

#3


0  

You can specify encoding when establishing connection. This way was perfect and solve my encoding problem:

您可以在建立连接时指定编码。这种方法是完美的,解决了我的编码问题:

    DatabaseImpl open = DatabaseImpl.open(new File("main.mdb"), true, null, Database.DEFAULT_AUTO_SYNC, java.nio.charset.Charset.availableCharsets().get("windows-1251"), null, null);
    Table table = open.getTable("FolderInfo");

#4


-1  

Using "ISO-8859-1" helped me deal with the French charactes.

使用“ISO-8859-1”帮助我处理了法国人的性格。

#1


9  

New analysis, based on new information.
It looks like your problem is with the encoding of the text before it was stored in the Access DB. It seems it had been encoded as ISO-8859-1 or windows-1252, but decoded as cp850, resulting in the string HANDICAP╔ES being stored in the DB.

新的分析,基于新的信息。看起来您的问题是在它存储在Access DB之前的文本编码。似乎被编码为iso - 8859 - 1或windows - 1252,但解码cp850,导致字符串障碍╔ES被存储在数据库中。

Having correctly retrieved that string from the DB, you're now trying to reverse the original encoding error and recover the string as it should have been stored: HANDICAPÉES. And you're accomplishing that with this line:

正确地从数据库中检索了字符串之后,您现在正在尝试逆转原来的编码错误,并恢复本来应该存储的字符串:apees。你用这条线来完成它:

String valueISO = new String(valueCP850.getBytes("CP850"), "ISO-8859-1");

getBytes("CP850") converts the character to the byte value 0xC9, and the String constructor decodes that according to ISO-8859-1, resulting in the character É. The next line:

getBytes(“CP850”)转换字符╔字节值0 xc9,构造函数和字符串解码,根据iso - 8859 - 1,导致角色E。下一行:

String valueUTF8 = new String(valueISO.getBytes(), "UTF-8");

...does nothing. getBytes() encodes the string in the platform default encoding, which is UTF-8 on your Linux system. Then the String constructor decodes it with the same encoding. Delete that line and you should still get the same result.

…什么也不做。getBytes()在平台默认编码中编码字符串,在您的Linux系统上是UTF-8。然后,字符串构造函数用相同的编码对其进行解码。删除该行,您仍然可以得到相同的结果。

More to the point, your attempt to create a "UTF-8 string" was misguided. You don't need to concern yourself with the encoding of Java's strings--they're always UTF-16. When bringing text into a Java app, you just need to make sure you decode it with the correct encoding.

更重要的是,你创建“UTF-8字符串”的尝试被误导了。您不需要关心Java字符串的编码——它们总是UTF-16。将文本引入Java应用程序时,只需确保用正确的编码进行解码。

And if my analysis is correct, your Access driver is decoding it correctly; the problem is at the other end, possibly before the DB even comes into the picture. That's what you need to fix, because that new String(getBytes()) hack can't be counted on to work in all cases.

如果我的分析是正确的,你的访问驱动程序是正确解码的;问题是在另一端,可能在DB甚至出现之前。这就是您需要解决的问题,因为新字符串(getBytes())在所有情况下都无法计算。


Original analysis, based on no information. :-/
If you're seeing HANDICAP╔ES on the console, there's probably no problem. Given this code:

原始分析,基于无信息。如果你在游戏机上看到障碍,很可能没有问题。鉴于这种代码:

System.out.println("HANDICAPÉES");

The JVM converts the (Unicode) string to the platform default encoding, windows-1252, before sending it to the console. Then the console decodes that using its own default encoding, which happens to be cp850. So the console displays it wrong, but that's normal. If you want it to display correctly, you can change the console's encoding with this command:

JVM将(Unicode)字符串转换为平台默认编码(window -1252),然后将其发送到控制台。然后,控制台将使用它自己的默认编码(碰巧是cp850)进行解码。所以控制台显示错误,但这是正常的。如果您希望它正确显示,您可以使用以下命令更改控制台的编码:

CHCP 1252

To display the string in a GUI element, such as a JLabel, you don't have to do anything special. Just make sure you use a font that can display all the characters, but that shouldn't be problem for French.

要在GUI元素(如JLabel)中显示字符串,您不需要做任何特别的事情。只要确保你使用的字体可以显示所有的字符,但这对法语来说不是问题。

As for writing to a file, just specify the desired encoding when you create the Writer:

对于写入文件,只需在创建写入器时指定所需的编码:

OutputStreamWriter osw = new OutputStreamWriter(
    new FileOutputStream("myFile.txt"), "UTF-8");

#2


8  

String s = "HANDICAP╔ES";
System.out.println(new String(s.getBytes("CP850"), "ISO-8859-1")); // HANDICAPÉES

This shows the correct string value. This means that it was originally encoded/decoded with ISO-8859-1 and then incorrectly encoded with CP850 (originally CP1252 a.k.a. Windows ANSI as pointed in a comment is indeed also possible since the É has the same codepoint there as in ISO-8859-1).

这显示了正确的字符串值。这意味着它最初是用ISO-8859-1编码/解码的,然后用CP850(最初是CP1252 a.k.a)编码错误。由于E的代码点与ISO-8859-1相同,因此也有可能对其进行注释。

Align your environment and binary pipelines to use all the one and same character encoding. You can't and shouldn't convert between them. You would risk losing information in the non-ASCII range that way.

调整您的环境和二进制管道,以使用所有的一个和相同的字符编码。你不能也不应该在他们之间转换。你可能会在非ascii范围内丢失信息。

Note: do NOT use the above code snippet to "fix" the problem! That would not be the right solution.

注意:不要使用上面的代码片段来“修复”问题!这不是正确的解决方案。


Update: you are apparently still struggling with the problem. I'll repeat the important parts of the answer:

更新:你显然还在纠结于这个问题。我重复一下答案的重要部分:

  1. Align your environment and binary pipelines to use all the one and same character encoding.

    调整您的环境和二进制管道,以使用所有的一个和相同的字符编码。

  2. You can not and should not convert between them. You would risk losing information in the non-ASCII range that way.

    你不能也不应该在他们之间转换。你可能会在非ascii范围内丢失信息。

  3. Do NOT use the above code snippet to "fix" the problem! That would not be the right solution.

    不要使用上面的代码片段来“修复”问题!这不是正确的解决方案。

To fix the problem you need to choose character encoding X which you'd like to use throughout the entire application. I suggest UTF-8. Update MS Access to use encoding X. Update your development environment to use encoding X. Update the java.io readers and writers in your code to use encoding X. Update your editor to read/write files with encoding X. Update the application's user interface to use encoding X. Do not use Y or Z or whatever at some step. If the characters are already corrupted in some datastore (MS Access, files, etc), then you need to fix it by manually replacing the characters right there in the datastore. Do not use Java for this.

要解决这个问题,您需要选择在整个应用程序中使用的字符编码X。我建议utf - 8。更新MS Access以使用编码X.更新您的开发环境以使用编码X.更新java。在你的代码中,io读者和作者使用编码X.更新编辑器,用编码x来读/写文件。更新应用程序的用户界面来使用编码x。不要在某个步骤使用Y或Z或其他任何东西。如果字符在某些数据存储(MS Access、files等)中已经被损坏,那么您需要通过手动替换数据存储中的字符来修复它。不要使用Java。

If you're actually using the "command prompt" as user interface, then you're actually lost. It doesn't support UTF-8. As suggested in the comments and in the article linked in the comments, you need to create a Swing application instead of relying on the restricted command prompt environment.

如果您实际上是使用“命令提示符”作为用户界面,那么您实际上是丢失了。它不支持utf - 8。正如评论中所建议的,在评论中链接的文章中,您需要创建一个Swing应用程序,而不是依赖于受限的命令提示环境。

#3


0  

You can specify encoding when establishing connection. This way was perfect and solve my encoding problem:

您可以在建立连接时指定编码。这种方法是完美的,解决了我的编码问题:

    DatabaseImpl open = DatabaseImpl.open(new File("main.mdb"), true, null, Database.DEFAULT_AUTO_SYNC, java.nio.charset.Charset.availableCharsets().get("windows-1251"), null, null);
    Table table = open.getTable("FolderInfo");

#4


-1  

Using "ISO-8859-1" helped me deal with the French charactes.

使用“ISO-8859-1”帮助我处理了法国人的性格。