将Java字符串转换为代码页1252

时间:2022-07-02 22:16:34

I'm using JNI to interface between a Java program and a C++ function. The C++ function deals with multi-byte strings (CP 1252). I use this C++ code to convert the Java String to a char*:

我使用JNI在Java程序和c++函数之间进行接口。函数的作用是:处理多字节字符串(CP 1252)。我使用这个c++代码将Java字符串转换为char*:

char *arg=(char*) jEnv->GetStringUTFChars(jArg2,0);

This works fine unless I have some high-order characters. For example, if my input is:

除非我有一些高阶的字符,否则这个操作很好。例如,如果我的输入是:

Àlan (UTF: c2 6c 61 6e 20 4a 6f 6e 65 7e)

艾伦(UTF: c2 6c 61 6e 20 4a 6f 6e 65 7e)

I can see that the resultant arg is:

我可以看到合成的arg是:

c3 82 6c 61 6e

c3 82 6c 61 6e

But, I would expect to see:

但是,我希望看到:

c0 6c 61 6e

c0 6 c 61 6 e

Seeing that GetStringUTFChars() is supposed to return UTF strings, I tried obtaining the Unicode string with GetStringChars() and converting it via WideCharToMultiByte():

看到GetStringUTFChars()应该返回UTF字符串,我尝试使用GetStringChars()获取Unicode字符串,并通过WideCharToMultiByte()将其转换为:

const jchar *str=jEnv->GetStringChars(jArg2,0);
WideCharToMultiByte(CP_UTF8,0,(LPCWSTR) str,jEnv->GetStringLength(jArg2),str,szStr,0,0);

(you can assume that I've allocated str and set szStr properly). In this situation, I see this in the resultant str:

(您可以假设我已经分配了str并正确地设置了szStr)。在这种情况下,我在合成str中看到:

c3 82 6c 61 6e

c3 82 6c 61 6e

I've tried other CP_ values for the first parameter to WideCharToMultiByte, none yield useful results (they either return the above or substitute a '?' for the 'À'.

我已经尝试了其他CP_值,将第一个参数设置为WideCharToMultiByte,没有一个会产生有用的结果(它们要么返回上面的值,要么替换为a '?的“一个”。

I would expect that somehow I could get this resultant str:

我期望我可以得到这个合成的str:

c0 6c 61 6e

c0 6 c 61 6 e

But so far, I've had no luck.

但到目前为止,我运气不佳。

2 个解决方案

#1


3  

Java uses a modified version of UTF-8. Here is a quote from Java's documentation:

Java使用了修改后的UTF-8版本。以下是来自Java文档的引用:

Modified UTF-8 is not new to the Java platform, but it's something that application developers need to be more aware of when converting text that might contain supplementary characters to and from UTF-8. The main thing to remember is that some J2SE interfaces use an encoding that's similar to UTF-8 but incompatible with it. This encoding has in the past sometimes been called "Java modified UTF-8" or (incorrectly) just "UTF-8". For J2SE 5.0, the documentation is being updated to uniformly call it "modified UTF-8."

修改后的UTF-8对Java平台来说并不新鲜,但是在将可能包含补充字符的文本转换为UTF-8或从UTF-8转换时,应用程序开发人员需要更加注意这一点。需要记住的主要事情是,一些J2SE接口使用与UTF-8类似但与它不兼容的编码。这种编码过去有时被称为“Java modified UTF-8”,或者(不正确的)只是“UTF-8”。对于J2SE 5.0,文档被更新为一致的称为“修改的UTF-8”。

The incompatibility between modified UTF-8 and standard UTF-8 stems from two differences. First, modified UTF-8 represents the character U+0000 as the two-byte sequence 0xC0 0x80, whereas standard UTF-8 uses the single byte value 0x0. Second, modified UTF-8 represents supplementary characters by separately encoding the two surrogate code units of their UTF-16 representation. Each of the surrogate code units is represented by three bytes, for a total of six bytes. Standard UTF-8, on the other hand, uses a single four byte sequence for the complete character.

修改后的UTF-8和标准的UTF-8之间的不兼容性源于两个不同点。首先,修改后的UTF-8表示字符U+0000为两个字节序列0xC0 0x80,而标准的UTF-8使用单个字节值0x0。第二,修改后的UTF-8通过分别编码其UTF-16表示的两个代理代码单元来表示补充字符。每个代理代码单元由3个字节表示,总共6个字节。另一方面,标准的UTF-8对完整字符使用一个4字节的序列。

Modified UTF-8 is used by the Java Virtual Machine and the interfaces attached to it (such as the Java Native Interface, the various tool interfaces, or Java class files), in the java.io.DataInput and DataOutput interfaces and classes implementing or using them, and for serialization. The Java Native Interface provides routines that convert to and from modified UTF-8. Standard UTF-8, on the other hand, is supported by the String class, by the java.io.InputStreamReader and OutputStreamWriter classes, the java.nio.charset facilities, and many APIs layered on top of them.

修改后的UTF-8由Java虚拟机和附加到它的接口(如Java本机接口、各种工具接口或Java类文件)在Java .io中使用。DataInput和DataOutput接口和实现或使用它们的类以及序列化。Java本机接口提供了转换到修改后的UTF-8的例程。另一方面,标准的UTF-8是由java.io支持的String类。InputStreamReader和OutputStreamWriter类,java.nio。charset设施和许多api在它们之上进行分层。

Since modified UTF-8 is incompatible with standard UTF-8, it is critical not to use one where the other is needed. Modified UTF-8 can only be used with the Java interfaces described above. In all other cases, in particular for data streams that may come from or may be interpreted by software that's not based on the Java platform, standard UTF-8 must be used. The Java Native Interface routines that convert to and from modified UTF-8 cannot be used when standard UTF-8 is required.

由于修改后的UTF-8与标准的UTF-8不兼容,关键是不要在需要另一个的地方使用一个。修改后的UTF-8只能与上面描述的Java接口一起使用。在所有其他情况下,特别是对于来自或可能由不基于Java平台的软件解释的数据流,必须使用标准的UTF-8。当需要标准的UTF-8时,不能使用转换到修改的UTF-8的Java本机接口例程。

The byte sequence c2 6c 61 6e 20 4a 6f 6e 65 7e is not valid under standard UTF-8. In cp1252, that same byte sequence would be the string Âlan Jone~ (notice  instead of À).

在标准UTF-8下,字节序列c2 6c 61 6e 204a 6f 6e 6e 65 7e无效。在cp1252中,相同的字节序列是字符串Alan Jone~(注意不是A,而是A)。

Under standard UTF-8, the string Àlan Jone~ would be the byte sequence c3 80 6c 61 6e 20 4a 6f 6e 65 7e (notice c3 80 6c instead of c2 6c).

在标准UTF-8下,字符串Alan Jone~将是字节序列c3 806c 61 6e 20 4a 6f 6e 65 7e(注意c3 806c而不是c2 6c)。

All Java strings are natively UTF-16, so you don't need to retreive the string as UTF-8. Use GetStringChars() to get a original UTF-16 encoded characters and pass them as-is to WideCharToMultiByte() specifying 1252 as the codepage (note, in your example you are using str for both the UTF-16 input buffer and the cp1252 output buffer - don't get your variables confused!), eg:

所有的Java字符串都是原生的UTF-16,所以不需要将字符串重新设置为UTF-8。使用GetStringChars()获取原始的UTF-16编码字符,并将它们按原样传递给WideCharToMultiByte(),指定1252作为代码页(注意,在您的示例中,您正在对UTF-16输入缓冲区和cp1252输出缓冲区使用str—不要混淆您的变量!)

const jchar *str = jEnv->GetStringChars(jArg2,0); 
char *cp1252 = NULL;
int len = WideCharToMultiByte(1252, 0, (LPCWSTR)str, jEnv->GetStringLength(jArg2), NULL, 0, 0, 0);
if (len > 0)
{
    cp1252 = new char[len + 1];
    WideCharToMultiByte(1252, 0, (LPCWSTR)str, jEnv->GetStringLength(jArg2), cp1252, len, 0, 0); 
    cp1252[len] = 0;
}

#2


0  

Codepage 1252, Windows ANSI Western, is a superset of ISO Latin 1. Which is a subset of Unicode. Thus, if you can live without the Euro sign and some other added Microsoft characters, just discard any Unicode code point higher than 255, and you have a valid cp 1252 encoded string.

代码页1252,Windows ANSI Western,是ISO Latin 1的超集。这是Unicode的一个子集。因此,如果您可以不使用欧元符号和其他添加的Microsoft字符,那么只需丢弃任何高于255的Unicode代码点,就可以得到一个有效的cp 1252编码字符串。

For using WideCharToMultiByte correctly (more general conversion, e.g. supporting Euro sign), read the documentation, and note e.g. the flag values.

要正确使用WideCharToMultiByte(更一般的转换,例如支持欧元符号),请阅读文档,并注意标记值。

Or as we used to say on Usenet about those who would like others to read the documentation for them and tell them what's significant and what is not, RTFM, please.

或者就像我们在Usenet上说的那些想让别人为他们阅读文件并告诉他们什么是重要的,什么是不重要的,请说RTFM。

#1


3  

Java uses a modified version of UTF-8. Here is a quote from Java's documentation:

Java使用了修改后的UTF-8版本。以下是来自Java文档的引用:

Modified UTF-8 is not new to the Java platform, but it's something that application developers need to be more aware of when converting text that might contain supplementary characters to and from UTF-8. The main thing to remember is that some J2SE interfaces use an encoding that's similar to UTF-8 but incompatible with it. This encoding has in the past sometimes been called "Java modified UTF-8" or (incorrectly) just "UTF-8". For J2SE 5.0, the documentation is being updated to uniformly call it "modified UTF-8."

修改后的UTF-8对Java平台来说并不新鲜,但是在将可能包含补充字符的文本转换为UTF-8或从UTF-8转换时,应用程序开发人员需要更加注意这一点。需要记住的主要事情是,一些J2SE接口使用与UTF-8类似但与它不兼容的编码。这种编码过去有时被称为“Java modified UTF-8”,或者(不正确的)只是“UTF-8”。对于J2SE 5.0,文档被更新为一致的称为“修改的UTF-8”。

The incompatibility between modified UTF-8 and standard UTF-8 stems from two differences. First, modified UTF-8 represents the character U+0000 as the two-byte sequence 0xC0 0x80, whereas standard UTF-8 uses the single byte value 0x0. Second, modified UTF-8 represents supplementary characters by separately encoding the two surrogate code units of their UTF-16 representation. Each of the surrogate code units is represented by three bytes, for a total of six bytes. Standard UTF-8, on the other hand, uses a single four byte sequence for the complete character.

修改后的UTF-8和标准的UTF-8之间的不兼容性源于两个不同点。首先,修改后的UTF-8表示字符U+0000为两个字节序列0xC0 0x80,而标准的UTF-8使用单个字节值0x0。第二,修改后的UTF-8通过分别编码其UTF-16表示的两个代理代码单元来表示补充字符。每个代理代码单元由3个字节表示,总共6个字节。另一方面,标准的UTF-8对完整字符使用一个4字节的序列。

Modified UTF-8 is used by the Java Virtual Machine and the interfaces attached to it (such as the Java Native Interface, the various tool interfaces, or Java class files), in the java.io.DataInput and DataOutput interfaces and classes implementing or using them, and for serialization. The Java Native Interface provides routines that convert to and from modified UTF-8. Standard UTF-8, on the other hand, is supported by the String class, by the java.io.InputStreamReader and OutputStreamWriter classes, the java.nio.charset facilities, and many APIs layered on top of them.

修改后的UTF-8由Java虚拟机和附加到它的接口(如Java本机接口、各种工具接口或Java类文件)在Java .io中使用。DataInput和DataOutput接口和实现或使用它们的类以及序列化。Java本机接口提供了转换到修改后的UTF-8的例程。另一方面,标准的UTF-8是由java.io支持的String类。InputStreamReader和OutputStreamWriter类,java.nio。charset设施和许多api在它们之上进行分层。

Since modified UTF-8 is incompatible with standard UTF-8, it is critical not to use one where the other is needed. Modified UTF-8 can only be used with the Java interfaces described above. In all other cases, in particular for data streams that may come from or may be interpreted by software that's not based on the Java platform, standard UTF-8 must be used. The Java Native Interface routines that convert to and from modified UTF-8 cannot be used when standard UTF-8 is required.

由于修改后的UTF-8与标准的UTF-8不兼容,关键是不要在需要另一个的地方使用一个。修改后的UTF-8只能与上面描述的Java接口一起使用。在所有其他情况下,特别是对于来自或可能由不基于Java平台的软件解释的数据流,必须使用标准的UTF-8。当需要标准的UTF-8时,不能使用转换到修改的UTF-8的Java本机接口例程。

The byte sequence c2 6c 61 6e 20 4a 6f 6e 65 7e is not valid under standard UTF-8. In cp1252, that same byte sequence would be the string Âlan Jone~ (notice  instead of À).

在标准UTF-8下,字节序列c2 6c 61 6e 204a 6f 6e 6e 65 7e无效。在cp1252中,相同的字节序列是字符串Alan Jone~(注意不是A,而是A)。

Under standard UTF-8, the string Àlan Jone~ would be the byte sequence c3 80 6c 61 6e 20 4a 6f 6e 65 7e (notice c3 80 6c instead of c2 6c).

在标准UTF-8下,字符串Alan Jone~将是字节序列c3 806c 61 6e 20 4a 6f 6e 65 7e(注意c3 806c而不是c2 6c)。

All Java strings are natively UTF-16, so you don't need to retreive the string as UTF-8. Use GetStringChars() to get a original UTF-16 encoded characters and pass them as-is to WideCharToMultiByte() specifying 1252 as the codepage (note, in your example you are using str for both the UTF-16 input buffer and the cp1252 output buffer - don't get your variables confused!), eg:

所有的Java字符串都是原生的UTF-16,所以不需要将字符串重新设置为UTF-8。使用GetStringChars()获取原始的UTF-16编码字符,并将它们按原样传递给WideCharToMultiByte(),指定1252作为代码页(注意,在您的示例中,您正在对UTF-16输入缓冲区和cp1252输出缓冲区使用str—不要混淆您的变量!)

const jchar *str = jEnv->GetStringChars(jArg2,0); 
char *cp1252 = NULL;
int len = WideCharToMultiByte(1252, 0, (LPCWSTR)str, jEnv->GetStringLength(jArg2), NULL, 0, 0, 0);
if (len > 0)
{
    cp1252 = new char[len + 1];
    WideCharToMultiByte(1252, 0, (LPCWSTR)str, jEnv->GetStringLength(jArg2), cp1252, len, 0, 0); 
    cp1252[len] = 0;
}

#2


0  

Codepage 1252, Windows ANSI Western, is a superset of ISO Latin 1. Which is a subset of Unicode. Thus, if you can live without the Euro sign and some other added Microsoft characters, just discard any Unicode code point higher than 255, and you have a valid cp 1252 encoded string.

代码页1252,Windows ANSI Western,是ISO Latin 1的超集。这是Unicode的一个子集。因此,如果您可以不使用欧元符号和其他添加的Microsoft字符,那么只需丢弃任何高于255的Unicode代码点,就可以得到一个有效的cp 1252编码字符串。

For using WideCharToMultiByte correctly (more general conversion, e.g. supporting Euro sign), read the documentation, and note e.g. the flag values.

要正确使用WideCharToMultiByte(更一般的转换,例如支持欧元符号),请阅读文档,并注意标记值。

Or as we used to say on Usenet about those who would like others to read the documentation for them and tell them what's significant and what is not, RTFM, please.

或者就像我们在Usenet上说的那些想让别人为他们阅读文件并告诉他们什么是重要的,什么是不重要的,请说RTFM。