如何获得Unicode字符的代码?

时间:2022-09-13 09:01:30

Let's say I have this:

假设我有这个

char registered = '®';

or an umlaut, or whatever unicode character. How could I get its code?

或者umlaut,或者其他unicode字符。我怎么能得到它的代码?

6 个解决方案

#1


92  

Just convert it to int:

只需将其转换为int:

char registered = '®';
int code = (int) registered;

In fact there's an implicit conversion from char to int so you don't have to specify it explicitly as I've done above, but I would do so in this case to make it obvious what you're trying to do.

实际上,有一个从char到int的隐式转换,所以您不必像我上面所做的那样显式地指定它,但是在本例中,我将这样做,以使您想要做的事情更明显。

This will give the UTF-16 code unit - which is the same as the Unicode code point for any character defined in the Basic Multilingual Plane. (And only BMP characters can be represented as char values in Java.) As Andrzej Doyle's answer says, if you want the Unicode code point from an arbitrary string, use Character.codePointAt().

这将提供UTF-16代码单元——这与在基本多语言平面中定义的任何字符的Unicode代码点相同。(在Java中,只有BMP字符可以表示为char值。)正如Andrzej Doyle的回答所言,如果您想从任意字符串中获得Unicode代码点,请使用Character.codePointAt()。

Once you've got the UTF-16 code unit or Unicode code points, but of which are integers, it's up to you what you do with them. If you want a string representation, you need to decide exactly what kind of representation you want. (For example, if you know the value will always be in the BMP, you might want a fixed 4-digit hex representation prefixed with U+, e.g. "U+0020" for space.) That's beyond the scope of this question though, as we don't know what the requirements are.

一旦您获得了UTF-16代码单元或Unicode代码点(但这些代码点是整数),您就可以使用它们了。如果您想要一个字符串表示形式,您需要确切地决定您想要哪种表示形式。(例如,如果您知道值总是在BMP中,您可能需要一个以U+为前缀的固定4位十六进制表示,例如。“U + 0020”空间。)这超出了这个问题的范围,因为我们不知道需求是什么。

#2


32  

A more complete, albeit more verbose, way of doing this would be to use the Character.codePointAt method. This will handle 'high surrogate' characters, that cannot be represented by a single integer within the range that a char can represent.

更完整的(尽管更冗长)方法是使用字符。codePointAt方法。这将处理“高代理”字符,不能用字符表示范围内的单个整数表示。

In the example you've given this is not strictly necessary - if the (Unicode) character can fit inside a single (Java) char (such as the registered local variable) then it must fall within the \u0000 to \uffff range, and you won't need to worry about surrogate pairs. But if you're looking at potentially higher code points, from within a String/char array, then calling this method is wise in order to cover the edge cases.

在您给出的示例中,这并不是绝对必要的——如果(Unicode)字符可以装入单个(Java)字符(例如已注册的本地变量),那么它必须位于\u0000到\uffff范围内,并且您不需要担心代理对。但是,如果从字符串/char数组中查看可能更高的代码点,那么调用这个方法是明智的,以便覆盖边缘情况。

For example, instead of

例如,而不是

String input = ...;
char fifthChar = input.charAt(4);
int codePoint = (int)fifthChar;

use

使用

String input = ...;
int codePoint = Character.codePointAt(input, 4);

Not only is this slightly less code in this instance, but it will handle detection of surrogate pairs for you.

在这个实例中,不仅这段代码稍微少一些,而且还将为您处理代理对的检测。

#3


5  

In Java, char is technically a "16-bit integer", so you can simply cast it to int and you'll get it's code. From Oracle:

在Java中,char在技术上是一个“16位整数”,因此您可以简单地将它转换为int,您将得到它的代码。从Oracle:

The char data type is a single 16-bit Unicode character. It has a minimum value of '\u0000' (or 0) and a maximum value of '\uffff' (or 65,535 inclusive).

char数据类型是一个16位的Unicode字符。它的最小值为'\u0000'(或0),最大值为'\uffff'(或包括65,535)。

So you can simply cast it to int.

所以你可以将它转换为int类型。

char registered = '®';
System.out.println(String.format("This is an int-code: %d", (int) registered));
System.out.println(String.format("And this is an hexa code: %x", (int) registered));

#4


0  

dear friend, Jon Skeet said you can find character Decimal codebut it is not character Hex code as it should mention in unicode, so you should represent character codes via HexCode not in Deciaml.

亲爱的朋友,Jon Skeet说你可以找到十进制字符代码,它不是unicode里应该提到的字符十六进制代码,所以你应该用十六进制代码而不是Deciaml来表示字符代码。

there is an open source tool at http://unicode.codeplex.com that provides complete information about a characer or a sentece.

在http://unicode.codeplex.com上有一个开源工具,它可以提供关于角色或者句子的完整信息。

so it is better to create a parser that give a char as a parameter and return ahexCode as string

因此,最好创建一个以char作为参数并以string返回ahexCode的解析器

public static String GetHexCode(char character)
    {
        return String.format("{0:X4}", GetDecimal(character));
    }//end

hope it help

希望它帮助

#5


0  

For me, only "Integer.toHexString(registered)" worked the way I wanted:

对我来说,只有“Integer.toHexString(已注册)”符合我的要求:

char registered = '®';
System.out.println("Answer:"+Integer.toHexString(registered));

This answer will give you only string representations what are usually presented in the tables. Jon Skeet's answer explains more.

这个答案将只给出通常出现在表中的字符串表示形式。Jon Skeet的回答解释得更多。

#6


0  

There is an open source library MgntUtils that has a Utility class StringUnicodeEncoderDecoder. That class provides static methods that convert any String into Unicode sequence vise-versa. Very simple and useful. To convert String you just do:

有一个开源库MgntUtils,它有一个实用类StringUnicodeEncoderDecoder。该类提供了静态方法,可以将任何字符串转换为Unicode序列,反之亦然。非常简单的和有用的。要转换字符串,只需:

String codes = StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence(myString);

For example a String "Hello World" will be converted into

例如,一个字符串“Hello World”将被转换为

"\u0048\u0065\u006c\u006c\u006f\u0020 \u0057\u006f\u0072\u006c\u0064"

“\ u0048 \ u0065 \ u006c \ u006c \ u006f \ u0020 \ u0057 \ u006f \ u0072 \ u006c \ u0064”

It works with any language. Here is the link to the article that explains all te ditails about the library: MgntUtils. Look for the subtitle "String Unicode converter". The article gives you link to Maven Central where you can get artifacts and github where you can get the project itself. The library comes with well written javadoc and source code.

它适用于任何语言。这里有一篇文章的链接,解释了关于这个库的所有分歧:MgntUtils。查找副标题“字符串Unicode转换器”。本文提供了到Maven Central的链接,您可以在其中获得工件,并可以在github中获得项目本身。该库附带编写良好的javadoc和源代码。

#1


92  

Just convert it to int:

只需将其转换为int:

char registered = '®';
int code = (int) registered;

In fact there's an implicit conversion from char to int so you don't have to specify it explicitly as I've done above, but I would do so in this case to make it obvious what you're trying to do.

实际上,有一个从char到int的隐式转换,所以您不必像我上面所做的那样显式地指定它,但是在本例中,我将这样做,以使您想要做的事情更明显。

This will give the UTF-16 code unit - which is the same as the Unicode code point for any character defined in the Basic Multilingual Plane. (And only BMP characters can be represented as char values in Java.) As Andrzej Doyle's answer says, if you want the Unicode code point from an arbitrary string, use Character.codePointAt().

这将提供UTF-16代码单元——这与在基本多语言平面中定义的任何字符的Unicode代码点相同。(在Java中,只有BMP字符可以表示为char值。)正如Andrzej Doyle的回答所言,如果您想从任意字符串中获得Unicode代码点,请使用Character.codePointAt()。

Once you've got the UTF-16 code unit or Unicode code points, but of which are integers, it's up to you what you do with them. If you want a string representation, you need to decide exactly what kind of representation you want. (For example, if you know the value will always be in the BMP, you might want a fixed 4-digit hex representation prefixed with U+, e.g. "U+0020" for space.) That's beyond the scope of this question though, as we don't know what the requirements are.

一旦您获得了UTF-16代码单元或Unicode代码点(但这些代码点是整数),您就可以使用它们了。如果您想要一个字符串表示形式,您需要确切地决定您想要哪种表示形式。(例如,如果您知道值总是在BMP中,您可能需要一个以U+为前缀的固定4位十六进制表示,例如。“U + 0020”空间。)这超出了这个问题的范围,因为我们不知道需求是什么。

#2


32  

A more complete, albeit more verbose, way of doing this would be to use the Character.codePointAt method. This will handle 'high surrogate' characters, that cannot be represented by a single integer within the range that a char can represent.

更完整的(尽管更冗长)方法是使用字符。codePointAt方法。这将处理“高代理”字符,不能用字符表示范围内的单个整数表示。

In the example you've given this is not strictly necessary - if the (Unicode) character can fit inside a single (Java) char (such as the registered local variable) then it must fall within the \u0000 to \uffff range, and you won't need to worry about surrogate pairs. But if you're looking at potentially higher code points, from within a String/char array, then calling this method is wise in order to cover the edge cases.

在您给出的示例中,这并不是绝对必要的——如果(Unicode)字符可以装入单个(Java)字符(例如已注册的本地变量),那么它必须位于\u0000到\uffff范围内,并且您不需要担心代理对。但是,如果从字符串/char数组中查看可能更高的代码点,那么调用这个方法是明智的,以便覆盖边缘情况。

For example, instead of

例如,而不是

String input = ...;
char fifthChar = input.charAt(4);
int codePoint = (int)fifthChar;

use

使用

String input = ...;
int codePoint = Character.codePointAt(input, 4);

Not only is this slightly less code in this instance, but it will handle detection of surrogate pairs for you.

在这个实例中,不仅这段代码稍微少一些,而且还将为您处理代理对的检测。

#3


5  

In Java, char is technically a "16-bit integer", so you can simply cast it to int and you'll get it's code. From Oracle:

在Java中,char在技术上是一个“16位整数”,因此您可以简单地将它转换为int,您将得到它的代码。从Oracle:

The char data type is a single 16-bit Unicode character. It has a minimum value of '\u0000' (or 0) and a maximum value of '\uffff' (or 65,535 inclusive).

char数据类型是一个16位的Unicode字符。它的最小值为'\u0000'(或0),最大值为'\uffff'(或包括65,535)。

So you can simply cast it to int.

所以你可以将它转换为int类型。

char registered = '®';
System.out.println(String.format("This is an int-code: %d", (int) registered));
System.out.println(String.format("And this is an hexa code: %x", (int) registered));

#4


0  

dear friend, Jon Skeet said you can find character Decimal codebut it is not character Hex code as it should mention in unicode, so you should represent character codes via HexCode not in Deciaml.

亲爱的朋友,Jon Skeet说你可以找到十进制字符代码,它不是unicode里应该提到的字符十六进制代码,所以你应该用十六进制代码而不是Deciaml来表示字符代码。

there is an open source tool at http://unicode.codeplex.com that provides complete information about a characer or a sentece.

在http://unicode.codeplex.com上有一个开源工具,它可以提供关于角色或者句子的完整信息。

so it is better to create a parser that give a char as a parameter and return ahexCode as string

因此,最好创建一个以char作为参数并以string返回ahexCode的解析器

public static String GetHexCode(char character)
    {
        return String.format("{0:X4}", GetDecimal(character));
    }//end

hope it help

希望它帮助

#5


0  

For me, only "Integer.toHexString(registered)" worked the way I wanted:

对我来说,只有“Integer.toHexString(已注册)”符合我的要求:

char registered = '®';
System.out.println("Answer:"+Integer.toHexString(registered));

This answer will give you only string representations what are usually presented in the tables. Jon Skeet's answer explains more.

这个答案将只给出通常出现在表中的字符串表示形式。Jon Skeet的回答解释得更多。

#6


0  

There is an open source library MgntUtils that has a Utility class StringUnicodeEncoderDecoder. That class provides static methods that convert any String into Unicode sequence vise-versa. Very simple and useful. To convert String you just do:

有一个开源库MgntUtils,它有一个实用类StringUnicodeEncoderDecoder。该类提供了静态方法,可以将任何字符串转换为Unicode序列,反之亦然。非常简单的和有用的。要转换字符串,只需:

String codes = StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence(myString);

For example a String "Hello World" will be converted into

例如,一个字符串“Hello World”将被转换为

"\u0048\u0065\u006c\u006c\u006f\u0020 \u0057\u006f\u0072\u006c\u0064"

“\ u0048 \ u0065 \ u006c \ u006c \ u006f \ u0020 \ u0057 \ u006f \ u0072 \ u006c \ u0064”

It works with any language. Here is the link to the article that explains all te ditails about the library: MgntUtils. Look for the subtitle "String Unicode converter". The article gives you link to Maven Central where you can get artifacts and github where you can get the project itself. The library comes with well written javadoc and source code.

它适用于任何语言。这里有一篇文章的链接,解释了关于这个库的所有分歧:MgntUtils。查找副标题“字符串Unicode转换器”。本文提供了到Maven Central的链接,您可以在其中获得工件,并可以在github中获得项目本身。该库附带编写良好的javadoc和源代码。