在java中对Unicode的字符比U+FFFF多吗?

时间:2022-03-01 10:36:21

How can I display a Unicode Character above U+FFFF using char in Java?

如何在Java中使用char在U+FFFF上显示Unicode字符?

I need something like this (if it were valid):

我需要这样的东西(如果它是有效的):

char u = '\u+10FFFF';

4 个解决方案

#1


18  

You can't do it with a single char (which holds a UTF-16 code unit), but you can use a String:

您不能使用单个char(它持有一个UTF-16代码单元),但是您可以使用字符串:

// This represents U+10FFFF
String x = "\udbff\udfff";

Alternatively:

另外:

String y = new StringBuilder().appendCodePoint(0x10ffff).toString();

That is a surrogate pair (two UTF-16 code units which combine to form a single Unicode code point beyond the Basic Multilingual Plane). Of course, you need whatever's going to display your data to cope with it too...

这是一个代理对(两个UTF-16代码单元,它们组合成一个单一的Unicode代码点,超出了基本的多语言平面)。当然,你需要什么来显示你的数据来处理它……

#2


2  

Source

The char data type are based on the original Unicode specification, which defined characters as fixed-width 16-bit entities. The range of legal code points is now U+0000 to U+10FFFF, known as Unicode scalar value.

char数据类型基于原始的Unicode规范,该规范将字符定义为固定宽度的16位实体。法律代码点的范围现在是U+0000到U+10FFFF,称为Unicode标量值。

The set of characters from U+0000 to U+FFFF is sometimes referred to as the Basic Multilingual Plane (BMP). Characters whose code points are greater than U+FFFF are called supplementary characters. The Java 2 platform uses the UTF-16 representation in char arrays and in the String and StringBuffer classes. In this representation, supplementary characters are represented as a pair of char values, the first from the high-surrogates range, (\uD800-\uDBFF), the second from the low-surrogates range (\uDC00-\uDFFF).

从U+0000到U+FFFF的字符集有时被称为基本的多语言平面(BMP)。代码点大于U+FFFF的字符称为补充字符。Java 2平台在char数组和字符串和StringBuffer类中使用UTF-16表示。在这个表示法中,补充字符表示为一对char值,第一个来自高代理范围,(\uD800-\uDBFF),第二个来自低代理范围(\uDC00-\uDFFF)。

A char value, therefore, represents Basic Multilingual Plane (BMP) code points, including the surrogate code points, or code units of the UTF-16 encoding. An int value represents all Unicode code points, including supplementary code points. The lower (least significant) 21 bits of int are used to represent Unicode code points and the upper (most significant) 11 bits must be zero. Unless otherwise specified, the behavior with respect to supplementary characters and surrogate char values is as follows:

因此,char值表示基本的多语言平面(BMP)代码点,包括代理代码点或UTF-16编码的代码单元。int值表示所有Unicode代码点,包括补充代码点。较低的(最不重要的)21位int被用来表示Unicode编码点,而上面(最重要的)11位必须为零。除非另有说明,关于补充字符和代理字符值的行为如下:

  • The methods that only accept a char value cannot support supplementary characters. They treat char values from the surrogate ranges as undefined characters. For example, Character.isLetter('\uD840') returns false, even though this specific value if followed by any low-surrogate value in a string would represent a letter.

    仅接受char值的方法不能支持辅助字符。它们将代理范围中的char值视为未定义的字符。例如,字符。isletter ('\uD840')返回false,即使该特定值后面的字符串中的任何低代理值都表示一个字母。

  • The methods that accept an int value support all Unicode characters, including supplementary characters. For example, Character.isLetter(0x2F81A) returns true because the code point value represents a letter (a CJK ideograph).

    接受int值的方法支持所有Unicode字符,包括补充字符。例如,字符。isletter (0x2F81A)返回true,因为代码点值表示一个字母(CJK象形图)。

In the J2SE API documentation, Unicode code point is used for character values in the range between U+0000 and U+10FFFF, and Unicode code unit is used for 16-bit char values that are code units of the UTF-16 encoding.

在J2SE API文档中,Unicode代码点用于在U+0000和U+10FFFF之间的字符值,而Unicode代码单元用于16位char值,这些值是UTF-16编码的代码单元。

#3


2  

Instead of using StringBuilder you can also use a function directly found in the class Character. The function is toChars() and it has the following spec:

您还可以使用直接在类字符中找到的函数来代替StringBuilder。该函数是toChars(),它具有以下规范:

Converts the specified character (Unicode code point) to
its UTF-16 representation stored in a {@code char} array.

将指定的字符(Unicode代码点)转换为它的UTF-16表示,存储在{@code char}数组中。

So you don't need to exactly know how the surrogate pairs look like and directly work with the code point. An example code then looks as follows:

因此,您不需要确切地知道代理对是什么样的,并且直接使用代码点。示例代码如下:

int x = 0x10FFFF;
String y = new String(Character.toChars(ch));

Note the datatype for the code point is int and not char.

注意,代码点的数据类型是int而不是char。

#4


1  

Unicode characters can take more than two bytes which can't be in general hold in a char.

Unicode字符可以占用超过两个字节,这在char中是不可能的。

#1


18  

You can't do it with a single char (which holds a UTF-16 code unit), but you can use a String:

您不能使用单个char(它持有一个UTF-16代码单元),但是您可以使用字符串:

// This represents U+10FFFF
String x = "\udbff\udfff";

Alternatively:

另外:

String y = new StringBuilder().appendCodePoint(0x10ffff).toString();

That is a surrogate pair (two UTF-16 code units which combine to form a single Unicode code point beyond the Basic Multilingual Plane). Of course, you need whatever's going to display your data to cope with it too...

这是一个代理对(两个UTF-16代码单元,它们组合成一个单一的Unicode代码点,超出了基本的多语言平面)。当然,你需要什么来显示你的数据来处理它……

#2


2  

Source

The char data type are based on the original Unicode specification, which defined characters as fixed-width 16-bit entities. The range of legal code points is now U+0000 to U+10FFFF, known as Unicode scalar value.

char数据类型基于原始的Unicode规范,该规范将字符定义为固定宽度的16位实体。法律代码点的范围现在是U+0000到U+10FFFF,称为Unicode标量值。

The set of characters from U+0000 to U+FFFF is sometimes referred to as the Basic Multilingual Plane (BMP). Characters whose code points are greater than U+FFFF are called supplementary characters. The Java 2 platform uses the UTF-16 representation in char arrays and in the String and StringBuffer classes. In this representation, supplementary characters are represented as a pair of char values, the first from the high-surrogates range, (\uD800-\uDBFF), the second from the low-surrogates range (\uDC00-\uDFFF).

从U+0000到U+FFFF的字符集有时被称为基本的多语言平面(BMP)。代码点大于U+FFFF的字符称为补充字符。Java 2平台在char数组和字符串和StringBuffer类中使用UTF-16表示。在这个表示法中,补充字符表示为一对char值,第一个来自高代理范围,(\uD800-\uDBFF),第二个来自低代理范围(\uDC00-\uDFFF)。

A char value, therefore, represents Basic Multilingual Plane (BMP) code points, including the surrogate code points, or code units of the UTF-16 encoding. An int value represents all Unicode code points, including supplementary code points. The lower (least significant) 21 bits of int are used to represent Unicode code points and the upper (most significant) 11 bits must be zero. Unless otherwise specified, the behavior with respect to supplementary characters and surrogate char values is as follows:

因此,char值表示基本的多语言平面(BMP)代码点,包括代理代码点或UTF-16编码的代码单元。int值表示所有Unicode代码点,包括补充代码点。较低的(最不重要的)21位int被用来表示Unicode编码点,而上面(最重要的)11位必须为零。除非另有说明,关于补充字符和代理字符值的行为如下:

  • The methods that only accept a char value cannot support supplementary characters. They treat char values from the surrogate ranges as undefined characters. For example, Character.isLetter('\uD840') returns false, even though this specific value if followed by any low-surrogate value in a string would represent a letter.

    仅接受char值的方法不能支持辅助字符。它们将代理范围中的char值视为未定义的字符。例如,字符。isletter ('\uD840')返回false,即使该特定值后面的字符串中的任何低代理值都表示一个字母。

  • The methods that accept an int value support all Unicode characters, including supplementary characters. For example, Character.isLetter(0x2F81A) returns true because the code point value represents a letter (a CJK ideograph).

    接受int值的方法支持所有Unicode字符,包括补充字符。例如,字符。isletter (0x2F81A)返回true,因为代码点值表示一个字母(CJK象形图)。

In the J2SE API documentation, Unicode code point is used for character values in the range between U+0000 and U+10FFFF, and Unicode code unit is used for 16-bit char values that are code units of the UTF-16 encoding.

在J2SE API文档中,Unicode代码点用于在U+0000和U+10FFFF之间的字符值,而Unicode代码单元用于16位char值,这些值是UTF-16编码的代码单元。

#3


2  

Instead of using StringBuilder you can also use a function directly found in the class Character. The function is toChars() and it has the following spec:

您还可以使用直接在类字符中找到的函数来代替StringBuilder。该函数是toChars(),它具有以下规范:

Converts the specified character (Unicode code point) to
its UTF-16 representation stored in a {@code char} array.

将指定的字符(Unicode代码点)转换为它的UTF-16表示,存储在{@code char}数组中。

So you don't need to exactly know how the surrogate pairs look like and directly work with the code point. An example code then looks as follows:

因此,您不需要确切地知道代理对是什么样的,并且直接使用代码点。示例代码如下:

int x = 0x10FFFF;
String y = new String(Character.toChars(ch));

Note the datatype for the code point is int and not char.

注意,代码点的数据类型是int而不是char。

#4


1  

Unicode characters can take more than two bytes which can't be in general hold in a char.

Unicode字符可以占用超过两个字节,这在char中是不可能的。