使用Locale.SIMPLIFIED_CHINESE对Collator进行错误排序

时间:2022-04-15 07:43:01

I'm trying to order a list of countries in Chinese using Locale.SIMPLIFIED_CHINESE, which seems that it orders using pinyin (phonetic alphabet, that is characters are ordered according to their latin correspondent combination, from A to Z).

我想用中文订一份使用地区语言的国家清单。SIMPLIFIED_CHINESE,它似乎可以使用拼音(拼音,也就是文字根据它们的拉丁对应组合,从A到Z来排序)来排序。

But I've found some cases when it orders bad. For example:

但我发现有些情况下命令很糟糕。例如:

  • '中' character is zhong1
  • 角色名:“中”
  • '梵' character is fan4
  • “梵”fan4品格

The correct order should be 梵 < 中, but instead it is ordered in the other way.

正确的顺序应该是梵 <中,而是命令其他的方式。< p>

String[] characters = new String[] {"梵", "中"};
List<String> list = Arrays.asList(characters);
System.out.println("Before sorting...");
System.out.println(list.toString());

Collator collator = Collator.getInstance(Locale.SIMPLIFIED_CHINESE);
collator.setStrength(Collator.PRIMARY);
Collections.sort(list, collator);

System.out.println("After sorting...");
System.out.println(list.toString());

Results of this snippet are:

这段代码的结果是:

Before sorting...
[梵, 中]
After sorting...
[中, 梵]

Going deeper, I found the rules that Java applies with Locale.SIMPLIFIED_CHINESE. You can find in next image: http://postimg.org/image/4t915a7gp/full/ (Notice that 梵 is after 中)

深入研究之后,我发现Java应用于Locale.SIMPLIFIED_CHINESE的规则。你可以找到在未来形象:http://postimg.org/image/4t915a7gp/full/(请注意梵是后中)

I realized before the <口<口<口<口<口 that I highlighted in red, all characters are ordered according to their latin correspondent combination, from A to Z. However, after the <口<口<口<口<口 sign, the characters are ordered by the composition of the character. For example, if all the characters have a same part (usually the left part of the character), they are then grouped together, not according to the A to Z rule.

我意识到在 <口<口<口<口<我用红色突出显示的口,所有的字符都要求根据他们的拉丁记者组合,然而,从a到z <口<口<后口<口<口符号,字符命令组成的字符。例如,如果所有字符都有一个相同的部分(通常是字符的左边部分),那么它们就会被分组在一起,而不是按照a到z的规则。< p>

Also, all the characters after the <口<口<口<口<口 are less common Chinese characters. So, 梵 is a less common character than 中, so it is put after <口<口<口<口<口.

同时,所有的人物在 <口<口<口<口<口不太常见的汉字。梵是一个不太常见的字符中,所以它是把后<口<口<口<口<口。< p>

I wonder why this decision, if it is intentionally. But it results in wrong sortings. I don't know how to find a solution for this.

我想知道,如果这是有意的,为什么要做出这样的决定。但它会导致错误的分类。我不知道该怎么解决这个问题。

Thanks a lot for your time!

非常感谢您的时间!

1 个解决方案

#1


2  

The sorting order provided by the collator in Java is based on the strokes needed to write that character.

Java中排序器提供的排序顺序基于编写该字符所需的笔划。

See below small snippet to demonstrate. Stroke numbers taken from Wikitionary

请参阅下面的小片段来演示。来自*的笔划数字

// the unicode character and the number of strokes
String[] characters = new String[]{
    "\u68B5 (11)", "\u4E2D (4)", 
    "\u5207 (4)", "\u5973 (3)", "\u898B (7)"
};
List<String> list = Arrays.asList(characters);
System.out.println("Before sorting...");
System.out.println(list.toString());

Collator collator = Collator.getInstance(Locale.TRADITIONAL_CHINESE);
collator.setStrength(Collator.PRIMARY);
System.out.println();
Collections.sort(list, collator);

System.out.println("After sorting...");
System.out.println(list.toString());

output

输出

Before sorting...
[梵 (11), 中 (4), 切 (4), 女 (3), 見 (7)]

After sorting...
[女 (3), 中 (4), 切 (4), 見 (7), 梵 (11)]

There is an enhancement request JDK-6415666 to implement the sorting order according the Unicode collation order. But following the information about the Java 8 supported locale it's not implemented in Java 8.

有一个增强请求JDK-6415666,根据Unicode排序顺序实现排序顺序。但是根据Java 8支持的语言环境的信息,它不是在Java 8中实现的。

edit The sorting order using the collator from icu4j is

使用来自icu4j的排序器编辑排序顺序

[梵 (11), 見 (7), 女 (3), 切 (4), 中 (4)]

ICU4J code snippet

ICU4J代码片段

import com.ibm.icu.text.Collator;
import com.ibm.icu.text.RuleBasedCollator
...
Locale locale = new Locale("zh", "", "PINYIN");
Collator collator = (RuleBasedCollator) Collator.getInstance(locale);

#1


2  

The sorting order provided by the collator in Java is based on the strokes needed to write that character.

Java中排序器提供的排序顺序基于编写该字符所需的笔划。

See below small snippet to demonstrate. Stroke numbers taken from Wikitionary

请参阅下面的小片段来演示。来自*的笔划数字

// the unicode character and the number of strokes
String[] characters = new String[]{
    "\u68B5 (11)", "\u4E2D (4)", 
    "\u5207 (4)", "\u5973 (3)", "\u898B (7)"
};
List<String> list = Arrays.asList(characters);
System.out.println("Before sorting...");
System.out.println(list.toString());

Collator collator = Collator.getInstance(Locale.TRADITIONAL_CHINESE);
collator.setStrength(Collator.PRIMARY);
System.out.println();
Collections.sort(list, collator);

System.out.println("After sorting...");
System.out.println(list.toString());

output

输出

Before sorting...
[梵 (11), 中 (4), 切 (4), 女 (3), 見 (7)]

After sorting...
[女 (3), 中 (4), 切 (4), 見 (7), 梵 (11)]

There is an enhancement request JDK-6415666 to implement the sorting order according the Unicode collation order. But following the information about the Java 8 supported locale it's not implemented in Java 8.

有一个增强请求JDK-6415666,根据Unicode排序顺序实现排序顺序。但是根据Java 8支持的语言环境的信息,它不是在Java 8中实现的。

edit The sorting order using the collator from icu4j is

使用来自icu4j的排序器编辑排序顺序

[梵 (11), 見 (7), 女 (3), 切 (4), 中 (4)]

ICU4J code snippet

ICU4J代码片段

import com.ibm.icu.text.Collator;
import com.ibm.icu.text.RuleBasedCollator
...
Locale locale = new Locale("zh", "", "PINYIN");
Collator collator = (RuleBasedCollator) Collator.getInstance(locale);