Java 7、regexes和补充unicode字符

The string in question has a supplementary unicode character "\ud84c\udfb4". According to javadoc, regex matching should be done at code point level not character level. However, the split code below treats low surrogate (\udfb4) as non word character and splits on it.

该字符串有一个补充的unicode字符“\ud84c\udfb4”。根据javadoc, regex匹配应该在代码点级别而不是字符级别进行。然而，下面的分割代码将低代理(\udfb4)视为非单词字符并对其进行分割。

Am I missing something? What are other alternatives to accomplish splitting on non-word characters? (Java version "1.7.0_07")

我遗漏了什么东西?还有什么其他的替代方法可以在非文字字符上实现分割?(Java版本“1.7.0_07”)

Thanks in advance.

提前谢谢。

Pattern non_word_regex = Pattern.compile("[\\W]", Pattern.UNICODE_CHARACTER_CLASS);
String a = "\u529f\u80fd\u0020\u7d76\ud84c\udfb4\u986f\u793a\u5ee3\u544a";
String b ="功能 絶????顯示廣告";
System.out.print("original "+a+"\norginal hex ");
for(char c : a.toCharArray()){
    System.out.print(Integer.toHexString((int)c));
    System.out.print(' ');
}
System.out.println();

String[] tokens = non_word_regex.split(a);

for(int i =0; i< tokens.length; i++){
   String token = tokens[i];
   System.out.print(i+" ");
   for(char c : token.toCharArray()){
       System.out.print(Integer.toHexString((int)c));
       System.out.print(' ');
   }
   System.out.println();
}

Output:
original 功能絶????顯示廣告
orginal hex 529f 80fd 20 7d76 d84c dfb4 986f 793a 5ee3 544a
0 529f 80fd
1 7d76 d84c
2 986f 793a 5ee3 544a

Output:原始功能絶????顯示廣告原创十六进制529 f 80 fd 20 7 d76 d84c dfb4 986 f 793 80 544 0 529 f ee3 fd 1 7 d76 d84c 2 986 793 5 ee3 544 f

1 个解决方案

#1

This looks simply like a bug in the regex engine. If you use the \w expression, everything matches correctly, ???? remains to be a single code point composed of two chars. This can be easily verified by running the following code:

这看起来就像regex引擎中的一个错误。如果你使用\w expression,一切比赛correctly,????有待两chars.组成的一个代码点这可以通过运行以下代码轻松验证:

Pattern pattern = Pattern.compile("(?U)[\\w]");
String str = "功能 絶????顯示廣告";

Matcher matcher = pattern.matcher(str);
while (matcher.find()) {
    System.out.println(matcher.toMatchResult().group());
}

I've just made a through investigation, and so I can tell you where the problem is. If you look at the method compile() in java.util.regex.Pattern (start on the line 1625), you will see the code that scans the regex for supplementary characters and decides whether to support them in scanning or not.

我刚刚做了一个全面的调查，所以我可以告诉你问题出在哪里。如果查看java.util.regex中的方法compile()。模式(从第1625行开始)，您将看到扫描regex以寻找补充字符的代码，并决定是否在扫描中支持它们。

The problem with this approach is that the code doesn't take into account the fact that even if the regex doesn't have supplementary characters, it may still want to match them, as it happens in your case, for example.

这种方法的问题是，代码没有考虑到这样一个事实，即即使regex没有补充字符，它也可能希望匹配它们，就像在您的例子中发生的那样。

The solution is to devise some regex that contains the supplementary characters, but they don't affect the matching process. I suggest you use something innocent like this:

解决方案是设计一些包含补充字符的regex，但它们不会影响匹配过程。我建议你使用一些无害的东西，比如:

Pattern nonWordRegex = Pattern.compile("(?U)(?!\uDB80\uDC00)[\\W]");

The part (?!\uDB80\uDC00) does the trick. This is a negative lookahead for a character in the private range of supplementary characters, which means that most likely you won't find it in the text. And voila: the regex engine thinks that there are supplementary characters in the pattern, and turns on their support!

这部分(?!\uDB80\uDC00)很管用。这是一个在辅助字符的私有范围内的一个字符的负面展望，这意味着你很可能在文本中找不到它。瞧，regex引擎认为模式中有补充字符，并打开它们的支持!

#1