如何遍历Java String的unicode代码点？

So I know about String#codePointAt(int), but it's indexed by the char offset, not by the codepoint offset.

所以我知道String＃codePointAt（int），但它是由char偏移索引，而不是由codepoint偏移索引。

I'm thinking about trying something like:

我正在考虑尝试这样的事情：

using String#charAt(int) to get the char at an index
使用String＃charAt（int）获取索引处的char
testing whether the char is in the high-surrogates range
- if so, use String#codePointAt(int) to get the codepoint, and increment the index by 2
- 如果是这样，请使用String＃codePointAt（int）获取代码点，并将索引增加2
- if not, use the given char value as the codepoint, and increment the index by 1
- 如果不是，请使用给定的char值作为代码点，并将索引增加1
测试char是否在高代理范围内，如果是，则使用String＃codePointAt（int）获取代码点，如果不是，则将索引增加2，使用给定的char值作为代码点，并将索引增加1

But my concerns are

但我担心的是

I'm not sure whether codepoints which are naturally in the high-surrogates range will be stored as two char values or one
我不确定自然位于高代理范围内的代码点是否会存储为两个char值或一个
this seems like an awful expensive way to iterate through characters
这似乎是迭代字符的一种非常昂贵的方式
someone must have come up with something better.
有人必须想出更好的东西。

4 个解决方案

#1

120

Yes, Java uses a UTF-16-esque encoding for internal representations of Strings, and, yes, it encodes characters outside the Basic Multilingual Plane (BMP) using the surrogacy scheme.

是的，Java使用UTF-16-esque编码进行字符串的内部表示，是的，它使用代理方案对基本多语言平面（BMP）之外的字符进行编码。

If you know you'll be dealing with characters outside the BMP, then here is the canonical way to iterate over the characters of a Java String:

如果你知道你将处理BMP之外的字符，那么这是迭代Java字符串字符的规范方法：

final int length = s.length();
for (int offset = 0; offset < length; ) {
   final int codepoint = s.codePointAt(offset);

   // do something with the codepoint

   offset += Character.charCount(codepoint);
}

#2

Java 8 added CharSequence#codePoints which returns an IntStream containing the code points. You can use the stream directly to iterate over them:

Java 8添加了CharSequence＃codePoints，它返回包含代码点的IntStream。您可以直接使用流来迭代它们：

string.codePoints().forEach(c -> ...);

or with a for loop by collecting the stream into an array:

或者通过将流收集到数组中使用for循环：

for(int c : string.codePoints().toArray()){
    ...
}

These ways are probably more expensive than Jonathan Feinbergs's solution, but they are faster to read/write and the performance difference will usually be insignificant.

这些方法可能比Jonathan Feinbergs的解决方案更昂贵，但它们的读/写速度更快，性能差异通常无关紧要。

#3

Iterating over code points is filed as a feature request at Sun.

迭代代码点是作为Sun的功能请求提交的。

See Sun Bug Entry

请参阅Sun Bug Entry

There is also an example on how to iterate over String CodePoints there.

还有一个关于如何在那里迭代String CodePoints的例子。

#4

Thought I'd add a workaround method that works with foreach loops (ref), plus you can convert it to java 8's new String#codePoints method easily when you move to java 8:

以为我会添加一个与foreach循环（ref）一起使用的变通方法，并且当你转移到java 8时，你可以轻松地将它转换为java 8的新String＃codePoints方法：

public static Iterable<Integer> codePoints(final String string) {
  return new Iterable<Integer>() {
    public Iterator<Integer> iterator() {
      return new Iterator<Integer>() {
        int nextIndex = 0;
        public boolean hasNext() {
          return nextIndex < string.length();
        }
        public Integer next() {
          int result = string.codePointAt(nextIndex);
          nextIndex += Character.charCount(result);
          return result;
        }
        public void remove() {
          throw new UnsupportedOperationException();
        }
      };
    }
  };
}

Then you can use it with foreach like this:

然后你可以像foreach一样使用它：

 for(int codePoint : codePoints(myString)) {
   ....
 }

Or alternately if you just want to convert a string to an array of int (which might use more RAM than the above approach):

或者，如果您只想将字符串转换为int数组（可能使用比上述方法更多的RAM）：

 public static List<Integer> stringToCodePoints(String in) {
    if( in == null)
      throw new NullPointerException("got null");
    List<Integer> out = new ArrayList<Integer>();
    final int length = in.length();
    for (int offset = 0; offset < length; ) {
      final int codepoint = in.codePointAt(offset);
      out.add(codepoint);
      offset += Character.charCount(codepoint);
    }
    return out;
  }

#1

120