So I know about String#codePointAt(int)
, but it's indexed by the char
offset, not by the codepoint offset.
所以我知道String#codePointAt(int),但它是由char偏移索引,而不是由codepoint偏移索引。
I'm thinking about trying something like:
我正在考虑尝试这样的事情:
- using
String#charAt(int)
to get thechar
at an index - 使用String#charAt(int)获取索引处的char
- testing whether the
char
is in the high-surrogates range- if so, use
String#codePointAt(int)
to get the codepoint, and increment the index by 2 - 如果是这样,请使用String#codePointAt(int)获取代码点,并将索引增加2
- if not, use the given
char
value as the codepoint, and increment the index by 1 - 如果不是,请使用给定的char值作为代码点,并将索引增加1
- if so, use
- 测试char是否在高代理范围内,如果是,则使用String#codePointAt(int)获取代码点,如果不是,则将索引增加2,使用给定的char值作为代码点,并将索引增加1
But my concerns are
但我担心的是
- I'm not sure whether codepoints which are naturally in the high-surrogates range will be stored as two
char
values or one - 我不确定自然位于高代理范围内的代码点是否会存储为两个char值或一个
- this seems like an awful expensive way to iterate through characters
- 这似乎是迭代字符的一种非常昂贵的方式
- someone must have come up with something better.
- 有人必须想出更好的东西。
4 个解决方案
#1
120
Yes, Java uses a UTF-16-esque encoding for internal representations of Strings, and, yes, it encodes characters outside the Basic Multilingual Plane (BMP) using the surrogacy scheme.
是的,Java使用UTF-16-esque编码进行字符串的内部表示,是的,它使用代理方案对基本多语言平面(BMP)之外的字符进行编码。
If you know you'll be dealing with characters outside the BMP, then here is the canonical way to iterate over the characters of a Java String:
如果你知道你将处理BMP之外的字符,那么这是迭代Java字符串字符的规范方法:
final int length = s.length();
for (int offset = 0; offset < length; ) {
final int codepoint = s.codePointAt(offset);
// do something with the codepoint
offset += Character.charCount(codepoint);
}
#2
52
Java 8 added CharSequence#codePoints
which returns an IntStream
containing the code points. You can use the stream directly to iterate over them:
Java 8添加了CharSequence#codePoints,它返回包含代码点的IntStream。您可以直接使用流来迭代它们:
string.codePoints().forEach(c -> ...);
or with a for loop by collecting the stream into an array:
或者通过将流收集到数组中使用for循环:
for(int c : string.codePoints().toArray()){
...
}
These ways are probably more expensive than Jonathan Feinbergs's solution, but they are faster to read/write and the performance difference will usually be insignificant.
这些方法可能比Jonathan Feinbergs的解决方案更昂贵,但它们的读/写速度更快,性能差异通常无关紧要。
#3
5
Iterating over code points is filed as a feature request at Sun.
迭代代码点是作为Sun的功能请求提交的。
See Sun Bug Entry
请参阅Sun Bug Entry
There is also an example on how to iterate over String CodePoints there.
还有一个关于如何在那里迭代String CodePoints的例子。
#4
4
Thought I'd add a workaround method that works with foreach loops (ref), plus you can convert it to java 8's new String#codePoints method easily when you move to java 8:
以为我会添加一个与foreach循环(ref)一起使用的变通方法,并且当你转移到java 8时,你可以轻松地将它转换为java 8的新String#codePoints方法:
public static Iterable<Integer> codePoints(final String string) {
return new Iterable<Integer>() {
public Iterator<Integer> iterator() {
return new Iterator<Integer>() {
int nextIndex = 0;
public boolean hasNext() {
return nextIndex < string.length();
}
public Integer next() {
int result = string.codePointAt(nextIndex);
nextIndex += Character.charCount(result);
return result;
}
public void remove() {
throw new UnsupportedOperationException();
}
};
}
};
}
Then you can use it with foreach like this:
然后你可以像foreach一样使用它:
for(int codePoint : codePoints(myString)) {
....
}
Or alternately if you just want to convert a string to an array of int (which might use more RAM than the above approach):
或者,如果您只想将字符串转换为int数组(可能使用比上述方法更多的RAM):
public static List<Integer> stringToCodePoints(String in) {
if( in == null)
throw new NullPointerException("got null");
List<Integer> out = new ArrayList<Integer>();
final int length = in.length();
for (int offset = 0; offset < length; ) {
final int codepoint = in.codePointAt(offset);
out.add(codepoint);
offset += Character.charCount(codepoint);
}
return out;
}
#1
120
Yes, Java uses a UTF-16-esque encoding for internal representations of Strings, and, yes, it encodes characters outside the Basic Multilingual Plane (BMP) using the surrogacy scheme.
是的,Java使用UTF-16-esque编码进行字符串的内部表示,是的,它使用代理方案对基本多语言平面(BMP)之外的字符进行编码。
If you know you'll be dealing with characters outside the BMP, then here is the canonical way to iterate over the characters of a Java String:
如果你知道你将处理BMP之外的字符,那么这是迭代Java字符串字符的规范方法:
final int length = s.length();
for (int offset = 0; offset < length; ) {
final int codepoint = s.codePointAt(offset);
// do something with the codepoint
offset += Character.charCount(codepoint);
}
#2
52
Java 8 added CharSequence#codePoints
which returns an IntStream
containing the code points. You can use the stream directly to iterate over them:
Java 8添加了CharSequence#codePoints,它返回包含代码点的IntStream。您可以直接使用流来迭代它们:
string.codePoints().forEach(c -> ...);
or with a for loop by collecting the stream into an array:
或者通过将流收集到数组中使用for循环:
for(int c : string.codePoints().toArray()){
...
}
These ways are probably more expensive than Jonathan Feinbergs's solution, but they are faster to read/write and the performance difference will usually be insignificant.
这些方法可能比Jonathan Feinbergs的解决方案更昂贵,但它们的读/写速度更快,性能差异通常无关紧要。
#3
5
Iterating over code points is filed as a feature request at Sun.
迭代代码点是作为Sun的功能请求提交的。
See Sun Bug Entry
请参阅Sun Bug Entry
There is also an example on how to iterate over String CodePoints there.
还有一个关于如何在那里迭代String CodePoints的例子。
#4
4
Thought I'd add a workaround method that works with foreach loops (ref), plus you can convert it to java 8's new String#codePoints method easily when you move to java 8:
以为我会添加一个与foreach循环(ref)一起使用的变通方法,并且当你转移到java 8时,你可以轻松地将它转换为java 8的新String#codePoints方法:
public static Iterable<Integer> codePoints(final String string) {
return new Iterable<Integer>() {
public Iterator<Integer> iterator() {
return new Iterator<Integer>() {
int nextIndex = 0;
public boolean hasNext() {
return nextIndex < string.length();
}
public Integer next() {
int result = string.codePointAt(nextIndex);
nextIndex += Character.charCount(result);
return result;
}
public void remove() {
throw new UnsupportedOperationException();
}
};
}
};
}
Then you can use it with foreach like this:
然后你可以像foreach一样使用它:
for(int codePoint : codePoints(myString)) {
....
}
Or alternately if you just want to convert a string to an array of int (which might use more RAM than the above approach):
或者,如果您只想将字符串转换为int数组(可能使用比上述方法更多的RAM):
public static List<Integer> stringToCodePoints(String in) {
if( in == null)
throw new NullPointerException("got null");
List<Integer> out = new ArrayList<Integer>();
final int length = in.length();
for (int offset = 0; offset < length; ) {
final int codepoint = in.codePointAt(offset);
out.add(codepoint);
offset += Character.charCount(codepoint);
}
return out;
}