I'm using this code to generate U+10FFFC
我正在使用此代码生成U + 10FFFC
var s = Encoding.UTF8.GetString(new byte[] {0xF4,0x8F,0xBF,0xBC});
I know it's for private-use and such, but it does display a single character as I'd expect when displaying it. The problems come when manipulating this unicode character.
我知道它是供私人使用的,但它确实显示了一个单一的字符,正如我在展示它时所期望的那样。操纵这个unicode角色时会出现问题。
If I later do this:
如果我以后这样做:
foreach(var ch in s)
{
Console.WriteLine(ch);
}
Instead of it printing just the single character, it prints two characters (i.e. the string is apparently composed of two characters). If I alter my loop to add these characters back to an empty string like so:
它不是仅打印单个字符,而是打印两个字符(即字符串显然由两个字符组成)。如果我改变我的循环,将这些字符添加回空字符串,如下所示:
string tmp="";
foreach(var ch in s)
{
Console.WriteLine(ch);
tmp += ch;
}
At the end of this, tmp
will print just a single character.
在这结束时,tmp将只打印一个字符。
What exactly is going on here? I thought that char
contains a single unicode character and I never had to worry about how many bytes a character is unless I'm doing conversion to bytes. My real use case is I need to be able to detect when very large unicode characters are used in a string. Currently I have something like this:
到底发生了什么?我认为char包含一个unicode字符,除非我正在转换为字节,否则我不必担心字符有多少字节。我真正的用例是我需要能够检测字符串中何时使用非常大的unicode字符。目前我有这样的事情:
foreach(var ch in s)
{
if(ch>=0x100000 && ch<=0x10FFFF)
{
Console.WriteLine("special character!");
}
}
However, because of this splitting of very large characters, this doesn't work. How can I modify this to make it work?
但是,由于这种非常大的字符分裂,这不起作用。如何修改它以使其工作?
4 个解决方案
#1
34
U+10FFFC is one Unicode code point, but string
's interface does not expose a sequence of Unicode code points directly. Its interface exposes a sequence of UTF-16 code units. That is a very low-level view of text. It is quite unfortunate that such a low-level view of text was grafted onto the most obvious and intuitive interface available... I'll try not to rant much about how I don't like this design, and just say that not matter how unfortunate, it is just a (sad) fact you have to live with.
U + 10FFFC是一个Unicode代码点,但字符串的接口不直接公开一系列Unicode代码点。它的接口公开了一系列UTF-16代码单元。这是一个非常低级别的文本视图。非常不幸的是,这种低级别的文本视图被嫁接到最明显和最直观的界面上......我会尽量不去嘲笑我不喜欢这个设计,只是说不管怎样多么不幸,这只是一个(悲伤)事实,你必须忍受。
First off, I will suggest using char.ConvertFromUtf32
to get your initial string. Much simpler, much more readable:
首先,我建议使用char.ConvertFromUtf32来获取您的初始字符串。更简单,更可读:
var s = char.ConvertFromUtf32(0x10FFFC);
So, this string's Length
is not 1, because, as I said, the interface deals in UTF-16 code units, not Unicode code points. U+10FFFC uses two UTF-16 code units, so s.Length
is 2. All code points above U+FFFF require two UTF-16 code units for their representation.
所以,这个字符串的长度不是1,因为正如我所说,接口处理的是UTF-16代码单元,而不是Unicode代码点。 U + 10FFFC使用两个UTF-16代码单元,因此s.Length为2. U + FFFF以上的所有代码点都需要两个UTF-16代码单元来表示它们。
You should note that ConvertFromUtf32
doesn't return a char
: char
is a UTF-16 code unit, not a Unicode code point. To be able to return all Unicode code points, that method cannot return a single char
. Sometimes it needs to return two, and that's why it makes it a string. Sometimes you will find some APIs dealing in int
s instead of char
because int
can be used to handle all code points too (that's what ConvertFromUtf32
takes as argument, and what ConvertToUtf32
produces as result).
您应该注意ConvertFromUtf32不返回char:char是UTF-16代码单元,而不是Unicode代码点。为了能够返回所有Unicode代码点,该方法不能返回单个char。有时它需要返回两个,这就是为什么它使它成为一个字符串。有时您会发现一些处理in而不是char的API,因为int也可以用来处理所有代码点(这就是ConvertFromUtf32作为参数所采用的,以及ConvertToUtf32产生的结果)。
string
implements IEnumerable<char>
, which means that when you iterate over a string
you get one UTF-16 code unit per iteration. That's why iterating your string and printing it out yields some broken output with two "things" in it. Those are the two UTF-16 code units that make up the representation of U+10FFFC. They are called "surrogates". The first one is a high/lead surrogate and the second one is a low/trail surrogate. When you print them individually they do not produce meaningful output because lone surrogates are not even valid in UTF-16, and they are not considered Unicode characters either.
string实现了IEnumerable
When you append those two surrogates to the string in the loop, you effectively reconstruct the surrogate pair, and printing that pair later as one gets you the right output.
当您将这两个代理项附加到循环中的字符串时,您可以有效地重建代理项对,并在以后打印该对,从而获得正确的输出。
And in the ranting front, note how nothing complains that you used a malformed UTF-16 sequence in that loop. It creates a string with a lone surrogate, and yet everything carries on as if nothing happened: the string
type is not even the type of well-formed UTF-16 code unit sequences, but the type of any UTF-16 code unit sequence.
在咆哮的前面,请注意在该循环中你没有抱怨你使用了格式错误的UTF-16序列。它创建了一个带有单独代理的字符串,然而一切都继续进行,好像什么也没发生:字符串类型甚至不是格式良好的UTF-16代码单元序列的类型,而是任何UTF-16代码单元序列的类型。
The char
structure provides static methods to deal with surrogates: IsHighSurrogate
, IsLowSurrogate
, IsSurrogatePair
, ConvertToUtf32
, and ConvertFromUtf32
. If you want you can write an iterator that iterates over Unicode characters instead of UTF-16 code units:
char结构提供静态方法来处理代理:IsHighSurrogate,IsLowSurrogate,IsSurrogatePair,ConvertToUtf32和ConvertFromUtf32。如果需要,可以编写迭代器来迭代Unicode字符而不是UTF-16代码单元:
static IEnumerable<int> AsCodePoints(this string s)
{
for(int i = 0; i < s.Length; ++i)
{
yield return char.ConvertToUtf32(s, i);
if(char.IsHighSurrogate(s, i))
i++;
}
}
Then you can iterate like:
然后你可以迭代:
foreach(int codePoint in s.AsCodePoints())
{
// do stuff. codePoint will be an int will value 0x10FFFC in your example
}
If you prefer to get each code point as a string instead change the return type to IEnumerable<string>
and the yield line to:
如果您希望将每个代码点作为字符串,而是将返回类型更改为IEnumerable
yield return char.ConvertFromUtf32(char.ConvertToUtf32(s, i));
With that version, the following works as-is:
使用该版本,以下工作原样:
foreach(string codePoint in s.AsCodePoints())
{
Console.WriteLine(codePoint);
}
#2
0
As posted already by Martinho, it is much easier to create the string with this private codepoint that way:
正如Martinho已经发布的那样,使用这个私有代码点创建字符串要容易得多:
var s = char.ConvertFromUtf32(0x10FFFC);
But to loop through the two char elements of that string is senseless:
但是循环遍历该字符串的两个char元素是毫无意义的:
foreach(var ch in s)
{
Console.WriteLine(ch);
}
What for? You will just get the high and low surrogate that encode the codepoint. Remember a char is a 16 bit type so it can hold just a max value of 0xFFFF. Your codepoint doesn't fit into a 16 bit type, indeed for the highest codepoint you'll need 21 bits (0x10FFFF) so the next wider type would just be a 32 bit type. The two char elements are not characters, but a surrogate pair. The value of 0x10FFFC is encoded into the two surrogates.
做什么的?您将获得编码代码点的高低代理。请记住,char是16位类型,因此它只能保存最大值0xFFFF。您的代码点不适合16位类型,实际上对于最高代码点,您需要21位(0x10FFFF),因此下一个更宽的类型将只是32位类型。两个char元素不是字符,而是代理对。 0x10FFFC的值被编码到两个代理中。
#3
0
While @R. Martinho Fernandes's answer is correct, his AsCodePoints
extension method has two issues:
而@R。 Martinho Fernandes的回答是正确的,他的AsCodePoints扩展方法有两个问题:
- It will throw an
ArgumentException
on invalid code points (high surrogate without low surrogate or vice versa). - You can't use
char
static methods that take(char)
or(string, int)
(such aschar.IsNumber()
) if you only have int code points.
它会在无效的代码点上抛出ArgumentException(没有低代理的高代理,反之亦然)。
如果只有int代码点,则不能使用带有(char)或(string,int)的char静态方法(例如char.IsNumber())。
I've split the code into two methods, one similar to the original but returns the Unicode Replacement Character on invalid code points. The second method returns a struct IEnumerable with more useful fields:
我已经将代码拆分为两个方法,一个类似于原始方法但在无效代码点上返回Unicode替换字符。第二个方法返回一个包含更多有用字段的结构IEnumerable:
StringCodePointExtensions.cs
public static class StringCodePointExtensions {
const char ReplacementCharacter = '\ufffd';
public static IEnumerable<CodePointIndex> CodePointIndexes(this string s) {
for (int i = 0; i < s.Length; i++) {
if (char.IsHighSurrogate(s, i)) {
if (i + 1 < s.Length && char.IsLowSurrogate(s, i + 1)) {
yield return CodePointIndex.Create(i, true, true);
i++;
continue;
} else {
// High surrogate without low surrogate
yield return CodePointIndex.Create(i, false, false);
continue;
}
} else if (char.IsLowSurrogate(s, i)) {
// Low surrogate without high surrogate
yield return CodePointIndex.Create(i, false, false);
continue;
}
yield return CodePointIndex.Create(i, true, false);
}
}
public static IEnumerable<int> CodePointInts(this string s) {
return s
.CodePointIndexes()
.Select(
cpi => {
if (cpi.Valid) {
return char.ConvertToUtf32(s, cpi.Index);
} else {
return (int)ReplacementCharacter;
}
});
}
}
CodePointIndex.cs
:
public struct CodePointIndex {
public int Index;
public bool Valid;
public bool IsSurrogatePair;
public static CodePointIndex Create(int index, bool valid, bool isSurrogatePair) {
return new CodePointIndex {
Index = index,
Valid = valid,
IsSurrogatePair = isSurrogatePair,
};
}
}
To the extent possible under law, the person who associated CC0 with this work has waived all copyright and related or neighboring rights to this work.
在法律允许的范围内,将CC0与此作品相关联的人已放弃对此作品的所有版权及相关或相邻权利。
#4
0
Yet another alternative to enumerate the UTF32 characters in a C# string is to use the System.Globalization.StringInfo.GetTextElementEnumerator
method, as in the code below.
枚举C#字符串中的UTF32字符的另一种方法是使用System.Globalization.StringInfo.GetTextElementEnumerator方法,如下面的代码所示。
public static class StringExtensions
{
public static System.Collections.Generic.IEnumerable<UTF32Char> GetUTF32Chars(this string s)
{
var tee = System.Globalization.StringInfo.GetTextElementEnumerator(s);
while (tee.MoveNext())
{
yield return new UTF32Char(s, tee.ElementIndex);
}
}
}
public struct UTF32Char
{
private string s;
private int index;
public UTF32Char(string s, int index)
{
this.s = s;
this.index = index;
}
public override string ToString()
{
return char.ConvertFromUtf32(this.UTF32Code);
}
public int UTF32Code { get { return char.ConvertToUtf32(s, index); } }
public double NumericValue { get { return char.GetNumericValue(s, index); } }
public UnicodeCategory UnicodeCategory { get { return char.GetUnicodeCategory(s, index); } }
public bool IsControl { get { return char.IsControl(s, index); } }
public bool IsDigit { get { return char.IsDigit(s, index); } }
public bool IsLetter { get { return char.IsLetter(s, index); } }
public bool IsLetterOrDigit { get { return char.IsLetterOrDigit(s, index); } }
public bool IsLower { get { return char.IsLower(s, index); } }
public bool IsNumber { get { return char.IsNumber(s, index); } }
public bool IsPunctuation { get { return char.IsPunctuation(s, index); } }
public bool IsSeparator { get { return char.IsSeparator(s, index); } }
public bool IsSurrogatePair { get { return char.IsSurrogatePair(s, index); } }
public bool IsSymbol { get { return char.IsSymbol(s, index); } }
public bool IsUpper { get { return char.IsUpper(s, index); } }
public bool IsWhiteSpace { get { return char.IsWhiteSpace(s, index); } }
}
#1
34
U+10FFFC is one Unicode code point, but string
's interface does not expose a sequence of Unicode code points directly. Its interface exposes a sequence of UTF-16 code units. That is a very low-level view of text. It is quite unfortunate that such a low-level view of text was grafted onto the most obvious and intuitive interface available... I'll try not to rant much about how I don't like this design, and just say that not matter how unfortunate, it is just a (sad) fact you have to live with.
U + 10FFFC是一个Unicode代码点,但字符串的接口不直接公开一系列Unicode代码点。它的接口公开了一系列UTF-16代码单元。这是一个非常低级别的文本视图。非常不幸的是,这种低级别的文本视图被嫁接到最明显和最直观的界面上......我会尽量不去嘲笑我不喜欢这个设计,只是说不管怎样多么不幸,这只是一个(悲伤)事实,你必须忍受。
First off, I will suggest using char.ConvertFromUtf32
to get your initial string. Much simpler, much more readable:
首先,我建议使用char.ConvertFromUtf32来获取您的初始字符串。更简单,更可读:
var s = char.ConvertFromUtf32(0x10FFFC);
So, this string's Length
is not 1, because, as I said, the interface deals in UTF-16 code units, not Unicode code points. U+10FFFC uses two UTF-16 code units, so s.Length
is 2. All code points above U+FFFF require two UTF-16 code units for their representation.
所以,这个字符串的长度不是1,因为正如我所说,接口处理的是UTF-16代码单元,而不是Unicode代码点。 U + 10FFFC使用两个UTF-16代码单元,因此s.Length为2. U + FFFF以上的所有代码点都需要两个UTF-16代码单元来表示它们。
You should note that ConvertFromUtf32
doesn't return a char
: char
is a UTF-16 code unit, not a Unicode code point. To be able to return all Unicode code points, that method cannot return a single char
. Sometimes it needs to return two, and that's why it makes it a string. Sometimes you will find some APIs dealing in int
s instead of char
because int
can be used to handle all code points too (that's what ConvertFromUtf32
takes as argument, and what ConvertToUtf32
produces as result).
您应该注意ConvertFromUtf32不返回char:char是UTF-16代码单元,而不是Unicode代码点。为了能够返回所有Unicode代码点,该方法不能返回单个char。有时它需要返回两个,这就是为什么它使它成为一个字符串。有时您会发现一些处理in而不是char的API,因为int也可以用来处理所有代码点(这就是ConvertFromUtf32作为参数所采用的,以及ConvertToUtf32产生的结果)。
string
implements IEnumerable<char>
, which means that when you iterate over a string
you get one UTF-16 code unit per iteration. That's why iterating your string and printing it out yields some broken output with two "things" in it. Those are the two UTF-16 code units that make up the representation of U+10FFFC. They are called "surrogates". The first one is a high/lead surrogate and the second one is a low/trail surrogate. When you print them individually they do not produce meaningful output because lone surrogates are not even valid in UTF-16, and they are not considered Unicode characters either.
string实现了IEnumerable
When you append those two surrogates to the string in the loop, you effectively reconstruct the surrogate pair, and printing that pair later as one gets you the right output.
当您将这两个代理项附加到循环中的字符串时,您可以有效地重建代理项对,并在以后打印该对,从而获得正确的输出。
And in the ranting front, note how nothing complains that you used a malformed UTF-16 sequence in that loop. It creates a string with a lone surrogate, and yet everything carries on as if nothing happened: the string
type is not even the type of well-formed UTF-16 code unit sequences, but the type of any UTF-16 code unit sequence.
在咆哮的前面,请注意在该循环中你没有抱怨你使用了格式错误的UTF-16序列。它创建了一个带有单独代理的字符串,然而一切都继续进行,好像什么也没发生:字符串类型甚至不是格式良好的UTF-16代码单元序列的类型,而是任何UTF-16代码单元序列的类型。
The char
structure provides static methods to deal with surrogates: IsHighSurrogate
, IsLowSurrogate
, IsSurrogatePair
, ConvertToUtf32
, and ConvertFromUtf32
. If you want you can write an iterator that iterates over Unicode characters instead of UTF-16 code units:
char结构提供静态方法来处理代理:IsHighSurrogate,IsLowSurrogate,IsSurrogatePair,ConvertToUtf32和ConvertFromUtf32。如果需要,可以编写迭代器来迭代Unicode字符而不是UTF-16代码单元:
static IEnumerable<int> AsCodePoints(this string s)
{
for(int i = 0; i < s.Length; ++i)
{
yield return char.ConvertToUtf32(s, i);
if(char.IsHighSurrogate(s, i))
i++;
}
}
Then you can iterate like:
然后你可以迭代:
foreach(int codePoint in s.AsCodePoints())
{
// do stuff. codePoint will be an int will value 0x10FFFC in your example
}
If you prefer to get each code point as a string instead change the return type to IEnumerable<string>
and the yield line to:
如果您希望将每个代码点作为字符串,而是将返回类型更改为IEnumerable
yield return char.ConvertFromUtf32(char.ConvertToUtf32(s, i));
With that version, the following works as-is:
使用该版本,以下工作原样:
foreach(string codePoint in s.AsCodePoints())
{
Console.WriteLine(codePoint);
}
#2
0
As posted already by Martinho, it is much easier to create the string with this private codepoint that way:
正如Martinho已经发布的那样,使用这个私有代码点创建字符串要容易得多:
var s = char.ConvertFromUtf32(0x10FFFC);
But to loop through the two char elements of that string is senseless:
但是循环遍历该字符串的两个char元素是毫无意义的:
foreach(var ch in s)
{
Console.WriteLine(ch);
}
What for? You will just get the high and low surrogate that encode the codepoint. Remember a char is a 16 bit type so it can hold just a max value of 0xFFFF. Your codepoint doesn't fit into a 16 bit type, indeed for the highest codepoint you'll need 21 bits (0x10FFFF) so the next wider type would just be a 32 bit type. The two char elements are not characters, but a surrogate pair. The value of 0x10FFFC is encoded into the two surrogates.
做什么的?您将获得编码代码点的高低代理。请记住,char是16位类型,因此它只能保存最大值0xFFFF。您的代码点不适合16位类型,实际上对于最高代码点,您需要21位(0x10FFFF),因此下一个更宽的类型将只是32位类型。两个char元素不是字符,而是代理对。 0x10FFFC的值被编码到两个代理中。
#3
0
While @R. Martinho Fernandes's answer is correct, his AsCodePoints
extension method has two issues:
而@R。 Martinho Fernandes的回答是正确的,他的AsCodePoints扩展方法有两个问题:
- It will throw an
ArgumentException
on invalid code points (high surrogate without low surrogate or vice versa). - You can't use
char
static methods that take(char)
or(string, int)
(such aschar.IsNumber()
) if you only have int code points.
它会在无效的代码点上抛出ArgumentException(没有低代理的高代理,反之亦然)。
如果只有int代码点,则不能使用带有(char)或(string,int)的char静态方法(例如char.IsNumber())。
I've split the code into two methods, one similar to the original but returns the Unicode Replacement Character on invalid code points. The second method returns a struct IEnumerable with more useful fields:
我已经将代码拆分为两个方法,一个类似于原始方法但在无效代码点上返回Unicode替换字符。第二个方法返回一个包含更多有用字段的结构IEnumerable:
StringCodePointExtensions.cs
public static class StringCodePointExtensions {
const char ReplacementCharacter = '\ufffd';
public static IEnumerable<CodePointIndex> CodePointIndexes(this string s) {
for (int i = 0; i < s.Length; i++) {
if (char.IsHighSurrogate(s, i)) {
if (i + 1 < s.Length && char.IsLowSurrogate(s, i + 1)) {
yield return CodePointIndex.Create(i, true, true);
i++;
continue;
} else {
// High surrogate without low surrogate
yield return CodePointIndex.Create(i, false, false);
continue;
}
} else if (char.IsLowSurrogate(s, i)) {
// Low surrogate without high surrogate
yield return CodePointIndex.Create(i, false, false);
continue;
}
yield return CodePointIndex.Create(i, true, false);
}
}
public static IEnumerable<int> CodePointInts(this string s) {
return s
.CodePointIndexes()
.Select(
cpi => {
if (cpi.Valid) {
return char.ConvertToUtf32(s, cpi.Index);
} else {
return (int)ReplacementCharacter;
}
});
}
}
CodePointIndex.cs
:
public struct CodePointIndex {
public int Index;
public bool Valid;
public bool IsSurrogatePair;
public static CodePointIndex Create(int index, bool valid, bool isSurrogatePair) {
return new CodePointIndex {
Index = index,
Valid = valid,
IsSurrogatePair = isSurrogatePair,
};
}
}
To the extent possible under law, the person who associated CC0 with this work has waived all copyright and related or neighboring rights to this work.
在法律允许的范围内,将CC0与此作品相关联的人已放弃对此作品的所有版权及相关或相邻权利。
#4
0
Yet another alternative to enumerate the UTF32 characters in a C# string is to use the System.Globalization.StringInfo.GetTextElementEnumerator
method, as in the code below.
枚举C#字符串中的UTF32字符的另一种方法是使用System.Globalization.StringInfo.GetTextElementEnumerator方法,如下面的代码所示。
public static class StringExtensions
{
public static System.Collections.Generic.IEnumerable<UTF32Char> GetUTF32Chars(this string s)
{
var tee = System.Globalization.StringInfo.GetTextElementEnumerator(s);
while (tee.MoveNext())
{
yield return new UTF32Char(s, tee.ElementIndex);
}
}
}
public struct UTF32Char
{
private string s;
private int index;
public UTF32Char(string s, int index)
{
this.s = s;
this.index = index;
}
public override string ToString()
{
return char.ConvertFromUtf32(this.UTF32Code);
}
public int UTF32Code { get { return char.ConvertToUtf32(s, index); } }
public double NumericValue { get { return char.GetNumericValue(s, index); } }
public UnicodeCategory UnicodeCategory { get { return char.GetUnicodeCategory(s, index); } }
public bool IsControl { get { return char.IsControl(s, index); } }
public bool IsDigit { get { return char.IsDigit(s, index); } }
public bool IsLetter { get { return char.IsLetter(s, index); } }
public bool IsLetterOrDigit { get { return char.IsLetterOrDigit(s, index); } }
public bool IsLower { get { return char.IsLower(s, index); } }
public bool IsNumber { get { return char.IsNumber(s, index); } }
public bool IsPunctuation { get { return char.IsPunctuation(s, index); } }
public bool IsSeparator { get { return char.IsSeparator(s, index); } }
public bool IsSurrogatePair { get { return char.IsSurrogatePair(s, index); } }
public bool IsSymbol { get { return char.IsSymbol(s, index); } }
public bool IsUpper { get { return char.IsUpper(s, index); } }
public bool IsWhiteSpace { get { return char.IsWhiteSpace(s, index); } }
}