.NET如何确定字符的Unicode类别?

时间:2021-01-17 20:20:55

I was looking in mscorelib.dll with .NET Reflector, and stumbled upon the Char class. I always wondered how methods like Char.isLetter was done. I expected a huge list of test, but, buy digging a little bit, I found a really short code that determine the Unicode category. However, this code uses some kind of tables and some bitshifting voodoo. Can anyone explain to me how this is done, or point me to some resources?

我在看mscorelib。使用。net Reflector的dll,并偶然发现了Char类。我一直想知道Char之类的方法。胰岛。我期望有大量的测试列表,但是,请稍作挖掘,我发现了一个非常短的代码来确定Unicode类别。然而,这段代码使用了一些表和一些位移巫毒。有人能给我解释一下这是怎么做的吗?

EDIT : Here's the code. It's in System.Globalization.CharUnicodeInfo.

编辑:这是代码。它在System.Globalization.CharUnicodeInfo。

internal static unsafe byte InternalGetCategoryValue(int ch, int offset)
{
    ushort num = s_pCategoryLevel1Index[ch >> 8];
    num = s_pCategoryLevel1Index[num + ((ch >> 4) & 15)];
    byte* numPtr = (byte*) (s_pCategoryLevel1Index + num);
    byte num2 = numPtr[ch & 15];
    return s_pCategoriesValue[(num2 * 2) + offset];
}

s_pCategoryLevel1Index is a short* and s_pCategoryValues is a byte*

s_pCategoryLevel1Index是一个短*,而s_pCategoryValues是一个字节*

Both are created in the CharUnicodeInfo static constructor :

两者都是在CharUnicodeInfo静态构造函数中创建的:

 static unsafe CharUnicodeInfo()
{
    s_pDataTable = GlobalizationAssembly.GetGlobalizationResourceBytePtr(typeof(CharUnicodeInfo).Assembly, "charinfo.nlp");
    UnicodeDataHeader* headerPtr = (UnicodeDataHeader*) s_pDataTable;
    s_pCategoryLevel1Index = (ushort*) (s_pDataTable + headerPtr->OffsetToCategoriesIndex);
    s_pCategoriesValue = s_pDataTable + ((byte*) headerPtr->OffsetToCategoriesValue);
    s_pNumericLevel1Index = (ushort*) (s_pDataTable + headerPtr->OffsetToNumbericIndex);
    s_pNumericValues = s_pDataTable + ((byte*) headerPtr->OffsetToNumbericValue);
    s_pDigitValues = (DigitValues*) (s_pDataTable + headerPtr->OffsetToDigitValue);
    nativeInitTable(s_pDataTable);
}

Here is the UnicodeDataHeader.

这是UnicodeDataHeader。

internal struct UnicodeDataHeader
{
    // Fields
    [FieldOffset(40)]
    internal uint OffsetToCategoriesIndex;
    [FieldOffset(0x2c)]
    internal uint OffsetToCategoriesValue;
    [FieldOffset(0x34)]
    internal uint OffsetToDigitValue;
    [FieldOffset(0x30)]
    internal uint OffsetToNumbericIndex;
    [FieldOffset(0x38)]
    internal uint OffsetToNumbericValue;
    [FieldOffset(0)]
    internal char TableName;
    [FieldOffset(0x20)]
    internal ushort version;
}

Note : I Hope this doesn't break any licence. If so, I'll remove the code.

注意:我希望这不会违反任何许可证。如果是,我将删除代码。

2 个解决方案

#1


2  

The basic information is stored in charinfo.nlp which is embedded in mscorlib.dll as a resource and loaded at runtime. The specifics of the file are probably only known to Microsoft but suffice it to say that it probably is a lookup table in a fashion.

基本信息存储在charinfo中。在mscorlib中嵌入的nlp。dll作为资源,在运行时被加载。该文件的具体细节可能只有微软知道,但只要说它可能是某种形式的查找表就足够了。

EDIT

编辑

According to MSDN:

根据MSDN:

This enumeration is based on The Unicode Standard, version 5.0. For more information, see the "UCD File Format" and "General Category Values" subtopics at the Unicode Character Database.

此枚举基于Unicode标准5.0版本。有关更多信息,请参阅Unicode字符数据库中的“UCD文件格式”和“一般类别值”子主题。

#2


1  

That looks like a b-tree of sorts.

看起来像b树。

The advantage is that a bunch of regions can all point to the same "character unknown" block, instead of needing a unique element in the array for each possible Char value.

优点是,许多区域都可以指向相同的“字符未知”块,而不需要为每个可能的Char值在数组中使用唯一的元素。

#1


2  

The basic information is stored in charinfo.nlp which is embedded in mscorlib.dll as a resource and loaded at runtime. The specifics of the file are probably only known to Microsoft but suffice it to say that it probably is a lookup table in a fashion.

基本信息存储在charinfo中。在mscorlib中嵌入的nlp。dll作为资源,在运行时被加载。该文件的具体细节可能只有微软知道,但只要说它可能是某种形式的查找表就足够了。

EDIT

编辑

According to MSDN:

根据MSDN:

This enumeration is based on The Unicode Standard, version 5.0. For more information, see the "UCD File Format" and "General Category Values" subtopics at the Unicode Character Database.

此枚举基于Unicode标准5.0版本。有关更多信息,请参阅Unicode字符数据库中的“UCD文件格式”和“一般类别值”子主题。

#2


1  

That looks like a b-tree of sorts.

看起来像b树。

The advantage is that a bunch of regions can all point to the same "character unknown" block, instead of needing a unique element in the array for each possible Char value.

优点是,许多区域都可以指向相同的“字符未知”块,而不需要为每个可能的Char值在数组中使用唯一的元素。