为什么在删除Accents / Diacritics时不会将D扁平化为D.

时间:2021-07-11 20:58:38

I'm using this method to remove accents from my strings:

我正在使用此方法从我的字符串中删除重音:

static string RemoveAccents(string input)
{
    string normalized = input.Normalize(NormalizationForm.FormKD);
    StringBuilder builder = new StringBuilder();
    foreach (char c in normalized)
    {
        if (char.GetUnicodeCategory(c) !=
        UnicodeCategory.NonSpacingMark)
        {
            builder.Append(c);
        }
    }
    return builder.ToString();
}

but this method leaves đ as đ and doesn't change it to d, even though d is its base char. you can try it with this input string "æøåáâăäĺćçčéęëěíîďđńňóôőöřůúűüýţ"

但是这个方法使đ为đ,并且不会将其更改为d,即使d是其基本字符。您可以使用此输入字符串“æøåáâăäĺćçčéęěěîďđńňóôőöřůúűüýţ”进行尝试

What's so special in letter đ?

字母đ中有什么特别之处?

4 个解决方案

#1


13  

The answer for why it doesn't work is that the statement that "d is its base char" is false. U+0111 (LATIN SMALL LETTER D WITH STROKE) has Unicode category "Letter, Lowercase" and has no decomposition mapping (i.e., it doesn't decompose to "d" followed by a combining mark).

为什么它不起作用的答案是“d是它的基本字符”的陈述是错误的。 U + 0111(LATIN SMALL LETTER D WITH STROKE)具有Unicode类别“Letter,Lowercase”并且没有分解映射(即,它不分解为“d”,后面跟着组合标记)。

"đ".Normalize(NormalizationForm.FormD) simply returns "đ", which is not stripped out by the loop because it is not a non-spacing mark.

“đ”.Normalize(NormalizationForm.FormD)只返回“đ”,它不会被循环剥离,因为它不是非间距标记。

A similar issue will exist for "ø" and other letters for which Unicode provides no decomposition mapping. (And if you're trying to find the "best" ASCII character to represent a Unicode letter, this approach won't work at all for Cyrillic, Greek, Chinese or other non-Latin alphabets; you'll also run into problems if you wanted to transliterate "ß" into "ss", for example. Using a library like UnidecodeSharp may help.)

“ø”和其他Unicode不提供分解映射的字母也存在类似的问题。 (如果你试图找到代表Unicode字母的“最佳”ASCII字符,这种方法对于西里尔字母,希腊文,中文或其他非拉丁字母表都不起作用;如果你发现问题,你也会遇到问题。例如,你想将“ß”音译成“ss”。使用像UnidecodeSharp这样的库可能会有帮助。)

#2


3  

I have to admit that I'm not sure why this works but it sure seems to

我不得不承认,我不确定为什么会这样,但似乎确实如此

var str = "æøåáâăäĺćçčéęëěíîďđńňóôőöřůúűüýţ";
var noApostrophes = Encoding.ASCII.GetString(Encoding.GetEncoding("Cyrillic").GetBytes(str)); 

=> "aoaaaaalccceeeeiiddnnooooruuuuyt"

#3


3  

"D with stroke" (Wikipedia) is used in several languages, and appears to be considered a distinct letter in all of them -- and that is why it remains unchanged.

“D with stroke”(*)以多种语言使用,并且在所有语言中看起来都被视为一个独特的字母 - 这就是它保持不变的原因。

#4


-4  

this should work

这应该工作

    private static String RemoveDiacritics(string text)
    {
        String normalized = text.Normalize(NormalizationForm.FormD);
        StringBuilder sb = new StringBuilder();

        for (int i = 0; i < normalized.Length; i++)
        {
            Char c = normalized[i];
            if (CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
                sb.Append(c);
        }

        return sb.ToString();
    }

#1


13  

The answer for why it doesn't work is that the statement that "d is its base char" is false. U+0111 (LATIN SMALL LETTER D WITH STROKE) has Unicode category "Letter, Lowercase" and has no decomposition mapping (i.e., it doesn't decompose to "d" followed by a combining mark).

为什么它不起作用的答案是“d是它的基本字符”的陈述是错误的。 U + 0111(LATIN SMALL LETTER D WITH STROKE)具有Unicode类别“Letter,Lowercase”并且没有分解映射(即,它不分解为“d”,后面跟着组合标记)。

"đ".Normalize(NormalizationForm.FormD) simply returns "đ", which is not stripped out by the loop because it is not a non-spacing mark.

“đ”.Normalize(NormalizationForm.FormD)只返回“đ”,它不会被循环剥离,因为它不是非间距标记。

A similar issue will exist for "ø" and other letters for which Unicode provides no decomposition mapping. (And if you're trying to find the "best" ASCII character to represent a Unicode letter, this approach won't work at all for Cyrillic, Greek, Chinese or other non-Latin alphabets; you'll also run into problems if you wanted to transliterate "ß" into "ss", for example. Using a library like UnidecodeSharp may help.)

“ø”和其他Unicode不提供分解映射的字母也存在类似的问题。 (如果你试图找到代表Unicode字母的“最佳”ASCII字符,这种方法对于西里尔字母,希腊文,中文或其他非拉丁字母表都不起作用;如果你发现问题,你也会遇到问题。例如,你想将“ß”音译成“ss”。使用像UnidecodeSharp这样的库可能会有帮助。)

#2


3  

I have to admit that I'm not sure why this works but it sure seems to

我不得不承认,我不确定为什么会这样,但似乎确实如此

var str = "æøåáâăäĺćçčéęëěíîďđńňóôőöřůúűüýţ";
var noApostrophes = Encoding.ASCII.GetString(Encoding.GetEncoding("Cyrillic").GetBytes(str)); 

=> "aoaaaaalccceeeeiiddnnooooruuuuyt"

#3


3  

"D with stroke" (Wikipedia) is used in several languages, and appears to be considered a distinct letter in all of them -- and that is why it remains unchanged.

“D with stroke”(*)以多种语言使用,并且在所有语言中看起来都被视为一个独特的字母 - 这就是它保持不变的原因。

#4


-4  

this should work

这应该工作

    private static String RemoveDiacritics(string text)
    {
        String normalized = text.Normalize(NormalizationForm.FormD);
        StringBuilder sb = new StringBuilder();

        for (int i = 0; i < normalized.Length; i++)
        {
            Char c = normalized[i];
            if (CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
                sb.Append(c);
        }

        return sb.ToString();
    }