如何删除字符串上的重音? [重复]

时间:2021-06-06 00:10:56

Possible Duplicate:
How do I remove diacritics (accents) from a string in .NET?

可能重复:如何从.NET中的字符串中删除变音符号(重音符号)?

I have the following string

我有以下字符串

áéíóú

which I need to convert it to

我需要将其转换为

aeiou

How can I achieve it? (I don't need to compare, I need the new string to save)

我怎样才能实现它? (我不需要比较,我需要新的字符串来保存)


Not a duplicate of How do I remove diacritics (accents) from a string in .NET?. The accepted answer there doesn't explain anything and that's why I've "reopened" it.

不重复如何从.NET中的字符串中删除变音符号(重音符号)?那里接受的答案没有解释任何事情,这就是为什么我“重新开启”它。

2 个解决方案

#1


21  

It depends on requirements. For most uses, then normalising to NFD and then filtering out all combining chars will do. For some cases, normalising to NFKD is more appropriate (if you also want to removed some further distinctions between characters).

这取决于要求。对于大多数用途,然后归一化为NFD,然后过滤掉所有组合字符。对于某些情况,标准化为NFKD更合适(如果您还想删除字符之间的一些进一步区别)。

Some other distinctions will not be caught by this, notably stroked Latin characters. There's also no clear non-locale-specific way for some (should ł be considered equivalent to l or w?) so you may need to customise beyond this.

其他一些区别将不会被这个,特别是抚摸拉丁字符所捕获。对于某些人来说,也没有明确的非特定于语言环境的方式(应该被认为等同于l还是w?),因此您可能需要自定义。

There are also some cases where NFD and NFKD don't work quite as expected, to allow for consistency between Unicode versions.

还有一些情况下NFD和NFKD不能按预期工作,以允许Unicode版本之间的一致性。

Hence:

public static IEnumerable<char> RemoveDiacriticsEnum(string src, bool compatNorm, Func<char, char> customFolding)
{
    foreach(char c in src.Normalize(compatNorm ? NormalizationForm.FormKD : NormalizationForm.FormD))
    switch(CharUnicodeInfo.GetUnicodeCategory(c))
    {
      case UnicodeCategory.NonSpacingMark:
      case UnicodeCategory.SpacingCombiningMark:
      case UnicodeCategory.EnclosingMark:
        //do nothing
        break;
      default:
        yield return customFolding(c);
        break;
    }
}
public static IEnumerable<char> RemoveDiacriticsEnum(string src, bool compatNorm)
{
  return RemoveDiacritics(src, compatNorm, c => c);
}
public static string RemoveDiacritics(string src, bool compatNorm, Func<char, char> customFolding)
{
  StringBuilder sb = new StringBuilder();
  foreach(char c in RemoveDiacriticsEnum(src, compatNorm, customFolding))
    sb.Append(c);
  return sb.ToString();
}
public static string RemoveDiacritics(string src, bool compatNorm)
{
  return RemoveDiacritics(src, compatNorm, c => c);
}

Here we've a default for the problem cases mentioned above, which just ignores them. We've also split building a string from generating the enumeration of characters so we need not be wasteful in cases where there's no need for string manipulation on the result (say we were going to write the chars to output next, or do some further char-by-char manipulation).

在这里,我们对上面提到的问题情况进行了默认,只是忽略了它们。我们还分割了一个字符串来生成字符的枚举,所以我们不必浪费在不需要对结果进行字符串操作的情况下(比如我们要将字符写入下一个输出,或者做一些其他的字符串) -by-char操纵)。

An example case for something where we wanted to also convert ł and Ł to l and L, but had no other specialised concerns could use:

我们想要将ł和Ł转换为l和L,但没有其他特殊问题可以使用的示例:

private static char NormaliseLWithStroke(char c)
{
  switch(c)
  {
     case 'ł':
       return 'l';
     case 'Ł':
       return 'L';
     default:
       return c;
  }
}

Using this with the above methods will combine to remove the stroke in this case, along with the decomposable diacritics.

与上述方法一起使用它将结合在一起除去中风,以及可分解的变音符号。

#2


15  

public string RemoveDiacritics(string input)
{
    string stFormD = input.Normalize(NormalizationForm.FormD);
    int len = stFormD.Length;
    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < len; i++)
    {
        System.Globalization.UnicodeCategory uc = System.Globalization.CharUnicodeInfo.GetUnicodeCategory(stFormD[i]);
        if (uc != System.Globalization.UnicodeCategory.NonSpacingMark)
        {
            sb.Append(stFormD[i]);
        }
    }
    return (sb.ToString().Normalize(NormalizationForm.FormC));
}

#1


21  

It depends on requirements. For most uses, then normalising to NFD and then filtering out all combining chars will do. For some cases, normalising to NFKD is more appropriate (if you also want to removed some further distinctions between characters).

这取决于要求。对于大多数用途,然后归一化为NFD,然后过滤掉所有组合字符。对于某些情况,标准化为NFKD更合适(如果您还想删除字符之间的一些进一步区别)。

Some other distinctions will not be caught by this, notably stroked Latin characters. There's also no clear non-locale-specific way for some (should ł be considered equivalent to l or w?) so you may need to customise beyond this.

其他一些区别将不会被这个,特别是抚摸拉丁字符所捕获。对于某些人来说,也没有明确的非特定于语言环境的方式(应该被认为等同于l还是w?),因此您可能需要自定义。

There are also some cases where NFD and NFKD don't work quite as expected, to allow for consistency between Unicode versions.

还有一些情况下NFD和NFKD不能按预期工作,以允许Unicode版本之间的一致性。

Hence:

public static IEnumerable<char> RemoveDiacriticsEnum(string src, bool compatNorm, Func<char, char> customFolding)
{
    foreach(char c in src.Normalize(compatNorm ? NormalizationForm.FormKD : NormalizationForm.FormD))
    switch(CharUnicodeInfo.GetUnicodeCategory(c))
    {
      case UnicodeCategory.NonSpacingMark:
      case UnicodeCategory.SpacingCombiningMark:
      case UnicodeCategory.EnclosingMark:
        //do nothing
        break;
      default:
        yield return customFolding(c);
        break;
    }
}
public static IEnumerable<char> RemoveDiacriticsEnum(string src, bool compatNorm)
{
  return RemoveDiacritics(src, compatNorm, c => c);
}
public static string RemoveDiacritics(string src, bool compatNorm, Func<char, char> customFolding)
{
  StringBuilder sb = new StringBuilder();
  foreach(char c in RemoveDiacriticsEnum(src, compatNorm, customFolding))
    sb.Append(c);
  return sb.ToString();
}
public static string RemoveDiacritics(string src, bool compatNorm)
{
  return RemoveDiacritics(src, compatNorm, c => c);
}

Here we've a default for the problem cases mentioned above, which just ignores them. We've also split building a string from generating the enumeration of characters so we need not be wasteful in cases where there's no need for string manipulation on the result (say we were going to write the chars to output next, or do some further char-by-char manipulation).

在这里,我们对上面提到的问题情况进行了默认,只是忽略了它们。我们还分割了一个字符串来生成字符的枚举,所以我们不必浪费在不需要对结果进行字符串操作的情况下(比如我们要将字符写入下一个输出,或者做一些其他的字符串) -by-char操纵)。

An example case for something where we wanted to also convert ł and Ł to l and L, but had no other specialised concerns could use:

我们想要将ł和Ł转换为l和L,但没有其他特殊问题可以使用的示例:

private static char NormaliseLWithStroke(char c)
{
  switch(c)
  {
     case 'ł':
       return 'l';
     case 'Ł':
       return 'L';
     default:
       return c;
  }
}

Using this with the above methods will combine to remove the stroke in this case, along with the decomposable diacritics.

与上述方法一起使用它将结合在一起除去中风,以及可分解的变音符号。

#2


15  

public string RemoveDiacritics(string input)
{
    string stFormD = input.Normalize(NormalizationForm.FormD);
    int len = stFormD.Length;
    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < len; i++)
    {
        System.Globalization.UnicodeCategory uc = System.Globalization.CharUnicodeInfo.GetUnicodeCategory(stFormD[i]);
        if (uc != System.Globalization.UnicodeCategory.NonSpacingMark)
        {
            sb.Append(stFormD[i]);
        }
    }
    return (sb.ToString().Normalize(NormalizationForm.FormC));
}