从Unicode字符中删除变音符号（ǹṅṅņṇṋṉ̈ƞᶇᶇȵ）

I am looking at an algorithm that can map between characters with diacritics (tilde, circumflex, caret, umlaut, caron) and their "simple" character.

我正在研究一种算法，它可以在变音符号（波形符号，旋律符号，插入符号，变音符号，卡隆语）和它们的“简单”字符之间进行映射。

For example:

例如：

ń  ǹ  ň  ñ  ṅ  ņ  ṇ  ṋ  ṉ  ̈  ɲ  ƞ ᶇ ɳ ȵ  --> n
á --> a
ä --> a
ấ --> a
ṏ --> o

Etc.

等等。

I want to do this in Java, although I suspect it should be something Unicode-y and should be doable reasonably easily in any language.

我想用Java做这个，虽然我怀疑它应该是Unicode-y，并且应该可以在任何语言中合理地使用。
Purpose: to allow easily search for words with diacritical marks. For example, if I have a database of tennis players, and Björn_Borg is entered, I will also keep Bjorn_Borg so I can find it if someone enters Bjorn and not Björn.

目的：允许轻松搜索带变音标记的单词。例如，如果我有一个网球运动员数据库，并且输入了Björn_Borg，我还会保留Bjorn_Borg，这样如果有人进入Bjorn而不是Björn，我就能找到它。

12 个解决方案

#1

I have done this recently in Java:

我最近用Java做过这个：

public static final Pattern DIACRITICS_AND_FRIENDS
    = Pattern.compile("[\\p{InCombiningDiacriticalMarks}\\p{IsLm}\\p{IsSk}]+");

private static String stripDiacritics(String str) {
    str = Normalizer.normalize(str, Normalizer.Form.NFD);
    str = DIACRITICS_AND_FRIENDS.matcher(str).replaceAll("");
    return str;
}

This will do as you specified:

这将按照您的指定执行：

stripDiacritics("Björn")  = Bjorn

but it will fail on for example Białystok, because the ł character is not diacritic.

但它会失败，例如比亚韦斯托克，因为ł字符不是变音符号。

If you want to have a full-blown string simplifier, you will need a second cleanup round, for some more special characters that are not diacritics. Is this map, I have included the most common special characters that appear in our customer names. It is not a complete list, but it will give you the idea how to do extend it. The immutableMap is just a simple class from google-collections.

如果你想拥有一个完整的字符串简化器，你需要进行第二轮清理，以获得一些不是变音符号的特殊字符。这张地图，我已经包含了我们客户名称中出现的最常见的特殊字符。它不是一个完整的列表，但它会让你知道如何扩展它。 immutableMap只是google-collections中的一个简单类。

public class StringSimplifier {
    public static final char DEFAULT_REPLACE_CHAR = '-';
    public static final String DEFAULT_REPLACE = String.valueOf(DEFAULT_REPLACE_CHAR);
    private static final ImmutableMap<String, String> NONDIACRITICS = ImmutableMap.<String, String>builder()

        //Remove crap strings with no sematics
        .put(".", "")
        .put("\"", "")
        .put("'", "")

        //Keep relevant characters as seperation
        .put(" ", DEFAULT_REPLACE)
        .put("]", DEFAULT_REPLACE)
        .put("[", DEFAULT_REPLACE)
        .put(")", DEFAULT_REPLACE)
        .put("(", DEFAULT_REPLACE)
        .put("=", DEFAULT_REPLACE)
        .put("!", DEFAULT_REPLACE)
        .put("/", DEFAULT_REPLACE)
        .put("\\", DEFAULT_REPLACE)
        .put("&", DEFAULT_REPLACE)
        .put(",", DEFAULT_REPLACE)
        .put("?", DEFAULT_REPLACE)
        .put("°", DEFAULT_REPLACE) //Remove ?? is diacritic?
        .put("|", DEFAULT_REPLACE)
        .put("<", DEFAULT_REPLACE)
        .put(">", DEFAULT_REPLACE)
        .put(";", DEFAULT_REPLACE)
        .put(":", DEFAULT_REPLACE)
        .put("_", DEFAULT_REPLACE)
        .put("#", DEFAULT_REPLACE)
        .put("~", DEFAULT_REPLACE)
        .put("+", DEFAULT_REPLACE)
        .put("*", DEFAULT_REPLACE)

        //Replace non-diacritics as their equivalent characters
        .put("\u0141", "l") // BiaLystock
        .put("\u0142", "l") // Bialystock
        .put("ß", "ss")
        .put("æ", "ae")
        .put("ø", "o")
        .put("©", "c")
        .put("\u00D0", "d") // All Ð ð from http://de.wikipedia.org/wiki/%C3%90
        .put("\u00F0", "d")
        .put("\u0110", "d")
        .put("\u0111", "d")
        .put("\u0189", "d")
        .put("\u0256", "d")
        .put("\u00DE", "th") // thorn Þ
        .put("\u00FE", "th") // thorn þ
        .build();


    public static String simplifiedString(String orig) {
        String str = orig;
        if (str == null) {
            return null;
        }
        str = stripDiacritics(str);
        str = stripNonDiacritics(str);
        if (str.length() == 0) {
            // Ugly special case to work around non-existing empty strings
            // in Oracle. Store original crapstring as simplified.
            // It would return an empty string if Oracle could store it.
            return orig;
        }
        return str.toLowerCase();
    }

    private static String stripNonDiacritics(String orig) {
        StringBuffer ret = new StringBuffer();
        String lastchar = null;
        for (int i = 0; i < orig.length(); i++) {
            String source = orig.substring(i, i + 1);
            String replace = NONDIACRITICS.get(source);
            String toReplace = replace == null ? String.valueOf(source) : replace;
            if (DEFAULT_REPLACE.equals(lastchar) && DEFAULT_REPLACE.equals(toReplace)) {
                toReplace = "";
            } else {
                lastchar = toReplace;
            }
            ret.append(toReplace);
        }
        if (ret.length() > 0 && DEFAULT_REPLACE_CHAR == ret.charAt(ret.length() - 1)) {
            ret.deleteCharAt(ret.length() - 1);
        }
        return ret.toString();
    }

    /*
    Special regular expression character ranges relevant for simplification -> see http://docstore.mik.ua/orelly/perl/prog3/ch05_04.htm
    InCombiningDiacriticalMarks: special marks that are part of "normal" ä, ö, î etc..
        IsSk: Symbol, Modifier see http://www.fileformat.info/info/unicode/category/Sk/list.htm
        IsLm: Letter, Modifier see http://www.fileformat.info/info/unicode/category/Lm/list.htm
     */
    public static final Pattern DIACRITICS_AND_FRIENDS
        = Pattern.compile("[\\p{InCombiningDiacriticalMarks}\\p{IsLm}\\p{IsSk}]+");


    private static String stripDiacritics(String str) {
        str = Normalizer.normalize(str, Normalizer.Form.NFD);
        str = DIACRITICS_AND_FRIENDS.matcher(str).replaceAll("");
        return str;
    }
}

#2

The core java.text package was designed to address this use case (matching strings without caring about diacritics, case, etc.).

核心java.text包旨在解决这个用例（匹配字符串而不关心变音符号，大小写等）。

Configure a Collator to sort on PRIMARY differences in characters. With that, create a CollationKey for each string. If all of your code is in Java, you can use the CollationKey directly. If you need to store the keys in a database or other sort of index, you can convert it to a byte array.

配置Collator以对主要的PRIMARY差异进行排序。这样，为每个字符串创建一个CollationKey。如果所有代码都是Java，则可以直接使用CollationKey。如果需要将密钥存储在数据库或其他类型的索引中，则可以将其转换为字节数组。

These classes use the Unicode standard case folding data to determine which characters are equivalent, and support various decomposition strategies.

这些类使用Unicode标准案例折叠数据来确定哪些字符是等效的，并支持各种分解策略。

Collator c = Collator.getInstance();
c.setStrength(Collator.PRIMARY);
Map<CollationKey, String> dictionary = new TreeMap<CollationKey, String>();
dictionary.put(c.getCollationKey("Björn"), "Björn");
...
CollationKey query = c.getCollationKey("bjorn");
System.out.println(dictionary.get(query)); // --> "Björn"

Note that collators are locale-specific. This is because "alphabetical order" is differs between locales (and even over time, as has been the case with Spanish). The Collator class relieves you from having to track all of these rules and keep them up to date.

请注意，collators是特定于语言环境的。这是因为“字母顺序”在区域设置之间是不同的（甚至随着时间的推移，与西班牙语的情况一样）。 Collator类使您无需跟踪所有这些规则并使其保持最新状态。

#3

It's part of Apache Commons Lang as of ver. 3.1.

它是Apache Commons Lang的一部分。 3.1。

org.apache.commons.lang3.StringUtils.stripAccents("Añ");

returns An

返回An

#4

You could use the Normalizer class from java.text:

您可以使用java.text中的Normalizer类：

System.out.println(new String(Normalizer.normalize("ń ǹ ň ñ ṅ ņ ṇ ṋ", Normalizer.Form.NFKD).getBytes("ascii"), "ascii"));

But there is still some work to do, since Java makes strange things with unconvertable Unicode characters (it does not ignore them, and it does not throw an exception). But I think you could use that as an starting point.

但仍有一些工作要做，因为Java使用不可转换的Unicode字符做出奇怪的事情（它不会忽略它们，并且它不会抛出异常）。但我认为你可以用它作为起点。

#5

There is a draft report on character folding on the unicode website which has a lot of relevant material. See specifically Section 4.1. "Folding algorithm".

在unicode网站上有关于字符折叠的报告草案，其中有很多相关资料。具体见4.1节。 “折叠算法”。

Here's a discussion and implementation of diacritic marker removal using Perl.

这是使用Perl去除变音标记的讨论和实现。

These existing SO questions are related:

这些现有的SO问题是相关的：

How to convert UTF-8 to US ASCII
如何将UTF-8转换为US ASCII
How to change diacritic characters to non-diacritic ones
如何将变音符号改为非变音符号

#6

Please note that not all of these marks are just "marks" on some "normal" character, that you can remove without changing the meaning.

请注意，并非所有这些标记都只是某些“正常”字符的“标记”，您可以删除而不更改其含义。

In Swedish, å ä and ö are true and proper first-class characters, not some "variant" of some other character. They sound different from all other characters, they sort different, and they make words change meaning ("mätt" and "matt" are two different words).

在瑞典语中，åä和ö是真正的一流人物，而不是某些其他角色的“变体”。它们听起来与所有其他角色不同，它们排序不同，它们使词语改变意义（“mätt”和“matt”是两个不同的词）。

#7

Unicode has specific diatric characters (which are composite characters) and a string can be converted so that the character and the diatrics are separated. Then, you can just remove the diatricts from the string and you're basically done.

Unicode具有特定的diatric字符（复合字符），并且可以转换字符串，以便分隔字符和diatrics。然后，你可以从字符串中删除区域，你基本上完成了。

For more information on normalization, decompositions and equivalence, see The Unicode Standard at the Unicode home page.

有关规范化，分解和等效性的更多信息，请参阅Unicode主页上的Unicode标准。

However, how you can actually achieve this depends on the framework/OS/... you're working on. If you're using .NET, you can use the String.Normalize method accepting the System.Text.NormalizationForm enumeration.

但是，如何实际实现这一点取决于您正在使用的框架/ OS / ....如果您使用的是.NET，则可以使用String.Normalize方法接受System.Text.NormalizationForm枚举。

#8

The easiest way (to me) would be to simply maintain a sparse mapping array which simply changes your Unicode code points into displayable strings.

最简单的方法（对我来说）就是简单地维护一个稀疏的映射数组，它只是将你的Unicode代码点改成可显示的字符串。

Such as:

如：

start    = 0x00C0
size     = 23
mappings = {
    "A","A","A","A","A","A","AE","C",
    "E","E","E","E","I","I","I", "I",
    "D","N","O","O","O","O","O"
}
start    = 0x00D8
size     = 6
mappings = {
    "O","U","U","U","U","Y"
}
start    = 0x00E0
size     = 23
mappings = {
    "a","a","a","a","a","a","ae","c",
    "e","e","e","e","i","i","i", "i",
    "d","n","o","o","o","o","o"
}
start    = 0x00F8
size     = 6
mappings = {
    "o","u","u","u","u","y"
}
: : :

The use of a sparse array will allow you to efficiently represent replacements even when they in widely spaced sections of the Unicode table. String replacements will allow arbitrary sequences to replace your diacritics (such as the æ grapheme becoming ae).

使用稀疏数组将允许您有效地表示替换，即使它们在Unicode表的宽间隔部分中也是如此。字符串替换将允许任意序列替换您的变音符号（例如æ字形变为ae）。

This is a language-agnostic answer so, if you have a specific language in mind, there will be better ways (although they'll all likely come down to this at the lowest levels anyway).

这是一个与语言无关的答案，所以，如果你有一个特定的语言，会有更好的方法（尽管他们都可能在最低层次上达到这个目的）。

#9

Something to consider: if you go the route of trying to get a single "translation" of each word, you may miss out on some possible alternates.

需要考虑的事项：如果你试图获得每个单词的单个“翻译”的路线，你可能会错过一些可能的替代。

For instance, in German, when replacing the "s-set", some people might use "B", while others might use "ss". Or, replacing an umlauted o with "o" or "oe". Any solution you come up with, ideally, I would think should include both.

例如，在德语中，当替换“s-set”时，有些人可能使用“B”，而其他人则可能使用“s”。或者，用“o”或“oe”替换已声明的o。你提出的任何解决方案，理想情况下，我认为应该包括两者。

#10

In Windows and .NET, I just convert using string encoding. That way I avoid manual mapping and coding.

在Windows和.NET中，我只是使用字符串编码进行转换。这样我就避免了手动映射和编码。

Try to play with string encoding.

尝试使用字符串编码。

#11

In case of German it's not wanted to remove diacritics from Umlauts (ä, ö, ü). Instead they are replaced by two letter combination (ae, oe, ue) For instance, Björn should be written as Bjoern (not Bjorn) to have correct pronounciation.

在德国人的情况下，它不想从变音符号（ä，ö，ü）中删除变音符号。相反，他们被两个字母组合（ae，oe，ue）取代。例如，Björn应该写成Bjoern（而不是Bjorn）才能有正确的发音。

For that I would have rather a hardcoded mapping, where you can define the replacement rule individually for each special character group.

为此，我宁愿使用硬编码映射，您可以在其中为每个特殊字符组单独定义替换规则。

#12

For future reference, here is a C# extension method that removes accents.

为了将来参考，这是一个删除重音的C＃扩展方法。

public static class StringExtensions
{
    public static string RemoveDiacritics(this string str)
    {
        return new string(
            str.Normalize(NormalizationForm.FormD)
                .Where(c => CharUnicodeInfo.GetUnicodeCategory(c) != 
                            UnicodeCategory.NonSpacingMark)
                .ToArray());
    }
}
static void Main()
{
    var input = "ŃŅŇ ÀÁÂÃÄÅ ŢŤţť Ĥĥ àáâãäå ńņň";
    var output = input.RemoveDiacritics();
    Debug.Assert(output == "NNN AAAAAA TTtt Hh aaaaaa nnn");
}

#1