删除变音符号(ńǹňnṅņṇṋṉ̈ɲƞᶇɳȵ)从Unicode字符

I am looking at an algorithm that can map between characters with diacritics (tilde, circumflex, caret, umlaut, caron) and their "simple" character.

我正在研究一种算法，该算法可以在字符之间进行映射，这些字符具有可变音符(tilde, round flex, caret, umlaut, caron)和它们的“简单”字符之间进行映射。

For example:

例如:

ń  ǹ  ň  ñ  ṅ  ņ  ṇ  ṋ  ṉ  ̈  ɲ  ƞ ᶇ ɳ ȵ  --> ná --> aä --> aấ --> aṏ --> o

Etc.

等。

I want to do this in Java, although I suspect it should be something Unicode-y and should be doable reasonably easily in any language.

我想用Java来实现这一点，尽管我认为它应该是单元类的，并且在任何语言中都应该是合理可行的。
Purpose: to allow easily search for words with diacritical marks. For example, if I have a database of tennis players, and Björn_Borg is entered, I will also keep Bjorn_Borg so I can find it if someone enters Bjorn and not Björn.

目的:方便地搜索具有区分字符标记的单词。例如，如果我有一个网球运动员的数据库，并且输入了Bjorn_Borg，我还将保留Bjorn_Borg，以便如果有人输入了Bjorn而不是Bjorn，我可以找到它。

12 个解决方案

#1

I have done this recently in Java:

我最近在Java做过这个:

public static final Pattern DIACRITICS_AND_FRIENDS    = Pattern.compile("[\\p{InCombiningDiacriticalMarks}\\p{IsLm}\\p{IsSk}]+");private static String stripDiacritics(String str) {    str = Normalizer.normalize(str, Normalizer.Form.NFD);    str = DIACRITICS_AND_FRIENDS.matcher(str).replaceAll("");    return str;}

This will do as you specified:

这将按照您指定的方式进行:

stripDiacritics("Björn")  = Bjorn

but it will fail on for example Białystok, because the ł character is not diacritic.

但它在例如Białystok将失败,因为ł字符不是可区别的。

If you want to have a full-blown string simplifier, you will need a second cleanup round, for some more special characters that are not diacritics. Is this map, I have included the most common special characters that appear in our customer names. It is not a complete list, but it will give you the idea how to do extend it. The immutableMap is just a simple class from google-collections.

如果您想要一个完整的字符串简化程序，您将需要进行第二轮清理，以处理一些非变音符号的特殊字符。是这张地图吗，我已经包含了在我们的客户名中出现的最常见的特殊字符。它不是一个完整的列表，但是它会告诉你如何扩展它。immutableMap只是一个来自google-collections的简单类。

public class StringSimplifier {    public static final char DEFAULT_REPLACE_CHAR = '-';    public static final String DEFAULT_REPLACE = String.valueOf(DEFAULT_REPLACE_CHAR);    private static final ImmutableMap<String, String> NONDIACRITICS = ImmutableMap.<String, String>builder()        //Remove crap strings with no sematics        .put(".", "")        .put("\"", "")        .put("'", "")        //Keep relevant characters as seperation        .put(" ", DEFAULT_REPLACE)        .put("]", DEFAULT_REPLACE)        .put("[", DEFAULT_REPLACE)        .put(")", DEFAULT_REPLACE)        .put("(", DEFAULT_REPLACE)        .put("=", DEFAULT_REPLACE)        .put("!", DEFAULT_REPLACE)        .put("/", DEFAULT_REPLACE)        .put("\\", DEFAULT_REPLACE)        .put("&", DEFAULT_REPLACE)        .put(",", DEFAULT_REPLACE)        .put("?", DEFAULT_REPLACE)        .put("°", DEFAULT_REPLACE) //Remove ?? is diacritic?        .put("|", DEFAULT_REPLACE)        .put("<", DEFAULT_REPLACE)        .put(">", DEFAULT_REPLACE)        .put(";", DEFAULT_REPLACE)        .put(":", DEFAULT_REPLACE)        .put("_", DEFAULT_REPLACE)        .put("#", DEFAULT_REPLACE)        .put("~", DEFAULT_REPLACE)        .put("+", DEFAULT_REPLACE)        .put("*", DEFAULT_REPLACE)        //Replace non-diacritics as their equivalent characters        .put("\u0141", "l") // BiaLystock        .put("\u0142", "l") // Bialystock        .put("ß", "ss")        .put("æ", "ae")        .put("ø", "o")        .put("©", "c")        .put("\u00D0", "d") // All Ð ð from http://de.wikipedia.org/wiki/%C3%90        .put("\u00F0", "d")        .put("\u0110", "d")        .put("\u0111", "d")        .put("\u0189", "d")        .put("\u0256", "d")        .put("\u00DE", "th") // thorn Þ        .put("\u00FE", "th") // thorn þ        .build();    public static String simplifiedString(String orig) {        String str = orig;        if (str == null) {            return null;        }        str = stripDiacritics(str);        str = stripNonDiacritics(str);        if (str.length() == 0) {            // Ugly special case to work around non-existing empty strings            // in Oracle. Store original crapstring as simplified.            // It would return an empty string if Oracle could store it.            return orig;        }        return str.toLowerCase();    }    private static String stripNonDiacritics(String orig) {        StringBuffer ret = new StringBuffer();        String lastchar = null;        for (int i = 0; i < orig.length(); i++) {            String source = orig.substring(i, i + 1);            String replace = NONDIACRITICS.get(source);            String toReplace = replace == null ? String.valueOf(source) : replace;            if (DEFAULT_REPLACE.equals(lastchar) && DEFAULT_REPLACE.equals(toReplace)) {                toReplace = "";            } else {                lastchar = toReplace;            }            ret.append(toReplace);        }        if (ret.length() > 0 && DEFAULT_REPLACE_CHAR == ret.charAt(ret.length() - 1)) {            ret.deleteCharAt(ret.length() - 1);        }        return ret.toString();    }    /*    Special regular expression character ranges relevant for simplification -> see http://docstore.mik.ua/orelly/perl/prog3/ch05_04.htm    InCombiningDiacriticalMarks: special marks that are part of "normal" ä, ö, î etc..        IsSk: Symbol, Modifier see http://www.fileformat.info/info/unicode/category/Sk/list.htm        IsLm: Letter, Modifier see http://www.fileformat.info/info/unicode/category/Lm/list.htm     */    public static final Pattern DIACRITICS_AND_FRIENDS        = Pattern.compile("[\\p{InCombiningDiacriticalMarks}\\p{IsLm}\\p{IsSk}]+");    private static String stripDiacritics(String str) {        str = Normalizer.normalize(str, Normalizer.Form.NFD);        str = DIACRITICS_AND_FRIENDS.matcher(str).replaceAll("");        return str;    }}

#2

The core java.text package was designed to address this use case (matching strings without caring about diacritics, case, etc.).

核心java。文本包被设计来处理这个用例(匹配字符串而不关心符号、用例等)。

Configure a Collator to sort on PRIMARY differences in characters. With that, create a CollationKey for each string. If all of your code is in Java, you can use the CollationKey directly. If you need to store the keys in a database or other sort of index, you can convert it to a byte array.

配置排序器对字符中的主要差异进行排序。这样，为每个字符串创建一个CollationKey。如果所有代码都在Java中，那么可以直接使用CollationKey。如果需要将键存储在数据库或其他类型的索引中，可以将其转换为字节数组。

These classes use the Unicode standard case folding data to determine which characters are equivalent, and support various decomposition strategies.

这些类使用Unicode标准的大小写折叠数据来确定哪些字符是等效的，并支持各种分解策略。

Collator c = Collator.getInstance();c.setStrength(Collator.PRIMARY);Map<CollationKey, String> dictionary = new TreeMap<CollationKey, String>();dictionary.put(c.getCollationKey("Björn"), "Björn");...CollationKey query = c.getCollationKey("bjorn");System.out.println(dictionary.get(query)); // --> "Björn"

Note that collators are locale-specific. This is because "alphabetical order" is differs between locales (and even over time, as has been the case with Spanish). The Collator class relieves you from having to track all of these rules and keep them up to date.

注意，collators是特定于地区的。这是因为“字母顺序”在不同地区是不同的(甚至随着时间的推移，就像西班牙语一样)。Collator类使您不必跟踪所有这些规则并使它们保持最新。

#3

It's part of Apache Commons Lang as of ver. 3.1.

它是Apache Commons Lang的一部分。3.1。

org.apache.commons.lang3.StringUtils.stripAccents("Añ");

returns An

返回一个

#4

You could use the Normalizer class from java.text:

可以使用java.text中的Normalizer类:

System.out.println(new String(Normalizer.normalize("ń ǹ ň ñ ṅ ņ ṇ ṋ", Normalizer.Form.NFKD).getBytes("ascii"), "ascii"));

But there is still some work to do, since Java makes strange things with unconvertable Unicode characters (it does not ignore them, and it does not throw an exception). But I think you could use that as an starting point.

但是仍然有一些工作要做，因为Java用不可转换的Unicode字符做了一些奇怪的事情(它不会忽略它们，也不会抛出异常)。但我认为你可以以此为起点。

#5

There is a draft report on character folding on the unicode website which has a lot of relevant material. See specifically Section 4.1. "Folding algorithm".

unicode网站上有一份关于字符折叠的报告草稿，里面有很多相关的材料。见4.1节。“折叠算法”。

Here's a discussion and implementation of diacritic marker removal using Perl.

下面是使用Perl进行变音符号删除的讨论和实现。

These existing SO questions are related:

这些现有的问题是相关的:

How to convert UTF-8 to US ASCII
如何将UTF-8转换成美国ASCII码
How to change diacritic characters to non-diacritic ones
如何将变音符变为非变音符

#6

Please note that not all of these marks are just "marks" on some "normal" character, that you can remove without changing the meaning.

请注意，并非所有这些标记都是某些“正常”字符上的“标记”，您可以在不改变其含义的情况下删除它们。

In Swedish, å ä and ö are true and proper first-class characters, not some "variant" of some other character. They sound different from all other characters, they sort different, and they make words change meaning ("mätt" and "matt" are two different words).

在瑞典语中，a和o是真实而恰当的一流字符，而不是其他字符的“变体”。它们听起来不同于其他所有的角色，它们分类不同，它们使单词改变含义(“matt”和“matt”是两个不同的单词)。

#7

Unicode has specific diatric characters (which are composite characters) and a string can be converted so that the character and the diatrics are separated. Then, you can just remove the diatricts from the string and you're basically done.

Unicode具有特定的缩放字符(复合字符)，可以转换字符串，以便将字符和缩放字符分开。然后，你可以从字符串中删除谩骂，基本上就完成了。

For more information on normalization, decompositions and equivalence, see The Unicode Standard at the Unicode home page.

有关规范化、分解和等价的更多信息，请参阅Unicode主页上的Unicode标准。

However, how you can actually achieve this depends on the framework/OS/... you're working on. If you're using .NET, you can use the String.Normalize method accepting the System.Text.NormalizationForm enumeration.

然而，如何实现这一点取决于框架/操作系统/…你的工作中。如果你正在使用。net，你可以使用这个字符串。规范化接受System.Text的方法。NormalizationForm枚举。

#8

The easiest way (to me) would be to simply maintain a sparse mapping array which simply changes your Unicode code points into displayable strings.

对我来说，最简单的方法就是维护一个稀疏的映射数组，该数组只需将Unicode代码点更改为可显示的字符串。

Such as:

如:

start    = 0x00C0size     = 23mappings = {    "A","A","A","A","A","A","AE","C",    "E","E","E","E","I","I","I", "I",    "D","N","O","O","O","O","O"}start    = 0x00D8size     = 6mappings = {    "O","U","U","U","U","Y"}start    = 0x00E0size     = 23mappings = {    "a","a","a","a","a","a","ae","c",    "e","e","e","e","i","i","i", "i",    "d","n","o","o","o","o","o"}start    = 0x00F8size     = 6mappings = {    "o","u","u","u","u","y"}: : :

The use of a sparse array will allow you to efficiently represent replacements even when they in widely spaced sections of the Unicode table. String replacements will allow arbitrary sequences to replace your diacritics (such as the æ grapheme becoming ae).

使用稀疏数组将使您能够有效地表示替换，即使在Unicode表中间距较大的部分中也是如此。字符串替换将允许任意序列替换你的附加符号(如æ字母成为ae)。

This is a language-agnostic answer so, if you have a specific language in mind, there will be better ways (although they'll all likely come down to this at the lowest levels anyway).

这是一个与语言无关的答案，因此，如果您心中有一种特定的语言，那么就会有更好的方法(尽管它们都可能会降到最低级别)。

#9

Something to consider: if you go the route of trying to get a single "translation" of each word, you may miss out on some possible alternates.

需要考虑的是:如果你想要得到每个单词的一个“翻译”，你可能会错过一些可能的替代词。

For instance, in German, when replacing the "s-set", some people might use "B", while others might use "ss". Or, replacing an umlauted o with "o" or "oe". Any solution you come up with, ideally, I would think should include both.

例如，在德语中，当替换“s-set”时，一些人可能会使用“B”，而另一些人可能会使用“ss”。或者，用“o”或“oe”替换一个umlauted o。你提出的任何解决方案，理想情况下，我认为都应该包括。

#10

In Windows and .NET, I just convert using string encoding. That way I avoid manual mapping and coding.

在Windows和。net中，我只是使用字符串编码进行转换。这样我就避免了手工映射和编码。

Try to play with string encoding.

尝试使用字符串编码。

#11

In case of German it's not wanted to remove diacritics from Umlauts (ä, ö, ü).Instead they are replaced by two letter combination (ae, oe, ue)For instance, Björn should be written as Bjoern (not Bjorn) to have correct pronounciation.

对于德语，它不希望从Umlauts (a, o, u)中去除变音符。取而代之的是两个字母组合(ae, oe, ue)。例如，Bjorn应该写成Bjoern(而不是Bjorn)，以获得正确的发音。

For that I would have rather a hardcoded mapping, where you can define the replacement rule individually for each special character group.

为此，我宁愿使用硬编码映射，您可以为每个特殊字符组分别定义替换规则。

#12

For future reference, here is a C# extension method that removes accents.

为了以后的参考，这里有一个c#扩展方法，可以删除重音。

public static class StringExtensions{    public static string RemoveDiacritics(this string str)    {        return new string(            str.Normalize(NormalizationForm.FormD)                .Where(c => CharUnicodeInfo.GetUnicodeCategory(c) !=                             UnicodeCategory.NonSpacingMark)                .ToArray());    }}static void Main(){    var input = "ŃŅŇ ÀÁÂÃÄÅ ŢŤţť Ĥĥ àáâãäå ńņň";    var output = input.RemoveDiacritics();    Debug.Assert(output == "NNN AAAAAA TTtt Hh aaaaaa nnn");}

#1