I'm trying to convert some strings that are in French Canadian and basically, I'd like to be able to take out the French accent marks in the letters while keeping the letter. (E.g. convert é
to e
, so crème brûlée
would become creme brulee
)
我在尝试转换一些用法裔加拿大人写的字符串,基本上,我想在保留字母的同时去掉字母中的法语口音标记。(例如,把e换成e,这样creme brulee就会变成焦糖布丁)
What is the best method for achieving this?
达到这个目的的最佳方法是什么?
17 个解决方案
#1
421
I've not used this method, but Michael Kaplan describes a method for doing so in his blog post (with a confusing title) that talks about stripping diacritics: Stripping is an interesting job (aka On the meaning of meaningless, aka All Mn characters are non-spacing, but some are more non-spacing than others)
我没有使用这个方法,但是迈克尔·卡普兰描述了一个方法,这样做在他的博客(混乱的标题),谈到剥夺变音符号:剥离是一个有趣的工作(又名的含义意义,即所有Mn字符进行技术改造,但有些人比其他人更进行技术改造)
static string RemoveDiacritics(string text)
{
var normalizedString = text.Normalize(NormalizationForm.FormD);
var stringBuilder = new StringBuilder();
foreach (var c in normalizedString)
{
var unicodeCategory = CharUnicodeInfo.GetUnicodeCategory(c);
if (unicodeCategory != UnicodeCategory.NonSpacingMark)
{
stringBuilder.Append(c);
}
}
return stringBuilder.ToString().Normalize(NormalizationForm.FormC);
}
Note that this is a followup to his earlier post: Stripping diacritics....
注意,这是一个跟踪他之前的帖子:剥离附加符号....
The approach uses String.Normalize to split the input string into constituent glyphs (basically separating the "base" characters from the diacritics) and then scans the result and retains only the base characters. It's just a little complicated, but really you're looking at a complicated problem.
方法使用字符串。规范化将输入字符串分割为组成字符(基本上是将“基”字符与变音符号分开),然后扫描结果,只保留基字符。这有点复杂,但实际上你看到的是一个复杂的问题。
Of course, if you're limiting yourself to French, you could probably get away with the simple table-based approach in How to remove accents and tilde in a C++ std::string, as recommended by @David Dibben.
当然,如果您限制自己使用法语,您可以使用简单的基于表的方法来消除c++ std::string中的重音和斜线,如@David Dibben推荐的那样。
#2
118
this did the trick for me...
这对我起了作用……
string accentedStr;
byte[] tempBytes;
tempBytes = System.Text.Encoding.GetEncoding("ISO-8859-8").GetBytes(accentedStr);
string asciiStr = System.Text.Encoding.UTF8.GetString(tempBytes);
quick&short!
快速短!
#3
27
In case someone is interested, I was looking for something similar and ended writing the following:
如果有人感兴趣,我正在寻找类似的东西,然后写了以下内容:
public static string NormalizeStringForUrl(string name)
{
String normalizedString = name.Normalize(NormalizationForm.FormD);
StringBuilder stringBuilder = new StringBuilder();
foreach (char c in normalizedString)
{
switch (CharUnicodeInfo.GetUnicodeCategory(c))
{
case UnicodeCategory.LowercaseLetter:
case UnicodeCategory.UppercaseLetter:
case UnicodeCategory.DecimalDigitNumber:
stringBuilder.Append(c);
break;
case UnicodeCategory.SpaceSeparator:
case UnicodeCategory.ConnectorPunctuation:
case UnicodeCategory.DashPunctuation:
stringBuilder.Append('_');
break;
}
}
string result = stringBuilder.ToString();
return String.Join("_", result.Split(new char[] { '_' }
, StringSplitOptions.RemoveEmptyEntries)); // remove duplicate underscores
}
#4
13
In case anyone's interested, here is the java equivalent:
如果有人感兴趣的话,这里有一个java的等价物:
import java.text.Normalizer;
public class MyClass
{
public static String removeDiacritics(String input)
{
String nrml = Normalizer.normalize(input, Normalizer.Form.NFD);
StringBuilder stripped = new StringBuilder();
for (int i=0;i<nrml.length();++i)
{
if (Character.getType(nrml.charAt(i)) != Character.NON_SPACING_MARK)
{
stripped.append(nrml.charAt(i));
}
}
return stripped.toString();
}
}
#5
11
I often use an extenstion method based on another version I found here (see Replacing characters in C# (ascii)) A quick explanation:
我经常使用基于我在这里找到的另一个版本(参见替换c# (ascii)中的字符)的扩展方法。
- Normalizing to form D splits charactes like è to an e and a nonspacing `
- 正态化形成D分裂特征为e到e和非间隔
- From this, the nospacing characters are removed
- 从这里删除节点间距字符
- The result is normalized back to form C (I'm not sure if this is neccesary)
- 结果归一化回到C(我不确定这是否必要)
Code:
代码:
using System.Linq;
using System.Text;
using System.Globalization;
// namespace here
public static class Utility
{
public static string RemoveDiacritics(this string str)
{
if (null == str) return null;
var chars =
from c in str.Normalize(NormalizationForm.FormD).ToCharArray()
let uc = CharUnicodeInfo.GetUnicodeCategory(c)
where uc != UnicodeCategory.NonSpacingMark
select c;
var cleanStr = new string(chars.ToArray()).Normalize(NormalizationForm.FormC);
return cleanStr;
}
// or, alternatively
public static string RemoveDiacritics2(this string str)
{
if (null == str) return null;
var chars = str
.Normalize(NormalizationForm.FormD)
.ToCharArray()
.Where(c=> CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
.ToArray();
return new string(chars).Normalize(NormalizationForm.FormC);
}
}
#6
10
I needed something that converts all major unicode characters and the voted answer leaved a few out so I've created a version of CodeIgniter's convert_accented_characters($str)
into C# that is easily customisable:
我需要一些东西来转换所有主要的unicode字符,而经过投票的答案会让一些人离开,所以我创建了一个CodeIgniter的convert_accented_characters($str)的版本,这是一个容易定制的c#。
using System;
using System.Text;
using System.Collections.Generic;
public static class Strings
{
static Dictionary<string, string> foreign_characters = new Dictionary<string, string>
{
{ "äæǽ", "ae" },
{ "öœ", "oe" },
{ "ü", "ue" },
{ "Ä", "Ae" },
{ "Ü", "Ue" },
{ "Ö", "Oe" },
{ "ÀÁÂÃÄÅǺĀĂĄǍΑΆẢẠẦẪẨẬẰẮẴẲẶА", "A" },
{ "àáâãåǻāăąǎªαάảạầấẫẩậằắẵẳặа", "a" },
{ "Б", "B" },
{ "б", "b" },
{ "ÇĆĈĊČ", "C" },
{ "çćĉċč", "c" },
{ "Д", "D" },
{ "д", "d" },
{ "ÐĎĐΔ", "Dj" },
{ "ðďđδ", "dj" },
{ "ÈÉÊËĒĔĖĘĚΕΈẼẺẸỀẾỄỂỆЕЭ", "E" },
{ "èéêëēĕėęěέεẽẻẹềếễểệеэ", "e" },
{ "Ф", "F" },
{ "ф", "f" },
{ "ĜĞĠĢΓГҐ", "G" },
{ "ĝğġģγгґ", "g" },
{ "ĤĦ", "H" },
{ "ĥħ", "h" },
{ "ÌÍÎÏĨĪĬǏĮİΗΉΊΙΪỈỊИЫ", "I" },
{ "ìíîïĩīĭǐįıηήίιϊỉịиыї", "i" },
{ "Ĵ", "J" },
{ "ĵ", "j" },
{ "ĶΚК", "K" },
{ "ķκк", "k" },
{ "ĹĻĽĿŁΛЛ", "L" },
{ "ĺļľŀłλл", "l" },
{ "М", "M" },
{ "м", "m" },
{ "ÑŃŅŇΝН", "N" },
{ "ñńņňʼnνн", "n" },
{ "ÒÓÔÕŌŎǑŐƠØǾΟΌΩΏỎỌỒỐỖỔỘỜỚỠỞỢО", "O" },
{ "òóôõōŏǒőơøǿºοόωώỏọồốỗổộờớỡởợо", "o" },
{ "П", "P" },
{ "п", "p" },
{ "ŔŖŘΡР", "R" },
{ "ŕŗřρр", "r" },
{ "ŚŜŞȘŠΣС", "S" },
{ "śŝşșšſσςс", "s" },
{ "ȚŢŤŦτТ", "T" },
{ "țţťŧт", "t" },
{ "ÙÚÛŨŪŬŮŰŲƯǓǕǗǙǛŨỦỤỪỨỮỬỰУ", "U" },
{ "ùúûũūŭůűųưǔǖǘǚǜυύϋủụừứữửựу", "u" },
{ "ÝŸŶΥΎΫỲỸỶỴЙ", "Y" },
{ "ýÿŷỳỹỷỵй", "y" },
{ "В", "V" },
{ "в", "v" },
{ "Ŵ", "W" },
{ "ŵ", "w" },
{ "ŹŻŽΖЗ", "Z" },
{ "źżžζз", "z" },
{ "ÆǼ", "AE" },
{ "ß", "ss" },
{ "IJ", "IJ" },
{ "ij", "ij" },
{ "Œ", "OE" },
{ "ƒ", "f" },
{ "ξ", "ks" },
{ "π", "p" },
{ "β", "v" },
{ "μ", "m" },
{ "ψ", "ps" },
{ "Ё", "Yo" },
{ "ё", "yo" },
{ "Є", "Ye" },
{ "є", "ye" },
{ "Ї", "Yi" },
{ "Ж", "Zh" },
{ "ж", "zh" },
{ "Х", "Kh" },
{ "х", "kh" },
{ "Ц", "Ts" },
{ "ц", "ts" },
{ "Ч", "Ch" },
{ "ч", "ch" },
{ "Ш", "Sh" },
{ "ш", "sh" },
{ "Щ", "Shch" },
{ "щ", "shch" },
{ "ЪъЬь", "" },
{ "Ю", "Yu" },
{ "ю", "yu" },
{ "Я", "Ya" },
{ "я", "ya" },
};
public static char RemoveDiacritics(this char c){
foreach(KeyValuePair<string, string> entry in foreign_characters)
{
if(entry.Key.IndexOf (c) != -1)
{
return entry.Value[0];
}
}
return c;
}
public static string RemoveDiacritics(this string s)
{
//StringBuilder sb = new StringBuilder ();
string text = "";
foreach (char c in s)
{
int len = text.Length;
foreach(KeyValuePair<string, string> entry in foreign_characters)
{
if(entry.Key.IndexOf (c) != -1)
{
text += entry.Value;
break;
}
}
if (len == text.Length) {
text += c;
}
}
return text;
}
}
Usage
使用
// for strings
"crème brûlée".RemoveDiacritics (); // creme brulee
// for chars
"Ã"[0].RemoveDiacritics (); // A
#7
5
The CodePage of Greek (ISO) can do it
希腊文(ISO)的代码页可以做到这一点
The information about this codepage is into System.Text.Encoding.GetEncodings()
. Learn about in: https://msdn.microsoft.com/pt-br/library/system.text.encodinginfo.getencoding(v=vs.110).aspx
关于这个代码页的信息包含在System.Text.Encoding.GetEncodings()中。了解:https://msdn.microsoft.com/pt-br/library/system.text.encodinginfo.getencoding(v = vs.110). aspx
Greek (ISO) has codepage 28597 and name iso-8859-7.
希腊语(ISO)有代码页28597和名字ISO -8859-7。
Go to the code... \o/
去看代码…\ o /
string text = "Você está numa situação lamentável";
string textEncode = System.Web.HttpUtility.UrlEncode(text, Encoding.GetEncoding("iso-8859-7"));
//result: "Voce+esta+numa+situacao+lamentavel"
string textDecode = System.Web.HttpUtility.UrlDecode(textEncode);
//result: "Voce esta numa situacao lamentavel"
So, write this function...
所以,写这个函数…
public string RemoveAcentuation(string text)
{
return
System.Web.HttpUtility.UrlDecode(
System.Web.HttpUtility.UrlEncode(
text, Encoding.GetEncoding("iso-8859-7")));
}
Note that... Encoding.GetEncoding("iso-8859-7")
is equivalent to Encoding.GetEncoding(28597)
because first is the name, and second the codepage of Encoding.
请注意,…getencoding(“iso-8859-7”)等价于Encoding. getencoding(28597),因为第一个是名字,第二个是编码的代码页。
#8
3
This works fine in java.
这在java中工作得很好。
It basically converts all accented characters into their deAccented counterparts followed by their combining diacritics. Now you can use a regex to strip off the diacritics.
它基本上把所有重音字符都转换成非重音字符,然后再加上它们的发音。现在,您可以使用regex来去掉变化率。
import java.text.Normalizer;
import java.util.regex.Pattern;
public String deAccent(String str) {
String nfdNormalizedString = Normalizer.normalize(str, Normalizer.Form.NFD);
Pattern pattern = Pattern.compile("\\p{InCombiningDiacriticalMarks}+");
return pattern.matcher(nfdNormalizedString).replaceAll("");
}
#9
3
THIS IS THE VB VERSION (Works with GREEK) :
这是VB版本(与希腊语一起工作):
Imports System.Text
进口包含
Imports System.Globalization
进口System.Globalization
Public Function RemoveDiacritics(ByVal s As String)
Dim normalizedString As String
Dim stringBuilder As New StringBuilder
normalizedString = s.Normalize(NormalizationForm.FormD)
Dim i As Integer
Dim c As Char
For i = 0 To normalizedString.Length - 1
c = normalizedString(i)
If CharUnicodeInfo.GetUnicodeCategory(c) <> UnicodeCategory.NonSpacingMark Then
stringBuilder.Append(c)
End If
Next
Return stringBuilder.ToString()
End Function
#10
2
This is how i replace diacritic characters to non-diacritic ones in all my .NET program
这就是我如何在我所有的。net程序中把变音符号替换成非变音符号的方法
C#:
c#:
//Transforms the culture of a letter to its equivalent representation in the 0-127 ascii table, such as the letter 'é' is substituted by an 'e'
public string RemoveDiacritics(string s)
{
string normalizedString = null;
StringBuilder stringBuilder = new StringBuilder();
normalizedString = s.Normalize(NormalizationForm.FormD);
int i = 0;
char c = '\0';
for (i = 0; i <= normalizedString.Length - 1; i++)
{
c = normalizedString[i];
if (CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
{
stringBuilder.Append(c);
}
}
return stringBuilder.ToString().ToLower();
}
VB .NET:
VB . net:
'Transforms the culture of a letter to its equivalent representation in the 0-127 ascii table, such as the letter "é" is substituted by an "e"'
Public Function RemoveDiacritics(ByVal s As String) As String
Dim normalizedString As String
Dim stringBuilder As New StringBuilder
normalizedString = s.Normalize(NormalizationForm.FormD)
Dim i As Integer
Dim c As Char
For i = 0 To normalizedString.Length - 1
c = normalizedString(i)
If CharUnicodeInfo.GetUnicodeCategory(c) <> UnicodeCategory.NonSpacingMark Then
stringBuilder.Append(c)
End If
Next
Return stringBuilder.ToString().ToLower()
End Function
#11
2
you can use string extension from MMLib.Extensions nuget package:
您可以使用MMLib中的字符串扩展。扩展nuget包:
using MMLib.RapidPrototyping.Generators;
public void ExtensionsExample()
{
string target = "aácčeéií";
Assert.AreEqual("aacceeii", target.RemoveDiacritics());
}
Nuget page: https://www.nuget.org/packages/MMLib.Extensions/ Codeplex project site https://mmlib.codeplex.com/
Nuget页面:https://www.nuget.org/packages/MMLib.Extensions/ Codeplex项目网站https://mlib.codeplex.com/
#12
2
It's funny such a question can get so many answers, and yet none fit my requirements :) There are so many languages around, a full language agnostic solution is AFAIK not really possible, as others has mentionned that the FormC or FormD are giving issues.
有趣的是,这样的问题可以得到如此多的答案,但没有一个符合我的要求:)有如此多的语言,一个完整的语言不可知论的解决方案是不可能的,因为其他人已经指出FormC或FormD正在提供问题。
Since the original question was related to French, the simplest working answer is indeed
由于最初的问题与法语有关,所以最简单的有效答案的确是
public static string ConvertWesternEuropeanToASCII(this string str)
{
return Encoding.ASCII.GetString(Encoding.GetEncoding(1251).GetBytes(str));
}
1251 should be replaced by the encoding code of the input language.
应该用输入语言的编码代码替换1251。
This however replace only one character by one character. Since I am also working with German as input, I did a manual convert
然而,这只能用一个字符替换一个字符。由于我也使用德语作为输入,所以我做了一个手动转换。
public static string LatinizeGermanCharacters(this string str)
{
StringBuilder sb = new StringBuilder(str.Length);
foreach (char c in str)
{
switch (c)
{
case 'ä':
sb.Append("ae");
break;
case 'ö':
sb.Append("oe");
break;
case 'ü':
sb.Append("ue");
break;
case 'Ä':
sb.Append("Ae");
break;
case 'Ö':
sb.Append("Oe");
break;
case 'Ü':
sb.Append("Ue");
break;
case 'ß':
sb.Append("ss");
break;
default:
sb.Append(c);
break;
}
}
return sb.ToString();
}
It might not deliver the best performance, but at least it is very easy to read and extend. Regex is a NO GO, much slower than any char/string stuff.
它可能不能提供最好的性能,但至少它非常容易阅读和扩展。Regex是NO GO,比任何char/string都要慢得多。
I also have a very simple method to remove space:
我还有一个非常简单的方法来移除空间:
public static string RemoveSpace(this string str)
{
return str.Replace(" ", string.Empty);
}
Eventually, I am using a combination of all 3 above extensions:
最后,我使用了以上三种扩展的组合:
public static string LatinizeAndConvertToASCII(this string str, bool keepSpace = false)
{
str = str.LatinizeGermanCharacters().ConvertWesternEuropeanToASCII();
return keepSpace ? str : str.RemoveSpace();
}
And a small unit test to that (not exhaustive) which pass successfully.
一个小的单元测试(不是详尽的)成功地通过了。
[TestMethod()]
public void LatinizeAndConvertToASCIITest()
{
string europeanStr = "Bonjour ça va? C'est l'été! Ich möchte ä Ä á à â ê é è ë Ë É ï Ï î í ì ó ò ô ö Ö Ü ü ù ú û Û ý Ý ç Ç ñ Ñ";
string expected = "Bonjourcava?C'estl'ete!IchmoechteaeAeaaaeeeeEEiIiiiooooeOeUeueuuuUyYcCnN";
string actual = europeanStr.LatinizeAndConvertToASCII();
Assert.AreEqual(expected, actual);
}
#13
1
Try HelperSharp package.
尝试HelperSharp包。
There is a method RemoveAccents:
有一个方法去除重音:
public static string RemoveAccents(this string source)
{
//8 bit characters
byte[] b = Encoding.GetEncoding(1251).GetBytes(source);
// 7 bit characters
string t = Encoding.ASCII.GetString(b);
Regex re = new Regex("[^a-zA-Z0-9]=-_/");
string c = re.Replace(t, " ");
return c;
}
#14
1
这人说:
Encoding.ASCII.GetString(Encoding.GetEncoding(1251).GetBytes(text));
Encoding.ASCII.GetString(Encoding.GetEncoding(1251).GetBytes(文本);
It actually splits the likes of å
which is one character (which is character code 00E5
, not 0061
plus the modifier 030A
which would look the same) into a
plus some kind of modifier, and then the ASCII conversion removes the modifier, leaving the only a
.
它实际上将a这样的字符(字符代码是00E5,而不是0061加上修饰符030A,看起来是一样的)分割成一个加上某种修饰符,然后ASCII转换删除修饰符,只留下a。
#15
0
Imports System.Text
Imports System.Globalization
Public Function DECODE(ByVal x As String) As String
Dim sb As New StringBuilder
For Each c As Char In x.Normalize(NormalizationForm.FormD).Where(Function(a) CharUnicodeInfo.GetUnicodeCategory(a) <> UnicodeCategory.NonSpacingMark)
sb.Append(c)
Next
Return sb.ToString()
End Function
#16
0
I really like the concise and functional code provided by azrafe7. So, I have changed it a little bit to convert it to an extension method:
我非常喜欢azrafe7提供的简洁的功能代码。所以,我改变了一点,把它转换成一个扩展的方法:
public static class StringExtensions
{
public static string RemoveDiacritics(this string text)
{
const string SINGLEBYTE_LATIN_ASCII_ENCODING = "ISO-8859-8";
if (string.IsNullOrEmpty(text))
{
return string.Empty;
}
return Encoding.ASCII.GetString(
Encoding.GetEncoding(SINGLEBYTE_LATIN_ASCII_ENCODING).GetBytes(text));
}
}
#17
0
Popping this Library here if you haven't already considered it. Looks like there are a full range of unit tests with it.
如果你还没有考虑过这个库,就把它取出来。看起来有完整的单元测试。
https://github.com/thomasgalliker/Diacritics.NET
https://github.com/thomasgalliker/Diacritics.NET
#1
421
I've not used this method, but Michael Kaplan describes a method for doing so in his blog post (with a confusing title) that talks about stripping diacritics: Stripping is an interesting job (aka On the meaning of meaningless, aka All Mn characters are non-spacing, but some are more non-spacing than others)
我没有使用这个方法,但是迈克尔·卡普兰描述了一个方法,这样做在他的博客(混乱的标题),谈到剥夺变音符号:剥离是一个有趣的工作(又名的含义意义,即所有Mn字符进行技术改造,但有些人比其他人更进行技术改造)
static string RemoveDiacritics(string text)
{
var normalizedString = text.Normalize(NormalizationForm.FormD);
var stringBuilder = new StringBuilder();
foreach (var c in normalizedString)
{
var unicodeCategory = CharUnicodeInfo.GetUnicodeCategory(c);
if (unicodeCategory != UnicodeCategory.NonSpacingMark)
{
stringBuilder.Append(c);
}
}
return stringBuilder.ToString().Normalize(NormalizationForm.FormC);
}
Note that this is a followup to his earlier post: Stripping diacritics....
注意,这是一个跟踪他之前的帖子:剥离附加符号....
The approach uses String.Normalize to split the input string into constituent glyphs (basically separating the "base" characters from the diacritics) and then scans the result and retains only the base characters. It's just a little complicated, but really you're looking at a complicated problem.
方法使用字符串。规范化将输入字符串分割为组成字符(基本上是将“基”字符与变音符号分开),然后扫描结果,只保留基字符。这有点复杂,但实际上你看到的是一个复杂的问题。
Of course, if you're limiting yourself to French, you could probably get away with the simple table-based approach in How to remove accents and tilde in a C++ std::string, as recommended by @David Dibben.
当然,如果您限制自己使用法语,您可以使用简单的基于表的方法来消除c++ std::string中的重音和斜线,如@David Dibben推荐的那样。
#2
118
this did the trick for me...
这对我起了作用……
string accentedStr;
byte[] tempBytes;
tempBytes = System.Text.Encoding.GetEncoding("ISO-8859-8").GetBytes(accentedStr);
string asciiStr = System.Text.Encoding.UTF8.GetString(tempBytes);
quick&short!
快速短!
#3
27
In case someone is interested, I was looking for something similar and ended writing the following:
如果有人感兴趣,我正在寻找类似的东西,然后写了以下内容:
public static string NormalizeStringForUrl(string name)
{
String normalizedString = name.Normalize(NormalizationForm.FormD);
StringBuilder stringBuilder = new StringBuilder();
foreach (char c in normalizedString)
{
switch (CharUnicodeInfo.GetUnicodeCategory(c))
{
case UnicodeCategory.LowercaseLetter:
case UnicodeCategory.UppercaseLetter:
case UnicodeCategory.DecimalDigitNumber:
stringBuilder.Append(c);
break;
case UnicodeCategory.SpaceSeparator:
case UnicodeCategory.ConnectorPunctuation:
case UnicodeCategory.DashPunctuation:
stringBuilder.Append('_');
break;
}
}
string result = stringBuilder.ToString();
return String.Join("_", result.Split(new char[] { '_' }
, StringSplitOptions.RemoveEmptyEntries)); // remove duplicate underscores
}
#4
13
In case anyone's interested, here is the java equivalent:
如果有人感兴趣的话,这里有一个java的等价物:
import java.text.Normalizer;
public class MyClass
{
public static String removeDiacritics(String input)
{
String nrml = Normalizer.normalize(input, Normalizer.Form.NFD);
StringBuilder stripped = new StringBuilder();
for (int i=0;i<nrml.length();++i)
{
if (Character.getType(nrml.charAt(i)) != Character.NON_SPACING_MARK)
{
stripped.append(nrml.charAt(i));
}
}
return stripped.toString();
}
}
#5
11
I often use an extenstion method based on another version I found here (see Replacing characters in C# (ascii)) A quick explanation:
我经常使用基于我在这里找到的另一个版本(参见替换c# (ascii)中的字符)的扩展方法。
- Normalizing to form D splits charactes like è to an e and a nonspacing `
- 正态化形成D分裂特征为e到e和非间隔
- From this, the nospacing characters are removed
- 从这里删除节点间距字符
- The result is normalized back to form C (I'm not sure if this is neccesary)
- 结果归一化回到C(我不确定这是否必要)
Code:
代码:
using System.Linq;
using System.Text;
using System.Globalization;
// namespace here
public static class Utility
{
public static string RemoveDiacritics(this string str)
{
if (null == str) return null;
var chars =
from c in str.Normalize(NormalizationForm.FormD).ToCharArray()
let uc = CharUnicodeInfo.GetUnicodeCategory(c)
where uc != UnicodeCategory.NonSpacingMark
select c;
var cleanStr = new string(chars.ToArray()).Normalize(NormalizationForm.FormC);
return cleanStr;
}
// or, alternatively
public static string RemoveDiacritics2(this string str)
{
if (null == str) return null;
var chars = str
.Normalize(NormalizationForm.FormD)
.ToCharArray()
.Where(c=> CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
.ToArray();
return new string(chars).Normalize(NormalizationForm.FormC);
}
}
#6
10
I needed something that converts all major unicode characters and the voted answer leaved a few out so I've created a version of CodeIgniter's convert_accented_characters($str)
into C# that is easily customisable:
我需要一些东西来转换所有主要的unicode字符,而经过投票的答案会让一些人离开,所以我创建了一个CodeIgniter的convert_accented_characters($str)的版本,这是一个容易定制的c#。
using System;
using System.Text;
using System.Collections.Generic;
public static class Strings
{
static Dictionary<string, string> foreign_characters = new Dictionary<string, string>
{
{ "äæǽ", "ae" },
{ "öœ", "oe" },
{ "ü", "ue" },
{ "Ä", "Ae" },
{ "Ü", "Ue" },
{ "Ö", "Oe" },
{ "ÀÁÂÃÄÅǺĀĂĄǍΑΆẢẠẦẪẨẬẰẮẴẲẶА", "A" },
{ "àáâãåǻāăąǎªαάảạầấẫẩậằắẵẳặа", "a" },
{ "Б", "B" },
{ "б", "b" },
{ "ÇĆĈĊČ", "C" },
{ "çćĉċč", "c" },
{ "Д", "D" },
{ "д", "d" },
{ "ÐĎĐΔ", "Dj" },
{ "ðďđδ", "dj" },
{ "ÈÉÊËĒĔĖĘĚΕΈẼẺẸỀẾỄỂỆЕЭ", "E" },
{ "èéêëēĕėęěέεẽẻẹềếễểệеэ", "e" },
{ "Ф", "F" },
{ "ф", "f" },
{ "ĜĞĠĢΓГҐ", "G" },
{ "ĝğġģγгґ", "g" },
{ "ĤĦ", "H" },
{ "ĥħ", "h" },
{ "ÌÍÎÏĨĪĬǏĮİΗΉΊΙΪỈỊИЫ", "I" },
{ "ìíîïĩīĭǐįıηήίιϊỉịиыї", "i" },
{ "Ĵ", "J" },
{ "ĵ", "j" },
{ "ĶΚК", "K" },
{ "ķκк", "k" },
{ "ĹĻĽĿŁΛЛ", "L" },
{ "ĺļľŀłλл", "l" },
{ "М", "M" },
{ "м", "m" },
{ "ÑŃŅŇΝН", "N" },
{ "ñńņňʼnνн", "n" },
{ "ÒÓÔÕŌŎǑŐƠØǾΟΌΩΏỎỌỒỐỖỔỘỜỚỠỞỢО", "O" },
{ "òóôõōŏǒőơøǿºοόωώỏọồốỗổộờớỡởợо", "o" },
{ "П", "P" },
{ "п", "p" },
{ "ŔŖŘΡР", "R" },
{ "ŕŗřρр", "r" },
{ "ŚŜŞȘŠΣС", "S" },
{ "śŝşșšſσςс", "s" },
{ "ȚŢŤŦτТ", "T" },
{ "țţťŧт", "t" },
{ "ÙÚÛŨŪŬŮŰŲƯǓǕǗǙǛŨỦỤỪỨỮỬỰУ", "U" },
{ "ùúûũūŭůűųưǔǖǘǚǜυύϋủụừứữửựу", "u" },
{ "ÝŸŶΥΎΫỲỸỶỴЙ", "Y" },
{ "ýÿŷỳỹỷỵй", "y" },
{ "В", "V" },
{ "в", "v" },
{ "Ŵ", "W" },
{ "ŵ", "w" },
{ "ŹŻŽΖЗ", "Z" },
{ "źżžζз", "z" },
{ "ÆǼ", "AE" },
{ "ß", "ss" },
{ "IJ", "IJ" },
{ "ij", "ij" },
{ "Œ", "OE" },
{ "ƒ", "f" },
{ "ξ", "ks" },
{ "π", "p" },
{ "β", "v" },
{ "μ", "m" },
{ "ψ", "ps" },
{ "Ё", "Yo" },
{ "ё", "yo" },
{ "Є", "Ye" },
{ "є", "ye" },
{ "Ї", "Yi" },
{ "Ж", "Zh" },
{ "ж", "zh" },
{ "Х", "Kh" },
{ "х", "kh" },
{ "Ц", "Ts" },
{ "ц", "ts" },
{ "Ч", "Ch" },
{ "ч", "ch" },
{ "Ш", "Sh" },
{ "ш", "sh" },
{ "Щ", "Shch" },
{ "щ", "shch" },
{ "ЪъЬь", "" },
{ "Ю", "Yu" },
{ "ю", "yu" },
{ "Я", "Ya" },
{ "я", "ya" },
};
public static char RemoveDiacritics(this char c){
foreach(KeyValuePair<string, string> entry in foreign_characters)
{
if(entry.Key.IndexOf (c) != -1)
{
return entry.Value[0];
}
}
return c;
}
public static string RemoveDiacritics(this string s)
{
//StringBuilder sb = new StringBuilder ();
string text = "";
foreach (char c in s)
{
int len = text.Length;
foreach(KeyValuePair<string, string> entry in foreign_characters)
{
if(entry.Key.IndexOf (c) != -1)
{
text += entry.Value;
break;
}
}
if (len == text.Length) {
text += c;
}
}
return text;
}
}
Usage
使用
// for strings
"crème brûlée".RemoveDiacritics (); // creme brulee
// for chars
"Ã"[0].RemoveDiacritics (); // A
#7
5
The CodePage of Greek (ISO) can do it
希腊文(ISO)的代码页可以做到这一点
The information about this codepage is into System.Text.Encoding.GetEncodings()
. Learn about in: https://msdn.microsoft.com/pt-br/library/system.text.encodinginfo.getencoding(v=vs.110).aspx
关于这个代码页的信息包含在System.Text.Encoding.GetEncodings()中。了解:https://msdn.microsoft.com/pt-br/library/system.text.encodinginfo.getencoding(v = vs.110). aspx
Greek (ISO) has codepage 28597 and name iso-8859-7.
希腊语(ISO)有代码页28597和名字ISO -8859-7。
Go to the code... \o/
去看代码…\ o /
string text = "Você está numa situação lamentável";
string textEncode = System.Web.HttpUtility.UrlEncode(text, Encoding.GetEncoding("iso-8859-7"));
//result: "Voce+esta+numa+situacao+lamentavel"
string textDecode = System.Web.HttpUtility.UrlDecode(textEncode);
//result: "Voce esta numa situacao lamentavel"
So, write this function...
所以,写这个函数…
public string RemoveAcentuation(string text)
{
return
System.Web.HttpUtility.UrlDecode(
System.Web.HttpUtility.UrlEncode(
text, Encoding.GetEncoding("iso-8859-7")));
}
Note that... Encoding.GetEncoding("iso-8859-7")
is equivalent to Encoding.GetEncoding(28597)
because first is the name, and second the codepage of Encoding.
请注意,…getencoding(“iso-8859-7”)等价于Encoding. getencoding(28597),因为第一个是名字,第二个是编码的代码页。
#8
3
This works fine in java.
这在java中工作得很好。
It basically converts all accented characters into their deAccented counterparts followed by their combining diacritics. Now you can use a regex to strip off the diacritics.
它基本上把所有重音字符都转换成非重音字符,然后再加上它们的发音。现在,您可以使用regex来去掉变化率。
import java.text.Normalizer;
import java.util.regex.Pattern;
public String deAccent(String str) {
String nfdNormalizedString = Normalizer.normalize(str, Normalizer.Form.NFD);
Pattern pattern = Pattern.compile("\\p{InCombiningDiacriticalMarks}+");
return pattern.matcher(nfdNormalizedString).replaceAll("");
}
#9
3
THIS IS THE VB VERSION (Works with GREEK) :
这是VB版本(与希腊语一起工作):
Imports System.Text
进口包含
Imports System.Globalization
进口System.Globalization
Public Function RemoveDiacritics(ByVal s As String)
Dim normalizedString As String
Dim stringBuilder As New StringBuilder
normalizedString = s.Normalize(NormalizationForm.FormD)
Dim i As Integer
Dim c As Char
For i = 0 To normalizedString.Length - 1
c = normalizedString(i)
If CharUnicodeInfo.GetUnicodeCategory(c) <> UnicodeCategory.NonSpacingMark Then
stringBuilder.Append(c)
End If
Next
Return stringBuilder.ToString()
End Function
#10
2
This is how i replace diacritic characters to non-diacritic ones in all my .NET program
这就是我如何在我所有的。net程序中把变音符号替换成非变音符号的方法
C#:
c#:
//Transforms the culture of a letter to its equivalent representation in the 0-127 ascii table, such as the letter 'é' is substituted by an 'e'
public string RemoveDiacritics(string s)
{
string normalizedString = null;
StringBuilder stringBuilder = new StringBuilder();
normalizedString = s.Normalize(NormalizationForm.FormD);
int i = 0;
char c = '\0';
for (i = 0; i <= normalizedString.Length - 1; i++)
{
c = normalizedString[i];
if (CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
{
stringBuilder.Append(c);
}
}
return stringBuilder.ToString().ToLower();
}
VB .NET:
VB . net:
'Transforms the culture of a letter to its equivalent representation in the 0-127 ascii table, such as the letter "é" is substituted by an "e"'
Public Function RemoveDiacritics(ByVal s As String) As String
Dim normalizedString As String
Dim stringBuilder As New StringBuilder
normalizedString = s.Normalize(NormalizationForm.FormD)
Dim i As Integer
Dim c As Char
For i = 0 To normalizedString.Length - 1
c = normalizedString(i)
If CharUnicodeInfo.GetUnicodeCategory(c) <> UnicodeCategory.NonSpacingMark Then
stringBuilder.Append(c)
End If
Next
Return stringBuilder.ToString().ToLower()
End Function
#11
2
you can use string extension from MMLib.Extensions nuget package:
您可以使用MMLib中的字符串扩展。扩展nuget包:
using MMLib.RapidPrototyping.Generators;
public void ExtensionsExample()
{
string target = "aácčeéií";
Assert.AreEqual("aacceeii", target.RemoveDiacritics());
}
Nuget page: https://www.nuget.org/packages/MMLib.Extensions/ Codeplex project site https://mmlib.codeplex.com/
Nuget页面:https://www.nuget.org/packages/MMLib.Extensions/ Codeplex项目网站https://mlib.codeplex.com/
#12
2
It's funny such a question can get so many answers, and yet none fit my requirements :) There are so many languages around, a full language agnostic solution is AFAIK not really possible, as others has mentionned that the FormC or FormD are giving issues.
有趣的是,这样的问题可以得到如此多的答案,但没有一个符合我的要求:)有如此多的语言,一个完整的语言不可知论的解决方案是不可能的,因为其他人已经指出FormC或FormD正在提供问题。
Since the original question was related to French, the simplest working answer is indeed
由于最初的问题与法语有关,所以最简单的有效答案的确是
public static string ConvertWesternEuropeanToASCII(this string str)
{
return Encoding.ASCII.GetString(Encoding.GetEncoding(1251).GetBytes(str));
}
1251 should be replaced by the encoding code of the input language.
应该用输入语言的编码代码替换1251。
This however replace only one character by one character. Since I am also working with German as input, I did a manual convert
然而,这只能用一个字符替换一个字符。由于我也使用德语作为输入,所以我做了一个手动转换。
public static string LatinizeGermanCharacters(this string str)
{
StringBuilder sb = new StringBuilder(str.Length);
foreach (char c in str)
{
switch (c)
{
case 'ä':
sb.Append("ae");
break;
case 'ö':
sb.Append("oe");
break;
case 'ü':
sb.Append("ue");
break;
case 'Ä':
sb.Append("Ae");
break;
case 'Ö':
sb.Append("Oe");
break;
case 'Ü':
sb.Append("Ue");
break;
case 'ß':
sb.Append("ss");
break;
default:
sb.Append(c);
break;
}
}
return sb.ToString();
}
It might not deliver the best performance, but at least it is very easy to read and extend. Regex is a NO GO, much slower than any char/string stuff.
它可能不能提供最好的性能,但至少它非常容易阅读和扩展。Regex是NO GO,比任何char/string都要慢得多。
I also have a very simple method to remove space:
我还有一个非常简单的方法来移除空间:
public static string RemoveSpace(this string str)
{
return str.Replace(" ", string.Empty);
}
Eventually, I am using a combination of all 3 above extensions:
最后,我使用了以上三种扩展的组合:
public static string LatinizeAndConvertToASCII(this string str, bool keepSpace = false)
{
str = str.LatinizeGermanCharacters().ConvertWesternEuropeanToASCII();
return keepSpace ? str : str.RemoveSpace();
}
And a small unit test to that (not exhaustive) which pass successfully.
一个小的单元测试(不是详尽的)成功地通过了。
[TestMethod()]
public void LatinizeAndConvertToASCIITest()
{
string europeanStr = "Bonjour ça va? C'est l'été! Ich möchte ä Ä á à â ê é è ë Ë É ï Ï î í ì ó ò ô ö Ö Ü ü ù ú û Û ý Ý ç Ç ñ Ñ";
string expected = "Bonjourcava?C'estl'ete!IchmoechteaeAeaaaeeeeEEiIiiiooooeOeUeueuuuUyYcCnN";
string actual = europeanStr.LatinizeAndConvertToASCII();
Assert.AreEqual(expected, actual);
}
#13
1
Try HelperSharp package.
尝试HelperSharp包。
There is a method RemoveAccents:
有一个方法去除重音:
public static string RemoveAccents(this string source)
{
//8 bit characters
byte[] b = Encoding.GetEncoding(1251).GetBytes(source);
// 7 bit characters
string t = Encoding.ASCII.GetString(b);
Regex re = new Regex("[^a-zA-Z0-9]=-_/");
string c = re.Replace(t, " ");
return c;
}
#14
1
这人说:
Encoding.ASCII.GetString(Encoding.GetEncoding(1251).GetBytes(text));
Encoding.ASCII.GetString(Encoding.GetEncoding(1251).GetBytes(文本);
It actually splits the likes of å
which is one character (which is character code 00E5
, not 0061
plus the modifier 030A
which would look the same) into a
plus some kind of modifier, and then the ASCII conversion removes the modifier, leaving the only a
.
它实际上将a这样的字符(字符代码是00E5,而不是0061加上修饰符030A,看起来是一样的)分割成一个加上某种修饰符,然后ASCII转换删除修饰符,只留下a。
#15
0
Imports System.Text
Imports System.Globalization
Public Function DECODE(ByVal x As String) As String
Dim sb As New StringBuilder
For Each c As Char In x.Normalize(NormalizationForm.FormD).Where(Function(a) CharUnicodeInfo.GetUnicodeCategory(a) <> UnicodeCategory.NonSpacingMark)
sb.Append(c)
Next
Return sb.ToString()
End Function
#16
0
I really like the concise and functional code provided by azrafe7. So, I have changed it a little bit to convert it to an extension method:
我非常喜欢azrafe7提供的简洁的功能代码。所以,我改变了一点,把它转换成一个扩展的方法:
public static class StringExtensions
{
public static string RemoveDiacritics(this string text)
{
const string SINGLEBYTE_LATIN_ASCII_ENCODING = "ISO-8859-8";
if (string.IsNullOrEmpty(text))
{
return string.Empty;
}
return Encoding.ASCII.GetString(
Encoding.GetEncoding(SINGLEBYTE_LATIN_ASCII_ENCODING).GetBytes(text));
}
}
#17
0
Popping this Library here if you haven't already considered it. Looks like there are a full range of unit tests with it.
如果你还没有考虑过这个库,就把它取出来。看起来有完整的单元测试。
https://github.com/thomasgalliker/Diacritics.NET
https://github.com/thomasgalliker/Diacritics.NET