This question already has an answer here:
这个问题已经有了答案:
- PHP: Replace umlauts with closest 7-bit ASCII equivalent in an UTF-8 string 7 answers
- PHP:在UTF-8字符串7中使用最接近的7位ASCII等效值替换umlauts
What is the most efficient way to remove accents from a string e.g. ÈâuÑ
becomes Eaun
?
从字符串中去掉重音最有效的方法是什么?
Is there a simple, built in way that I'm missing or a regular expression?
是否有一个简单的,构建的方式,我正在丢失或一个正则表达式?
5 个解决方案
#1
51
If you have iconv installed, try this (the example assumes your input string is in UTF-8):
如果安装了iconv,请尝试一下(本例假设输入字符串为UTF-8):
echo iconv('UTF-8', 'ASCII//TRANSLIT', $string);
(iconv is a library to convert between all kinds of encodings; it's efficient and included with many PHP distributions by default. Most of all, it's definitely easier and more error-proof than trying to roll your own solution (did you know that there's a "Latin letter N with a curl"? Me neither.))
(iconv是一个在各种编码之间进行转换的库;它是高效的,默认情况下包含了许多PHP发行版。最重要的是,它肯定比尝试自己的解决方案更容易、更容易出错(你知道有一个“带卷的拉丁字母N”吗?)我也不知道)。
#2
40
I found a solution, that worked in all my test-cases (copied from http://php.net/manual/en/transliterator.transliterate.php):
我找到了一个适用于所有测试用例的解决方案(从http://php.net/manual/en/transliterator.transliterate.php复制):
var_dump(transliterator_transliterate('Any-Latin; Latin-ASCII; [\u0080-\u7fff] remove',
"A æ Übérmensch på høyeste nivå! И я люблю PHP! есть. fi ¦"));
// string(50) "A ae Ubermensch pa hoyeste niva! I a lublu PHP! est. fi "
see: http://www.php.net/normalizer
参见:http://www.php.net/normalizer
EDIT: This solution is independent of the locale set using setlocale(). Another benefit over iconv() is, that even non-latin characters are not ignored.
编辑:这个解决方案独立于使用setlocale()的语言环境集。iconv()的另一个好处是,即使是非拉丁字符也不会被忽略。
EDIT2: I discovered, that there are some characters, that are not covered by the transliteration I posted originally. Any-Latin
translates the cyrillic character ь
to a character, that doesn't fit into a latin character-set: ʹ
(http://en.wikipedia.org/wiki/Prime_%28symbol%29). I've added [\u0100-\u7fff] remove
to remove all these non-latin characters. I also added a test to the text ;)
EDIT2:我发现,有一些字符没有被我最初发布的音译所覆盖。有拉丁文翻译斯拉夫字母字符ь性格,这并不符合一个拉丁字符集:ʹ(http://en.wikipedia.org/wiki/Prime_%28symbol%29)。我添加了[\u0100-\u7fff]删除所有这些非拉丁字符。我还在文本中添加了一个测试;)
I suggest, that they mean the latin alphabet and not one of the latin character-sets by Latin
here. But anyways - in my opinion, they should transliterate it to something ASCII then in Latin-ASCII
...
我的意思是,他们指的是拉丁字母,而不是这里的拉丁字母。但是无论如何——在我看来,他们应该把它转换成某种ASCII码然后用拉丁语-ASCII…
EDIT3: Sorry for another change here. I had to take the characters down to u0080 instead of u0100, to get only ASCII characters as output. The test above is updated.
不好意思,这里又有变化。我必须把字符改为u0080,而不是u0100,只能得到ASCII字符作为输出。上面的测试被更新。
#3
21
Reposting this on request of @palantir ...
应@palantir的请求重新发布此消息…
I find iconv completely unreliable, and I dislike preg_replace solutions and big arrays ... so my favorite way (and the only reliable method I've found) is ...
我发现iconv完全不可靠,我讨厌preg_replace解决方案和大数组……所以我最喜欢的方法(也是我发现的唯一可靠的方法)是……
function toASCII( $str )
{
return strtr(utf8_decode($str),
utf8_decode(
'ŠŒŽšœžŸ¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿ'),
'SOZsozYYuAAAAAAACEEEEIIIIDNOOOOOOUUUUYsaaaaaaaceeeeiiiionoooooouuuuyy');
}
#4
13
You can use iconv
to transliterate the characters to plain US-ASCII and then use a regular expression to remove non-alphabetic characters:
您可以使用iconv将字符转换为普通的US-ASCII,然后使用正则表达式删除非字母字符:
preg_replace('/[^a-z]/i', '', iconv("UTF-8", "US-ASCII//TRANSLIT", $text))
Another way would be using the Normalizer to normalize to the Normalization Form KD (NFKD) and then remove the mark characters:
另一种方法是使用归一化器对KD (NFKD)进行归一化,然后删除标记字符:
preg_replace('/\p{Mn}/u', '', Normalizer::normalize($text, Normalizer::FORM_KD))
#5
12
Note: I'm reposting this from another similar question in the hope that it's helpful to others.
注意:我将这篇文章从另一个类似的问题中重新发布,希望它对其他人有所帮助。
I ended up writing a PHP library based on URLify.js from the Django project, since I found iconv() to be too incomplete. You can find it here:
最后我编写了一个基于URLify的PHP库。来自Django项目的js,因为我发现iconv()太不完整。你可以在这里找到它:
https://github.com/jbroadway/urlify
https://github.com/jbroadway/urlify
Handles Latin characters as well as Greek, Turkish, Russian, Ukrainian, Czech, Polish, and Latvian.
处理拉丁字符以及希腊、土耳其、俄罗斯、乌克兰、捷克、波兰和拉脱维亚。
#1
51
If you have iconv installed, try this (the example assumes your input string is in UTF-8):
如果安装了iconv,请尝试一下(本例假设输入字符串为UTF-8):
echo iconv('UTF-8', 'ASCII//TRANSLIT', $string);
(iconv is a library to convert between all kinds of encodings; it's efficient and included with many PHP distributions by default. Most of all, it's definitely easier and more error-proof than trying to roll your own solution (did you know that there's a "Latin letter N with a curl"? Me neither.))
(iconv是一个在各种编码之间进行转换的库;它是高效的,默认情况下包含了许多PHP发行版。最重要的是,它肯定比尝试自己的解决方案更容易、更容易出错(你知道有一个“带卷的拉丁字母N”吗?)我也不知道)。
#2
40
I found a solution, that worked in all my test-cases (copied from http://php.net/manual/en/transliterator.transliterate.php):
我找到了一个适用于所有测试用例的解决方案(从http://php.net/manual/en/transliterator.transliterate.php复制):
var_dump(transliterator_transliterate('Any-Latin; Latin-ASCII; [\u0080-\u7fff] remove',
"A æ Übérmensch på høyeste nivå! И я люблю PHP! есть. fi ¦"));
// string(50) "A ae Ubermensch pa hoyeste niva! I a lublu PHP! est. fi "
see: http://www.php.net/normalizer
参见:http://www.php.net/normalizer
EDIT: This solution is independent of the locale set using setlocale(). Another benefit over iconv() is, that even non-latin characters are not ignored.
编辑:这个解决方案独立于使用setlocale()的语言环境集。iconv()的另一个好处是,即使是非拉丁字符也不会被忽略。
EDIT2: I discovered, that there are some characters, that are not covered by the transliteration I posted originally. Any-Latin
translates the cyrillic character ь
to a character, that doesn't fit into a latin character-set: ʹ
(http://en.wikipedia.org/wiki/Prime_%28symbol%29). I've added [\u0100-\u7fff] remove
to remove all these non-latin characters. I also added a test to the text ;)
EDIT2:我发现,有一些字符没有被我最初发布的音译所覆盖。有拉丁文翻译斯拉夫字母字符ь性格,这并不符合一个拉丁字符集:ʹ(http://en.wikipedia.org/wiki/Prime_%28symbol%29)。我添加了[\u0100-\u7fff]删除所有这些非拉丁字符。我还在文本中添加了一个测试;)
I suggest, that they mean the latin alphabet and not one of the latin character-sets by Latin
here. But anyways - in my opinion, they should transliterate it to something ASCII then in Latin-ASCII
...
我的意思是,他们指的是拉丁字母,而不是这里的拉丁字母。但是无论如何——在我看来,他们应该把它转换成某种ASCII码然后用拉丁语-ASCII…
EDIT3: Sorry for another change here. I had to take the characters down to u0080 instead of u0100, to get only ASCII characters as output. The test above is updated.
不好意思,这里又有变化。我必须把字符改为u0080,而不是u0100,只能得到ASCII字符作为输出。上面的测试被更新。
#3
21
Reposting this on request of @palantir ...
应@palantir的请求重新发布此消息…
I find iconv completely unreliable, and I dislike preg_replace solutions and big arrays ... so my favorite way (and the only reliable method I've found) is ...
我发现iconv完全不可靠,我讨厌preg_replace解决方案和大数组……所以我最喜欢的方法(也是我发现的唯一可靠的方法)是……
function toASCII( $str )
{
return strtr(utf8_decode($str),
utf8_decode(
'ŠŒŽšœžŸ¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿ'),
'SOZsozYYuAAAAAAACEEEEIIIIDNOOOOOOUUUUYsaaaaaaaceeeeiiiionoooooouuuuyy');
}
#4
13
You can use iconv
to transliterate the characters to plain US-ASCII and then use a regular expression to remove non-alphabetic characters:
您可以使用iconv将字符转换为普通的US-ASCII,然后使用正则表达式删除非字母字符:
preg_replace('/[^a-z]/i', '', iconv("UTF-8", "US-ASCII//TRANSLIT", $text))
Another way would be using the Normalizer to normalize to the Normalization Form KD (NFKD) and then remove the mark characters:
另一种方法是使用归一化器对KD (NFKD)进行归一化,然后删除标记字符:
preg_replace('/\p{Mn}/u', '', Normalizer::normalize($text, Normalizer::FORM_KD))
#5
12
Note: I'm reposting this from another similar question in the hope that it's helpful to others.
注意:我将这篇文章从另一个类似的问题中重新发布,希望它对其他人有所帮助。
I ended up writing a PHP library based on URLify.js from the Django project, since I found iconv() to be too incomplete. You can find it here:
最后我编写了一个基于URLify的PHP库。来自Django项目的js,因为我发现iconv()太不完整。你可以在这里找到它:
https://github.com/jbroadway/urlify
https://github.com/jbroadway/urlify
Handles Latin characters as well as Greek, Turkish, Russian, Ukrainian, Czech, Polish, and Latvian.
处理拉丁字符以及希腊、土耳其、俄罗斯、乌克兰、捷克、波兰和拉脱维亚。