删除Python unicode字符串中的重音的最好方法是什么?

时间:2022-12-22 21:41:08

I have a Unicode string in Python, and I would like to remove all the accents (diacritics).

我在Python中有一个Unicode字符串,我想去掉所有的重音(变音符号)。

I found on the Web an elegant way to do this in Java:

我在Web上发现了一种优雅的方法,可以在Java中做到这一点:

  1. convert the Unicode string to its long normalized form (with a separate character for letters and diacritics)
  2. 将Unicode字符串转换为它的长规范化形式(字母和变音符有单独的字符)
  3. remove all the characters whose Unicode type is "diacritic".
  4. 删除Unicode类型为“变音符号”的所有字符。

Do I need to install a library such as pyICU or is this possible with just the python standard library? And what about python 3?

我需要安装一个像pyICU这样的库吗?还是仅仅使用python标准库就可以实现?那么python 3呢?

Important note: I would like to avoid code with an explicit mapping from accented characters to their non-accented counterpart.

注意:我希望避免使用从重音字符到非重音字符的显式映射的代码。

9 个解决方案

#1


253  

Unidecode is the correct answer for this. It transliterates any unicode string into the closest possible representation in ascii text.

Unidecode是正确的答案。它将任何unicode字符串转换为ascii文本中最接近的表示形式。

Example:

例子:

accented_string = u'Málaga'
# accented_string is of type 'unicode'
import unidecode
unaccented_string = unidecode.unidecode(accented_string)
# unaccented_string contains 'Malaga'and is of type 'str'

#2


219  

How about this:

这个怎么样:

import unicodedata
def strip_accents(s):
   return ''.join(c for c in unicodedata.normalize('NFD', s)
                  if unicodedata.category(c) != 'Mn')

This works on greek letters, too:

这也适用于希腊字母:

>>> strip_accents(u"A \u00c0 \u0394 \u038E")
u'A A \u0394 \u03a5'
>>> 

The character category "Mn" stands for Nonspacing_Mark, which is similar to unicodedata.combining in MiniQuark's answer (I didn't think of unicodedata.combining, but it is probably the better solution, because it's more explicit).

字符类别“Mn”表示非spacing_mark,类似于unicodedata.结合在MiniQuark的答案(我没有想到unicodedata.合并,但它可能是更好的解决方案,因为它更明确)。

And keep in mind, these manipulations may significantly alter the meaning of the text. Accents, Umlauts etc. are not "decoration".

记住,这些手法可能会显著地改变文本的含义。口音、乌姆劳茨等都不是“装饰”。

#3


115  

I just found this answer on the Web:

我在网上找到了这个答案:

import unicodedata

def remove_accents(input_str):
    nfkd_form = unicodedata.normalize('NFKD', input_str)
    only_ascii = nfkd_form.encode('ASCII', 'ignore')
    return only_ascii

It works fine (for French, for example), but I think the second step (removing the accents) could be handled better than dropping the non-ASCII characters, because this will fail for some languages (Greek, for example). The best solution would probably be to explicitly remove the unicode characters that are tagged as being diacritics.

它工作得很好(例如,对于法语),但是我认为第二步(删除重音)可以比删除非ascii字符处理得更好,因为对于某些语言(例如希腊语),这将失败。最好的解决方案可能是显式地删除标记为diacritics的unicode字符。

Edit: this does the trick:

编辑:这就是诀窍:

import unicodedata

def remove_accents(input_str):
    nfkd_form = unicodedata.normalize('NFKD', input_str)
    return u"".join([c for c in nfkd_form if not unicodedata.combining(c)])

unicodedata.combining(c) will return true if the character c can be combined with the preceding character, that is mainly if it's a diacritic.

组合(c)将返回true,如果字符c可以与前面的字符组合,这主要是如果它是一个变音符号。

Edit 2: remove_accents expects a unicode string, not a byte string. If you have a byte string, then you must decode it into a unicode string like this:

编辑2:remove_accent要求unicode字符串,而不是字节字符串。如果你有一个字节字符串,那么你必须把它解码成这样的unicode字符串:

encoding = "utf-8" # or iso-8859-15, or cp1252, or whatever encoding you use
byte_string = b"café"  # or simply "café" before python 3.
unicode_string = byte_string.decode(encoding)

#4


15  

Actually I work on project compatible python 2.6, 2.7 and 3.4 and I have to create IDs from free user entries.

实际上,我的工作是项目兼容的python 2.6、2.7和3.4,我必须从免费用户条目中创建id。

Thanks to you, I have created this function that works wonders.

感谢你,我创造了这个神奇的功能。

import re
import unicodedata

def strip_accents(text):
    """
    Strip accents from input String.

    :param text: The input string.
    :type text: String.

    :returns: The processed String.
    :rtype: String.
    """
    try:
        text = unicode(text, 'utf-8')
    except (TypeError, NameError): # unicode is a default on python 3 
        pass
    text = unicodedata.normalize('NFD', text)
    text = text.encode('ascii', 'ignore')
    text = text.decode("utf-8")
    return str(text)

def text_to_id(text):
    """
    Convert input text to id.

    :param text: The input string.
    :type text: String.

    :returns: The processed String.
    :rtype: String.
    """
    text = strip_accents(text.lower())
    text = re.sub('[ ]+', '_', text)
    text = re.sub('[^0-9a-zA-Z_-]', '', text)
    return text

result:

结果:

text_to_id("Montréal, über, 12.89, Mère, Françoise, noël, 889")
>>> 'montreal_uber_1289_mere_francoise_noel_889'

#5


12  

This handles not only accents, but also "strokes" (as in ø etc.):

这不仅处理口音,还“中风”(如ø等):

import unicodedata as ud

def rmdiacritics(char):
    '''
    Return the base character of char, by "removing" any
    diacritics like accents or curls and strokes and the like.
    '''
    desc = ud.name(unicode(char))
    cutoff = desc.find(' WITH ')
    if cutoff != -1:
        desc = desc[:cutoff]
    return ud.lookup(desc)

This is the most elegant way I can think of (and it has been mentioned by alexis in a comment on this page), although I don't think it is very elegant indeed.

这是我所能想到的最优雅的方式(亚历克西斯曾在这一页的评论中提到过),虽然我不认为它确实很优雅。

There are still special letters that are not handled by this, such as turned and inverted letters, since their unicode name does not contain 'WITH'. It depends on what you want to do anyway. I sometimes needed accent stripping for achieving dictionary sort order.

仍然有一些特殊的字母没有被处理,例如翻转和倒转的字母,因为它们的unicode名称不包含“WITH”。这取决于你想做什么。我有时需要重音剥离来实现字典排序顺序。

#6


11  

In response to @MiniQuark's answer:

回应@MiniQuark的回答:

I was trying to read in a csv file that was half-French (containing accents) and also some strings which would eventually become integers and floats. As a test, I created a test.txt file that looked like this:

我试着读取一个半法语的csv文件(包含重音)和一些字符串,这些字符串最终会变成整数和浮点数。作为一个测试,我创建了一个测试。txt文件是这样的:

Montréal, über, 12.89, Mère, Françoise, noël, 889

蒙特利尔,优步,12.89,米尔,弗朗索瓦,诺埃尔,889

I had to include lines 2 and 3 to get it to work (which I found in a python ticket), as well as incorporate @Jabba's comment:

我必须包括第2行和第3行才能使它正常工作(我在python票据中找到了这两行),同时合并@Jabba的评论:

import sys 
reload(sys) 
sys.setdefaultencoding("utf-8")
import csv
import unicodedata

def remove_accents(input_str):
    nkfd_form = unicodedata.normalize('NFKD', unicode(input_str))
    return u"".join([c for c in nkfd_form if not unicodedata.combining(c)])

with open('test.txt') as f:
    read = csv.reader(f)
    for row in read:
        for element in row:
            print remove_accents(element)

The result:

结果:

Montreal
uber
12.89
Mere
Francoise
noel
889

(Note: I am on Mac OS X 10.8.4 and using Python 2.7.3)

(注意:我使用的是Mac OS X 10.8.4,使用的是Python 2.7.3)

#7


8  

import unicodedata
s = 'Émission'
search_string = ''.join((c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn'))

For Python 3.X

对于Python 3. x

print (search_string)

For Python 2.X

对于Python 2. x

print search_string

#8


3  

gensim.utils.deaccent(text) from Gensim - topic modelling for humans:

Gensim .utils.deaccent(文本)来自Gensim——人类主题建模:

deaccent("Šéf chomutovských komunistů dostal poštou bílý prášek") 'Sef chomutovskych komunistu dostal postou bily prasek'

deaccent(“Šef chomutovskych komunistůdostal poštou由prašek”)“下丘穆托夫斯基·孔内斯图·多斯图·波萨克”

Another solution is unidecode.

另一个解决方案是unidecode。

Not that the suggested solution with unicodedata typically removes accents only in some character (e.g. it turns 'ł' into '', rather than into 'l').

不是建议的解决方案与unicodedata通常只删除口音在某些字符(例如“ł”变成“,而不是“l”)。

#9


1  

Some languages have combining diacritics as language letters and accent diacritics to specify accent.

有些语言将变音符作为语言字母和重音变音符来指定重音。

I think it is more safe to specify explicitly what diactrics you want to strip:

我认为更安全的做法是明确指定你想要去除的diactrics:

def strip_accents(string, accents=('COMBINING ACUTE ACCENT', 'COMBINING GRAVE ACCENT', 'COMBINING TILDE')):
    accents = set(map(unicodedata.lookup, accents))
    chars = [c for c in unicodedata.normalize('NFD', string) if c not in accents]
    return unicodedata.normalize('NFC', ''.join(chars))

#1


253  

Unidecode is the correct answer for this. It transliterates any unicode string into the closest possible representation in ascii text.

Unidecode是正确的答案。它将任何unicode字符串转换为ascii文本中最接近的表示形式。

Example:

例子:

accented_string = u'Málaga'
# accented_string is of type 'unicode'
import unidecode
unaccented_string = unidecode.unidecode(accented_string)
# unaccented_string contains 'Malaga'and is of type 'str'

#2


219  

How about this:

这个怎么样:

import unicodedata
def strip_accents(s):
   return ''.join(c for c in unicodedata.normalize('NFD', s)
                  if unicodedata.category(c) != 'Mn')

This works on greek letters, too:

这也适用于希腊字母:

>>> strip_accents(u"A \u00c0 \u0394 \u038E")
u'A A \u0394 \u03a5'
>>> 

The character category "Mn" stands for Nonspacing_Mark, which is similar to unicodedata.combining in MiniQuark's answer (I didn't think of unicodedata.combining, but it is probably the better solution, because it's more explicit).

字符类别“Mn”表示非spacing_mark,类似于unicodedata.结合在MiniQuark的答案(我没有想到unicodedata.合并,但它可能是更好的解决方案,因为它更明确)。

And keep in mind, these manipulations may significantly alter the meaning of the text. Accents, Umlauts etc. are not "decoration".

记住,这些手法可能会显著地改变文本的含义。口音、乌姆劳茨等都不是“装饰”。

#3


115  

I just found this answer on the Web:

我在网上找到了这个答案:

import unicodedata

def remove_accents(input_str):
    nfkd_form = unicodedata.normalize('NFKD', input_str)
    only_ascii = nfkd_form.encode('ASCII', 'ignore')
    return only_ascii

It works fine (for French, for example), but I think the second step (removing the accents) could be handled better than dropping the non-ASCII characters, because this will fail for some languages (Greek, for example). The best solution would probably be to explicitly remove the unicode characters that are tagged as being diacritics.

它工作得很好(例如,对于法语),但是我认为第二步(删除重音)可以比删除非ascii字符处理得更好,因为对于某些语言(例如希腊语),这将失败。最好的解决方案可能是显式地删除标记为diacritics的unicode字符。

Edit: this does the trick:

编辑:这就是诀窍:

import unicodedata

def remove_accents(input_str):
    nfkd_form = unicodedata.normalize('NFKD', input_str)
    return u"".join([c for c in nfkd_form if not unicodedata.combining(c)])

unicodedata.combining(c) will return true if the character c can be combined with the preceding character, that is mainly if it's a diacritic.

组合(c)将返回true,如果字符c可以与前面的字符组合,这主要是如果它是一个变音符号。

Edit 2: remove_accents expects a unicode string, not a byte string. If you have a byte string, then you must decode it into a unicode string like this:

编辑2:remove_accent要求unicode字符串,而不是字节字符串。如果你有一个字节字符串,那么你必须把它解码成这样的unicode字符串:

encoding = "utf-8" # or iso-8859-15, or cp1252, or whatever encoding you use
byte_string = b"café"  # or simply "café" before python 3.
unicode_string = byte_string.decode(encoding)

#4


15  

Actually I work on project compatible python 2.6, 2.7 and 3.4 and I have to create IDs from free user entries.

实际上,我的工作是项目兼容的python 2.6、2.7和3.4,我必须从免费用户条目中创建id。

Thanks to you, I have created this function that works wonders.

感谢你,我创造了这个神奇的功能。

import re
import unicodedata

def strip_accents(text):
    """
    Strip accents from input String.

    :param text: The input string.
    :type text: String.

    :returns: The processed String.
    :rtype: String.
    """
    try:
        text = unicode(text, 'utf-8')
    except (TypeError, NameError): # unicode is a default on python 3 
        pass
    text = unicodedata.normalize('NFD', text)
    text = text.encode('ascii', 'ignore')
    text = text.decode("utf-8")
    return str(text)

def text_to_id(text):
    """
    Convert input text to id.

    :param text: The input string.
    :type text: String.

    :returns: The processed String.
    :rtype: String.
    """
    text = strip_accents(text.lower())
    text = re.sub('[ ]+', '_', text)
    text = re.sub('[^0-9a-zA-Z_-]', '', text)
    return text

result:

结果:

text_to_id("Montréal, über, 12.89, Mère, Françoise, noël, 889")
>>> 'montreal_uber_1289_mere_francoise_noel_889'

#5


12  

This handles not only accents, but also "strokes" (as in ø etc.):

这不仅处理口音,还“中风”(如ø等):

import unicodedata as ud

def rmdiacritics(char):
    '''
    Return the base character of char, by "removing" any
    diacritics like accents or curls and strokes and the like.
    '''
    desc = ud.name(unicode(char))
    cutoff = desc.find(' WITH ')
    if cutoff != -1:
        desc = desc[:cutoff]
    return ud.lookup(desc)

This is the most elegant way I can think of (and it has been mentioned by alexis in a comment on this page), although I don't think it is very elegant indeed.

这是我所能想到的最优雅的方式(亚历克西斯曾在这一页的评论中提到过),虽然我不认为它确实很优雅。

There are still special letters that are not handled by this, such as turned and inverted letters, since their unicode name does not contain 'WITH'. It depends on what you want to do anyway. I sometimes needed accent stripping for achieving dictionary sort order.

仍然有一些特殊的字母没有被处理,例如翻转和倒转的字母,因为它们的unicode名称不包含“WITH”。这取决于你想做什么。我有时需要重音剥离来实现字典排序顺序。

#6


11  

In response to @MiniQuark's answer:

回应@MiniQuark的回答:

I was trying to read in a csv file that was half-French (containing accents) and also some strings which would eventually become integers and floats. As a test, I created a test.txt file that looked like this:

我试着读取一个半法语的csv文件(包含重音)和一些字符串,这些字符串最终会变成整数和浮点数。作为一个测试,我创建了一个测试。txt文件是这样的:

Montréal, über, 12.89, Mère, Françoise, noël, 889

蒙特利尔,优步,12.89,米尔,弗朗索瓦,诺埃尔,889

I had to include lines 2 and 3 to get it to work (which I found in a python ticket), as well as incorporate @Jabba's comment:

我必须包括第2行和第3行才能使它正常工作(我在python票据中找到了这两行),同时合并@Jabba的评论:

import sys 
reload(sys) 
sys.setdefaultencoding("utf-8")
import csv
import unicodedata

def remove_accents(input_str):
    nkfd_form = unicodedata.normalize('NFKD', unicode(input_str))
    return u"".join([c for c in nkfd_form if not unicodedata.combining(c)])

with open('test.txt') as f:
    read = csv.reader(f)
    for row in read:
        for element in row:
            print remove_accents(element)

The result:

结果:

Montreal
uber
12.89
Mere
Francoise
noel
889

(Note: I am on Mac OS X 10.8.4 and using Python 2.7.3)

(注意:我使用的是Mac OS X 10.8.4,使用的是Python 2.7.3)

#7


8  

import unicodedata
s = 'Émission'
search_string = ''.join((c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn'))

For Python 3.X

对于Python 3. x

print (search_string)

For Python 2.X

对于Python 2. x

print search_string

#8


3  

gensim.utils.deaccent(text) from Gensim - topic modelling for humans:

Gensim .utils.deaccent(文本)来自Gensim——人类主题建模:

deaccent("Šéf chomutovských komunistů dostal poštou bílý prášek") 'Sef chomutovskych komunistu dostal postou bily prasek'

deaccent(“Šef chomutovskych komunistůdostal poštou由prašek”)“下丘穆托夫斯基·孔内斯图·多斯图·波萨克”

Another solution is unidecode.

另一个解决方案是unidecode。

Not that the suggested solution with unicodedata typically removes accents only in some character (e.g. it turns 'ł' into '', rather than into 'l').

不是建议的解决方案与unicodedata通常只删除口音在某些字符(例如“ł”变成“,而不是“l”)。

#9


1  

Some languages have combining diacritics as language letters and accent diacritics to specify accent.

有些语言将变音符作为语言字母和重音变音符来指定重音。

I think it is more safe to specify explicitly what diactrics you want to strip:

我认为更安全的做法是明确指定你想要去除的diactrics:

def strip_accents(string, accents=('COMBINING ACUTE ACCENT', 'COMBINING GRAVE ACCENT', 'COMBINING TILDE')):
    accents = set(map(unicodedata.lookup, accents))
    chars = [c for c in unicodedata.normalize('NFD', string) if c not in accents]
    return unicodedata.normalize('NFC', ''.join(chars))