如何让str.translate使用Unicode字符串？

I have the following code:

我有以下代码：

import string
def translate_non_alphanumerics(to_translate, translate_to='_'):
    not_letters_or_digits = u'!"#%\'()*+,-./:;<=>?@[\]^_`{|}~'
    translate_table = string.maketrans(not_letters_or_digits,
                                       translate_to
                                         *len(not_letters_or_digits))
    return to_translate.translate(translate_table)

Which works great for non-unicode strings:

这适用于非unicode字符串：

>>> translate_non_alphanumerics('<foo>!')
'_foo__'

But fails for unicode strings:

但unicode字符串失败：

>>> translate_non_alphanumerics(u'<foo>!')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 5, in translate_non_alphanumerics
TypeError: character mapping must return integer, None or unicode

I can't make any sense of the paragraph on "Unicode objects" in the Python 2.6.2 docs for the str.translate() method.

对于str.translate（）方法，我无法理解Python 2.6.2文档中关于“Unicode对象”的段落。

How do I make this work for Unicode strings?

如何使这个工作的Unicode字符串？

6 个解决方案

#1

The Unicode version of translate requires a mapping from Unicode ordinals (which you can retrieve for a single character with ord) to Unicode ordinals. If you want to delete characters, you map to None.

Unicode版本的translate需要从Unicode序列（您可以使用ord检索单个字符）到Unicode序列的映射。如果要删除字符，则映射到“无”。

I changed your function to build a dict mapping the ordinal of every character to the ordinal of what you want to translate to:

我改变了你的函数来构建一个dict，将每个字符的序数映射到你想要翻译成的序数：

def translate_non_alphanumerics(to_translate, translate_to=u'_'):
    not_letters_or_digits = u'!"#%\'()*+,-./:;<=>?@[\]^_`{|}~'
    translate_table = dict((ord(char), translate_to) for char in not_letters_or_digits)
    return to_translate.translate(translate_table)

>>> translate_non_alphanumerics(u'<foo>!')
u'_foo__'

edit: It turns out that the translation mapping must map from the Unicode ordinal (via ord) to either another Unicode ordinal, a Unicode string, or None (to delete). I have thus changed the default value for translate_to to be a Unicode literal. For example:

编辑：事实证明，转换映射必须从Unicode序数（通过ord）映射到另一个Unicode序号，Unicode字符串或None（要删除）。因此我将translate_to的默认值更改为Unicode文字。例如：

>>> translate_non_alphanumerics(u'<foo>!', u'bad')
u'badfoobadbad'

#2

In this version you can relatively make one's letters to other

在这个版本中你可以相对地给别人写一个字母

def trans(to_translate):
    tabin = u'привет'
    tabout = u'тевирп'
    tabin = [ord(char) for char in tabin]
    translate_table = dict(zip(tabin, tabout))
    return to_translate.translate(translate_table)

#3

I came up with the following combination of my original function and Mike's version that works with Unicode and ASCII strings:

我想出了我的原始函数和Mike的版本的以下组合，它与Unicode和ASCII字符串一起使用：

def translate_non_alphanumerics(to_translate, translate_to=u'_'):
    not_letters_or_digits = u'!"#%\'()*+,-./:;<=>?@[\]^_`{|}~'
    if isinstance(to_translate, unicode):
        translate_table = dict((ord(char), unicode(translate_to))
                               for char in not_letters_or_digits)
    else:
        assert isinstance(to_translate, str)
        translate_table = string.maketrans(not_letters_or_digits,
                                           translate_to
                                              *len(not_letters_or_digits))
    return to_translate.translate(translate_table)

Update: "coerced" translate_to to unicode for the unicode translate_table. Thanks Mike.

更新：“强制”translate_to为unicode的unicode translate_table。谢谢迈克。

#4

For a simple hack that will work on both str and unicode objects, convert the translation table to unicode before running translate():

对于可以在str和unicode对象上运行的简单hack，在运行translate（）之前将转换表转换为unicode：

import string
def translate_non_alphanumerics(to_translate, translate_to='_'):
    not_letters_or_digits = u'!"#%\'()*+,-./:;<=>?@[\]^_`{|}~'
    translate_table = string.maketrans(not_letters_or_digits,
                                       translate_to
                                         *len(not_letters_or_digits))
    translate_table = translate_table.decode("latin-1")
    return to_translate.translate(translate_table)

The catch here is that it will implicitly convert all str objects to unicode, throwing errors if to_translate contains non-ascii characters.

这里的问题是它将隐式地将所有str对象转换为unicode，如果to_translate包含非ascii字符则抛出错误。

#5

Instead of having to specify all the characters that need to be replaced, you could also view it the other way around and, instead, specify only the valid characters, like so:

您不必指定需要替换的所有字符，也可以反过来查看它，而是仅指定有效字符，如下所示：

import re

def replace_non_alphanumerics(source, replacement_character='_'):
    result = re.sub("[^_a-zA-Z0-9]", replacement_character, source)

    return result

This works with unicode as well as regular strings, and preserves the type (if both the replacement_character and the source are of the same type, obviously).

这适用于unicode和常规字符串，并保留类型（如果replacement_character和source都是相同的类型，显然）。

#6

I found that where in python 2.7, with type str, you would write

我发现在python 2.7中，使用str类型，你会写

import string
table = string.maketrans("123", "abc")
print "135".translate(table)

whereas with type unicode you would say

而你会说类型为unicode

table = {ord(s): unicode(d) for s, d in zip("123", "abc")}
print u"135".translate(table)

In python 3.6 you would write

在python 3.6中你会写

table = {ord(s): d for s, d in zip("123", "abc")}
print("135".translate(table))

maybe this is helpful.

也许这很有帮助。

#1