python类名中的有效字符

I'm dynamically creating python classes, and I know not all characters are valid in this context.

我正在动态创建python类,我知道在这种情况下并非所有字符都有效。

Is there a method somewhere in the class library that I can use to sanitize a random text string, so that I can use it as a class name? Either that or a list of the allowed characters would be a good help.

在类库中是否有一个方法可以用来清理随机文本字符串,以便我可以将它用作类名?这个或允许的字符列表将是一个很好的帮助。

Addition regarding *es with identifier names: Like @Ignacio pointed out in the answer below, any character that is valid as an identifier is a valid character in a class name. And you can even use a reserved word as a class name without any trouble. But there's a catch. If you do use a reserved word, you won't be able to make the class accessible like other (non-dynamically-created) classes (e.g., by doing globals()[my_class.__name__] = my_class). The reserved word will always take precedence in such case.

关于与标识符名称冲突的补充:与@Ignacio一样,在下面的答案中指出,任何有效作为标识符的字符都是类名中的有效字符。你甚至可以毫无困难地使用保留字作为类名。但是有一个问题。如果使用保留字,则无法像其他(非动态创建的)类一样访问类(例如,通过执行globals()[my_class .__ name__] = my_class)。在这种情况下,保留字始终优先。

4 个解决方案

#1

Python Language Reference, §2.3, "Identifiers and keywords"

Python语言参考,§2.3,“标识符和关键字”

Identifiers (also referred to as names) are described by the following lexical definitions:

标识符(也称为名称)由以下词汇定义描述:
identifier ::=  (letter|"_") (letter | digit | "_")*
letter     ::=  lowercase | uppercase
lowercase  ::=  "a"..."z"
uppercase  ::=  "A"..."Z"
digit      ::=  "0"..."9"
Identifiers are unlimited in length. Case is significant.

标识符的长度不受限制。案例很重要。

#2

As per Python Language Reference, §2.3, "Identifiers and keywords", a valid Python identifier is defined as:

根据Python语言参考,§2.3,“标识符和关键字”,有效的Python标识符定义为:

(letter|"_") (letter | digit | "_")*

Or, in regex:

或者,在正则表达式中:

[a-zA-Z_][a-zA-Z0-9_]*

#3

The thing that makes this interesting is that the first character of an identifier is special. After the first character, numbers '0' through '9' are valid for identifiers, but they must not be the first character.

使这个有趣的是标识符的第一个字符是特殊的。在第一个字符之后,数字“0”到“9”对于标识符有效,但它们不能是第一个字符。

Here's a function that will return a valid identifier given any random string of characters. Here's how it works:

这是一个函数,它将返回给定任意字符串的有效标识符。以下是它的工作原理:

First, we use itr = iter(seq) to get an explicit iterator on the input. Then there is a first loop, which uses the iterator itr to look at characters until it finds a valid first character for an identifier. Then it breaks out of that loop and runs the second loop, using the same iterator (which we named itr) for the second loop. The iterator itr keeps our place for us; the characters the first loop pulled out of the iterator are still gone when the second loop runs.

首先,我们使用itr = iter(seq)在输入上获得显式迭代器。然后有一个第一个循环,它使用迭代器itr查看字符,直到找到标识符的有效第一个字符。然后它突破了该循环并运行第二个循环,使用相同的迭代器(我们将其命名为itr)用于第二个循环。迭代器itr为我们保留了我们的位置;当第二个循环运行时,第一个循环从迭代器中拉出的字符仍然消失。

def gen_valid_identifier(seq):
    # get an iterator
    itr = iter(seq)
    # pull characters until we get a legal one for first in identifer
    for ch in itr:
        if ch == '_' or ch.isalpha():
            yield ch
            break
    # pull remaining characters and yield legal ones for identifier
    for ch in itr:
        if ch == '_' or ch.isalpha() or ch.isdigit():
            yield ch

def sanitize_identifier(name):
    return ''.join(gen_valid_identifier(name))

This is a clean and Pythonic way to handle a sequence two different ways. For a problem this simple, we could just have a Boolean variable that indicates whether we have seen the first character yet or not:

这是一种干净且Pythonic的方式来处理序列两种不同的方式。对于一个简单的问题,我们可以只有一个布尔变量来指示我们是否已经看到了第一个字符:

def gen_valid_identifier(seq):
    saw_first_char = False
    for ch in seq:
        if not saw_first_char and (ch == '_' or ch.isalpha()):
            saw_first_char = True 
            yield ch
        elif saw_first_char and (ch == '_' or ch.isalpha() or ch.isdigit()):
            yield ch

I don't like this version nearly as much as the first version. The special handling for one character is now tangled up in the whole flow of control, and this will be slower than the first version as it has to keep checking the value of saw_first_char constantly. But this is the way you would have to handle the flow of control in most languages! Python's explicit iterator is a nifty feature, and I think it makes this code a lot better.

我不喜欢这个版本和第一个版本差不多。一个字符的特殊处理现在纠缠在整个控制流中,这将比第一个版本慢,因为它必须不断地检查saw_first_char的值。但这是你在大多数语言中处理控制流的方式! Python的显式迭代器是一个很好的功能,我认为它使这个代码更好。

Looping on an explicit iterator is just as fast as letting Python implicitly get an iterator for you, and the explicit iterator lets us split up the loops that handle the different rules for different parts of the identifier. So the explicit iterator gives us cleaner code that also runs faster. Win/win.

循环显式迭代器与让Python隐式获取迭代器的速度一样快,而显式迭代器允许我们拆分处理标识符不同部分的不同规则的循环。因此,显式迭代器为我们提供了更快的代码。赢/赢。

#4

This is an old question by now, but I'd like to add an answer on how to do this in Python 3 as I've made an implementation.

这是一个古老的问题,但是我想在Python 3中添加一个关于如何在我实现的过程中做到这一点的答案。

The allowed characters are documented here: https://docs.python.org/3/reference/lexical_analysis.html#identifiers . They include quite a lot of special characters, including punctuation, underscore, and a whole slew of foreign characters. Luckily the unicodedata module can help. Here's my implementation implementing directly what the Python documentation says:

允许的字符在此处记录:https://docs.python.org/3/reference/lexical_analysis.html#identifiers。它们包括很多特殊字符,包括标点符号,下划线和一大堆外来字符。幸运的是,unicodedata模块可以提供帮助。这是我的实现直接实现Python文档所说的内容:

import unicodedata

def is_valid_name(name):
    if not _is_id_start(name[0]):
        return False
    for character in name[1:]:
        if not _is_id_continue(character):
            return False
    return True #All characters are allowed.

_allowed_id_continue_categories = {"Ll", "Lm", "Lo", "Lt", "Lu", "Mc", "Mn", "Nd", "Nl", "Pc"}
_allowed_id_continue_characters = {"_", "\u00B7", "\u0387", "\u1369", "\u136A", "\u136B", "\u136C", "\u136D", "\u136E", "\u136F", "\u1370", "\u1371", "\u19DA", "\u2118", "\u212E", "\u309B", "\u309C"}
_allowed_id_start_categories = {"Ll", "Lm", "Lo", "Lt", "Lu", "Nl"}
_allowed_id_start_characters = {"_", "\u2118", "\u212E", "\u309B", "\u309C"}

def _is_id_start(character):
    return unicodedata.category(character) in _allowed_id_start_categories or character in _allowed_id_start_categories or unicodedata.category(unicodedata.normalize("NFKC", character)) in _allowed_id_start_categories or unicodedata.normalize("NFKC", character) in _allowed_id_start_characters

def _is_id_continue(character):
    return unicodedata.category(character) in _allowed_id_continue_categories or character in _allowed_id_continue_characters or unicodedata.category(unicodedata.normalize("NFKC", character)) in _allowed_id_continue_categories or unicodedata.normalize("NFKC", character) in _allowed_id_continue_characters

This code is adapted from here under CC0: https://github.com/Ghostkeeper/Luna/blob/d69624cd0dd5648aec2139054fae4d45b634da7e/plugins/data/enumerated/enumerated_type.py#L91 . It has been well tested.

此代码改编自CC0:https://github.com/Ghostkeeper/Luna/blob/d69624cd0dd5648aec2139054fae4d45b634da7e/plugins/data/enumerated/enumerated_type.py#L91。它已经过很好的测试。

#1