我可以在Python RegEx中混合字符类吗?

时间:2022-06-27 18:17:19

Special sequences (character classes) in Python RegEx are escapes like \w or \d that matches a set of characters.

Python RegEx中的特殊序列(字符类)是\ w或\ d等与一组字符匹配的转义符。

In my case, I need to be able to match all alpha-numerical characters except numbers.

在我的情况下,我需要能够匹配除数字之外的所有字母数字字符。

That is, \w minus \d.

也就是说,\ w减去\ d。

I need to use the special sequence \w because I'm dealing with non-ASCII characters and need to match symbols like "Æ" and "Ø".

我需要使用特殊序列\ w因为我正在处理非ASCII字符,需要匹配“Æ”和“Ø”之类的符号。

One would think I could use this expression: [\w^\d] but it doesn't seem to match anything and I'm not sure why.

有人会认为我可以使用这个表达式:[\ w ^ \ d]但它似乎不匹配任何东西,我不知道为什么。

So in short, how can I mix (add/subtract) special sequences in Python Regular Expressions?

简而言之,我如何在Python正则表达式中混合(加/减)特殊序列?


EDIT: I accidentally used [\W^\d] instead of [\w^\d]. The latter does indeed match something, including parentheses and commas which are not alpha-numerical characters as far as I'm concerned.

编辑:我不小心使用了[\ W ^ \ d]而不是[\ w ^ \ d]。后者确实匹配某些东西,包括括号和逗号,就我而言,它们不是字母数字字符。

4 个解决方案

#1


13  

You can use r"[^\W\d]", ie. invert the union of non-alphanumerics and numbers.

您可以使用r“[^ \ W \ d]”,即。颠倒非字母数字和数字的结合。

#2


5  

You cannot subtract character classes, no.

你不能减去字符类,不。

Your best bet is to use the new regex module, set to replace the current re module in python. It supports character classes based on Unicode properties:

最好的办法是使用新的正则表达式模块,设置为替换python中的当前re模块。它支持基于Unicode属性的字符类:

\p{IsAlphabetic}

This will match any character that the Unicode specification states is an alphabetic character.

这将匹配Unicode规范声明的任何字符是字母字符。

Even better, regex does support character class subtraction; it views such classes as sets and allows you to create a difference with the -- operator:

更好的是,正则表达式确实支持字符类减法;它将这些类视为集合,并允许您使用 - 运算符创建差异:

[\w--\d]

matches everything in \w except anything that also matches \d.

匹配\ w中的所有内容,除了匹配\ d的任何内容。

#3


2  

You can exclude classes using a negative lookahead assertion, such as r'(?!\d)[\w]' to match a word character, excluding digits. For example:

您可以使用负前瞻断言排除类,例如r'(?!\ d)[\ w]'以匹配单词字符,不包括数字。例如:

>>> re.search(r'(?!\d)[\w]', '12bac')
<_sre.SRE_Match object at 0xb7779218>
>>> _.group(0)
'b'

To exclude more than one group, you can use the usual [...] syntax in the lookahead assertion, for example r'(?![0-5])[\w]' would match any alphanumeric character except for digits 0-5.

要排除多个组,您可以在前瞻断言中使用通常的语法,例如r'(?![0-5])[\ w]'将匹配除数字0之外的任何字母数字字符-5。

As with [...], the above construct matches a single character. To match multiple characters, add a repetition operator:

与[...]一样,上述构造匹配单个字符。要匹配多个字符,请添加重复运算符:

>>> re.search(r'((?!\d)[\w])+', '12bac15')
<_sre.SRE_Match object at 0x7f44cd2588a0>
>>> _.group(0)
'bac'

#4


1  

I don't think you can directly combine (boolean and) character sets in a single regex, whether one is negated or not. Otherwise you could simply have combined [^\d] and \w.

我不认为你可以在一个正则表达式中直接组合(布尔和)字符集,无论是否被否定。否则你可以简单地合并[^ \ d]和\ w。

Note: the ^ has to be at the start of the set, and applies to the whole set. From the docs: "If the first character of the set is '^', all the characters that are not in the set will be matched.". Your set [\w^\d] tries to match an alpha-numerical character, followed by a caret, followed by a digit. I can imagine that doesn't match anything either.

注意:^必须位于集合的开头,并应用于整个集合。从文档:“如果集合的第一个字符是'^',那么不在集合中的所有字符都将匹配。”您的设置[\ w ^ \ d]尝试匹配字母数字字符,后跟插入符号,后跟数字。我可以想象这也不匹配。

I would do it in two steps, effectly combining the regular expressions. First match by non-digits (inner regex), then match by alpha-numerical characters:

我会分两步完成,有效地结合正则表达式。首先按非数字(内部正则表达式)匹配,然后按字母数字字符匹配:

re.search('\w+', re.search('([^\d]+)', s).group(0)).group(0)

or variations to this theme.

或该主题的变体。

Note that would need to surround this with a try: except: block, as it will throw an AttributeError: 'NoneType' object has no attribute 'group' in case one of the two regexes fails. But you can, of course, split this single line up in a few more lines.

注意,需要用try:except:block来包围它,因为它会抛出一个AttributeError:'NoneType'对象没有属性'group',以防两个正则表达式中的一个失败。但是,当然,你可以将这一行分成几行。

#1


13  

You can use r"[^\W\d]", ie. invert the union of non-alphanumerics and numbers.

您可以使用r“[^ \ W \ d]”,即。颠倒非字母数字和数字的结合。

#2


5  

You cannot subtract character classes, no.

你不能减去字符类,不。

Your best bet is to use the new regex module, set to replace the current re module in python. It supports character classes based on Unicode properties:

最好的办法是使用新的正则表达式模块,设置为替换python中的当前re模块。它支持基于Unicode属性的字符类:

\p{IsAlphabetic}

This will match any character that the Unicode specification states is an alphabetic character.

这将匹配Unicode规范声明的任何字符是字母字符。

Even better, regex does support character class subtraction; it views such classes as sets and allows you to create a difference with the -- operator:

更好的是,正则表达式确实支持字符类减法;它将这些类视为集合,并允许您使用 - 运算符创建差异:

[\w--\d]

matches everything in \w except anything that also matches \d.

匹配\ w中的所有内容,除了匹配\ d的任何内容。

#3


2  

You can exclude classes using a negative lookahead assertion, such as r'(?!\d)[\w]' to match a word character, excluding digits. For example:

您可以使用负前瞻断言排除类,例如r'(?!\ d)[\ w]'以匹配单词字符,不包括数字。例如:

>>> re.search(r'(?!\d)[\w]', '12bac')
<_sre.SRE_Match object at 0xb7779218>
>>> _.group(0)
'b'

To exclude more than one group, you can use the usual [...] syntax in the lookahead assertion, for example r'(?![0-5])[\w]' would match any alphanumeric character except for digits 0-5.

要排除多个组,您可以在前瞻断言中使用通常的语法,例如r'(?![0-5])[\ w]'将匹配除数字0之外的任何字母数字字符-5。

As with [...], the above construct matches a single character. To match multiple characters, add a repetition operator:

与[...]一样,上述构造匹配单个字符。要匹配多个字符,请添加重复运算符:

>>> re.search(r'((?!\d)[\w])+', '12bac15')
<_sre.SRE_Match object at 0x7f44cd2588a0>
>>> _.group(0)
'bac'

#4


1  

I don't think you can directly combine (boolean and) character sets in a single regex, whether one is negated or not. Otherwise you could simply have combined [^\d] and \w.

我不认为你可以在一个正则表达式中直接组合(布尔和)字符集,无论是否被否定。否则你可以简单地合并[^ \ d]和\ w。

Note: the ^ has to be at the start of the set, and applies to the whole set. From the docs: "If the first character of the set is '^', all the characters that are not in the set will be matched.". Your set [\w^\d] tries to match an alpha-numerical character, followed by a caret, followed by a digit. I can imagine that doesn't match anything either.

注意:^必须位于集合的开头,并应用于整个集合。从文档:“如果集合的第一个字符是'^',那么不在集合中的所有字符都将匹配。”您的设置[\ w ^ \ d]尝试匹配字母数字字符,后跟插入符号,后跟数字。我可以想象这也不匹配。

I would do it in two steps, effectly combining the regular expressions. First match by non-digits (inner regex), then match by alpha-numerical characters:

我会分两步完成,有效地结合正则表达式。首先按非数字(内部正则表达式)匹配,然后按字母数字字符匹配:

re.search('\w+', re.search('([^\d]+)', s).group(0)).group(0)

or variations to this theme.

或该主题的变体。

Note that would need to surround this with a try: except: block, as it will throw an AttributeError: 'NoneType' object has no attribute 'group' in case one of the two regexes fails. But you can, of course, split this single line up in a few more lines.

注意,需要用try:except:block来包围它,因为它会抛出一个AttributeError:'NoneType'对象没有属性'group',以防两个正则表达式中的一个失败。但是,当然,你可以将这一行分成几行。