python regex - 使用re.IGNORECASE时,虚线匹配为数字

时间:2020-12-15 19:29:38

I'm using a regular expression for stripping non-numeric characters from 13-digit ISBNs, and I came across some weird behavior that I'd like to understand. I tested this with Python 2.7.5 and 3.3.2:

我正在使用正则表达式从13位ISBN中删除非数字字符,我遇到了一些我想要了解的奇怪行为。我用Python 2.7.5和3.3.2测试了这个:

import re
re.sub("\D", '', '978-1-936978-09-0')

This gives 9781936978090, correctly. But I noticed that if I do this...

这正确地给出了9781936978090。但我注意到如果我这样做......

re.sub("\D", '', '978-1-936978-09-0', re.IGNORECASE)

...with re.IGNORECASE, it leaves in the last two dashes, giving this: 9781936978-09-0.

...与re.IGNORECASE,它留在最后两个破折号,给出:9781936978-09-0。

Not that it should matter, but I double-checked that all four dashes are the exact same character (just a normal dash). I tried some variants (like [^\d] instead of \D, or [^0-9]) and got the same weird result.

这并不重要,但我仔细检查了所有四个破折号是完全相同的字符(只是一个正常的破折号)。我尝试了一些变体(比如[^ \ d]而不是\ D或[^ 0-9])并得到了同样奇怪的结果。

This isn't urgent for me, since ignoring case doesn't matter for this, but I'd like to know what's going on. Any ideas?

这对我来说并不紧急,因为忽略案例对此无关紧要,但我想知道发生了什么。有任何想法吗?

3 个解决方案

#1


5  

The fourth parameter of re.sub is not flags, but replace count. You should specify flags using flags keyword argument.

re.sub的第四个参数不是标志,而是替换计数。您应该使用flags关键字参数指定标志。

re.sub(pattern, repl, string, count=0, flags=0)
#                             ^^^^^^^

>>> re.sub("\D", '', '978-1-936978-09-0', flags=re.IGNORECASE)
'9781936978090'

#2


2  

Because of a python re "bug" your re.IGNORECASE is not used as flag in re.sub -> it's the count.
Instead it's used for the count parameter.

由于python re“bug”你的re.IGNORECASE不用作re.sub中的标志 - >它是计数。相反,它用于count参数。

To avoid this name the parameter and it will work.
Example:

要避免使用此名称参数,它将起作用。例:

re.sub("\D", '', '978-1-936978-09-0', re.IGNORECASE)
9781936978-09-0

re.sub("\D", '', '978-1-936978-09-0', flags=re.IGNORECASE)
9781936978090

Tested in python 3.2.0 and 3.3.0

在python 3.2.0和3.3.0中测试过

#3


1  

You are using wrong parameter, re.sub(pattern, repl, string, count=0, flags=0). Fourth is count. It is very easy to fall for this pitfall if you test your regular expression using re.search before using it for replacements.

您正在使用错误的参数re.sub(pattern,repl,string,count = 0,flags = 0)。第四是计数。如果在使用re.search进行替换之前使用re.search测试正则表达式,则很容易陷入此陷阱。

#1


5  

The fourth parameter of re.sub is not flags, but replace count. You should specify flags using flags keyword argument.

re.sub的第四个参数不是标志,而是替换计数。您应该使用flags关键字参数指定标志。

re.sub(pattern, repl, string, count=0, flags=0)
#                             ^^^^^^^

>>> re.sub("\D", '', '978-1-936978-09-0', flags=re.IGNORECASE)
'9781936978090'

#2


2  

Because of a python re "bug" your re.IGNORECASE is not used as flag in re.sub -> it's the count.
Instead it's used for the count parameter.

由于python re“bug”你的re.IGNORECASE不用作re.sub中的标志 - >它是计数。相反,它用于count参数。

To avoid this name the parameter and it will work.
Example:

要避免使用此名称参数,它将起作用。例:

re.sub("\D", '', '978-1-936978-09-0', re.IGNORECASE)
9781936978-09-0

re.sub("\D", '', '978-1-936978-09-0', flags=re.IGNORECASE)
9781936978090

Tested in python 3.2.0 and 3.3.0

在python 3.2.0和3.3.0中测试过

#3


1  

You are using wrong parameter, re.sub(pattern, repl, string, count=0, flags=0). Fourth is count. It is very easy to fall for this pitfall if you test your regular expression using re.search before using it for replacements.

您正在使用错误的参数re.sub(pattern,repl,string,count = 0,flags = 0)。第四是计数。如果在使用re.search进行替换之前使用re.search测试正则表达式,则很容易陷入此陷阱。