为什么re.sub（'。*？'，' - '，'abc'）返回'-a-b-c-'而不是'-------'？

This is the results from python2.7.

这是python2.7的结果。

>>> re.sub('.*?', '-', 'abc')
'-a-b-c-'

The results I thought should be as follows.

我认为结果应该如下。

>>> re.sub('.*?', '-', 'abc')
'-------'

But it's not. Why?

但事实并非如此。为什么?

4 个解决方案

#1

The best explanation of this behaviour I know of is from the regex PyPI package, which is intended to eventually replace re (although it has been this way for a long time now).

我所知道的这种行为的最好解释来自正则表达式PyPI包,它最终会替换re(尽管现在已经很久了)。

Sometimes it’s not clear how zero-width matches should be handled. For example, should .* match 0 characters directly after matching >0 characters?

有时候不清楚应该如何处理零宽度匹配。例如,在匹配> 0个字符后,*。匹配0个字符?

Most regex implementations follow the lead of Perl (PCRE), but the re module sometimes doesn’t. The Perl behaviour appears to be the most common (and the re module is sometimes definitely wrong), so in version 1 the regex module follows the Perl behaviour, whereas in version 0 it follows the legacy re behaviour.

大多数正则表达式实现都遵循Perl(PCRE)的引导,但re模块有时却没有。 Perl行为似乎是最常见的(并且re模块有时肯定是错误的),因此在版本1中,regex模块遵循Perl行为,而在版本0中,它遵循遗留的re行为。

Examples:
# Version 0 behaviour (like re)
>>> regex.sub('(?V0).*', 'x', 'test')
'x'
>>> regex.sub('(?V0).*?', '|', 'test')
'|t|e|s|t|'

# Version 1 behaviour (like Perl)
>>> regex.sub('(?V1).*', 'x', 'test')
'xx'
>>> regex.sub('(?V1).*?', '|', 'test')
'|||||||||'

(?VX) sets the version flag in the regex. The second example is what you expect, and is supposedly what PCRE does. Python's re is somewhat nonstandard, and is kept as it is probably solely due to backwards compatibility concerns. I've found an example of something similar (with re.split).

(?VX)在正则表达式中设置版本标志。第二个例子是你所期望的,并且据说是PCRE所做的。 Python有点不标准,并且保留,因为它可能完全是由于向后兼容性问题。我找到了一个类似的例子(使用re.split)。

#2

For your new, edited question:

对于您新编辑的问题:

The .*? can match any number of characters, including zero. So what it does is it matches zero characters at every position in the string: before the "a", between the "a" and "b", etc. It replaces each of those zero-width matches with a hyphen, giving the result you see.

。*?可以匹配任意数量的字符,包括零。所以它的作用是匹配字符串中每个位置的零个字符:在“a”之前,在“a”和“b”之间,等等。它用连字符替换每个零宽度匹配,给出结果你看。

The regex does not try to match each character one by one; it tries to match at each position in the string. Your regex allows it to match zero characters. So it matches zero at each position and moves on to the next. You seem to be thinking that in a string like "abc" there is one position before the "b", one position "inside" the "b", and one position after "b", but there isn't a position "inside" an individual character. If it matches zero characters starting before "b", the next thing it tries is to match starting after "b". There's no way you can get a regex to match seven times in a three-character string, because there are only four positions to match at.

正则表达式不会尝试逐个匹配每个字符;它尝试匹配字符串中的每个位置。你的正则表达式允许它匹配零个字符。因此它在每个位置匹配零并继续下一个位置。你似乎在想,在像“abc”这样的字符串中,“b”之前有一个位置,“b”之内有一个位置,“b”之后有一个位置,但是里面没有位置“个性。如果它匹配在“b”之前开始的零个字符,则它尝试的下一个事项是匹配“b”之后的开始。你无法在三个字符的字符串中使用正则表达式匹配七次,因为只有四个位置可以匹配。

#3

Are you sure you interpreted re.sub's documentation correctly?

您确定正确解释了re.sub的文档吗?

*?, +?, ?? The '', '+', and '?' qualifiers are all greedy; they match as much text as possible. Sometimes this behaviour isn’t desired; if the RE <.> is matched against '<H1>title</H1>', it will match the entire string, and not just '<H1>'. Adding '?' after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using .*? in the previous expression will match only ''.

*?,+?,?? '','+'和'?'资格赛都是贪心的;它们匹配尽可能多的文本。有时这种行为是不可取的;如果RE <。>与'

title '匹配,它将匹配整个字符串,而不仅仅是'

'。添加'?'在限定符之后,它以非贪婪或最小的方式执行匹配;尽可能少的字符将匹配。使用。*?在前一个表达式中将只匹配''。

Adding a ? will turn the expression into a non-greedy one.

添加一个?将表达变为非贪婪的表达。

Greedy:

re.sub(".*", "-", "abc")

non-Greedy:

re.sub(".*?", "-", "abc")

Update: FWIW re.sub does exactly what it should:

更新:FWIW re.sub完全应该做到:

>>> from re import sub
>>> sub(".*?", "-", "abc")
'-a-b-c-'
>>> sub(".*", "-", "abc")
'-'

See @BrenBarn's awesome answer on why you get -a-b-c- :)