为什么re.sub('。*?',' - ','abc')返回'-a-b-c-'而不是'-------'?

时间:2022-10-13 17:01:41

This is the results from python2.7.

这是python2.7的结果。

>>> re.sub('.*?', '-', 'abc')
'-a-b-c-'

The results I thought should be as follows.

我认为结果应该如下。

>>> re.sub('.*?', '-', 'abc')
'-------'

But it's not. Why?

但事实并非如此。为什么?

4 个解决方案

#1


The best explanation of this behaviour I know of is from the regex PyPI package, which is intended to eventually replace re (although it has been this way for a long time now).

我所知道的这种行为的最好解释来自正则表达式PyPI包,它最终会替换re(尽管现在已经很久了)。

Sometimes it’s not clear how zero-width matches should be handled. For example, should .* match 0 characters directly after matching >0 characters?

有时候不清楚应该如何处理零宽度匹配。例如,在匹配> 0个字符后,*。匹配0个字符?

Most regex implementations follow the lead of Perl (PCRE), but the re module sometimes doesn’t. The Perl behaviour appears to be the most common (and the re module is sometimes definitely wrong), so in version 1 the regex module follows the Perl behaviour, whereas in version 0 it follows the legacy re behaviour.

大多数正则表达式实现都遵循Perl(PCRE)的引导,但re模块有时却没有。 Perl行为似乎是最常见的(并且re模块有时肯定是错误的),因此在版本1中,regex模块遵循Perl行为,而在版本0中,它遵循遗留的re行为。

Examples:

# Version 0 behaviour (like re)
>>> regex.sub('(?V0).*', 'x', 'test')
'x'
>>> regex.sub('(?V0).*?', '|', 'test')
'|t|e|s|t|'

# Version 1 behaviour (like Perl)
>>> regex.sub('(?V1).*', 'x', 'test')
'xx'
>>> regex.sub('(?V1).*?', '|', 'test')
'|||||||||'

(?VX) sets the version flag in the regex. The second example is what you expect, and is supposedly what PCRE does. Python's re is somewhat nonstandard, and is kept as it is probably solely due to backwards compatibility concerns. I've found an example of something similar (with re.split).

(?VX)在正则表达式中设置版本标志。第二个例子是你所期望的,并且据说是PCRE所做的。 Python有点不标准,并且保留,因为它可能完全是由于向后兼容性问题。我找到了一个类似的例子(使用re.split)。

#2


For your new, edited question:

对于您新编辑的问题:

The .*? can match any number of characters, including zero. So what it does is it matches zero characters at every position in the string: before the "a", between the "a" and "b", etc. It replaces each of those zero-width matches with a hyphen, giving the result you see.

。*?可以匹配任意数量的字符,包括零。所以它的作用是匹配字符串中每个位置的零个字符:在“a”之前,在“a”和“b”之间,等等。它用连字符替换每个零宽度匹配,给出结果你看。

The regex does not try to match each character one by one; it tries to match at each position in the string. Your regex allows it to match zero characters. So it matches zero at each position and moves on to the next. You seem to be thinking that in a string like "abc" there is one position before the "b", one position "inside" the "b", and one position after "b", but there isn't a position "inside" an individual character. If it matches zero characters starting before "b", the next thing it tries is to match starting after "b". There's no way you can get a regex to match seven times in a three-character string, because there are only four positions to match at.

正则表达式不会尝试逐个匹配每个字符;它尝试匹配字符串中的每个位置。你的正则表达式允许它匹配零个字符。因此它在每个位置匹配零并继续下一个位置。你似乎在想,在像“abc”这样的字符串中,“b”之前有一个位置,“b”之内有一个位置,“b”之后有一个位置,但是里面没有位置“个性。如果它匹配在“b”之前开始的零个字符,则它尝试的下一个事项是匹配“b”之后的开始。你无法在三个字符的字符串中使用正则表达式匹配七次,因为只有四个位置可以匹配。

#3


Are you sure you interpreted re.sub's documentation correctly?

您确定正确解释了re.sub的文档吗?

*?, +?, ?? The '', '+', and '?' qualifiers are all greedy; they match as much text as possible. Sometimes this behaviour isn’t desired; if the RE <.> is matched against '<H1>title</H1>', it will match the entire string, and not just '<H1>'. Adding '?' after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using .*? in the previous expression will match only ''.

*?,+?,?? '','+'和'?'资格赛都是贪心的;它们匹配尽可能多的文本。有时这种行为是不可取的;如果RE <。>与'

title '匹配,它将匹配整个字符串,而不仅仅是'

'。添加'?'在限定符之后,它以非贪婪或最小的方式执行匹配;尽可能少的字符将匹配。使用。*?在前一个表达式中将只匹配''。

Adding a ? will turn the expression into a non-greedy one.

添加一个?将表达变为非贪婪的表达。

Greedy:

re.sub(".*", "-", "abc")

non-Greedy:

re.sub(".*?", "-", "abc")

Update: FWIW re.sub does exactly what it should:

更新:FWIW re.sub完全应该做到:

>>> from re import sub
>>> sub(".*?", "-", "abc")
'-a-b-c-'
>>> sub(".*", "-", "abc")
'-'

See @BrenBarn's awesome answer on why you get -a-b-c- :)

请参阅@BrenBarn关于你为什么得到-a-b-c- :)的精彩答案

Here's a visual representation of what's going on:

这是对正在发生的事情的直观表示:

.*?

为什么re.sub('。*?',' - ','abc')返回'-a-b-c-'而不是'-------'?

Debuggex Demo

#4


To elaborate on Veedrac's answer, different implementation has different treatment of zero-width matches in a FindAll (or ReplaceAll) operations. Two behaviors can be observed among different implementations, and Python re simply chooses to follow the first line of implementation.

为了详细说明Veedrac的答案,不同的实现在FindAll(或ReplaceAll)操作中对零宽度匹配有不同的处理方式。在不同的实现中可以观察到两种行为,Python只是选择遵循第一行实现。

1. Always bump along by one character on zero-width match

In Java and JavaScript, zero-width match causes the index to bump along by one character, since staying at the same index will cause an infinite loop in FindAll or ReplaceAll operations.

在Java和JavaScript中,零宽度匹配会导致索引碰到一个字符,因为保持在同一索引将导致FindAll或ReplaceAll操作中的无限循环。

As a result, output of FindAll operations in such implementation can contain at most 1 match starting at a particular index.

因此,此类实现中的FindAll操作的输出最多可包含从特定索引开始的1个匹配。

The default Python re package probably also follow the same implementation (and it seems to be also the case for Ruby).

默认的Python re包也可能遵循相同的实现(而Ruby似乎也是如此)。

2. Disallow zero-width match on next match at same index

In PHP, which provides a wrapper over PCRE libreary, zero-width match does not cause the index to bump along immediately. Instead, it will set a flag (PCRE_NOTEMPTY) requiring the next match (which starts at the same index) to be a non-zero-width match. If the match succeeds, it will bump along by the length of the match (non-zero); otherwise, it bumps along by one character.

在PHP中,它提供了一个基于PCRE libreary的包装器,零宽度匹配不会导致索引立即碰撞。相反,它将设置一个标志(PCRE_NOTEMPTY),要求下一个匹配(从同一索引开始)为非零宽度匹配。如果匹配成功,它将沿着匹配的长度(非零)碰撞;否则,它会被一个字符碰撞。

By the way, PCRE library does not provide built-in FindAll or ReplaceAll operation. It is actually provided by PHP wrapper.

顺便说一句,PCRE库不提供内置的FindAll或ReplaceAll操作。它实际上是由PHP包装器提供的。

As a result, output of FindAll operations in such implementation can contain up to 2 matches starting at the same index.

因此,此类实现中的FindAll操作的输出最多可包含从同一索引开始的2个匹配项。

Python regex package probably follows this line of implementation.

Python正则表达式包可能遵循这一实现。

This line of implementation is more complex, since it requires the implementation of FindAll or ReplaceAll to keep an extra state of whether to disallow zero-width match or not. Developer also needs to keep track of this extra flags when they use the low level matching API.

这条实现更复杂,因为它需要实现FindAll或ReplaceAll以保持是否禁止零宽度匹配的额外状态。开发人员还需要在使用低级匹配API时跟踪这些额外的标志。

#1


The best explanation of this behaviour I know of is from the regex PyPI package, which is intended to eventually replace re (although it has been this way for a long time now).

我所知道的这种行为的最好解释来自正则表达式PyPI包,它最终会替换re(尽管现在已经很久了)。

Sometimes it’s not clear how zero-width matches should be handled. For example, should .* match 0 characters directly after matching >0 characters?

有时候不清楚应该如何处理零宽度匹配。例如,在匹配> 0个字符后,*。匹配0个字符?

Most regex implementations follow the lead of Perl (PCRE), but the re module sometimes doesn’t. The Perl behaviour appears to be the most common (and the re module is sometimes definitely wrong), so in version 1 the regex module follows the Perl behaviour, whereas in version 0 it follows the legacy re behaviour.

大多数正则表达式实现都遵循Perl(PCRE)的引导,但re模块有时却没有。 Perl行为似乎是最常见的(并且re模块有时肯定是错误的),因此在版本1中,regex模块遵循Perl行为,而在版本0中,它遵循遗留的re行为。

Examples:

# Version 0 behaviour (like re)
>>> regex.sub('(?V0).*', 'x', 'test')
'x'
>>> regex.sub('(?V0).*?', '|', 'test')
'|t|e|s|t|'

# Version 1 behaviour (like Perl)
>>> regex.sub('(?V1).*', 'x', 'test')
'xx'
>>> regex.sub('(?V1).*?', '|', 'test')
'|||||||||'

(?VX) sets the version flag in the regex. The second example is what you expect, and is supposedly what PCRE does. Python's re is somewhat nonstandard, and is kept as it is probably solely due to backwards compatibility concerns. I've found an example of something similar (with re.split).

(?VX)在正则表达式中设置版本标志。第二个例子是你所期望的,并且据说是PCRE所做的。 Python有点不标准,并且保留,因为它可能完全是由于向后兼容性问题。我找到了一个类似的例子(使用re.split)。

#2


For your new, edited question:

对于您新编辑的问题:

The .*? can match any number of characters, including zero. So what it does is it matches zero characters at every position in the string: before the "a", between the "a" and "b", etc. It replaces each of those zero-width matches with a hyphen, giving the result you see.

。*?可以匹配任意数量的字符,包括零。所以它的作用是匹配字符串中每个位置的零个字符:在“a”之前,在“a”和“b”之间,等等。它用连字符替换每个零宽度匹配,给出结果你看。

The regex does not try to match each character one by one; it tries to match at each position in the string. Your regex allows it to match zero characters. So it matches zero at each position and moves on to the next. You seem to be thinking that in a string like "abc" there is one position before the "b", one position "inside" the "b", and one position after "b", but there isn't a position "inside" an individual character. If it matches zero characters starting before "b", the next thing it tries is to match starting after "b". There's no way you can get a regex to match seven times in a three-character string, because there are only four positions to match at.

正则表达式不会尝试逐个匹配每个字符;它尝试匹配字符串中的每个位置。你的正则表达式允许它匹配零个字符。因此它在每个位置匹配零并继续下一个位置。你似乎在想,在像“abc”这样的字符串中,“b”之前有一个位置,“b”之内有一个位置,“b”之后有一个位置,但是里面没有位置“个性。如果它匹配在“b”之前开始的零个字符,则它尝试的下一个事项是匹配“b”之后的开始。你无法在三个字符的字符串中使用正则表达式匹配七次,因为只有四个位置可以匹配。

#3


Are you sure you interpreted re.sub's documentation correctly?

您确定正确解释了re.sub的文档吗?

*?, +?, ?? The '', '+', and '?' qualifiers are all greedy; they match as much text as possible. Sometimes this behaviour isn’t desired; if the RE <.> is matched against '<H1>title</H1>', it will match the entire string, and not just '<H1>'. Adding '?' after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using .*? in the previous expression will match only ''.

*?,+?,?? '','+'和'?'资格赛都是贪心的;它们匹配尽可能多的文本。有时这种行为是不可取的;如果RE <。>与'

title '匹配,它将匹配整个字符串,而不仅仅是'

'。添加'?'在限定符之后,它以非贪婪或最小的方式执行匹配;尽可能少的字符将匹配。使用。*?在前一个表达式中将只匹配''。

Adding a ? will turn the expression into a non-greedy one.

添加一个?将表达变为非贪婪的表达。

Greedy:

re.sub(".*", "-", "abc")

non-Greedy:

re.sub(".*?", "-", "abc")

Update: FWIW re.sub does exactly what it should:

更新:FWIW re.sub完全应该做到:

>>> from re import sub
>>> sub(".*?", "-", "abc")
'-a-b-c-'
>>> sub(".*", "-", "abc")
'-'

See @BrenBarn's awesome answer on why you get -a-b-c- :)

请参阅@BrenBarn关于你为什么得到-a-b-c- :)的精彩答案

Here's a visual representation of what's going on:

这是对正在发生的事情的直观表示:

.*?

为什么re.sub('。*?',' - ','abc')返回'-a-b-c-'而不是'-------'?

Debuggex Demo

#4


To elaborate on Veedrac's answer, different implementation has different treatment of zero-width matches in a FindAll (or ReplaceAll) operations. Two behaviors can be observed among different implementations, and Python re simply chooses to follow the first line of implementation.

为了详细说明Veedrac的答案,不同的实现在FindAll(或ReplaceAll)操作中对零宽度匹配有不同的处理方式。在不同的实现中可以观察到两种行为,Python只是选择遵循第一行实现。

1. Always bump along by one character on zero-width match

In Java and JavaScript, zero-width match causes the index to bump along by one character, since staying at the same index will cause an infinite loop in FindAll or ReplaceAll operations.

在Java和JavaScript中,零宽度匹配会导致索引碰到一个字符,因为保持在同一索引将导致FindAll或ReplaceAll操作中的无限循环。

As a result, output of FindAll operations in such implementation can contain at most 1 match starting at a particular index.

因此,此类实现中的FindAll操作的输出最多可包含从特定索引开始的1个匹配。

The default Python re package probably also follow the same implementation (and it seems to be also the case for Ruby).

默认的Python re包也可能遵循相同的实现(而Ruby似乎也是如此)。

2. Disallow zero-width match on next match at same index

In PHP, which provides a wrapper over PCRE libreary, zero-width match does not cause the index to bump along immediately. Instead, it will set a flag (PCRE_NOTEMPTY) requiring the next match (which starts at the same index) to be a non-zero-width match. If the match succeeds, it will bump along by the length of the match (non-zero); otherwise, it bumps along by one character.

在PHP中,它提供了一个基于PCRE libreary的包装器,零宽度匹配不会导致索引立即碰撞。相反,它将设置一个标志(PCRE_NOTEMPTY),要求下一个匹配(从同一索引开始)为非零宽度匹配。如果匹配成功,它将沿着匹配的长度(非零)碰撞;否则,它会被一个字符碰撞。

By the way, PCRE library does not provide built-in FindAll or ReplaceAll operation. It is actually provided by PHP wrapper.

顺便说一句,PCRE库不提供内置的FindAll或ReplaceAll操作。它实际上是由PHP包装器提供的。

As a result, output of FindAll operations in such implementation can contain up to 2 matches starting at the same index.

因此,此类实现中的FindAll操作的输出最多可包含从同一索引开始的2个匹配项。

Python regex package probably follows this line of implementation.

Python正则表达式包可能遵循这一实现。

This line of implementation is more complex, since it requires the implementation of FindAll or ReplaceAll to keep an extra state of whether to disallow zero-width match or not. Developer also needs to keep track of this extra flags when they use the low level matching API.

这条实现更复杂,因为它需要实现FindAll或ReplaceAll以保持是否禁止零宽度匹配的额外状态。开发人员还需要在使用低级匹配API时跟踪这些额外的标志。