为什么re.findall('(ab)+','abab')返回['ab'] =同时re.findall('(ab)+?','abab')返回['ab','ab'] ?

时间:2021-02-03 22:34:07

My python version is 2.7.6

我的python版本是2.7.6

I know that +? is the non-greedy version of +.
so that re.findall('(ab)+?', 'abab') will match as less ab as it can.
The result ['ab', 'ab'] thus make sense.

我知道+?是+的非贪婪版本。所以re.findall('(ab)+?','abab')将匹配尽可能少的ab。结果['ab','ab']因此有意义。

But when comes to the greedy version match re.findall('(ab)+', 'abab'), it confused me.
I thought the greedy version should match as much ab as it can.
Thus I shall got ['abab'] as the result.
But I got ['ab'] instead !

但是当贪婪的版本匹配re.findall('(ab)+','abab')时,它让我很困惑。我认为贪婪的版本应该尽可能多地匹配ab。因此我得到['abab']作为结果。但我改为['ab']!

In the re.findall()'s help info, it says:

在re.findall()的帮助信息中,它说:

Return a list of all non-overlapping matches in the string.
If one or more groups are present in the pattern, return a
list of groups; this will be a list of tuples if the pattern
has more than one group.

Empty matches are included in the result.

Here I got two group, the default group0 for the whole RE, and my specified (ab) as group1.

这里我有两个组,整个RE的默认group0和我指定的(ab)为group1。

So I did the following investigation:

所以我做了以下调查:

In [21]: ng = re.search('(ab)+?', 'abab')

In [22]: g = re.search('(ab)+', 'abab')

In [23]: ng.group(0)
Out[23]: 'ab'

In [24]: ng.group(1)
Out[24]: 'ab'

In [25]: g.group(0)
Out[25]: 'abab'

In [26]: g.group(1)
Out[26]: 'ab'

It is crystal clear that re module will match 'abab' as group0 and 'ab' as group1 for the greedy search.
But why I got ['ab'] instead of ['abab', 'ab'] when doing findall() operation?
Beause 'abab' contains ab so they were overlapped, and findall() only return the last match in this situation?

很明显,re模块将'abab'与group0匹配,'ab'作为group1用于贪婪搜索。但是为什么我在执行findall()操作时得到['ab']而不是['abab','ab']?因为'abab'包含ab所以它们是重叠的,而findall()只返回这种情况下的最后一场比赛?

With this question, I did the following test:

有了这个问题,我做了以下测试:

In [30]: g = re.findall('[A-z](ab)+', 'ababdab')

In [31]: g
Out[31]: ['ab', 'ab']

In [32]: dg = re.search('[A-z](ab)+', 'ababdab')

In [33]: dg.groups()
Out[33]: ('ab',)

In [34]: dg.group()
Out[34]: 'bab'

Now I'm totally out of my mind now.
How does findall works here?
Why???

现在我完全忘记了。 findall如何在这里工作?为什么???

3 个解决方案

#1


2  

There's a subtlety here - touched on in Jerry's answer, but not stated clearly.

这里有一个微妙的东西 - 在杰瑞的回答中提到,但没有明确说明。

You expected re.findall('(ab)+', 'abab') to tell you about both the implicit "group 0" for what the entire regex matched, and "group 1" for the parentheses. That's not how it works. If there are capturing parentheses, findall's list only contains the groups for the capturing parentheses. Observe:

你期望re.findall('(ab)+','abab')告诉你关于整个正则表达式匹配的隐含“组0”和括号的“组1”。这不是它的工作原理。如果有捕获括号,则findall的列表仅包含捕获括号的组。注意:

>>> re.findall('(?:ab)+', 'abab') # no capture, reports group 0
['abab']
>>> re.findall('(ab)+', 'abab')   # one capture, reports _only_ group 1
['ab']
>>> re.findall('((ab)+)', 'abab') # two captures, reports both groups 1 and 2
[('abab', 'ab')]                  # (but still not group 0)

The documentation could stand to be clearer about this. It assumes you understand that "group 0" doesn't really count as a group. But this is how RE libraries have worked for decades.

文档可以更清楚地说明这一点。它假设您理解“组0”并不真正算作一个组。但这就是RE库已经运作了几十年的方式。

#2


1  

findall is working just like it should be working:

findall正在工作,就像它应该工作:

  1. It gives all the matches within a string into a result list if there are no capture groups.
  2. 如果没有捕获组,它会将字符串中的所有匹配项放入结果列表中。

  3. If there is one capture group, it will return a list of the capture groups only.
  4. 如果有一个捕获组,它将仅返回捕获组的列表。

  5. If there are more than one capture group, a list of tuples will be returned, with 1 tuple containing the capture groups for one match.
  6. 如果有多个捕获组,将返回一个元组列表,其中1个元组包含一个匹配的捕获组。

Next, the MatchObject returns the last captured group whenever there is a repetition of the group. It is mentioned in the docs:

接下来,只要组重复,MatchObject就会返回最后捕获的组。它在文档中提到:

If a group matches multiple times, only the last match is accessible:

如果一个组匹配多次,则只能访问最后一个匹配:

>>>
>>> m = re.match(r"(..)+", "a1b2c3")  # Matches 3 times.
>>> m.group(1)                        # Returns only the last match.
'c3'

So a combination of both phenomenons give the result you are experiencing.

因此,两种现象的组合给出了您正在经历的结果。

#3


-1  

Take a look:

看一看:

In [13]: re.findall('(ab)', 'ababab')
Out[13]: ['ab', 'ab', 'ab']

In [14]: re.findall('(ab)+?', 'ababab')
Out[14]: ['ab', 'ab', 'ab']

In [15]: re.findall('(ab)+', 'ababab')
Out[15]: ['ab']

In[13] is equivalent to In[14]. Both patterns will match each ab group. However, In[15] will match all ab contiguous repetitions, regardless their number.

在[13]中相当于In [14]。两种模式都匹配每个ab组。然而,在[15]中将匹配所有ab连续重复,无论其数量如何。

The [A-z](ab)+ pattern means that you want all ab contiguous repetitions that start with a letter [A-z]. The first group that matches it in ababdab is bab: it starts with b which is in [A-z], then there is a following ab group that ends at d, which starts the very next matching group.

[A-z](ab)+模式意味着您希望以字母[A-z]开头的所有ab连续重复。在ababdab中匹配它的第一组是bab:它以[A-z]中的b开头,然后有一个后续的ab组以d结尾,从而启动下一个匹配组。

In [20]: re.findall('[A-z](ab)+', 'XababXabXab')
Out[20]: ['ab', 'ab', 'ab']

#1


2  

There's a subtlety here - touched on in Jerry's answer, but not stated clearly.

这里有一个微妙的东西 - 在杰瑞的回答中提到,但没有明确说明。

You expected re.findall('(ab)+', 'abab') to tell you about both the implicit "group 0" for what the entire regex matched, and "group 1" for the parentheses. That's not how it works. If there are capturing parentheses, findall's list only contains the groups for the capturing parentheses. Observe:

你期望re.findall('(ab)+','abab')告诉你关于整个正则表达式匹配的隐含“组0”和括号的“组1”。这不是它的工作原理。如果有捕获括号,则findall的列表仅包含捕获括号的组。注意:

>>> re.findall('(?:ab)+', 'abab') # no capture, reports group 0
['abab']
>>> re.findall('(ab)+', 'abab')   # one capture, reports _only_ group 1
['ab']
>>> re.findall('((ab)+)', 'abab') # two captures, reports both groups 1 and 2
[('abab', 'ab')]                  # (but still not group 0)

The documentation could stand to be clearer about this. It assumes you understand that "group 0" doesn't really count as a group. But this is how RE libraries have worked for decades.

文档可以更清楚地说明这一点。它假设您理解“组0”并不真正算作一个组。但这就是RE库已经运作了几十年的方式。

#2


1  

findall is working just like it should be working:

findall正在工作,就像它应该工作:

  1. It gives all the matches within a string into a result list if there are no capture groups.
  2. 如果没有捕获组,它会将字符串中的所有匹配项放入结果列表中。

  3. If there is one capture group, it will return a list of the capture groups only.
  4. 如果有一个捕获组,它将仅返回捕获组的列表。

  5. If there are more than one capture group, a list of tuples will be returned, with 1 tuple containing the capture groups for one match.
  6. 如果有多个捕获组,将返回一个元组列表,其中1个元组包含一个匹配的捕获组。

Next, the MatchObject returns the last captured group whenever there is a repetition of the group. It is mentioned in the docs:

接下来,只要组重复,MatchObject就会返回最后捕获的组。它在文档中提到:

If a group matches multiple times, only the last match is accessible:

如果一个组匹配多次,则只能访问最后一个匹配:

>>>
>>> m = re.match(r"(..)+", "a1b2c3")  # Matches 3 times.
>>> m.group(1)                        # Returns only the last match.
'c3'

So a combination of both phenomenons give the result you are experiencing.

因此,两种现象的组合给出了您正在经历的结果。

#3


-1  

Take a look:

看一看:

In [13]: re.findall('(ab)', 'ababab')
Out[13]: ['ab', 'ab', 'ab']

In [14]: re.findall('(ab)+?', 'ababab')
Out[14]: ['ab', 'ab', 'ab']

In [15]: re.findall('(ab)+', 'ababab')
Out[15]: ['ab']

In[13] is equivalent to In[14]. Both patterns will match each ab group. However, In[15] will match all ab contiguous repetitions, regardless their number.

在[13]中相当于In [14]。两种模式都匹配每个ab组。然而,在[15]中将匹配所有ab连续重复,无论其数量如何。

The [A-z](ab)+ pattern means that you want all ab contiguous repetitions that start with a letter [A-z]. The first group that matches it in ababdab is bab: it starts with b which is in [A-z], then there is a following ab group that ends at d, which starts the very next matching group.

[A-z](ab)+模式意味着您希望以字母[A-z]开头的所有ab连续重复。在ababdab中匹配它的第一组是bab:它以[A-z]中的b开头,然后有一个后续的ab组以d结尾,从而启动下一个匹配组。

In [20]: re.findall('[A-z](ab)+', 'XababXabXab')
Out[20]: ['ab', 'ab', 'ab']