findall()并没有按照预期的那样工作

I have the code:

我的代码:

import re
sequence="aabbaa"
rexp=re.compile("(aa|bb)+")
rexp.findall(sequence)

This returns ['aa']

这将返回(“aa”)

If we have

如果我们有

import re
sequence="aabbaa"
rexp=re.compile("(aa|cc)+")
rexp.findall(sequence)

we get ['aa','aa']

我们得到了(“aa”、“aa”)

Why is there a difference and why (for the first) do we not get ['aa','bb','aa']?

为什么会有区别?为什么(第一个)我们没有得到[aa'，'bb'，'aa']?

Thanks!

谢谢!

4 个解决方案

#1

let me explain what you are doing:

让我解释一下你在做什么:

regex = re.compile("(aa|bb)+")

you are creating a regex which will look for aa or bb and then will try to find if there are more aa or bb after that, and it will keep looking for aa or bb until it doesnt find. since you want your capturing group to return only the aa or bb then you only get the last captured/found group.

您正在创建一个regex，它将查找aa或bb，然后尝试查找之后是否有更多的aa或bb，它将继续查找aa或bb，直到没有找到为止。由于您希望捕获组只返回aa或bb，那么您只能获得最后捕获/找到的组。

however, if you have a string like this: aaxaabbxaa you will get aa,bb,aa because you first look at the string and find aa, then you look for more, and find only an x, so you have 1 group. then you find another aa, but then you find a bb, and then an x so you stop and you have your second group which is bb. then you find another aa. and so your final result is aa,bb,aa

但是，如果你有一个这样的字符串aaxaabbxaa你会得到aabb aa因为你首先看一下这个字符串然后找到aa，然后再找更多的，只找到一个x，所以你有一个组。然后你找到另一个aa，然后你会发现一个bb，然后是一个x所以你停下来，你的第二组是bb。然后找到另一个aa。最后的结果是aa,bb,aa。

i hope this explains what you are DOING. and it is as expected. to get ANY group of aa or bb you need to remove the + which is telling the regex to seek multiple groups before returning a match. and just have regex return each match of aa or bb...

我希望这能解释你在做什么。这是意料之中的。要获得任意组的aa或bb，您需要删除+，该+告诉regex在返回匹配之前查找多个组。让regex返回每一场aa或bb的比赛…

so your regex should be:

因此，您的regex应该是:

regex = re.compile("(aa|bb)")

cheers.

欢呼。

#2

The unwanted behaviour comes down to the way you formulate regualar expression:

这种不受欢迎的行为归结于你对regualar的表达方式:

rexp=re.compile("(aa|bb)+")

Parentheses (aa|bb) forms a group.

括号(aa|bb)组成一个组。

And if we look at the docs of findall we will see this:

如果我们看看findall的文档我们会看到

Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.**

返回字符串中所有模式的非重叠匹配，如字符串列表。从左到右扫描字符串，并按找到的顺序返回匹配项。如果模式中存在一个或多个组，则返回一个组列表;如果模式有多个组，这将是一个元组列表。结果中包含空匹配项，除非它们触及另一个匹配项的开始。**

As you formed a group, it mathced first aa, then bb, then aa again (because of + quantifier). So this group holds aa in the end. And findall returns this value in the list ['aa'] (as there is only one match aabbaa of the whole expression, the list contains only one element aa which is saved in the group).

当你组成一个组时，它首先计算aa，然后是bb，然后是aa(因为+量词)。所以这个基团最后是aa。并且findall返回列表中的这个值['aa'](因为整个表达式只有一个匹配aabbaa，列表中只包含一个元素aa，它保存在组中)。

From the code you gave, you seemed to want to do this:

从你给出的代码来看，你似乎想这样做:

>>> rexp=re.compile("(?:aa|bb)+")
>>> rexp.findall(sequence)
['aabbaa']

(?: ...) doesnt create any group, so findall returns the match of the whole expression.

(?不创建任何组，所以findall返回整个表达式的匹配项。

In the end of your question you show the desired output. This is achieved by just looking for aa or bb. No quantifiers (+ or *) are needed. Just do it the way is in the Inbar Rose's answer:

在问题的最后，您将显示所需的输出。这是通过寻找aa或bb来实现的。不需要量词(+或*)。按照Inbar Rose的回答:

>>> rexp=re.compile("aa|bb")
>>> rexp.findall(sequence)
['aa', 'bb', 'aa']

#3

your pattern

你的模式

rexp=re.compile("(aa|bb)+")

matches the whole string aabbaa. to clarify just look at this

匹配整个字符串aabbaa。为了澄清这一点，看看这个

>>> re.match(re.compile("(aa|bb)+"),"aabbaa").group(0)
'aabbaa'

Also no other substrings are to match then

也没有其他子字符串匹配。

>>> re.match(re.compile("(aa|bb)+"),"aabbaa").group(1)
'aa'

so a findall will return the one substring only

所以findall将只返回一个子字符串。

>>> re.findall(re.compile("(aa|bb)+"),"aabbaa")
['aa']
>>>

#4

-1

I do not understand why you use + - it means 0 or 1 occurrence, and is usually used when you want find string with optional inclusion of substring.

我不明白为什么要使用+ -它表示0或1出现，通常用于查找包含子字符串的可选字符串。

>>> re.findall(r'(aa|bb)', 'aabbaa')
['aa', 'bb', 'aa']

work as expected

像预期的那样工作

#1