python：re.search不是从字符串的开头开始的？

I'm working on a Flask API, which takes the following regex as an endpoint:

我正在使用Flask API,它将以下正则表达式作为端点:

([0-9]*)((OK)|(BACK)|(X))*

That means I'm expecting a series of numbers, and the OK, BACK, X keywords multiple times in succession after the numbers.

这意味着我期待一系列数字,以及OK,BACK,X关键字在数字之后连续多次。

I want to split this regex and do different stuff depending which capture groups were present.

我想拆分这个正则表达式并根据哪些捕获组存在而做不同的事情。

My approach was the following:

我的方法如下:

endp = endp.encode('ASCII', 'ignore')
    match = re.search(r"([0-9]*)", str(endp), re.I)
    if match:
        n = match.groups()
        logging.info('nums: ' + str(n[0]))

    match = re.search(r"((OK)|(BACK)|(X))*", str(endp), re.I)
    if match:
        s1 = match.groups()
        for i in s1:
            logging.info('str: ' + str(i[0]))

Using the /12OK endpoint, getting the numbers works just fine, but for some reason capturing the rest of the keywords are unsuccessful. I tried reducing the second capture group to only

使用/ 12OK端点,获取数字工作正常,但由于某些原因捕获其余关键字不成功。我尝试将第二个捕获组减少到仅

match = re.search(r"(OK)*", str(endp), re.I)

I constantly find the following in s1 (using the reduced regex):

我经常在s1中找到以下内容(使用简化的正则表达式):

(None,)

originally (with the rest of the keywords):

最初(与其他关键字一起):

(None, None, None, None)

Which I suppose means the regex pattern does not match anything in my endp string (why does it have 4 Nones? 1 for each keyword, but what the 4th is there for?). I validated my endpoint (the regex against the same string too) with a regex validator, it seems fine to me. I understand that re.match is supposed to get matches from the beginning, therefore I used the re.search method, as the documentation points out it's supposed to match anywhere in the string.

我认为这意味着正则表达式模式与我的endp字符串中的任何内容都不匹配(为什么它有4个Nones?每个关键字1个,但第4个是什么?)。我使用正则表达式验证器验证了我的端点(正则规则对同一个字符串),对我来说似乎没问题。我知道re.match应该从头开始匹配,因此我使用了re.search方法,因为文档指出它应该匹配字符串中的任何位置。

What am I missing here? Please advise, I'm a beginner in the python world.

我在这里想念的是什么?请指教,我是蟒蛇世界的初学者。

4 个解决方案

#1

Indeed it is a bit surprising that searching with * returns `None:

确实有点令人惊讶的是,使用*返回`None:

>>> re.search("(OK|BACK|X)*", u'/12OK').groups()
(None,)

But it's "correct", since * matches zero or more, and any pattern matches zero times in any string, that's why you see None. Searching with + somewhat solves it:

但它是“正确的”,因为*匹配零或更多,并且任何模式在任何字符串中匹配零次,这就是为什么你看到无。用+搜索有点解决它:

>>> re.search("(OK|BACK|X)+", u'/12OK').groups()
('OK',)

But now, searching with this pattern in /12OKOK still only finds one match because + means one or more, and it matched one time at the first OK. To find all occurrences you need to use re.findall:

但现在,在/ 12OKOK中使用此模式搜索仍然只找到一个匹配,因为+表示一个或多个,并且它在第一个OK时匹配一次。要查找所有需要使用re.findall的事件:

>>> re.findall("(OK|BACK|X)", u'/12OKOK')
['OK', 'OK']

With those findings, your code would look as follows: (note that you don't need to write i[0] since i is already a string, unless you want to log only the first char of the string):

根据这些发现,您的代码将如下所示:(请注意,您不需要编写i [0],因为我已经是一个字符串,除非您只想记录字符串的第一个字符串):

import re

endp = endp.encode('ASCII', 'ignore')
match = re.search(r"([0-9]+)", str(endp))
if match:
    n = match.groups()
    logging.info('nums: ' + str(n))

match = re.findall(r"(OK|BACK|X)", str(endp), re.I)
for i in match:
    logging.info('str: ' + str(i))

#2

If you want to match at least ONE of the groups, use + instead of *.

如果要匹配至少一个组,请使用+而不是*。

>>> endp = '/12OK'
>>> match = re.search(r"((OK)|(BACK)|(X))+", str(endp), re.I)
>>> if match:
...     s1 = match.groups()
...     for i in s1:
...         print s1
...
('OK', 'OK', None, None)
>>> endp = '/12X'
>>> match = re.search(r"((OK)|(BACK)|(X))+", str(endp), re.I)
>>> match.groups()
('X', None, None, 'X')

Notice that you have 4 matching groups in your expression, one for each pair of parentheses. The first match is the outer parenthesis and the second one is the first of the nested groups. In the second example, you still get the first match for the outer parenthesis and then the last one is the third of the nested ones.

请注意,表达式中有4个匹配的组,每对括号一个。第一个匹配是外括号,第二个匹配是第一个嵌套组。在第二个示例中,您仍然获得外括号的第一个匹配,然后最后一个匹配嵌套的第三个匹配。

#3

"((OK)|(BACK)|(X))*" will search for OK or BACK or X, 0 or more times. Note that the * means 0 or more, not more than 0. The above expression should have a + at the end not * as + means 1 or more.

“((OK)|(BACK)|(X))*”将搜索OK或BACK或X,0或更多次。注意,*表示0或更多,不大于0.上面的表达式应该在末尾不是+,因为+表示1或更多。

#4

I think you're having two different issues, and their intersection is causing more confusion than either of them would cause on their own.

我认为你有两个不同的问题,它们的交集比其中任何一个都会造成更多的混乱。

The first issue is that you're using repeated groups. Python's re library is not able to capture multiple matches when a group is repeated. Matching with a pattern like (X)+ against 'XXXX' will only capture a single 'X' in the first group even though the whole string will be matched. The regex library (which is not part of the standard library) can do multiple captures, though I'm not sure of the exact commands required.

第一个问题是你正在使用重复的组。当重复一个组时,Python的re库无法捕获多个匹配项。与(X)+对抗'XXXX'的模式匹配将仅捕获第一组中的单个'X',即使整个字符串将匹配。正则表达式库(不是标准库的一部分)可以执行多次捕获,但我不确定所需的确切命令。

The second issue is using the * repetition operator in your pattern. The pattern you show at the top of the question will match on an empty string. Obviously, none of the gropus will capture anything in that situation (which may be why you're seeing a lot of None entries in your results). You probably need to modify your pattern so that it requires some minimal amount of valid text to count as a match. Using + instead of * might be one solution, but it's not clear to me exactly what you want to match against so I can't suggest a specific pattern.

第二个问题是在模式中使用*重复运算符。您在问题顶部显示的模式将匹配空字符串。显然,在那种情况下,没有一个gropus会捕获任何东西(这可能就是你在结果中看到很多无条目的原因)。您可能需要修改模式,以便需要一些最小数量的有效文本作为匹配项。使用+而不是*可能是一种解决方案,但我不清楚你想要匹配什么,所以我不能建议一个特定的模式。

#1