Python - 奇怪的正则表达式与组上的+ / *匹配

>>> src = '  pkg.subpkg.submod.thing  pkg2.subpkg.submod.thing  '
>>> re.search(r'\s*(\w+\.)+', src).groups()
('submod.',)

This regex seems to put everything which is not space into a/the group - nothing to be lost before stop of regex match.

这个正则表达式似乎把所有不是空间的东西放到了一个/组中 - 在正则表达式匹配之前没有什么可以丢失的。

Why is just the last "+" repetition found in the group here - and not ('pkg.subpkg.submod.',)?

为什么这里只是组中发现的最后一次“+”重复 - 而不是('pkg.subpkg.submod。',)?

Or ('pkg.',) - early stop because no real repetition - no "loss of information" in another sense?

或者('pkg。') - 早期停止,因为没有真正的重复 - 在另一种意义上没有“信息丢失”?

(I needed to use another (?:...) like r'\s((?:\w+\.)+)')

(我需要使用另一个(?:...)像r'\ s((?:\ w + \。)+)')

Even more strange:

更奇怪的是:

>>> src = '  pkg.subpkg.submod.thing  pkg2.subpkg.submod.thing  '
>>> re.search(r'\s(\w+\.)*', src).groups()
(None,)

Edit: the "more strange" is actually "less strange" as @Avinash Raj pointed out, because - unlike intended - the match simply ends before the group; So

编辑:“更奇怪”实际上“不那么奇怪”正如@Avinash Raj指出的那样,因为 - 与预期不同 - 这场比赛只是在小组之前结束;所以

>>> re.search(r'\s+(\w+\.)*', '  pkg.subpkg.submod.thing').groups()
('submod.',)

.. then produces the same questioned behavior than "+" : just last repetition - things before seeming lost...

..然后产生与“+”相同的质疑行为:只是最后一次重复 - 看似丢失之前的事情......

3 个解决方案

#1

I'll explain the even more strange part..

我会解释更奇怪的部分..

src = '  pkg.subpkg.submod.thing  pkg2.subpkg.submod.thing  '

re.search stops matching once it finds a first match. So,

re.search在找到第一个匹配后停止匹配。所以,

r'\s(\w+\.)*' would match the first space character (* repeats the previous pattern zero or more times), since there is no match for (\w+\.)* after the first space, groups() function on searchObj returns None and group on searchObj should return the space that is the first space.

r'\ s(\ w + \。)*'将匹配第一个空格字符(*重复前一个模式零次或多次),因为在第一个空格,组之后没有匹配(\ w + \。)* searchObj上的函数返回None,searchObj上的group应该返回第一个空格的空格。

#2

I do not know, why it is strange for you. What do you expect?

我不知道,为什么对你来说很奇怪。你能指望什么?

In the documentation you find the following:

在文档中,您可以找到以下内容:

re.search(pattern, string, flags=0) Scan through string looking for the first location where the regular expression pattern ...

re.search(pattern,string,flags = 0)扫描字符串,查找正则表达式模式的第一个位置...

re.search(r'\s*(\w+\.)+', src).groups()

in your search string you have only one group: (\w+.) Because it is greedy by default all the pkg.subpkg. is eaten before you find submod., this is the last that is filled, that the string matches.

在您的搜索字符串中,您只有一个组:(\ w +。)因为默认情况下它是贪婪的所有pkg.subpkg。在找到submod之前被吃掉。这是填充的最后一个,字符串匹配。

your second try doesn't match, cause there is not even 1 group nessesary to fulfil the Statement, so all 3 parts are eaten and inside the Group you find nothing.

你的第二次尝试不匹配,因为甚至没有一个小组有必要履行声明,所以所有3个部分都被吃掉了,在集团里面你什么都没找到。

Do you look for this?

你在找这个吗?

re.search(r'\s*((\w+\.)+)', src).groups()[0]

Try out the following to understand it better:

尝试以下内容以更好地理解它:

re.search(r'\s*((\w+\.)*)(\w+\.)*', 'a.b.c.d.e.f.g.h.i').groups()

#3

-1

This should work fine to match the complete string ' pkg.subpkg.submod.thing pkg2.subpkg.submod.thing '

这应该可以正常匹配完整的字符串'pkg.subpkg.submod.thing pkg2.subpkg.submod.thing'

(\s*(\w+[.\s])+)+

In case you want the output ' pkg.subpkg.submod.thing ' then use this

如果你想要输出'pkg.subpkg.submod.thing'然后使用它

\s*(\w+[.\s])+

#1