将Python regex匹配多次

时间:2021-12-31 23:37:01

I'm trying to match a pattern against strings that could have multiple instances of the pattern. I need every instance separately. re.findall() should do it but I don't know what I'm doing wrong.

我正在尝试将一个模式与可能有多个模式实例的字符串进行匹配。我需要每个实例分开。findall()应该这样做,但我不知道自己做错了什么。

pattern = re.compile('/review: (http://url.com/(\d+)\s?)+/', re.IGNORECASE)
match = pattern.findall('this is the message. review: http://url.com/123 http://url.com/456')

I need 'http://url.com/123', http://url.com/456 and the two numbers 123 & 456 to be different elements of the match list.

我需要“http://url.com/123”、http://url.com/456和两个数字123 & 456作为匹配列表的不同元素。

I have also tried '/review: ((http://url.com/(\d+)\s?)+)/' as the pattern, but no luck.

我也尝试过“/review: (http://url.com/(\d+)\s?)+)/”作为模式,但没有运气。

3 个解决方案

#1


12  

Use this. You need to place 'review' outside the capturing group to achieve the desired result.

用这个。您需要在捕获组之外放置“review”以实现所需的结果。

pattern = re.compile(r'(?:review: )?(http://url.com/(\d+))\s?', re.IGNORECASE)

This gives output

这使输出

>>> match = pattern.findall('this is the message. review: http://url.com/123 http://url.com/456')
>>> match
[('http://url.com/123', '123'), ('http://url.com/456', '456')]

#2


5  

You've got extra /'s in the regex. In python the pattern should just be a string. e.g. instead of this:

你在正则表达式中得到了额外的/'s。在python中,模式应该是一个字符串。例如,而不是这样的:

pattern = re.compile('/review: (http://url.com/(\d+)\s?)+/', re.IGNORECASE)

It should be:

应该是:

pattern = re.compile('review: (http://url.com/(\d+)\s?)+', re.IGNORECASE)

Also typically in python you'd actually use a "raw" string like this:

通常在python中,你会使用这样的“原始”字符串:

pattern = re.compile(r'review: (http://url.com/(\d+)\s?)+', re.IGNORECASE)

The extra r on the front of the string saves you from having to do lots of backslash escaping etc.

在字符串前面的额外的r可以避免你不得不做大量的反斜杠转义等等。

#3


0  

Use a two-step approach: First get everything from "review:" to EOL, then tokenize that.

使用两步方法:首先获取从“review:”到EOL的所有内容,然后进行标记。

msg = 'this is the message. review: http://url.com/123 http://url.com/456'

review_pattern = re.compile('.*review: (.*)$')
urls = review_pattern.findall(msg)[0]

url_pattern = re.compile("(http://url.com/(\d+))")
url_pattern.findall(urls)

#1


12  

Use this. You need to place 'review' outside the capturing group to achieve the desired result.

用这个。您需要在捕获组之外放置“review”以实现所需的结果。

pattern = re.compile(r'(?:review: )?(http://url.com/(\d+))\s?', re.IGNORECASE)

This gives output

这使输出

>>> match = pattern.findall('this is the message. review: http://url.com/123 http://url.com/456')
>>> match
[('http://url.com/123', '123'), ('http://url.com/456', '456')]

#2


5  

You've got extra /'s in the regex. In python the pattern should just be a string. e.g. instead of this:

你在正则表达式中得到了额外的/'s。在python中,模式应该是一个字符串。例如,而不是这样的:

pattern = re.compile('/review: (http://url.com/(\d+)\s?)+/', re.IGNORECASE)

It should be:

应该是:

pattern = re.compile('review: (http://url.com/(\d+)\s?)+', re.IGNORECASE)

Also typically in python you'd actually use a "raw" string like this:

通常在python中,你会使用这样的“原始”字符串:

pattern = re.compile(r'review: (http://url.com/(\d+)\s?)+', re.IGNORECASE)

The extra r on the front of the string saves you from having to do lots of backslash escaping etc.

在字符串前面的额外的r可以避免你不得不做大量的反斜杠转义等等。

#3


0  

Use a two-step approach: First get everything from "review:" to EOL, then tokenize that.

使用两步方法:首先获取从“review:”到EOL的所有内容,然后进行标记。

msg = 'this is the message. review: http://url.com/123 http://url.com/456'

review_pattern = re.compile('.*review: (.*)$')
urls = review_pattern.findall(msg)[0]

url_pattern = re.compile("(http://url.com/(\d+))")
url_pattern.findall(urls)