I'm trying to match a pattern against strings that could have multiple instances of the pattern. I need every instance separately. re.findall()
should do it but I don't know what I'm doing wrong.
我正在尝试将一个模式与可能有多个模式实例的字符串进行匹配。我需要每个实例分开。findall()应该这样做,但我不知道自己做错了什么。
pattern = re.compile('/review: (http://url.com/(\d+)\s?)+/', re.IGNORECASE)
match = pattern.findall('this is the message. review: http://url.com/123 http://url.com/456')
I need 'http://url.com/123', http://url.com/456 and the two numbers 123 & 456 to be different elements of the match
list.
我需要“http://url.com/123”、http://url.com/456和两个数字123 & 456作为匹配列表的不同元素。
I have also tried '/review: ((http://url.com/(\d+)\s?)+)/'
as the pattern, but no luck.
我也尝试过“/review: (http://url.com/(\d+)\s?)+)/”作为模式,但没有运气。
3 个解决方案
#1
12
Use this. You need to place 'review' outside the capturing group to achieve the desired result.
用这个。您需要在捕获组之外放置“review”以实现所需的结果。
pattern = re.compile(r'(?:review: )?(http://url.com/(\d+))\s?', re.IGNORECASE)
This gives output
这使输出
>>> match = pattern.findall('this is the message. review: http://url.com/123 http://url.com/456')
>>> match
[('http://url.com/123', '123'), ('http://url.com/456', '456')]
#2
5
You've got extra /'s in the regex. In python the pattern should just be a string. e.g. instead of this:
你在正则表达式中得到了额外的/'s。在python中,模式应该是一个字符串。例如,而不是这样的:
pattern = re.compile('/review: (http://url.com/(\d+)\s?)+/', re.IGNORECASE)
It should be:
应该是:
pattern = re.compile('review: (http://url.com/(\d+)\s?)+', re.IGNORECASE)
Also typically in python you'd actually use a "raw" string like this:
通常在python中,你会使用这样的“原始”字符串:
pattern = re.compile(r'review: (http://url.com/(\d+)\s?)+', re.IGNORECASE)
The extra r on the front of the string saves you from having to do lots of backslash escaping etc.
在字符串前面的额外的r可以避免你不得不做大量的反斜杠转义等等。
#3
0
Use a two-step approach: First get everything from "review:" to EOL, then tokenize that.
使用两步方法:首先获取从“review:”到EOL的所有内容,然后进行标记。
msg = 'this is the message. review: http://url.com/123 http://url.com/456'
review_pattern = re.compile('.*review: (.*)$')
urls = review_pattern.findall(msg)[0]
url_pattern = re.compile("(http://url.com/(\d+))")
url_pattern.findall(urls)
#1
12
Use this. You need to place 'review' outside the capturing group to achieve the desired result.
用这个。您需要在捕获组之外放置“review”以实现所需的结果。
pattern = re.compile(r'(?:review: )?(http://url.com/(\d+))\s?', re.IGNORECASE)
This gives output
这使输出
>>> match = pattern.findall('this is the message. review: http://url.com/123 http://url.com/456')
>>> match
[('http://url.com/123', '123'), ('http://url.com/456', '456')]
#2
5
You've got extra /'s in the regex. In python the pattern should just be a string. e.g. instead of this:
你在正则表达式中得到了额外的/'s。在python中,模式应该是一个字符串。例如,而不是这样的:
pattern = re.compile('/review: (http://url.com/(\d+)\s?)+/', re.IGNORECASE)
It should be:
应该是:
pattern = re.compile('review: (http://url.com/(\d+)\s?)+', re.IGNORECASE)
Also typically in python you'd actually use a "raw" string like this:
通常在python中,你会使用这样的“原始”字符串:
pattern = re.compile(r'review: (http://url.com/(\d+)\s?)+', re.IGNORECASE)
The extra r on the front of the string saves you from having to do lots of backslash escaping etc.
在字符串前面的额外的r可以避免你不得不做大量的反斜杠转义等等。
#3
0
Use a two-step approach: First get everything from "review:" to EOL, then tokenize that.
使用两步方法:首先获取从“review:”到EOL的所有内容,然后进行标记。
msg = 'this is the message. review: http://url.com/123 http://url.com/456'
review_pattern = re.compile('.*review: (.*)$')
urls = review_pattern.findall(msg)[0]
url_pattern = re.compile("(http://url.com/(\d+))")
url_pattern.findall(urls)