Python Regex查找两个子字符串之间的所有字符串

时间:2022-11-27 18:50:27

I am looking to find all strings between two substrings while keeping the first substring and discarding the second. The substrings might be one of several values though. For example, if these are the possible substrings:

我正在查找两个子字符串之间的所有字符串,同时保留第一个子字符串并丢弃第二个。子字符串可能是几个值中的一个。例如,如果这些是可能的子字符串:

subs = ['MIKE','WILL','TOM','DAVID']

I am looking to get the string between any of these like this:

我想要得到这样的字符串

Input:

text = 'MIKE an entry for mike WILL and here is wills text DAVID and this belongs to david'

Output:

[('MIKE': 'an entry for mike'),
 ('WILL': 'and here is wills text'),
 ('DAVID': 'and this belongs to david')]

Trailing spaces are not important. I have tried:

尾随空格并不重要。我有尝试:

re.findall('(MIKE|WILL|TOM|DAVID)(.*?)(MIKE|WILL|TOM|DAVID)',text)

which only returns the first occurrence and retains the end substring. Not too sure of the best approach.

它只返回第一个事件并保留结束子字符串。不太确定最好的方法。

2 个解决方案

#1


2  

You may use

你可以用

import re
text = 'MIKE an entry for mike WILL and here is wills text DAVID and this belongs to david'
subs = ['MIKE','WILL','TOM','DAVID']
res = re.findall(r'({0})\s*(.*?)(?=\s*(?:{0}|$))'.format("|".join(subs)), text)
print(res)
# => [('MIKE', 'an entry for mike'), ('WILL', 'and here is wills text'), ('DAVID', 'and this belongs to david')]

See the Python demo.

查看演示Python。

The pattern that is built dynamically will look like (MIKE|WILL|TOM|DAVID)\s*(.*?)(?=\s*(?:MIKE|WILL|TOM|DAVID|$)) in this case.

在这种情况下,动态构建的模式看起来像(MIKE|将|TOM|DAVID)\s*(. ?)(?=\s*(:MIKE|将|TOM|DAVID|$))。

Details

细节

  • (MIKE|WILL|TOM|DAVID) - Group 1 matching one of the alternatives substrings
  • (MIKE|将|TOM|DAVID) -组1匹配一个备选子字符串
  • \s* - 0+ whitespaces
  • \ s * - 0 +空格
  • (.*?) - Group 2 capturing any 0+ chars other than line break chars (use re.S flag to match any chars), as few as possible, up to the first...
  • (.*?) -组2捕获除换行字符以外的任何0+字符(使用re.S标志匹配任何字符),尽可能少,直到第一个…
  • (?=\s*(?:MIKE|WILL|TOM|DAVID|$)) - 0+ whitespaces followed with one of the substrings or end of string ($). These texts are not consumed, so, the regex engine still can get consequent matches.
  • (?=\s*(?:MIKE|WILL|TOM|DAVID|$)) - 0+空格,后跟一个子字符串或字符串结尾($)。这些文本没有被消耗,所以regex引擎仍然可以获得后续匹配。

#2


0  

You can also use the following regex to achieve your goal:

您还可以使用以下regex来实现您的目标:

(MIKE.*)(?= WILL)|(WILL.*)(?= DAVID)|(DAVID.*)

It uses Positive lookahead to get the intermediate strings. (http://www.rexegg.com/regex-quickstart.html)

它使用正向前视来获取中间字符串。(http://www.rexegg.com/regex-quickstart.html)

TESTED: https://regex101.com/r/ZSJJVG/1

测试:https://regex101.com/r/ZSJJVG/1

#1


2  

You may use

你可以用

import re
text = 'MIKE an entry for mike WILL and here is wills text DAVID and this belongs to david'
subs = ['MIKE','WILL','TOM','DAVID']
res = re.findall(r'({0})\s*(.*?)(?=\s*(?:{0}|$))'.format("|".join(subs)), text)
print(res)
# => [('MIKE', 'an entry for mike'), ('WILL', 'and here is wills text'), ('DAVID', 'and this belongs to david')]

See the Python demo.

查看演示Python。

The pattern that is built dynamically will look like (MIKE|WILL|TOM|DAVID)\s*(.*?)(?=\s*(?:MIKE|WILL|TOM|DAVID|$)) in this case.

在这种情况下,动态构建的模式看起来像(MIKE|将|TOM|DAVID)\s*(. ?)(?=\s*(:MIKE|将|TOM|DAVID|$))。

Details

细节

  • (MIKE|WILL|TOM|DAVID) - Group 1 matching one of the alternatives substrings
  • (MIKE|将|TOM|DAVID) -组1匹配一个备选子字符串
  • \s* - 0+ whitespaces
  • \ s * - 0 +空格
  • (.*?) - Group 2 capturing any 0+ chars other than line break chars (use re.S flag to match any chars), as few as possible, up to the first...
  • (.*?) -组2捕获除换行字符以外的任何0+字符(使用re.S标志匹配任何字符),尽可能少,直到第一个…
  • (?=\s*(?:MIKE|WILL|TOM|DAVID|$)) - 0+ whitespaces followed with one of the substrings or end of string ($). These texts are not consumed, so, the regex engine still can get consequent matches.
  • (?=\s*(?:MIKE|WILL|TOM|DAVID|$)) - 0+空格,后跟一个子字符串或字符串结尾($)。这些文本没有被消耗,所以regex引擎仍然可以获得后续匹配。

#2


0  

You can also use the following regex to achieve your goal:

您还可以使用以下regex来实现您的目标:

(MIKE.*)(?= WILL)|(WILL.*)(?= DAVID)|(DAVID.*)

It uses Positive lookahead to get the intermediate strings. (http://www.rexegg.com/regex-quickstart.html)

它使用正向前视来获取中间字符串。(http://www.rexegg.com/regex-quickstart.html)

TESTED: https://regex101.com/r/ZSJJVG/1

测试:https://regex101.com/r/ZSJJVG/1