I have a bunch of documents and I'm interested in finding mentions of clinical trials. These are always denoted by the letters being in all caps (e.g. ASPIRE). I want to match any word in all caps, greater than three letters. I also want the surrounding +- 4 words for context.
我有一堆文件,我有兴趣找到临床试验的提及。这些总是由全部大写字母(例如ASPIRE)表示。我希望匹配所有大写字母中的任何单词,大于三个字母。我也想要周围的+ - 4个单词用于上下文。
Below is what I currently have. It kind of works, but fails the test below.
以下是我目前的情况。它有点工作,但未通过下面的测试。
import re
pattern = '((?:\w*\s*){,4})\s*([A-Z]{4,})\s*((?:\s*\w*){,4})'
line = r"Lorem IPSUM is simply DUMMY text of the printing and typesetting INDUSTRY."
re.findall(pattern, line)
4 个解决方案
#1
2
You may use this code in python that does it in 2 steps. First we split input by 4+ letter capital words and then we find upto 4 words on either side of match.
您可以在python中使用此代码,分两步完成。首先我们将输入分为4个以上的大写字母,然后我们在匹配的两边找到最多4个字。
import re
str = 'Lorem IPSUM is simply DUMMY text of the printing and typesetting INDUSTRY'
re1 = r'\b([A-Z]{4,})\b'
re2 = r'(?:\s*\w+\b){,4}'
arr = re.split(re1, str)
result = []
for i in range(len(arr)):
if i % 2:
result.append( (re.search(re2, arr[i-1]).group(), arr[i], re.search(re2, arr[i+1]).group()) )
print result
Output:
[('Lorem', 'IPSUM', ' is simply'), (' is simply', 'DUMMY', ' text of the printing'), (' text of the printing', 'INDUSTRY', '')]
#2
2
Would the following regex works for you?
以下正则表达式适合您吗?
(\b\w+\b\W*){,4}[A-Z]{3,}\W*(\b\w+\b\W*){,4}
Tested here: https://regex101.com/r/nTzLue/1/
在此测试:https://regex101.com/r/nTzLue/1/
#3
2
On the left side you could match any word character \w+
one or more times followed by any non word characters \W+
one or more times. Combine those two in a non capturing group and repeat that 4 times {4}
like (?:\w+\W+){4}
在左侧,您可以匹配任何单词字符\ w +一次或多次,然后是任何非单词字符\ W +一次或多次。将这两个组合在非捕获组中并重复4次{4},如(?:\ w + \ W +){4}
Then capture 3 or more uppercase characters in a group ([A-Z]{3,})
.
然后捕获一组中的3个或更多大写字符([A-Z] {3,})。
Or the right side you could then turn the matching of the word and non word characters around of what you match on the left side (?:\W+\w+){4}
或者在右侧,您可以将左侧匹配的单词和非单词字符匹配(?:\ W + \ w +){4}
(?:\w+\W+){4}([A-Z]{3,})(?:\W+\w+){4}
The captured group will contain your uppercase word and the on capturing groups will contain the surrounding words.
捕获的组将包含您的大写单词,而捕获组将包含周围的单词。
#4
1
This should do the job:
这应该做的工作:
pattern = '(?:(\w+ ){4})[A-Z]{3}(\w+ ){5}'
#1
2
You may use this code in python that does it in 2 steps. First we split input by 4+ letter capital words and then we find upto 4 words on either side of match.
您可以在python中使用此代码,分两步完成。首先我们将输入分为4个以上的大写字母,然后我们在匹配的两边找到最多4个字。
import re
str = 'Lorem IPSUM is simply DUMMY text of the printing and typesetting INDUSTRY'
re1 = r'\b([A-Z]{4,})\b'
re2 = r'(?:\s*\w+\b){,4}'
arr = re.split(re1, str)
result = []
for i in range(len(arr)):
if i % 2:
result.append( (re.search(re2, arr[i-1]).group(), arr[i], re.search(re2, arr[i+1]).group()) )
print result
Output:
[('Lorem', 'IPSUM', ' is simply'), (' is simply', 'DUMMY', ' text of the printing'), (' text of the printing', 'INDUSTRY', '')]
#2
2
Would the following regex works for you?
以下正则表达式适合您吗?
(\b\w+\b\W*){,4}[A-Z]{3,}\W*(\b\w+\b\W*){,4}
Tested here: https://regex101.com/r/nTzLue/1/
在此测试:https://regex101.com/r/nTzLue/1/
#3
2
On the left side you could match any word character \w+
one or more times followed by any non word characters \W+
one or more times. Combine those two in a non capturing group and repeat that 4 times {4}
like (?:\w+\W+){4}
在左侧,您可以匹配任何单词字符\ w +一次或多次,然后是任何非单词字符\ W +一次或多次。将这两个组合在非捕获组中并重复4次{4},如(?:\ w + \ W +){4}
Then capture 3 or more uppercase characters in a group ([A-Z]{3,})
.
然后捕获一组中的3个或更多大写字符([A-Z] {3,})。
Or the right side you could then turn the matching of the word and non word characters around of what you match on the left side (?:\W+\w+){4}
或者在右侧,您可以将左侧匹配的单词和非单词字符匹配(?:\ W + \ w +){4}
(?:\w+\W+){4}([A-Z]{3,})(?:\W+\w+){4}
The captured group will contain your uppercase word and the on capturing groups will contain the surrounding words.
捕获的组将包含您的大写单词,而捕获组将包含周围的单词。
#4
1
This should do the job:
这应该做的工作:
pattern = '(?:(\w+ ){4})[A-Z]{3}(\w+ ){5}'