I want to extract a portion of a large string. There's a target word and an upper bound on the number of words before and after that. The extracted substring must therefore contain the target word along with the upper bound words before and after it. The before and after part can contain lesser words if the target word is closer to the beginning or end of the text.
我想提取一个大字符串的一部分。在这之前和之后有一个目标词和一个词的数量上限。因此,提取的子字符串必须包含目标字以及前后的上界字。如果目标词更接近文本的开始或结束,则前后部分可以包含较小的词。
Eample string
Eample字符串
"Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."
“上帝保佑你,上帝保佑你,上帝保佑你,上帝保佑你,上帝保佑你。”但是,如果你的工作是你的工作,那么你的工作就会变得很好。在我们的国家里,有许多人都是欧洲人。“我不同意你的意见,”他说,“我不知道你为什么要这样做。”
Target word: laboris
目标词:laboris
words_before: 5
words_before:5
words_after: 2
words_after:2
Should return ['veniam, quis nostrud exercitation ullamco laboris nisi ut']
应返回['veniam, quis nostrud努力]
I thought of a couple of possible patterns but none of them worked. I guess it can also be done by simply traversing the string front and back from the target word. However a regex would definitely make things easier. Any help would be appreciated.
我想到了几个可能的模式,但没有一个奏效。我想它也可以通过简单地从目标字前后遍历字符串来完成。然而,regex肯定会使事情变得更容易。如有任何帮助,我们将不胜感激。
3 个解决方案
#1
3
If you still want regex....
def find_context(word_, n_before, n_after, string_):
import re
b= '\w+\W+' * n_before
a= '\W+\w+' * n_after
pattern = '(' + b + word_ + a + ')'
print(re.search(pattern, string_).groups(1)[0])
find_context('laboris', 5, 2, st)
veniam, quis nostrud exercitation ullamco laboris nisi ut
find_context('culpa', 2, 2, st)
sunt in culpa qui officia
#2
5
If you want to split words, you can use slice()
and split()
function. For example:
如果要拆分单词,可以使用slice()和split()函数。例如:
>>> text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod
tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, qu
is nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu
fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in
culpa qui officia deserunt mollit anim id est laborum.".split()
>>> n = text.index('laboris')
>>> s = slice(n - 5, n + 3)
>>> text[s]
['veniam,', 'quis', 'nostrud', 'exercitation', 'ullamco', 'laboris', 'nisi', 'ut']
#3
2
You can also approach it with nltk
and it's "concordance" method, inspired by Calling NLTK's concordance - how to get text before/after a word that was used?:
你也可以用nltk接近它,它是“索引”方法,灵感来自于调用nltk的索引——如何在使用的一个词之前/之后获得文本?
A concordance view shows us every occurrence of a given word, together with some context.
和谐观向我们展示了一个给定单词的每一个出现,以及一些上下文。
import nltk
def get_neighbors(input_text, word, before, after):
text = nltk.Text(nltk.tokenize.word_tokenize(input_text))
concordance_index = nltk.ConcordanceIndex(text.tokens)
offset = next(offset for offset in concordance_index.offsets(word))
return text.tokens[offset - before - 1: offset] + text.tokens[offset: offset + after + 1]
text = u"Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."
print(get_neighbors(text, 'laboris', 5, 2))
Prints 5 words/tokens before the target word and 2 after:
打印目标单词前5个单词/标记,后2个单词:
[u'veniam', u',', u'quis', u'nostrud', u'exercitation', u'ullamco', u'laboris', u'nisi', u'ut']
#1
3
If you still want regex....
def find_context(word_, n_before, n_after, string_):
import re
b= '\w+\W+' * n_before
a= '\W+\w+' * n_after
pattern = '(' + b + word_ + a + ')'
print(re.search(pattern, string_).groups(1)[0])
find_context('laboris', 5, 2, st)
veniam, quis nostrud exercitation ullamco laboris nisi ut
find_context('culpa', 2, 2, st)
sunt in culpa qui officia
#2
5
If you want to split words, you can use slice()
and split()
function. For example:
如果要拆分单词,可以使用slice()和split()函数。例如:
>>> text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod
tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, qu
is nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu
fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in
culpa qui officia deserunt mollit anim id est laborum.".split()
>>> n = text.index('laboris')
>>> s = slice(n - 5, n + 3)
>>> text[s]
['veniam,', 'quis', 'nostrud', 'exercitation', 'ullamco', 'laboris', 'nisi', 'ut']
#3
2
You can also approach it with nltk
and it's "concordance" method, inspired by Calling NLTK's concordance - how to get text before/after a word that was used?:
你也可以用nltk接近它,它是“索引”方法,灵感来自于调用nltk的索引——如何在使用的一个词之前/之后获得文本?
A concordance view shows us every occurrence of a given word, together with some context.
和谐观向我们展示了一个给定单词的每一个出现,以及一些上下文。
import nltk
def get_neighbors(input_text, word, before, after):
text = nltk.Text(nltk.tokenize.word_tokenize(input_text))
concordance_index = nltk.ConcordanceIndex(text.tokens)
offset = next(offset for offset in concordance_index.offsets(word))
return text.tokens[offset - before - 1: offset] + text.tokens[offset: offset + after + 1]
text = u"Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."
print(get_neighbors(text, 'laboris', 5, 2))
Prints 5 words/tokens before the target word and 2 after:
打印目标单词前5个单词/标记,后2个单词:
[u'veniam', u',', u'quis', u'nostrud', u'exercitation', u'ullamco', u'laboris', u'nisi', u'ut']