Python在字符串中的短语周围找到n大小的窗口

时间:2020-12-28 19:22:07

I have a string, for example 'i cant sleep what should i do'as well as a phrase that is contained in the string 'cant sleep'. What I am trying to accomplish is to get an n sized window around the phrase even if there isn't n words on either side. So in this case if I had a window size of 2 (2 words on either size of the phrase) I would want 'i cant sleep what should'.

我有一个字符串,例如'我不能睡觉我应该做什么'以及包含在字符串'不能睡觉'中的短语。我想要完成的是即使两边都没有n个单词,也要在短语周围找到一个n大小的窗口。因此,在这种情况下,如果我的窗口大小为2(在短语的任一大小上为2个单词),我会希望“我不能睡觉应该是什么”。

This is my current solution attempting to find a window size of 2, however it fails when the number of words to the left or right of the phrase is less than 2, I would also like to be able to use different window sizes.

这是我当前尝试找到窗口大小为2的解决方案,但是当短语左侧或右侧的单词数小于2时,它会失败,我还希望能够使用不同的窗口大小。

import re
sentence = 'i cant sleep what should i do'
phrase = 'cant sleep'
words = re.findall(r'\w+', sentence)
phrase_words = re.findall(r'\w+', phrase)
print sentence_words[left-2:right+3]

left = sentence_words.index(span_words[0]) 
right =  sentence_words.index(span_words[-1])
print sentence_words[left-2:right+3]

4 个解决方案

#1


3  

You can use the partition method for a non-regex solution:

您可以将分区方法用于非正则表达式解决方案:

>>> s='i cant sleep what should i do'
>>> p='cant sleep'
>>> lh, _, rh = s.partition(p)

Then use a slice to get up to two words:

然后使用切片最多得到两个单词:

>>> n=2
>>> ' '.join(lh.split()[:n]), p, ' '.join(rh.split()[:n])
('i', 'cant sleep', 'what should')

Your exact output:

你的确切输出:

>>> ' '.join(lh.split()[:n]+[p]+rh.split()[:n])
'i cant sleep what should'

You would want to check whether p is in s or if the partition succeeds of course.

您可能希望检查p是否在s中,或者当然是否分区成功。


As pointed out in comments, lh should be a negative to take the last n words (thanks Mathias Ettinger):

正如评论中指出的那样,lh应该是最后n个单词的否定(感谢Mathias Ettinger):

>>> s='w1 w2 w3 w4 w5 w6 w7 w8 w9'
>>> p='w4 w5'
>>> n=2
>>> ' '.join(lh.split()[-n:]+[p]+rh.split()[:n])
'w2 w3 w4 w5 w6 w7'

#2


2  

If you define words being entities separated by spaces you can split your sentences and use regular python slicing:

如果您将单词定义为由空格分隔的实体,则可以拆分句子并使用常规的python切片:

def get_window(sentence, phrase, window_size):
    sentence = sentence.split()
    phrase = phrase.split()
    words = len(phrase)

    for i,word in enumerate(sentence):
        if word == phrase[0] and sentence[i:i+words] == phrase:
            start = max(0, i-window_size)
            return ' '.join(sentence[start:i+words+window_size])

sentence = 'i cant sleep what should i do'
phrase = 'cant sleep'
print(get_window(sentence, phrase, 2))

You can also change it to a generator by changing return to yield and be able to generate all windows if several match of phrase are in sentence:

您还可以通过将return更改为yield来将其更改为生成器,并且如果句子中有多个匹配项,则可以生成所有窗口:

>>> list(gen_window('I dont need it, I need to get rid of it', 'need', 2))
['I dont need it, I', 'it, I need to get']

#3


1  

import re

def contains_sublist(lst, sublst):
    n = len(sublst)

    for i in xrange(len(lst)-n+1):
        if (sublst == lst[i:i+n]):
            a = max(i, i-2)
            b = min(i+n+2, len(lst))
            return ' '.join(lst[a:b])


sentence = 'i cant sleep what should i do'
phrase = 'cant sleep'
sentence_words = re.findall(r'\w+', sentence)
phrase_words = re.findall(r'\w+', phrase)

print contains_sublist(sentence_words, phrase_words)

#4


1  

you can split words using inbuilt string methods, so re shouldn't be nessesary. If you want to define varrring values, then wrap it in a function call like so:

你可以使用内置的字符串方法拆分单词,所以re不应该是nessesary。如果要定义varrring值,请将其包装在函数调用中,如下所示:

def get_word_window(sentence, phrase, w_left=0, w_right=0):
    w_lst = sentence.split()
    p_lst = phrase.split()

    for i,word in enumerate(w_lst):
        if word == p_lst[0] and \
           w_lst[i:i+len(p_lst)] == p_lst:
            left = max(0, i-w_left)
            right = min(len(w_lst), i+w_right+len(p_list)

    return w_lst[left:right]

Then you can get the new phrase like so:

然后你可以得到这样的新短语:

>>> sentence='i cant sleep what should i do'
>>> phrase='cant sleep'
>>> ' '.join(get_word_window(sentence,phrase,2,2))
'i cant sleep what should'

#1


3  

You can use the partition method for a non-regex solution:

您可以将分区方法用于非正则表达式解决方案:

>>> s='i cant sleep what should i do'
>>> p='cant sleep'
>>> lh, _, rh = s.partition(p)

Then use a slice to get up to two words:

然后使用切片最多得到两个单词:

>>> n=2
>>> ' '.join(lh.split()[:n]), p, ' '.join(rh.split()[:n])
('i', 'cant sleep', 'what should')

Your exact output:

你的确切输出:

>>> ' '.join(lh.split()[:n]+[p]+rh.split()[:n])
'i cant sleep what should'

You would want to check whether p is in s or if the partition succeeds of course.

您可能希望检查p是否在s中,或者当然是否分区成功。


As pointed out in comments, lh should be a negative to take the last n words (thanks Mathias Ettinger):

正如评论中指出的那样,lh应该是最后n个单词的否定(感谢Mathias Ettinger):

>>> s='w1 w2 w3 w4 w5 w6 w7 w8 w9'
>>> p='w4 w5'
>>> n=2
>>> ' '.join(lh.split()[-n:]+[p]+rh.split()[:n])
'w2 w3 w4 w5 w6 w7'

#2


2  

If you define words being entities separated by spaces you can split your sentences and use regular python slicing:

如果您将单词定义为由空格分隔的实体,则可以拆分句子并使用常规的python切片:

def get_window(sentence, phrase, window_size):
    sentence = sentence.split()
    phrase = phrase.split()
    words = len(phrase)

    for i,word in enumerate(sentence):
        if word == phrase[0] and sentence[i:i+words] == phrase:
            start = max(0, i-window_size)
            return ' '.join(sentence[start:i+words+window_size])

sentence = 'i cant sleep what should i do'
phrase = 'cant sleep'
print(get_window(sentence, phrase, 2))

You can also change it to a generator by changing return to yield and be able to generate all windows if several match of phrase are in sentence:

您还可以通过将return更改为yield来将其更改为生成器,并且如果句子中有多个匹配项,则可以生成所有窗口:

>>> list(gen_window('I dont need it, I need to get rid of it', 'need', 2))
['I dont need it, I', 'it, I need to get']

#3


1  

import re

def contains_sublist(lst, sublst):
    n = len(sublst)

    for i in xrange(len(lst)-n+1):
        if (sublst == lst[i:i+n]):
            a = max(i, i-2)
            b = min(i+n+2, len(lst))
            return ' '.join(lst[a:b])


sentence = 'i cant sleep what should i do'
phrase = 'cant sleep'
sentence_words = re.findall(r'\w+', sentence)
phrase_words = re.findall(r'\w+', phrase)

print contains_sublist(sentence_words, phrase_words)

#4


1  

you can split words using inbuilt string methods, so re shouldn't be nessesary. If you want to define varrring values, then wrap it in a function call like so:

你可以使用内置的字符串方法拆分单词,所以re不应该是nessesary。如果要定义varrring值,请将其包装在函数调用中,如下所示:

def get_word_window(sentence, phrase, w_left=0, w_right=0):
    w_lst = sentence.split()
    p_lst = phrase.split()

    for i,word in enumerate(w_lst):
        if word == p_lst[0] and \
           w_lst[i:i+len(p_lst)] == p_lst:
            left = max(0, i-w_left)
            right = min(len(w_lst), i+w_right+len(p_list)

    return w_lst[left:right]

Then you can get the new phrase like so:

然后你可以得到这样的新短语:

>>> sentence='i cant sleep what should i do'
>>> phrase='cant sleep'
>>> ' '.join(get_word_window(sentence,phrase,2,2))
'i cant sleep what should'