I have a string, for example 'i cant sleep what should i do'
as well as a phrase that is contained in the string 'cant sleep'
. What I am trying to accomplish is to get an n sized window around the phrase even if there isn't n words on either side. So in this case if I had a window size of 2 (2 words on either size of the phrase) I would want 'i cant sleep what should'
.
我有一个字符串,例如'我不能睡觉我应该做什么'以及包含在字符串'不能睡觉'中的短语。我想要完成的是即使两边都没有n个单词,也要在短语周围找到一个n大小的窗口。因此,在这种情况下,如果我的窗口大小为2(在短语的任一大小上为2个单词),我会希望“我不能睡觉应该是什么”。
This is my current solution attempting to find a window size of 2, however it fails when the number of words to the left or right of the phrase is less than 2, I would also like to be able to use different window sizes.
这是我当前尝试找到窗口大小为2的解决方案,但是当短语左侧或右侧的单词数小于2时,它会失败,我还希望能够使用不同的窗口大小。
import re
sentence = 'i cant sleep what should i do'
phrase = 'cant sleep'
words = re.findall(r'\w+', sentence)
phrase_words = re.findall(r'\w+', phrase)
print sentence_words[left-2:right+3]
left = sentence_words.index(span_words[0])
right = sentence_words.index(span_words[-1])
print sentence_words[left-2:right+3]
4 个解决方案
#1
3
You can use the partition method for a non-regex solution:
您可以将分区方法用于非正则表达式解决方案:
>>> s='i cant sleep what should i do'
>>> p='cant sleep'
>>> lh, _, rh = s.partition(p)
Then use a slice to get up to two words:
然后使用切片最多得到两个单词:
>>> n=2
>>> ' '.join(lh.split()[:n]), p, ' '.join(rh.split()[:n])
('i', 'cant sleep', 'what should')
Your exact output:
你的确切输出:
>>> ' '.join(lh.split()[:n]+[p]+rh.split()[:n])
'i cant sleep what should'
You would want to check whether p
is in s
or if the partition succeeds of course.
您可能希望检查p是否在s中,或者当然是否分区成功。
As pointed out in comments, lh
should be a negative to take the last n
words (thanks Mathias Ettinger):
正如评论中指出的那样,lh应该是最后n个单词的否定(感谢Mathias Ettinger):
>>> s='w1 w2 w3 w4 w5 w6 w7 w8 w9'
>>> p='w4 w5'
>>> n=2
>>> ' '.join(lh.split()[-n:]+[p]+rh.split()[:n])
'w2 w3 w4 w5 w6 w7'
#2
2
If you define words being entities separated by spaces you can split your sentences and use regular python slicing:
如果您将单词定义为由空格分隔的实体,则可以拆分句子并使用常规的python切片:
def get_window(sentence, phrase, window_size):
sentence = sentence.split()
phrase = phrase.split()
words = len(phrase)
for i,word in enumerate(sentence):
if word == phrase[0] and sentence[i:i+words] == phrase:
start = max(0, i-window_size)
return ' '.join(sentence[start:i+words+window_size])
sentence = 'i cant sleep what should i do'
phrase = 'cant sleep'
print(get_window(sentence, phrase, 2))
You can also change it to a generator by changing return
to yield
and be able to generate all windows if several match of phrase
are in sentence
:
您还可以通过将return更改为yield来将其更改为生成器,并且如果句子中有多个匹配项,则可以生成所有窗口:
>>> list(gen_window('I dont need it, I need to get rid of it', 'need', 2))
['I dont need it, I', 'it, I need to get']
#3
1
import re
def contains_sublist(lst, sublst):
n = len(sublst)
for i in xrange(len(lst)-n+1):
if (sublst == lst[i:i+n]):
a = max(i, i-2)
b = min(i+n+2, len(lst))
return ' '.join(lst[a:b])
sentence = 'i cant sleep what should i do'
phrase = 'cant sleep'
sentence_words = re.findall(r'\w+', sentence)
phrase_words = re.findall(r'\w+', phrase)
print contains_sublist(sentence_words, phrase_words)
#4
1
you can split words using inbuilt string methods, so re
shouldn't be nessesary. If you want to define varrring values, then wrap it in a function call like so:
你可以使用内置的字符串方法拆分单词,所以re不应该是nessesary。如果要定义varrring值,请将其包装在函数调用中,如下所示:
def get_word_window(sentence, phrase, w_left=0, w_right=0):
w_lst = sentence.split()
p_lst = phrase.split()
for i,word in enumerate(w_lst):
if word == p_lst[0] and \
w_lst[i:i+len(p_lst)] == p_lst:
left = max(0, i-w_left)
right = min(len(w_lst), i+w_right+len(p_list)
return w_lst[left:right]
Then you can get the new phrase like so:
然后你可以得到这样的新短语:
>>> sentence='i cant sleep what should i do'
>>> phrase='cant sleep'
>>> ' '.join(get_word_window(sentence,phrase,2,2))
'i cant sleep what should'
#1
3
You can use the partition method for a non-regex solution:
您可以将分区方法用于非正则表达式解决方案:
>>> s='i cant sleep what should i do'
>>> p='cant sleep'
>>> lh, _, rh = s.partition(p)
Then use a slice to get up to two words:
然后使用切片最多得到两个单词:
>>> n=2
>>> ' '.join(lh.split()[:n]), p, ' '.join(rh.split()[:n])
('i', 'cant sleep', 'what should')
Your exact output:
你的确切输出:
>>> ' '.join(lh.split()[:n]+[p]+rh.split()[:n])
'i cant sleep what should'
You would want to check whether p
is in s
or if the partition succeeds of course.
您可能希望检查p是否在s中,或者当然是否分区成功。
As pointed out in comments, lh
should be a negative to take the last n
words (thanks Mathias Ettinger):
正如评论中指出的那样,lh应该是最后n个单词的否定(感谢Mathias Ettinger):
>>> s='w1 w2 w3 w4 w5 w6 w7 w8 w9'
>>> p='w4 w5'
>>> n=2
>>> ' '.join(lh.split()[-n:]+[p]+rh.split()[:n])
'w2 w3 w4 w5 w6 w7'
#2
2
If you define words being entities separated by spaces you can split your sentences and use regular python slicing:
如果您将单词定义为由空格分隔的实体,则可以拆分句子并使用常规的python切片:
def get_window(sentence, phrase, window_size):
sentence = sentence.split()
phrase = phrase.split()
words = len(phrase)
for i,word in enumerate(sentence):
if word == phrase[0] and sentence[i:i+words] == phrase:
start = max(0, i-window_size)
return ' '.join(sentence[start:i+words+window_size])
sentence = 'i cant sleep what should i do'
phrase = 'cant sleep'
print(get_window(sentence, phrase, 2))
You can also change it to a generator by changing return
to yield
and be able to generate all windows if several match of phrase
are in sentence
:
您还可以通过将return更改为yield来将其更改为生成器,并且如果句子中有多个匹配项,则可以生成所有窗口:
>>> list(gen_window('I dont need it, I need to get rid of it', 'need', 2))
['I dont need it, I', 'it, I need to get']
#3
1
import re
def contains_sublist(lst, sublst):
n = len(sublst)
for i in xrange(len(lst)-n+1):
if (sublst == lst[i:i+n]):
a = max(i, i-2)
b = min(i+n+2, len(lst))
return ' '.join(lst[a:b])
sentence = 'i cant sleep what should i do'
phrase = 'cant sleep'
sentence_words = re.findall(r'\w+', sentence)
phrase_words = re.findall(r'\w+', phrase)
print contains_sublist(sentence_words, phrase_words)
#4
1
you can split words using inbuilt string methods, so re
shouldn't be nessesary. If you want to define varrring values, then wrap it in a function call like so:
你可以使用内置的字符串方法拆分单词,所以re不应该是nessesary。如果要定义varrring值,请将其包装在函数调用中,如下所示:
def get_word_window(sentence, phrase, w_left=0, w_right=0):
w_lst = sentence.split()
p_lst = phrase.split()
for i,word in enumerate(w_lst):
if word == p_lst[0] and \
w_lst[i:i+len(p_lst)] == p_lst:
left = max(0, i-w_left)
right = min(len(w_lst), i+w_right+len(p_list)
return w_lst[left:right]
Then you can get the new phrase like so:
然后你可以得到这样的新短语:
>>> sentence='i cant sleep what should i do'
>>> phrase='cant sleep'
>>> ' '.join(get_word_window(sentence,phrase,2,2))
'i cant sleep what should'