
时间:2021-08-13 12:51:38

I have a script that gives me sentences that contain one of a specified list of key words. A sentence is defined as anything between 2 periods.


Now I want to use it to select all of a sentence like 'Put 1.5 grams of powder in' where if powder was a key word it would get the whole sentence and not '5 grams of powder'


I am trying to figure out how to express that a sentence is between to sequences of period then space. My new filter is:


def iterphrases(text):
    return ifilter(None, imap(lambda m:, finditer(r'([^\.\s]+)', text)))

However now I no longer print any sentences just pieces/phrases of words (including my key word). I am very confused as to what I am doing wrong.


3 个解决方案



if you don't HAVE to use an iterator, re.split would be a bit simpler for your use case (custom definition of a sentence):


re.split(r'\.\s', text)

Note the last sentence will include . or will be empty (if text ends with whitespace after last period), to fix that:


re.split(r'\.\s', re.sub(r'\.\s*$', '', text))

also have a look at a bit more general case in the answer for Python - RegEx for splitting text into sentences (sentence-tokenizing)

另请参阅Python的答案中更为一般的案例 - 用于将文本拆分为句子的RegEx(句子标记化)

and for a completely general solution you would need a proper sentence tokenizer, such as nltk.tokenize





Here you get it as an iterator. Works with my testcases. It considers a sentence to be anything (non-greedy) until a period, which is followed by either a space or the end of the line.


import re
sentence = re.compile("\w.*?\.(?= |$)", re.MULTILINE)
def iterphrases(text):
    return ( for match in sentence.finditer(text))



If you are sure that . is used for nothing besides sentences delimiters and that every relevant sentence ends with a period, then the following may be useful:


matches = re.finditer('([^.]*?(powder|keyword2|keyword3).*?)\.', text)
result = [ for m in matches]



if you don't HAVE to use an iterator, re.split would be a bit simpler for your use case (custom definition of a sentence):


re.split(r'\.\s', text)

Note the last sentence will include . or will be empty (if text ends with whitespace after last period), to fix that:


re.split(r'\.\s', re.sub(r'\.\s*$', '', text))

also have a look at a bit more general case in the answer for Python - RegEx for splitting text into sentences (sentence-tokenizing)

另请参阅Python的答案中更为一般的案例 - 用于将文本拆分为句子的RegEx(句子标记化)

and for a completely general solution you would need a proper sentence tokenizer, such as nltk.tokenize





Here you get it as an iterator. Works with my testcases. It considers a sentence to be anything (non-greedy) until a period, which is followed by either a space or the end of the line.


import re
sentence = re.compile("\w.*?\.(?= |$)", re.MULTILINE)
def iterphrases(text):
    return ( for match in sentence.finditer(text))



If you are sure that . is used for nothing besides sentences delimiters and that every relevant sentence ends with a period, then the following may be useful:


matches = re.finditer('([^.]*?(powder|keyword2|keyword3).*?)\.', text)
result = [ for m in matches]