Python：从行提取句子 - 基于标准需要的正则表达式

Somewhat of a python/programming newbie here...

这里有点蟒蛇/编程新手......

I am trying to come up with a regex that can handle extracting sentences from a line in a text file, and then appending them to a list. The code:

我试图想出一个正则表达式,它可以处理从文本文件中的一行中提取句子,然后将它们附加到列表中。代码:

import re

txt_list = []

with open('sample.txt', 'r') as txt:
    patt = r'.*}[.!?]\s?\n?|.*}.+[.!?]\s?\n?'
    read_txt = txt.readlines()

    for line in read_txt:
        if line == "\n":
            txt_list.append("\n")
        else: 
            found = re.findall(patt, line)
            for f in found:
                txt_list.append(f)


for line in txt_list:
    if line == "\n":
        print "newline"
    else:
        print line

Printed Output as per last 5 lines of above code:

根据上述代码的最后5行打印输出:

{Hello there|Hello|Howdy} Dr. Munchauson you {gentleman|fine fellow}! 
What {will|shall|should} we {eat|have} for lunch? Peas by the {thousand|hundred|1000} said Dr. Munchauson; {that|is} what he said.

newline
I am the {very last|last} sentence for this {instance|example}.

The contents of 'sample.txt':

'sample.txt'的内容:

{Hello there|Hello|Howdy} Dr. Munchauson you {gentleman|fine fellow}! What {will|shall|should} we {eat|have} for lunch? Peas by the {thousand|hundred|1000} said Dr. Munchauson; {that|is} what he said.

I am the {very last|last} sentence for this {instance|example}.

I have been playing around with the regex for a couple of hours now and I cannot seem to crack it. As it stands the regex does not match at the end of for lunch?. Therefore these 2 sentences What {will|shall|should} we {eat|have} for lunch? Peas by the {thousand|hundred|1000} said Dr. Munchauson; {that|is} what he said. are not separated; which is what I want.

我现在已经玩了几个小时的正则表达式,我似乎无法破解它。因为它的正则表达式在午餐结束时不匹配?因此这两句话我们午餐时会吃什么{会|应该}? {千| 1000}的豌豆说Munchauson博士; {那是}他说的话。不分开;这就是我想要的。

A few important details for the regex:

正则表达式的一些重要细节:

Every sentence will always end in a period, exclamation mark or question mark

每个句子总是以句号,感叹号或问号结束

Every sentence will always contain at least 1 pair of curly brackets '{}' with some words in. Also there will be no misleading '.' after the last bracket in every sentence. Therefore Dr. will always be before the last pair of curly brackets in each sentence. This is why I have attempted to base my regex around using '}'. This way I can avoid using the exceptions approach, of creating exceptions for such grammar as Dr., Jr., approx. and so on. For each file I run this code on, I personally make sure there is no 'misleading period' after the last '}' in any sentence.

每个句子总是包含至少一对大括号“{}”,其中包含一些单词。此外,不会有误导性的“。”在每个句子的最后一个括号之后。因此博士总是在每个句子的最后一对花括号之前。这就是为什么我试图使用'}'来建立我的正则表达式。通过这种方式,我可以避免使用异常方法,为Dr.,Jr.,about等语法创建例外。等等。对于我运行此代码的每个文件,我个人确保在任何句子中的最后一个'}'之后没有“误导期”。

The output that I want is this:

我想要的输出是这样的:

{Hello there|Hello|Howdy} Dr. Munchauson you {gentleman|fine fellow}! 
What {will|shall|should} we {eat|have} for lunch?
Peas by the {thousand|hundred|1000} said Dr. Munchauson; {that|is} what he said.

newline
I am the {very last|last} sentence for this {instance|example}.

2 个解决方案

#1

The most intuitive solution I've got is this. Essentially, you need to treat the Dr. and Mr. tokens as atoms in their own right.

我得到的最直观的解决方案就是这个。从本质上讲,你需要将Dr.和Mr. tokens视为原子本身。

patt = r'(?:Dr\.|Mr\.|.)*?[.!?]\s?\n?'

Broken down, its says:

分解,它说:

Find me the least number of Mr.s, Dr.s or any character up to a puncuation mark followed by a zero or one spaces which is followed by zero or one new lines.

找到我最少数量的先生,博士或任何字符到一个标记,然后是零或一个空格,后面跟着零或一个新行。

When used on this sample.txt (I added a line):

在此sample.txt上使用时(我添加了一行):

{Hello there|Hello|Howdy} Dr. Munchauson you {gentleman|fine fellow}! What {will|shall|should} we {eat|have} for lunch? Peas by the {thousand|hundred|1000} said Dr. Munchauson; {that|is} what he said.

But there are no {misters|doctors} here good sir! Help us if there is an emergency.

I am the {very last|last} sentence for this {instance|example}.

It gives:

{Hello there|Hello|Howdy} Dr. Munchauson you {gentleman|fine fellow}!
What {will|shall|should} we {eat|have} for lunch?
Peas by the {thousand|hundred|1000} said Dr. Munchauson; {that|is} what he said.

newline
But there are no {misters|doctors} here good sir!
Help us if there is an emergency.

newline
I am the {very last|last} sentence for this {instance|example}.

#2

If you're don't mind adding a dependency, the NLTK library has a sent_tokenize function that should do what you need, though I'm not entirely sure whether the curly brackets will interfere.

如果你不介意添加一个依赖项,那么NLTK库有一个sent_tokenize函数可以做你需要的,虽然我不完全确定大括号是否会干扰。

The paper describing the method NLTK used is 40+ pages long. Detecting sentence boundaries isn't a trivial task.

描述NLTK使用方法的论文长达40多页。检测句子边界不是一项简单的任务。

#1