如何匹配使用正则表达式的段落

I have been struggling with python regex for a while trying to match paragraphs within a text, but I haven't been successful. I need to obtain the start and end positions of the paragraphs.

我一直在努力使python正则表达式试图匹配文本中的段落,但我没有成功。我需要获得段落的开头和结尾位置。

An example of a text:

文本示例:

Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod
tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At
vero eos et accusam et justo duo dolores et ea rebum. 

Stet clita kasd gubergren,
no sea takimata sanctus est Lorem ipsum dolor sit amet.

Ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod
tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At
vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren,
no sea takimata sanctus est Lorem ipsum dolor sit amet.

In this example case, I would want to separately match all the paragraphs starting with Lorem, Stet and Ipsum respectively (without the empty lines). Does anyone have any idea how to do this?

在这个例子中,我想分别匹配分别以Lorem,Stet和Ipsum开头的所有段落(没有空行)。有谁知道如何做到这一点?

5 个解决方案

#1

You can split on double-newline like this:

您可以像这样拆分双换行符:

paragraphs = re.split(r"\n\n", DATA)

Edit: To capture the paragraphs as matches, so you can get their start and end points, do this:

编辑:要将段落捕获为匹配项,以便获取其起点和终点,请执行以下操作:

for match in re.finditer(r'(?s)((?:[^\n][\n]?)+)', DATA):
   print match.start(), match.end()

# Prints:
# 0 214
# 215 298
# 299 589

#2

Using split is one way, you can do so with regular expression also like this:

使用split是一种方法,你可以使用正则表达式这样做:

paragraphs = re.search('(.+?\n\n|.+?$)',TEXT,re.DOTALL)

The .+? is a lazy match, it will match the shortest substring that makes the whole regex matched. Otherwise, it will just match the whole string.

。+?是一个惰性匹配,它将匹配使整个正则表达式匹配的最短子串。否则,它将匹配整个字符串。

So basically here we want to find a sequence of characters (.+?) which ends by a blank line (\n\n) or the end of string ($). The re.DOTALL flag makes the dot to match newline also (we also want to match a paragraph consisting of three lines without blank lines within)

所以基本上我们想要找到一个字符序列(。+?),它以空行(\ n \ n)或字符串结尾($)结尾。 re.DOTALL标志使得点也匹配换行符(我们还想匹配由三行组成的段落,其中没有空行)

#3

What is the newline symbol? Let us suppose the newline symbol is '\r\n', if you want to match the paragraphs starting with Lorem, you can do like this:

什么是换行符号?让我们假设换行符号是'\ r \ n',如果你想匹配以Lorem开头的段落,你可以这样做:

pattern = re.compile('\r\nLorem.*\r\n')
str = '...'    # your source text
matchlist = re.findall(pattern, str)

The matchlist will contain all the paragragh start with Lorem. And the other two words are the same.

匹配列表将包含Lorem的所有paragragh开头。而另外两个词是一样的。

#4

Try

^(.+?)\n\s*\n

^(.+?)\r\n\s*\r\n

just do not forget append extra new line at the end of text

只是不要忘记在文本末尾添加额外的新行

#5

i tried to use the recommended RegEx with the default Java RegEx engine. That gave me several times a *Exception, so in the end i rewrote the RegEx and optimized it a little more.

我尝试使用推荐的RegEx和默认的Java RegEx引擎。这给了我几次*Exception,所以最后我重写了RegEx并对它进行了一些优化。

So this is working fine for me in Java:

所以这对我来说在Java中工作得很好:

(?s)(.*?[^\:\-\,])(?:$|\n{2,})

This also handles the end of document without new lines and tries to concat lines which ends with ':', '-' or ',' to the next paragraph.

这也处理文档的结尾而没有新行,并尝试将以“:”,“ - ”或“,”结尾的行连接到下一段。

And to avoid that trailing blanks (whitespace or tabs) breaks the above described feature i am stripping them before with following regex:

并且为了避免尾随空格(空格或制表符)打破上述功能,我在使用后续正则表达式之前剥离它们:

(?m)[[:blank:]]+$

#1