How would one write a regular expression to use in python to split paragraphs?
如何编写一个正则表达式在python中使用来分割段落?
A paragraph is defined by 2 linebreaks (\n). But one can have any amount of spaces/tabs together with the line breaks, and it still should be considered as a paragraph.
段落由2个换行符(\ n)定义。但是,可以将任意数量的空格/制表符与换行符一起使用,并且它仍应被视为段落。
I am using python so the solution can use python's regular expression syntax which is extended. (can make use of (?P...)
stuff)
我正在使用python,因此解决方案可以使用扩展的python的正则表达式语法。 (可以使用(?P ...)的东西)
Examples:
the_str = 'paragraph1\n\nparagraph2'
# splitting should yield ['paragraph1', 'paragraph2']
the_str = 'p1\n\t\np2\t\n\tstill p2\t \n \n\tp3'
# should yield ['p1', 'p2\t\n\tstill p2', 'p3']
the_str = 'p1\n\n\n\tp2'
# should yield ['p1', '\n\tp2']
The best I could come with is: r'[ \t\r\f\v]*\n[ \t\r\f\v]*\n[ \t\r\f\v]*'
, i.e.
我能得到的最好的是:r'[\ t \ r \ n \ f \ v] * \ n [\ t \ r \ n \ f \ v] * \ n [\ t \ r \ n \ f \ v] *',即
import re
paragraphs = re.split(r'[ \t\r\f\v]*\n[ \t\r\f\v]*\n[ \t\r\f\v]*', the_str)
but that is ugly. Anything better?
但那很难看。还有什么更好的?
EDIT:
Suggestions rejected:
r'\s*?\n\s*?\n\s*?'
-> That would make example 2 and 3 fail, since \s
includes \n
, so it would allow paragraph breaks with more than 2 \n
s.
R '\ S *吗?\ n \ S *吗?\ n \ S *?' - >这会使示例2和3失败,因为\ s包含\ n,因此它允许段落中断超过2 \ ns。
4 个解决方案
#1
4
Unfortunately there's no nice way to write "space but not a newline".
不幸的是,没有很好的方法来写“空格而不是换行符”。
I think the best you can do is add some space with the x
modifier and try to factor out the ugliness a bit, but that's questionable: (?x) (?: [ \t\r\f\v]*? \n ){2} [ \t\r\f\v]*?
我认为你能做的最好的事情就是用x修饰符添加一些空间并尝试稍微分解一下丑陋,但这是有问题的:(?x)(?:[\ t \ r \ t \ t \ v] *?\ n ){2} [\ t \ r \ n \ f \ v] *?
You could also try creating a subrule just for the character class and interpolating it three times.
您还可以尝试为字符类创建一个子规则并将其插值三次。
#2
2
Are you trying to deduce the structure of a document in plain test? Are you doing what docutils does?
您是否试图在普通测试中推断出文档的结构?你在做什么是docutils吗?
You might be able to simply use the Docutils parser rather than roll your own.
您可以简单地使用Docutils解析器而不是自己动手。
#3
1
Not a regexp but really elegant:
不是正则表达式,而是非常优雅:
from itertools import groupby
def paragraph(lines) :
for group_separator, line_iteration in groupby(lines.splitlines(True), key = str.isspace) :
if not group_separator :
yield ''.join(line_iteration)
for p in paragraph('p1\n\t\np2\t\n\tstill p2\t \n \n\tp'):
print repr(p)
'p1\n'
'p2\t\n\tstill p2\t \n'
'\tp3'
It's up to you to strip the output as you need it of course.
您可以根据需要剥离输出。
Inspired from the famous "Python Cookbook" ;-)
灵感来自着名的“Python Cookbook”;-)
#4
0
Almost the same, but using non-greedy quantifiers and taking advantage of the whitespace sequence.
几乎相同,但使用非贪婪量词并利用空白序列。
\s*?\n\s*?\n\s*?
#1
4
Unfortunately there's no nice way to write "space but not a newline".
不幸的是,没有很好的方法来写“空格而不是换行符”。
I think the best you can do is add some space with the x
modifier and try to factor out the ugliness a bit, but that's questionable: (?x) (?: [ \t\r\f\v]*? \n ){2} [ \t\r\f\v]*?
我认为你能做的最好的事情就是用x修饰符添加一些空间并尝试稍微分解一下丑陋,但这是有问题的:(?x)(?:[\ t \ r \ t \ t \ v] *?\ n ){2} [\ t \ r \ n \ f \ v] *?
You could also try creating a subrule just for the character class and interpolating it three times.
您还可以尝试为字符类创建一个子规则并将其插值三次。
#2
2
Are you trying to deduce the structure of a document in plain test? Are you doing what docutils does?
您是否试图在普通测试中推断出文档的结构?你在做什么是docutils吗?
You might be able to simply use the Docutils parser rather than roll your own.
您可以简单地使用Docutils解析器而不是自己动手。
#3
1
Not a regexp but really elegant:
不是正则表达式,而是非常优雅:
from itertools import groupby
def paragraph(lines) :
for group_separator, line_iteration in groupby(lines.splitlines(True), key = str.isspace) :
if not group_separator :
yield ''.join(line_iteration)
for p in paragraph('p1\n\t\np2\t\n\tstill p2\t \n \n\tp'):
print repr(p)
'p1\n'
'p2\t\n\tstill p2\t \n'
'\tp3'
It's up to you to strip the output as you need it of course.
您可以根据需要剥离输出。
Inspired from the famous "Python Cookbook" ;-)
灵感来自着名的“Python Cookbook”;-)
#4
0
Almost the same, but using non-greedy quantifiers and taking advantage of the whitespace sequence.
几乎相同,但使用非贪婪量词并利用空白序列。
\s*?\n\s*?\n\s*?