I am a complete newbie to Python, and I'm stuck with a regex problem. I'm trying to remove the line break character at the end of each line in a text file, but only if it follows a lowercase letter, i.e. [a-z]
. If the end of the line ends in a lower case letter, I want to replace the line break/newline character with a space.
我完全是Python的新手,我遇到了一个regex问题。我试图删除文本文件中每一行末尾的换行字符,但前提是它必须遵循小写字母,即[a-z]。如果行尾以小写字母结尾,我想用空格替换换行符/换行字符。
This is what I've got so far:
这就是我到目前为止得到的:
import re
import sys
textout = open("output.txt","w")
textblock = open(sys.argv[1]).read()
textout.write(re.sub("[a-z]\z","[a-z] ", textblock, re.MULTILINE) )
textout.close()
3 个解决方案
#1
21
Try
试一试
re.sub(r"(?<=[a-z])\r?\n"," ", textblock)
\Z
only matches at the end of the string, after the last linebreak, so it's definitely not what you need here. \z
is not recognized by the Python regex engine.
\Z只在字符串的末尾匹配,在最后的换行符之后,所以这肯定不是您在这里需要的。Python regex引擎无法识别\z。
(?<=[a-z])
is a positive lookbehind assertion that checks if the character before the current position is a lowercase ASCII character. Only then the regex engine will try to match a line break.
(?<=[a-z])是一个积极的查找断言,用于检查当前位置之前的字符是否是小写ASCII字符。只有这样,regex引擎才会尝试匹配换行。
Also, always use raw strings with regexes. Makes backslashes easier to handle.
此外,始终使用带regexes的原始字符串。使反斜杠更容易处理。
#2
2
Just as an alternative answer, although it takes more lines, I think the following may be clearer since the regular expression is simpler:
作为另一种回答,虽然需要更多的行,但我认为以下内容可能更清晰,因为正则表达式更简单:
import re
import sys
with open(sys.argv[1]) as ifp:
with open("output.txt", "w") as ofp:
for line in ifp:
if re.search('[a-z]$',line):
ofp.write(line.rstrip("\n\r")+" ")
else:
ofp.write(line)
... and that avoids loading the whole file into a string. If you want to use fewer lines, but still avoid postive lookbehind, you could do:
…这样可以避免将整个文件加载到字符串中。如果您希望使用更少的行,但仍然避免查找后面,您可以这样做:
import re
import sys
with open(sys.argv[1]) as ifp:
with open("output.txt", "w") as ofp:
for line in ifp:
ofp.write(re.sub('(?m)([a-z])[\r\n]+$','\\1 ',line))
The parts of that regular expression are:
这个正则表达式的部分是:
-
(?m)
[turn on multiline matching] - (?m)[打开多行匹配]
-
([a-z])
[match a single lower case character as the first group] - ([a-z])[匹配一个单一的小写字母作为第一组]
-
[\r\n]+
[match one or more of carriage returns or newlines, to cover\n
,\r\n
and\r
] - [\r\n]+[匹配一个或多个回车或换行,以覆盖\n、\r\n和\r]
-
$
[match the end of the string] - $[匹配字符串的末尾]
... and if that matches line, the lowercase letter and line ending are replaced by \\1
, which will the lower case letter followed by a space.
…如果匹配行,小写字母和行尾将被替换为\1,小写字母后面跟着空格。
#3
1
my point was that avoiding using positive lookbehind might make the code more readable
我的观点是,避免使用积极的向后看可能会使代码更容易阅读。
OK. Though, personally, I don't find it's less readable. It's a matter of taste.
好的。不过,就我个人而言,我觉得这本书的可读性并不差。这是品味的问题。
In your EDIT:
在你的编辑:
-
First, (?m) is not necessary since for line in ifp: selects one line at a time and so there is only one newline at the end of each line's string
首先,(?m)不是必需的,因为对于ifp中的行:每次选择一行,因此每一行的字符串末尾只有一行新行
-
Secondly, $ as it is placed, has no utility because it will always match the end of the string line.
其次,$作为放置的对象,没有任何实用工具,因为它总是匹配字符串行的末尾。
Any way, adopting your point of view, I found two manners to avoid the lookbehind assertion:
不管怎么说,采纳你的观点,我发现有两种方法可以避免“后来居上”的说法:
with open(sys.argv[1]) as ifp:
with open("output.txt", "w") as ofp:
for line in ifp:
ante_newline,lower_last = re.match('(.*?([a-z])?$)',line).groups()
ofp.write(ante_newline+' ' if lower_last else line)
and
和
with open(sys.argv[1]) as ifp:
with open("output.txt", "w") as ofp:
for line in ifp:
ofp.write(line.strip('\r\n')+' ' if re.search('[a-z]$',line) else line)
the second one is better: only one line , a simple matching to test, no need of groups(), naturally logic
第二个更好:只有一行,简单的匹配测试,不需要组(),自然逻辑
EDIT: oh I realize that this second code is simply your first code rewritten in one line, Longair
编辑:哦,我意识到这第二段代码仅仅是你用一行重写的第一个代码,Longair
#1
21
Try
试一试
re.sub(r"(?<=[a-z])\r?\n"," ", textblock)
\Z
only matches at the end of the string, after the last linebreak, so it's definitely not what you need here. \z
is not recognized by the Python regex engine.
\Z只在字符串的末尾匹配,在最后的换行符之后,所以这肯定不是您在这里需要的。Python regex引擎无法识别\z。
(?<=[a-z])
is a positive lookbehind assertion that checks if the character before the current position is a lowercase ASCII character. Only then the regex engine will try to match a line break.
(?<=[a-z])是一个积极的查找断言,用于检查当前位置之前的字符是否是小写ASCII字符。只有这样,regex引擎才会尝试匹配换行。
Also, always use raw strings with regexes. Makes backslashes easier to handle.
此外,始终使用带regexes的原始字符串。使反斜杠更容易处理。
#2
2
Just as an alternative answer, although it takes more lines, I think the following may be clearer since the regular expression is simpler:
作为另一种回答,虽然需要更多的行,但我认为以下内容可能更清晰,因为正则表达式更简单:
import re
import sys
with open(sys.argv[1]) as ifp:
with open("output.txt", "w") as ofp:
for line in ifp:
if re.search('[a-z]$',line):
ofp.write(line.rstrip("\n\r")+" ")
else:
ofp.write(line)
... and that avoids loading the whole file into a string. If you want to use fewer lines, but still avoid postive lookbehind, you could do:
…这样可以避免将整个文件加载到字符串中。如果您希望使用更少的行,但仍然避免查找后面,您可以这样做:
import re
import sys
with open(sys.argv[1]) as ifp:
with open("output.txt", "w") as ofp:
for line in ifp:
ofp.write(re.sub('(?m)([a-z])[\r\n]+$','\\1 ',line))
The parts of that regular expression are:
这个正则表达式的部分是:
-
(?m)
[turn on multiline matching] - (?m)[打开多行匹配]
-
([a-z])
[match a single lower case character as the first group] - ([a-z])[匹配一个单一的小写字母作为第一组]
-
[\r\n]+
[match one or more of carriage returns or newlines, to cover\n
,\r\n
and\r
] - [\r\n]+[匹配一个或多个回车或换行,以覆盖\n、\r\n和\r]
-
$
[match the end of the string] - $[匹配字符串的末尾]
... and if that matches line, the lowercase letter and line ending are replaced by \\1
, which will the lower case letter followed by a space.
…如果匹配行,小写字母和行尾将被替换为\1,小写字母后面跟着空格。
#3
1
my point was that avoiding using positive lookbehind might make the code more readable
我的观点是,避免使用积极的向后看可能会使代码更容易阅读。
OK. Though, personally, I don't find it's less readable. It's a matter of taste.
好的。不过,就我个人而言,我觉得这本书的可读性并不差。这是品味的问题。
In your EDIT:
在你的编辑:
-
First, (?m) is not necessary since for line in ifp: selects one line at a time and so there is only one newline at the end of each line's string
首先,(?m)不是必需的,因为对于ifp中的行:每次选择一行,因此每一行的字符串末尾只有一行新行
-
Secondly, $ as it is placed, has no utility because it will always match the end of the string line.
其次,$作为放置的对象,没有任何实用工具,因为它总是匹配字符串行的末尾。
Any way, adopting your point of view, I found two manners to avoid the lookbehind assertion:
不管怎么说,采纳你的观点,我发现有两种方法可以避免“后来居上”的说法:
with open(sys.argv[1]) as ifp:
with open("output.txt", "w") as ofp:
for line in ifp:
ante_newline,lower_last = re.match('(.*?([a-z])?$)',line).groups()
ofp.write(ante_newline+' ' if lower_last else line)
and
和
with open(sys.argv[1]) as ifp:
with open("output.txt", "w") as ofp:
for line in ifp:
ofp.write(line.strip('\r\n')+' ' if re.search('[a-z]$',line) else line)
the second one is better: only one line , a simple matching to test, no need of groups(), naturally logic
第二个更好:只有一行,简单的匹配测试,不需要组(),自然逻辑
EDIT: oh I realize that this second code is simply your first code rewritten in one line, Longair
编辑:哦,我意识到这第二段代码仅仅是你用一行重写的第一个代码,Longair