My input file is a large txt file with concatenated texts I got from an open text library. I am now trying to extract only the content of the book itself and filter out other stuff such as disclaimers etc. So I have around 100 documents in my large text file (around 50 mb).
我的输入文件是一个大的txt文件,带有从开放文本库中获得的连接文本。我现在正试图仅提取书本身的内容并过滤掉其他内容,例如免责声明等。所以我的大文本文件(大约50 mb)中有大约100个文档。
I then have identified the start and end markers of the contents themselves, and decided to use a Python regex to find me everything between the start and end marker. To sum it up, the regex should look for the start marker, then match everything after it, and stop looking once the end marker is reached, then repeat these steps until the end of the file is reached.
然后我确定了内容本身的开始和结束标记,并决定使用Python正则表达式找到开始和结束标记之间的所有内容。总而言之,正则表达式应该查找开始标记,然后匹配它后面的所有内容,并在达到结束标记后停止查找,然后重复这些步骤直到到达文件末尾。
The following code works flawlessly when I feed a small, 100kb sized file into it:
当我将一个100kb大小的文件输入其中时,以下代码可以正常运行:
import codecs
import re
outfile = codecs.open("outfile.txt", "w", "utf-8-sig")
inputfile = codecs.open("infile.txt", "r", "utf-8-sig")
filecontents = inputfile.read()
for result in re.findall(r'START\sOF\sTHE\sPROJECT\sGUTENBERG\sEBOOK.*?\n(.*?)END\sOF\THE\sPROJECT\sGUTENBERG\sEBOOK', filecontents, re.DOTALL):
outfile.write(result)
outfile.close()
When I use this regex operation on my larger file however, it will not do anything, the program just hangs. I tested it overnight to see if it was just slow and even after around 8 hours the program was still stuck.
然而,当我在我的较大文件上使用此正则表达式操作时,它将不会执行任何操作,程序只会挂起。我在一夜之间进行了测试,看它是否只是很慢,甚至在大约8小时后程序仍然停滞不前。
I am very sure that the source of the problem is the (.*?) part of the regex, in combination with re.DOTALL. When I use a similar regex on smaller distances, the script will run fine and fast. My question now is: why is this just freezing up everything? I know the texts between the delimiters are not small, but a 50mb file shouldn't be too much to handle, right? Am I maybe missing a more efficient solution?
我非常确定问题的根源是正则表达式的(。*?)部分,与re.DOTALL结合使用。当我在较小距离上使用类似的正则表达式时,脚本将运行良好且快速。我现在的问题是:为什么这会冻结一切?我知道分隔符之间的文本不小,但50mb文件不应该太多,对吧?我可能错过了更有效的解决方案吗?
Thanks in advance.
提前致谢。
1 个解决方案
#1
10
You are correct in thinking that using the sequence .*
, which appears more than once, is causing problems. The issue is that the solver is trying many possible combinations of .*
, leading to a result known as catastrophic backtracking.
你认为使用不止一次出现的序列。*是正确的,这是正确的。问题在于求解器正在尝试。*的许多可能组合,导致称为灾难性回溯的结果。
The usual solution is to replace the .
with a character class that is much more specific, usually the production that you are trying to terminate the first .*
with. Something like:
通常的解决方案是更换。使用更具体的字符类,通常是您尝试终止第一个。*的生产。就像是:
`[^\n]*(.*)`
so that the capturing group can only match from the first newline to the end. Another option is to recognize that a regular expression solution may not be the best approach, and to use either a context free expression (such as pyparsing
), or by first breaking up the input into smaller, easier to digest chunks (for example, with corpus.split('\n')
)
这样捕获组只能匹配从第一个换行到结尾。另一种选择是认识到正则表达式解决方案可能不是最好的方法,并且可以使用无上下文表达式(例如pyparsing),或者首先将输入分解为更小,更容易消化的块(例如, corpus.split( '\ n'))
#1
10
You are correct in thinking that using the sequence .*
, which appears more than once, is causing problems. The issue is that the solver is trying many possible combinations of .*
, leading to a result known as catastrophic backtracking.
你认为使用不止一次出现的序列。*是正确的,这是正确的。问题在于求解器正在尝试。*的许多可能组合,导致称为灾难性回溯的结果。
The usual solution is to replace the .
with a character class that is much more specific, usually the production that you are trying to terminate the first .*
with. Something like:
通常的解决方案是更换。使用更具体的字符类,通常是您尝试终止第一个。*的生产。就像是:
`[^\n]*(.*)`
so that the capturing group can only match from the first newline to the end. Another option is to recognize that a regular expression solution may not be the best approach, and to use either a context free expression (such as pyparsing
), or by first breaking up the input into smaller, easier to digest chunks (for example, with corpus.split('\n')
)
这样捕获组只能匹配从第一个换行到结尾。另一种选择是认识到正则表达式解决方案可能不是最好的方法,并且可以使用无上下文表达式(例如pyparsing),或者首先将输入分解为更小,更容易消化的块(例如, corpus.split( '\ n'))