I have over a million text files compressed into 40 zip files. I also have a list of about 500 model names of phones. I want to find out the number of times a particular model was mentioned in the text files.
我有超过一百万个文本文件压缩成40个zip文件。我还有一个大约500个手机型号名单。我想找出文本文件中提到的特定模型的次数。
Is there any python module which can do a regex match on the files without unzipping it. Is there a simple way to solve this problem without unzipping?
是否有任何python模块可以对文件进行正则表达式匹配而不解压缩它。有解决这个问题的简单方法而不解压缩吗?
4 个解决方案
#1
9
There's nothing that will automatically do what you want.
没有什么可以自动完成你想要的。
However, there is a python zipfile module that will make this easy to do. Here's how to iterate over the lines in the file.
但是,有一个python zipfile模块可以让这很容易。以下是如何迭代文件中的行。
#!/usr/bin/python
import zipfile
f = zipfile.ZipFile('myfile.zip')
for subfile in f.namelist():
print subfile
data = f.read(subfile)
for line in data.split('\n'):
print line
#2
0
You could loop through the zip files, reading individual files using the zipfile module and running your regex on those, eliminating to unzip all the files at once.
您可以遍历zip文件,使用zipfile模块读取单个文件并在这些文件上运行正则表达式,从而无需一次解压缩所有文件。
I'm fairly certain that you can't run a regex over the zipped data, at least not meaningfully.
我相当肯定你不能对压缩数据运行正则表达式,至少没有意义。
#3
0
To access the contents of a zip file you have to unzip it, although the zipfile package makes this fairly easy, as you can unzip each file within an archive individually.
要访问zip文件的内容,您必须解压缩它,尽管zipfile包使这相当容易,因为您可以单独解压缩存档中的每个文件。
Python zipfile模块
#4
0
Isn't it (at least theoretically) possible, to read in the ZIP's Huffman coding and then translate the regexp into the Huffman code? Might this be more efficient than first de-compressing the data, then running the regexp?
是不是(至少在理论上)可以读取ZIP的霍夫曼编码,然后将正则表达式翻译成霍夫曼代码?这可能比首先解压缩数据,然后运行正则表达式更有效吗?
(Note: I know it wouldn't be quite that simple: you'd also have to deal with other aspects of the ZIP coding—file layout, block structures, back-references—but one imagines this could be fairly lightweight.)
(注意:我知道它不会那么简单:你还必须处理ZIP编码文件布局,块结构,反向引用的其他方面 - 但是人们想象这可能相当轻量级。)
EDIT: Also note that it's probably much more sensible to just use the zipfile
solution.
编辑:还要注意,使用zipfile解决方案可能更明智。
#1
9
There's nothing that will automatically do what you want.
没有什么可以自动完成你想要的。
However, there is a python zipfile module that will make this easy to do. Here's how to iterate over the lines in the file.
但是,有一个python zipfile模块可以让这很容易。以下是如何迭代文件中的行。
#!/usr/bin/python
import zipfile
f = zipfile.ZipFile('myfile.zip')
for subfile in f.namelist():
print subfile
data = f.read(subfile)
for line in data.split('\n'):
print line
#2
0
You could loop through the zip files, reading individual files using the zipfile module and running your regex on those, eliminating to unzip all the files at once.
您可以遍历zip文件,使用zipfile模块读取单个文件并在这些文件上运行正则表达式,从而无需一次解压缩所有文件。
I'm fairly certain that you can't run a regex over the zipped data, at least not meaningfully.
我相当肯定你不能对压缩数据运行正则表达式,至少没有意义。
#3
0
To access the contents of a zip file you have to unzip it, although the zipfile package makes this fairly easy, as you can unzip each file within an archive individually.
要访问zip文件的内容,您必须解压缩它,尽管zipfile包使这相当容易,因为您可以单独解压缩存档中的每个文件。
Python zipfile模块
#4
0
Isn't it (at least theoretically) possible, to read in the ZIP's Huffman coding and then translate the regexp into the Huffman code? Might this be more efficient than first de-compressing the data, then running the regexp?
是不是(至少在理论上)可以读取ZIP的霍夫曼编码,然后将正则表达式翻译成霍夫曼代码?这可能比首先解压缩数据,然后运行正则表达式更有效吗?
(Note: I know it wouldn't be quite that simple: you'd also have to deal with other aspects of the ZIP coding—file layout, block structures, back-references—but one imagines this could be fairly lightweight.)
(注意:我知道它不会那么简单:你还必须处理ZIP编码文件布局,块结构,反向引用的其他方面 - 但是人们想象这可能相当轻量级。)
EDIT: Also note that it's probably much more sensible to just use the zipfile
solution.
编辑:还要注意,使用zipfile解决方案可能更明智。