读大文本文件和内存

时间:2022-08-04 17:00:06

I am going to read about 7 GB text file.

我将阅读大约7 GB的文本文件。

Whenever I try to read this file, it takes long than what I expected.

每当我尝试读取此文件时,都需要很长时间。

For example, suppose I have 350 MB text file and my laptop takes about a minute or less. If I suppose to read 7GB, ideally it should take 20 minutes or less. Isn't it? Mine takes much longer than what I expected and I want to shorten the time of reading and processing my data.

例如,假设我有350 MB的文本文件,而我的笔记本电脑需要大约一分钟或更短的时间。如果我想读7GB,理想情况下应该花20分钟或更短时间。不是吗?我的花费比我预期的要长得多,我想缩短阅读和处理数据的时间。

I am using the following code for reading:

我使用以下代码进行阅读:

for line in open(filename, 'r'):
    try:
        list.append(json.loads(line))
    except:
        pass

After reading a file, I used to process to filter out unnecessary data from the list by making another list and killing previous list. If you have any suggestion, just let me know.

在读取文件后,我过去通过制作另一个列表并删除前一个列表来过滤掉列表中不必要的数据。如果您有任何建议,请告诉我。

1 个解决方案

#1


7  

The 7GB file is likely taking significantly longer than 20 x 350mb file because you don't have enough memory to hold all the data in memory. This means that, at some point, your operating system will start swapping out some of the data — writing it from memory onto disk — so that the memory can be re-used.

7GB文件可能比20 x 350mb文件长得多,因为您没有足够的内存来容纳内存中的所有数据。这意味着,在某些时候,您的操作系统将开始交换一些数据 - 将其从内存写入磁盘 - 以便可以重复使用内存。

This is slow because your hard disk is significantly slower than RAM, and at 7GB there will be a lot of data being read from your hard disk, put into RAM, then moved back to your page file (the file on disk your operating system uses to store data that has been copied out of RAM).

这很慢,因为你的硬盘明显慢于RAM,7GB时会有大量数据从硬盘读取,放入RAM,然后移回你的页面文件(操作系统使用的磁盘上的文件)存储已从RAM中复制的数据。

My suggestion would be to re-work your program so that it only needs to store a small portion of the file in memory at a time. Depending on your problem, you can likely do this by moving some of the logic into the loop that reads the file. For example, if your program is trying to find and print all the lines which contain "ERROR", you could re-write it from:

我的建议是重新编写你的程序,这样它一次只需要将一小部分文件存储在内存中。根据您的问题,您可以通过将一些逻辑移动到读取文件的循环中来实现此目的。例如,如果您的程序试图查找并打印包含“ERROR”的所有行,您可以从以下位置重新编写它:

lines = []
for line in open("myfile"):
    lines.append(json.loads(line))
for line in lines:
    if "ERROR" in line:
        print line

To:

至:

for line_str in open("myfile"):
    line_obj = json.loads(line_str)
    if "ERROR" in line_obj:
        print line_obj

#1


7  

The 7GB file is likely taking significantly longer than 20 x 350mb file because you don't have enough memory to hold all the data in memory. This means that, at some point, your operating system will start swapping out some of the data — writing it from memory onto disk — so that the memory can be re-used.

7GB文件可能比20 x 350mb文件长得多,因为您没有足够的内存来容纳内存中的所有数据。这意味着,在某些时候,您的操作系统将开始交换一些数据 - 将其从内存写入磁盘 - 以便可以重复使用内存。

This is slow because your hard disk is significantly slower than RAM, and at 7GB there will be a lot of data being read from your hard disk, put into RAM, then moved back to your page file (the file on disk your operating system uses to store data that has been copied out of RAM).

这很慢,因为你的硬盘明显慢于RAM,7GB时会有大量数据从硬盘读取,放入RAM,然后移回你的页面文件(操作系统使用的磁盘上的文件)存储已从RAM中复制的数据。

My suggestion would be to re-work your program so that it only needs to store a small portion of the file in memory at a time. Depending on your problem, you can likely do this by moving some of the logic into the loop that reads the file. For example, if your program is trying to find and print all the lines which contain "ERROR", you could re-write it from:

我的建议是重新编写你的程序,这样它一次只需要将一小部分文件存储在内存中。根据您的问题,您可以通过将一些逻辑移动到读取文件的循环中来实现此目的。例如,如果您的程序试图查找并打印包含“ERROR”的所有行,您可以从以下位置重新编写它:

lines = []
for line in open("myfile"):
    lines.append(json.loads(line))
for line in lines:
    if "ERROR" in line:
        print line

To:

至:

for line_str in open("myfile"):
    line_obj = json.loads(line_str)
    if "ERROR" in line_obj:
        print line_obj