为什么我的python进程会消耗这么多内存?

时间:2022-09-30 00:21:00

I'm working on a project that involves using python to read, process and write files that are sometimes as large as a few hundred megabytes. The program fails occasionally when I try to process some particularly large files. It does not say 'memory error', but I suspect that is the problem (in fact it gives no reason at all for failing').

我正在做一个项目,这个项目涉及到使用python来读取、处理和编写有时高达几百兆字节的文件。当我试图处理一些特别大的文件时,程序偶尔会失败。它没有说“内存错误”,但我怀疑这就是问题所在(事实上它没有给出任何失败的理由)。

I've been testing the code on smaller files and watching 'top' to see what memory usage is like, and it typically reaches 60%. top says that I have 4050352k total memory, so 3.8Gb.

我一直在小文件上测试代码,并观察“top”,看看内存的使用情况,通常会达到60%。top说我有4050352k的总内存,所以是3。8gb。

Meanwhile I'm trying to track memory usage within python itself (see my question from yesterday) with the following little bit of code:

与此同时,我正在尝试使用以下代码跟踪python本身的内存使用情况(参见我昨天的问题):

mem = 0
for variable in dir():
    variable_ = vars()[variable]
    try: 
        if str(type(variable_))[7:12] == 'numpy':
            numpy_ = True
        else:
            numpy_ = False
    except:
        numpy_ = False
    if numpy_:
        mem_ = variable_.nbytes
    else:
        mem_ = sys.getsizeof(variable)
    mem += mem_
    print variable+ type: '+str(type(variable_))+' size: '+str(mem_)
print 'Total: '+str(mem)

Before I run that block I set all the variables I don't need to None, close all files and figures, etc etc. After that block I use subprocess.call() to run a fortran program that is required for the next stage of processing. Looking at top while the fortran program is running shows that the fortran program is using ~100% of the cpu, and ~5% of the memory, and that python is using 0% of cpu and 53% of memory. However my little snippet of code tells me that all of the variables in python add up to only 23Mb, which ought to be ~0.5%.

在运行该块之前,我将不需要的所有变量设置为None,关闭所有文件和图形等等。当fortran程序运行时,查看顶部,说明fortran程序使用的是cpu的100%和内存的5%,而python使用的是cpu的0%和53%的内存。但是我的代码片段告诉我,python中的所有变量加起来只有23Mb,应该是~0.5%。

So what's happening? I wouldn't expect that little snippet to give me a spot on memory usage, but it ought to be accurate to within a few Mb surely? Or is it just that top doesn't notice the memory has been relinquished, but that it is available to other programs that need it if necessary?

那么发生了什么?我不指望那个小片段能让我了解内存的使用情况,但它应该在几Mb的范围内准确无误吗?或者只是top没有注意到内存已经被放弃了,但是如果有必要的话,它可以用于其他需要内存的程序?

As requested, here's a simplified part of the code that is using up all the memory (file_name.cub is an ISIS3 cube, it's a file that contains 5 layers (bands) of the same map, the first layer is spectral radiance, the next 4 have to do with latitude, longitude, and other details. It's an image from Mars that I'm trying to process. StartByte is a value I previously read from the .cub file's ascii header telling me the beginning byte of the data, Samples and Lines are the dimensions of the map, also read from the header.):

按照要求,这里是使用所有内存(file_name)的代码的一个简化部分。cub是一个ISIS3立方体,它是一个包含5层(带)相同地图的文件,第一层是光谱亮度,接下来的4层与纬度、经度和其他细节有关。这是我正在处理的一张来自火星的照片。StartByte是我先前从。cub文件的ascii标头中读取的一个值,它告诉我数据、示例和行的开始字节是映射的维数,也从标头中读取。

latitude_array = 'cheese'   # It'll make sense in a moment
f_to = open('To_file.dat','w') 

f_rad = open('file_name.cub', 'rb')
f_rad.seek(0)
header=struct.unpack('%dc' % (StartByte-1), f_rad.read(StartByte-1))
header = None    
#
f_lat = open('file_name.cub', 'rb')
f_lat.seek(0)
header=struct.unpack('%dc' % (StartByte-1), f_lat.read(StartByte-1))
header = None 
pre=struct.unpack('%df' % (Samples*Lines), f_lat.read(Samples*Lines*4))
pre = None
#
f_lon = open('file_name.cub', 'rb')
f_lon.seek(0)
header=struct.unpack('%dc' % (StartByte-1), f_lon.read(StartByte-1))
header = None 
pre=struct.unpack('%df' % (Samples*Lines*2), f_lon.read(Samples*Lines*2*4))
pre = None
# (And something similar for the other two bands)
# So header and pre are just to get to the right part of the file, and are 
# then set to None. I did try using seek(), but it didn't work for some
# reason, and I ended up with this technique.
for line in range(Lines):
    sample_rad = struct.unpack('%df' % (Samples), f_rad.read(Samples*4))
    sample_rad = np.array(sample_rad)
    sample_rad[sample_rad<-3.40282265e+38] = np.nan  
    # And Similar lines for all bands
    # Then some arithmetic operations on some of the arrays
    i = 0
    for value in sample_rad:
        nextline = sample_lat[i]+', '+sample_lon[i]+', '+value # And other stuff
        f_to.write(nextline)
        i += 1
    if radiance_array == 'cheese':  # I'd love to know a better way to do this!
        radiance_array = sample_rad.reshape(len(sample_rad),1)
    else:
        radiance_array = np.append(radiance_array, sample_rad.reshape(len(sample_rad),1), axis=1)
        # And again, similar operations on all arrays. I end up with 5 output arrays
        # with dimensions ~830*4000. For the large files they can reach ~830x20000
f_rad.close()
f_lat.close()
f_to.close()   # etc etc 
sample_lat = None  # etc etc
sample_rad = None  # etc etc

#
plt.figure()
plt.imshow(radiance_array)
# I plot all the arrays, for diagnostic reasons

plt.show()
plt.close()

radiance_array = None  # etc etc
# I set all arrays apart from one (which I need to identify the 
# locations of nan in future) to None

# LOCATION OF MEMORY USAGE MONITOR SNIPPET FROM ABOVE

So I lied in the comments about opening several files, it's many instances of the same file. I only continue with one array that isn't set to None, and it's size is ~830x4000, though this somehow constitutes 50% of my available memory. I've also tried gc.collect, but no change. I'd be very happy to hear any advice on how I could improve on any of that code (related to this problem or otherwise).

所以我在关于打开几个文件的评论中撒谎,它是同一个文件的许多实例。我只继续一个数组,它不设为None,它的大小是~830x4000,尽管它占了我可用内存的50%。我也试过gc。收集,但没有改变。我很高兴能听到任何关于如何改进这些代码的建议(与这个问题或其他问题相关)。

Perhaps I should mention: originally I was opening the files in full (i.e. not line by line as above), doing it line by line was an initial attempt to save memory.

也许我应该提一下:最初我是完全打开文件的(也就是说,不像上面那样一行一行地打开文件),逐行地打开文件是为了保存内存。

1 个解决方案

#1


10  

Just because you've deferenced your variables doesn't mean the Python process has given the allocated memory back to the system. See How can I explicitly free memory in Python?.

仅仅因为您已经推迟了变量,并不意味着Python进程已经将分配的内存还给了系统。看到如何在Python中显式释放内存了吗?

Update

更新

If gc.collect() does not work for you, investigate forking and reading/writing your files in child processes using IPC. Those processes will end when they're finished and release the memory back to the system. Your main process will continue to run with low memory usage.

如果gc.collect()对您不起作用,那么可以使用IPC在子进程中对文件进行分叉和读取/编写。当这些进程完成时,这些进程将结束,并将内存释放回系统。您的主进程将继续以低内存使用率运行。

#1


10  

Just because you've deferenced your variables doesn't mean the Python process has given the allocated memory back to the system. See How can I explicitly free memory in Python?.

仅仅因为您已经推迟了变量,并不意味着Python进程已经将分配的内存还给了系统。看到如何在Python中显式释放内存了吗?

Update

更新

If gc.collect() does not work for you, investigate forking and reading/writing your files in child processes using IPC. Those processes will end when they're finished and release the memory back to the system. Your main process will continue to run with low memory usage.

如果gc.collect()对您不起作用,那么可以使用IPC在子进程中对文件进行分叉和读取/编写。当这些进程完成时,这些进程将结束,并将内存释放回系统。您的主进程将继续以低内存使用率运行。