覆盖R对象时如何管理内存?

时间:2021-04-14 13:45:54

I'm handling some large datasets and am doing what I can to stay under R's memory limits. One question came up regarding the overwriting of R objects. I have a large data.table (or any R object), and it has to be copied to tmp multiple times. The question is: does it make any difference if I delete tmp before overwriting it? In code:

我正在处理一些大型数据集,并且正在尽我所能保持R的内存限制。关于覆盖R对象的一个​​问题。我有一个大的data.table(或任何R对象),它必须多次复制到tmp。问题是:如果我在覆盖它之前删除tmp会有什么不同吗?在代码中:

for (1:lots_of_times) {
     v_l_d_t_tmp <- copy(very_large_data_table) # Necessary copy of 7GB data
                                                # table on 16GB machine. I can
                                                # afford 2 but not 3 copies.
     ### do stuff to v_l_d_t_tmp and output
     rm (v_l_d_t_tmp)  # The question is whether this rm keeps max memory
                        # usage lower, or if it is equivalent to what an 
                        # overwrite will automatically do on the next iteration.
}

Assume the copy is necessary (If I reach a point where I need to read very_large_data_table from disk at each loop, I'll do that, but the question stands: will it make any difference on max memory usage if I explicitly delete v_l_d_t_tmp before loading into it again?).

假设副本是必要的(如果我在每个循环中达到需要从磁盘读取very_large_data_table的点,我会这样做,但问题是:如果我在加载前明确删除v_l_d_t_tmp,它会对最大内存使用产生任何影响吗?再次进入它?)。

Or, to teach the man to fish, what could I have typed (within R, let's not get into ps) to answer this myself?

或者,为了教导男人捕鱼,我可以输入什么(在R中,让我们不进入ps)自己回答这个问题?

It's totally OK if the answer turns out to be: "Trust garbage collection."

如果答案结果是:“信任垃圾收集”,那就完全没问了。

2 个解决方案

#1


2  

Here's another idea... it doesn't directly answer your question, instead tries to get around it by eliminating the memory problem in another way. Might get you thinking:

这是另一个想法...它没有直接回答你的问题,而是试图通过另一种方式消除内存问题来绕过它。可能会让你思考:

What if you instead cache the very_large_data_table, and then read it in just once, do what you need to do, and then exit R. Now, write a loop outside of R, and the memory problem vanishes. Granted, this costs you more CPU because you have to read in 7GB multiple times... but it might be worth saving the memory costs. In fact, this halves your memory use, since you don't have to ever copy the table.

如果您改为缓存very_large_data_table,然后只读一次,执行您需要做的操作然后退出R.现在,在R之外写一个循环,内存问题就会消失。当然,这会花费你更多的CPU,因为你必须多次读取7GB ......但是可能值得节省内存成本。实际上,这会减少你的内存使用,因为你不必复制表。

In addition, like @konvas pointed out in the comments, I too found that rm() even with gc() never got me what I needed with a long loop, memory would just accumulate and eventually bog down. Exiting R is the easy way out.

另外,就像@konvas在评论中指出的那样,我也发现rm()即使用gc()也从来没有得到我需要的长循环,内存只会累积并最终陷入困境。退出R是一个简单的出路。

I had to do this so often that I wrote a package to help me cache objects like this: simpleCache

我不得不经常这样做,因为我写了一个包来帮助我缓存这样的对象:simpleCache

if you're interested in trying, it would look something like this:

如果你有兴趣尝试,它会看起来像这样:

do this outside of R:

在R之外做这个:

for (1:lots_of_times) {
Rscript my_script.R
}

Then in R, do this... my_script.R:

然后在R中,执行此操作... my_script.R:

library(simpleCache)
simpleCache("very_large_data_table", {r code for how 
you make this table }, assignTo="v_l_d_t_tmp") 

 ### do stuff to v_l_d_t_tmp and output

#2


1  

This is a comment more than an answer, but it is becoming too long.

这是一个评论而不是一个答案,但它变得太长了。

I guess that in this case a call to rm might be proper. I think that starting from the second iteration, you may have 3 tables in memory if you don't call rm. While copying the large object, R cannot free the memory occupied by v_l_d_t_tmp before the end of the copy, since the function call may have an error and in this case the old object must be preserved. Consider this example:

我想在这种情况下,调用rm可能是正确的。我认为从第二次迭代开始,如果你不调用rm,你可能在内存中有3个表。在复制大对象时,R不能在复制结束之前释放v_l_d_t_tmp占用的内存,因为函数调用可能有错误,在这种情况下必须保留旧对象。考虑这个例子:

 x<-1:10
 myfunc<-function(y) {Sys.sleep(3);30}

Here I defined an object and a function that takes some time to do something. If you try:

在这里,我定义了一个对象和一个需要一些时间来做某事的函数。如果你试试:

 x<-myfunc()

and break the execution before it ends "naturally", the object x still exists, with its 1:10 content. So, I guess that in your case, even if you use the same symbol, R cannot free its content before or during the copy. It can if you remove it before the following copy. Of course, the object will be removed after the copy, but you may run out of memory during it.

并且在“自然地”结束之前中断执行,对象x仍然存在,其内容为1:10。因此,我想在您的情况下,即使您使用相同的符号,R也无法在复制之前或期间释放其内容。如果您在以下副本之前删除它,它可以。当然,复制后将删除该对象,但在此期间可能会耗尽内存。

I'm not by any means an expert of the R internals, so don't take for granted what I just said.

我不是R内部的专家,所以不要理所当然地说我刚才所说的。

#1


2  

Here's another idea... it doesn't directly answer your question, instead tries to get around it by eliminating the memory problem in another way. Might get you thinking:

这是另一个想法...它没有直接回答你的问题,而是试图通过另一种方式消除内存问题来绕过它。可能会让你思考:

What if you instead cache the very_large_data_table, and then read it in just once, do what you need to do, and then exit R. Now, write a loop outside of R, and the memory problem vanishes. Granted, this costs you more CPU because you have to read in 7GB multiple times... but it might be worth saving the memory costs. In fact, this halves your memory use, since you don't have to ever copy the table.

如果您改为缓存very_large_data_table,然后只读一次,执行您需要做的操作然后退出R.现在,在R之外写一个循环,内存问题就会消失。当然,这会花费你更多的CPU,因为你必须多次读取7GB ......但是可能值得节省内存成本。实际上,这会减少你的内存使用,因为你不必复制表。

In addition, like @konvas pointed out in the comments, I too found that rm() even with gc() never got me what I needed with a long loop, memory would just accumulate and eventually bog down. Exiting R is the easy way out.

另外,就像@konvas在评论中指出的那样,我也发现rm()即使用gc()也从来没有得到我需要的长循环,内存只会累积并最终陷入困境。退出R是一个简单的出路。

I had to do this so often that I wrote a package to help me cache objects like this: simpleCache

我不得不经常这样做,因为我写了一个包来帮助我缓存这样的对象:simpleCache

if you're interested in trying, it would look something like this:

如果你有兴趣尝试,它会看起来像这样:

do this outside of R:

在R之外做这个:

for (1:lots_of_times) {
Rscript my_script.R
}

Then in R, do this... my_script.R:

然后在R中,执行此操作... my_script.R:

library(simpleCache)
simpleCache("very_large_data_table", {r code for how 
you make this table }, assignTo="v_l_d_t_tmp") 

 ### do stuff to v_l_d_t_tmp and output

#2


1  

This is a comment more than an answer, but it is becoming too long.

这是一个评论而不是一个答案,但它变得太长了。

I guess that in this case a call to rm might be proper. I think that starting from the second iteration, you may have 3 tables in memory if you don't call rm. While copying the large object, R cannot free the memory occupied by v_l_d_t_tmp before the end of the copy, since the function call may have an error and in this case the old object must be preserved. Consider this example:

我想在这种情况下,调用rm可能是正确的。我认为从第二次迭代开始,如果你不调用rm,你可能在内存中有3个表。在复制大对象时,R不能在复制结束之前释放v_l_d_t_tmp占用的内存,因为函数调用可能有错误,在这种情况下必须保留旧对象。考虑这个例子:

 x<-1:10
 myfunc<-function(y) {Sys.sleep(3);30}

Here I defined an object and a function that takes some time to do something. If you try:

在这里,我定义了一个对象和一个需要一些时间来做某事的函数。如果你试试:

 x<-myfunc()

and break the execution before it ends "naturally", the object x still exists, with its 1:10 content. So, I guess that in your case, even if you use the same symbol, R cannot free its content before or during the copy. It can if you remove it before the following copy. Of course, the object will be removed after the copy, but you may run out of memory during it.

并且在“自然地”结束之前中断执行,对象x仍然存在,其内容为1:10。因此,我想在您的情况下,即使您使用相同的符号,R也无法在复制之前或期间释放其内容。如果您在以下副本之前删除它,它可以。当然,复制后将删除该对象,但在此期间可能会耗尽内存。

I'm not by any means an expert of the R internals, so don't take for granted what I just said.

我不是R内部的专家,所以不要理所当然地说我刚才所说的。