在修改一个大的R数据时内存不足。

时间:2021-12-07 22:55:32

I have a big data frame taking about 900MB ram. Then I tried to modify it like this:

我有一个大数据框架,大约有900MB内存。然后我试着这样修改:

dataframe[[17]][37544]=0 

It seems that makes R using more than 3G ram and R complains "Error: cannot allocate vector of size 3.0 Mb", ( I am on a 32bit machine.)

这似乎使R使用了超过3G的ram, R抱怨“错误:无法分配3.0 Mb大小的向量”(我在一台32位的机器上)。

I found this way is better:

我发现这样更好:

dataframe[37544, 17]=0

but R's footprint still doubled and the command takes quite some time to run.

但是R的内存占用仍然增加了一倍,并且命令运行需要相当长的时间。

From a C/C++ background, I am really confused about this behavior. I thought something like dataframe[37544, 17]=0 should be completed in a blink without costing any extra memory (only one cell should be modified). What is R doing for those commands I posted? What is the right way to modify some elements in a data frame then without doubling the memory footprint?

从C/ c++背景来看,我对这种行为感到非常困惑。我认为类似dataframe[37544, 17]=0这样的东西应该在一眨眼的功夫内完成,而不需要花费任何额外的内存(只需要修改一个单元)。R对我发布的命令做了什么?在不增加内存占用的情况下,如何正确地修改数据帧中的某些元素?

Thanks so much for your help!

非常感谢你的帮助!

Tao

4 个解决方案

#1


8  

Look up 'copy-on-write' in the context of R discussions related to memory. As soon as one part of a (potentially really large) data structure changes, a copy is made.

在与记忆有关的R讨论中查找“写时复制”。一旦数据结构(可能非常大)的一部分发生变化,就会进行复制。

A useful rule of thumb is that if your largest object is N mb/gb/... large, you need around 3*N of RAM. Such is life with an interpreted system.

一个有用的经验法则是,如果最大的对象是N mb/gb/…很大,你需要3*N的内存。这就是有解释系统的生活。

Years ago when I had to handle large amounts of data on machines with (relative to the data volume) relatively low-ram 32-bit machines, I got good use out of early versions of the bigmemory package. It uses the 'external pointer' interface to keep large gobs of memory outside of R. That save you not only the '3x' factor, but possibly more as you may get away with non-contiguous memory (which is the other thing R likes).

几年前,当我必须在使用(相对于数据量)相对较低的ram 32位机器的机器上处理大量数据时,我很好地利用了bigmemory包的早期版本。它使用“外部指针”接口将大量内存保存在R之外,这样不仅可以保存“3x”因子,还可以保存非连续内存(这是R喜欢的另一个东西)。

#2


12  

Following up on Joran suggesting data.table, here are some links. Your object, at 900MB, is manageable in RAM even in 32bit R, with no copies at all.

追踪Joran的数据。这里有一些链接。您的对象,在900MB,在RAM中是可管理的,即使在32位R中,完全没有副本。

When should I use the := operator in data.table?

我应该在什么时候使用:=操作符?

Why has data.table defined := rather than overloading <-?

为什么数据。表定义:=而不是重载<-?

Also, data.table v1.8.0 (not yet on CRAN but stable on R-Forge) has a set() function which provides even faster assignment to elements, as fast as assignment to matrix (appropriate for use inside loops for example). See latest NEWS for more details and example. Also see ?":=" which is linked from ?data.table.

此外,数据。表v1.8.0(还没有在CRAN上,但在R-Forge上是稳定的)有一个set()函数,它提供了对元素更快的赋值,如赋值给矩阵(例如,适合在循环中使用)。有关更多细节和示例,请参见最新新闻。还可以看到?":="它是从?data.table链接过来的。

And, here are 12 questions on Stack Overflow with the data.table tag containing the word "reference".

这里有12个关于栈溢出的问题。包含单词“reference”的表标记。

For completeness :

完整性:

require(data.table)
DT = as.data.table(dataframe)
# say column name 17 is 'Q' (i.e. LETTERS[17])
# then any of the following :

DT[37544, Q:=0]                # using column name (often preferred)

DT[37544, 17:=0, with=FALSE]   # using column number

col = "Q"
DT[37544, col:=0, with=FALSE]  # variable holding name

col = 17
DT[37544, col:=0, with=FALSE]  # variable holding number

set(DT,37544L,17L,0)           # using set(i,j,value) in v1.8.0
set(DT,37544L,"Q",0)

But, please do see linked questions and the package's documentation to see how := is more general than this simple example; e.g., combining := with binary search in an i join.

但是,请务必参阅相关问题和软件包的文档,以了解:=如何比这个简单示例更一般;例如,在i join中结合:=和二分查找。

#3


7  

Data frames are the worst structure you can choose to make modification to. Due to quite the complex handling of all features (such as keeping row names in synch, partial matching, etc.) which is done in pure R code (unlike most other objects that can go straight to C) they tend to force additional copies as you can't edit them in place. Check R-devel on the detailed discussions on this - it has been discussed in length several times.

数据帧是最糟糕的结构,您可以选择进行修改。由于对所有特性(如保持行名同步、部分匹配等)的处理非常复杂,这些特性都是在纯R代码中完成的(不像其他大多数可以直接转到C的对象),它们往往会强制附加副本,因为您无法对它们进行适当的编辑。检查R-devel关于这个问题的详细讨论——它已经讨论了好几次了。

The practical rule is to never use data frames for large data, unless you treat them read-only. You will be orders of magnitude more efficient if you either work on vectors or matrices.

实际的规则是,永远不要对大数据使用数据帧,除非您将它们视为只读的。如果你在向量或矩阵上做功,你的数量级会更有效率。

#4


4  

There is type of object called a ffdf in the ff package which is basically a data.frame stored on disk. In addition to the other tips above you can try that.

ff包中有一种对象叫做ffdf,它基本上是存储在磁盘上的data.frame。除了上面的其他技巧,你还可以尝试一下。

You can also try the RSQLite package.

您也可以尝试RSQLite包。

#1


8  

Look up 'copy-on-write' in the context of R discussions related to memory. As soon as one part of a (potentially really large) data structure changes, a copy is made.

在与记忆有关的R讨论中查找“写时复制”。一旦数据结构(可能非常大)的一部分发生变化,就会进行复制。

A useful rule of thumb is that if your largest object is N mb/gb/... large, you need around 3*N of RAM. Such is life with an interpreted system.

一个有用的经验法则是,如果最大的对象是N mb/gb/…很大,你需要3*N的内存。这就是有解释系统的生活。

Years ago when I had to handle large amounts of data on machines with (relative to the data volume) relatively low-ram 32-bit machines, I got good use out of early versions of the bigmemory package. It uses the 'external pointer' interface to keep large gobs of memory outside of R. That save you not only the '3x' factor, but possibly more as you may get away with non-contiguous memory (which is the other thing R likes).

几年前,当我必须在使用(相对于数据量)相对较低的ram 32位机器的机器上处理大量数据时,我很好地利用了bigmemory包的早期版本。它使用“外部指针”接口将大量内存保存在R之外,这样不仅可以保存“3x”因子,还可以保存非连续内存(这是R喜欢的另一个东西)。

#2


12  

Following up on Joran suggesting data.table, here are some links. Your object, at 900MB, is manageable in RAM even in 32bit R, with no copies at all.

追踪Joran的数据。这里有一些链接。您的对象,在900MB,在RAM中是可管理的,即使在32位R中,完全没有副本。

When should I use the := operator in data.table?

我应该在什么时候使用:=操作符?

Why has data.table defined := rather than overloading <-?

为什么数据。表定义:=而不是重载<-?

Also, data.table v1.8.0 (not yet on CRAN but stable on R-Forge) has a set() function which provides even faster assignment to elements, as fast as assignment to matrix (appropriate for use inside loops for example). See latest NEWS for more details and example. Also see ?":=" which is linked from ?data.table.

此外,数据。表v1.8.0(还没有在CRAN上,但在R-Forge上是稳定的)有一个set()函数,它提供了对元素更快的赋值,如赋值给矩阵(例如,适合在循环中使用)。有关更多细节和示例,请参见最新新闻。还可以看到?":="它是从?data.table链接过来的。

And, here are 12 questions on Stack Overflow with the data.table tag containing the word "reference".

这里有12个关于栈溢出的问题。包含单词“reference”的表标记。

For completeness :

完整性:

require(data.table)
DT = as.data.table(dataframe)
# say column name 17 is 'Q' (i.e. LETTERS[17])
# then any of the following :

DT[37544, Q:=0]                # using column name (often preferred)

DT[37544, 17:=0, with=FALSE]   # using column number

col = "Q"
DT[37544, col:=0, with=FALSE]  # variable holding name

col = 17
DT[37544, col:=0, with=FALSE]  # variable holding number

set(DT,37544L,17L,0)           # using set(i,j,value) in v1.8.0
set(DT,37544L,"Q",0)

But, please do see linked questions and the package's documentation to see how := is more general than this simple example; e.g., combining := with binary search in an i join.

但是,请务必参阅相关问题和软件包的文档,以了解:=如何比这个简单示例更一般;例如,在i join中结合:=和二分查找。

#3


7  

Data frames are the worst structure you can choose to make modification to. Due to quite the complex handling of all features (such as keeping row names in synch, partial matching, etc.) which is done in pure R code (unlike most other objects that can go straight to C) they tend to force additional copies as you can't edit them in place. Check R-devel on the detailed discussions on this - it has been discussed in length several times.

数据帧是最糟糕的结构,您可以选择进行修改。由于对所有特性(如保持行名同步、部分匹配等)的处理非常复杂,这些特性都是在纯R代码中完成的(不像其他大多数可以直接转到C的对象),它们往往会强制附加副本,因为您无法对它们进行适当的编辑。检查R-devel关于这个问题的详细讨论——它已经讨论了好几次了。

The practical rule is to never use data frames for large data, unless you treat them read-only. You will be orders of magnitude more efficient if you either work on vectors or matrices.

实际的规则是,永远不要对大数据使用数据帧,除非您将它们视为只读的。如果你在向量或矩阵上做功,你的数量级会更有效率。

#4


4  

There is type of object called a ffdf in the ff package which is basically a data.frame stored on disk. In addition to the other tips above you can try that.

ff包中有一种对象叫做ffdf,它基本上是存储在磁盘上的data.frame。除了上面的其他技巧,你还可以尝试一下。

You can also try the RSQLite package.

您也可以尝试RSQLite包。