使用大型列表,这些列表在操作时对RAM来说太大了

时间:2021-12-31 13:48:42

Short of working on a machine with more RAM, how can I work with large lists in R, for example put them on disk and then work on sections of it?

如果没有在具有更多RAM的机器上工作,我如何使用R中的大型列表,例如将它们放在磁盘上然后处理它的部分?

Here's some code to generate the type of lists I'm using

这里有一些代码来生成我正在使用的列表类型

n = 50; i = 100
WORD <- vector(mode = "integer", length = n)
for (i in 1:n){
  WORD[i] <- paste(sample(c(rep(0:9,each=5),LETTERS,letters),5,replace=TRUE),collapse='')
}
dat <- data.frame(WORD =  WORD,
                  COUNTS = sample(1:50, n, replace = TRUE))
dat_list <- lapply(1:i, function(i) dat) 

In my actual use case each data frame in the list is unique, unlike the quick example here. I'm aiming for n = 4000 and i = 100,000

在我的实际用例中,列表中的每个数据框都是唯一的,这与此处的快速示例不同。我的目标是n = 4000,i = 100,000

This is one example of what I want to do with this list of dataframes:

这是我想用这个数据帧列表做的一个例子:

FUNC <- function(x) {rep(x$WORD, times = x$COUNTS)}
la <- lapply(dat_list, FUNC)

With my actual use case this runs for a few hours, fills up the RAM and most of the swap and then RStudio freezes and shows a message with a bomb on it (RStudio was forced to terminate due to an error in the R session).

根据我的实际用例,这会运行几个小时,填满RAM和大部分交换,然后RStudio冻结并显示一条带有炸弹的消息(由于R会话中的错误,RStudio*终止)。

I see that bigmemory is limited to matrices and ff doesn't seem to handle lists. What are the other options? If sqldf or a related out-of-memory method possible here, how might I get started? I can't get enough out of the documentation to make any progress and would be grateful for any pointers. Note that instructions to "buy more RAM" will be ignored! This is for a package that I'm hoping will be suitable for average desktop computers (ie. undergrad computer labs).

我看到bigmemory仅限于矩阵,ff似乎没有处理列表。还有什么其他选择?如果这里有sqldf或相关的内存不足方法,我该如何开始?我无法从文档中获得足够的进展,并对任何指针表示感谢。请注意,“购买更多RAM”的说明将被忽略!这是一个我希望适用于普通台式计算机(即本科计算机实验室)的软件包。

UPDATE Followining up on the helpful comments from SimonO101 and Ari, here's some benchmarking comparing dataframes and data.tables, loops and lapply, and with and without gc

更新关注SimonO101和Ari的有用评论,这里有一些基准比较数据框架和data.tables,循环和lapply,以及有和没有gc

# self-contained speed test of untable
n = 50; i = 100
WORD <- vector(mode = "integer", length = n)
for (i in 1:n){
  WORD[i] <- paste(sample(c(rep(0:9,each=5),LETTERS,letters),5,replace=TRUE),collapse='')
}
# as data table
library(data.table)
dat_dt <- data.table(WORD = WORD, COUNTS = sample(1:50, n, replace = TRUE))
dat_list_dt <- lapply(1:i, function(i) dat_dt)

# as data frame
dat_df <- data.frame(WORD =  WORD, COUNTS = sample(1:50, n, replace = TRUE))
dat_list_df <- lapply(1:i, function(i) dat_df)

# increase object size
y <- 10
dt <- c(rep(dat_list_dt, y))
df <- c(rep(dat_list_df, y))
# untable
untable <- function(x) rep(x$WORD, times = x$COUNTS)


# preallocate objects for loop to fill
df1 <- vector("list", length = length(df))
dt1 <- vector("list", length = length(dt))
df3 <- vector("list", length = length(df))
dt3 <- vector("list", length = length(dt))
# functions for lapply
df_untable_gc <- function(x) { untable(df[[x]]); if (x%%10) invisible(gc()) }
dt_untable_gc <- function(x) { untable(dt[[x]]); if (x%%10) invisible(gc()) }
# speedtests
library(microbenchmark)
microbenchmark(
  for(i in 1:length(df)) { df1[[i]] <- untable(df[[i]]); if (i%%10) invisible(gc()) },
  for(i in 1:length(dt)) { dt1[[i]] <- untable(dt[[i]]); if (i%%10) invisible(gc()) },
  df2 <- lapply(1:length(df), function(i) df_untable_gc(i)),
  dt2 <- lapply(1:length(dt), function(i) dt_untable_gc(i)),
  for(i in 1:length(df)) { df3[[i]] <- untable(df[[i]])},
  for(i in 1:length(dt)) { dt3[[i]] <- untable(dt[[i]])},
  df4 <- lapply(1:length(df), function(i) untable(df[[i]]) ),
  dt4 <- lapply(1:length(dt), function(i) untable(dt[[i]]) ),

  times = 10)

And here are the results, without explicit garbage collection, data.table is much faster and lapply slightly faster than a loop. With explicit garbage collection (as I think SimonO101 might be suggesting) they are all much the same speed - a lot slower! I know that using gc is a bit controversial and probably not helpful in this case, but I'll give it a shot with my actual use-case and see if it makes any difference. Of course I don't have any data on memory use for any of these functions, which is really my main concern. Seems that there is no function for memory benchmarking equivalent to the timing functions (for windows, anyway).

这里是结果,没有明确的垃圾收集,data.table比循环更快,更快。使用显式的垃圾收集(我认为SimonO101可能会建议),它们的速度大致相同 - 速度要慢得多!我知道使用gc有点争议,在这种情况下可能没什么用处,但是我会用实际的用例来试一试,看看它是否有所作为。当然,我没有关于任何这些功能的内存使用数据,这是我真正关心的问题。似乎没有相当于计时功能的内存基准测试功能(对于Windows,无论如何)。

Unit: milliseconds
                                                                                                 expr
 for (i in 1:length(df)) {     df1[[i]] <- untable(df[[i]])     if (i%%10)          invisible(gc()) }
 for (i in 1:length(dt)) {     dt1[[i]] <- untable(dt[[i]])     if (i%%10)          invisible(gc()) }
                                            df2 <- lapply(1:length(df), function(i) df_untable_gc(i))
                                            dt2 <- lapply(1:length(dt), function(i) dt_untable_gc(i))
                                         for (i in 1:length(df)) {     df3[[i]] <- untable(df[[i]]) }
                                         for (i in 1:length(dt)) {     dt3[[i]] <- untable(dt[[i]]) }
                                            df4 <- lapply(1:length(df), function(i) untable(df[[i]]))
                                            dt4 <- lapply(1:length(dt), function(i) untable(dt[[i]]))
          min           lq       median           uq         max neval
 37436.433962 37955.714144 38663.120340 39142.350799 39651.88118    10
 37354.456809 38493.268121 38636.424561 38914.726388 39111.20439    10
 36959.630896 37924.878498 38314.428435 38636.894810 39537.31465    10
 36917.765453 37735.186358 38106.134494 38563.217919 38751.71627    10
    28.200943    29.221901    30.205502    31.616041    34.32218    10
    10.230519    10.418947    10.665668    12.194847    14.58611    10
    26.058039    27.103217    27.560739    28.189448    30.62751    10
     8.835168     8.904956     9.214692     9.485018    12.93788    10

1 个解决方案

#1


1  

If you really are going to be using very large data you can use the h5r package to write hdf5 files. You would be writing to and reading from your hard drive on the fly instead of using RAM. I have not used this so I can be of little help on it's general usage, I mention this because I think there's is no tutorial for it. I got this idea by thinking about pytables. Not sure if this solution is appropriate for you.

如果您真的要使用非常大的数据,可以使用h5r包来编写hdf5文件。您可以动态地写入和读取硬盘驱动器,而不是使用RAM。我没有用过这个,所以我对它的一般用法没什么帮助,我提到这个因为我认为没有教程。我通过思考pytables得到了这个想法。不确定此解决方案是否适合您。

#1


1  

If you really are going to be using very large data you can use the h5r package to write hdf5 files. You would be writing to and reading from your hard drive on the fly instead of using RAM. I have not used this so I can be of little help on it's general usage, I mention this because I think there's is no tutorial for it. I got this idea by thinking about pytables. Not sure if this solution is appropriate for you.

如果您真的要使用非常大的数据,可以使用h5r包来编写hdf5文件。您可以动态地写入和读取硬盘驱动器,而不是使用RAM。我没有用过这个,所以我对它的一般用法没什么帮助,我提到这个因为我认为没有教程。我通过思考pytables得到了这个想法。不确定此解决方案是否适合您。