使用plyr,doMC和summarize()与非常大的数据集?

时间:2022-03-20 09:17:20

I have a fairly large dataset (~1.4m rows) that I'm doing some splitting and summarizing on. The whole thing takes a while to run, and my final application depends on frequent running, so my thought was to use doMC and the .parallel=TRUE flag with plyr like so (simplified a bit):

我有一个相当大的数据集(约1.4米行),我正在做一些分裂和总结。整个过程需要一段时间才能运行,我的最终应用程序依赖于频​​繁运行,所以我的想法是使用doMC和.parallel = TRUE标志和plyr一样(简化了一下):

library(plyr)
require(doMC)
registerDoMC()

df <- ddply(df, c("cat1", "cat2"), summarize, count=length(cat2), .parallel = TRUE)

If I set the number of cores explicitly to two (using registerDoMC(cores=2)) my 8 GB of RAM see me through, and it shaves a decent amount of time. However, if I let it use all 8 cores, I quickly run out of memory due to the fact that each of the forked processes appears to clone the entire dataset in memory.

如果我明确地将内核数量设置为两个(使用registerDoMC(cores = 2)),我的8 GB内存就可以看到我,并且它可以节省相当多的时间。但是,如果我让它使用所有8个内核,由于每个分叉进程似乎克隆了内存中的整个数据集,我很快就会耗尽内存。

My question is whether or not it is possible to use plyr's parallel execution facilities in a more memory-thrifty way? I tried converting my dataframe to a big.matrix, but this simply seemed to force the whole thing back to using a single core:

我的问题是,是否有可能以更节省内存的方式使用plyr的并行执行工具?我尝试将我的数据帧转换为big.matrix,但这似乎只是迫使整个事情回到使用单个核心:

library(plyr)
library(doMC)
registerDoMC()
library(bigmemory)

bm <- as.big.matrix(df)
df <- mdply(bm, c("cat1", "cat2"), summarize, count=length(cat2), .parallel = TRUE)

This is my first foray into multicore R computing, so if there is a better way of thinking about this, I'm open to suggestion.

这是我第一次涉足多核R计算,所以如果有更好的思考方式,我愿意接受建议。

UPDATE: As with many things in life, it turns out I was doing Other Stupid Things elsewhere in my code, and that the whole issue of multi-processing becomes a moot point in this particular instance. However, for big data folding tasks, I'll keep data.table in mind. I was able to replicate my folding task in a straightforward way.

更新:与生活中的许多事情一样,事实证明我在我的代码中的其他地方正在做其他愚蠢的事情,并且在这个特定实例中,整个多处理问题变得没有实际意义。但是,对于大数据折叠任务,我会记住data.table。我能够以直截了当的方式复制我的折叠任务。

1 个解决方案

#1


6  

I do not think that plyr makes copies of the entire dataset. However, when processing a chunk of data, that subset is copied to the worker. Therefore, when using more workers, more subsets are in memory simultaneously (i.e. 8 instead of 2).

我认为plyr不会复制整个数据集。但是,在处理大量数据时,该子集将复制到工作程序。因此,当使用更多工作者时,更多子集同时存在于存储器中(即8而不是2)。

I can think of a few tips you could try:

我可以想到你可以尝试的一些技巧:

  • Put your data in to an array structure in stead of a data.frame and use adply to do the summarizing. arrays are much more efficient in terms of memory use and speed. I mean using normal matrices, not big.matrix.
  • 将数据放入数组结构而不是data.frame,并使用adply进行汇总。数组在内存使用和速度方面更有效。我的意思是使用普通矩阵,而不是big.matrix。
  • Give data.table a try, in some cases this can lead to a speed increase of several orders of magnitude. I'm not sure if data.table supports parallel processing, but even without parallelization, data.table might be hunderds of times faster. See a blog post of mine comparing ave, ddply and data.table for processing chunks of data.
  • 试试data.table,在某些情况下,这会导致速度增加几个数量级。我不确定data.table是否支持并行处理,但即使没有并行化,data.table可能会快几倍。查看我的博客文章,比较ave,ddply和data.table来处理数据块。

#1


6  

I do not think that plyr makes copies of the entire dataset. However, when processing a chunk of data, that subset is copied to the worker. Therefore, when using more workers, more subsets are in memory simultaneously (i.e. 8 instead of 2).

我认为plyr不会复制整个数据集。但是,在处理大量数据时,该子集将复制到工作程序。因此,当使用更多工作者时,更多子集同时存在于存储器中(即8而不是2)。

I can think of a few tips you could try:

我可以想到你可以尝试的一些技巧:

  • Put your data in to an array structure in stead of a data.frame and use adply to do the summarizing. arrays are much more efficient in terms of memory use and speed. I mean using normal matrices, not big.matrix.
  • 将数据放入数组结构而不是data.frame,并使用adply进行汇总。数组在内存使用和速度方面更有效。我的意思是使用普通矩阵,而不是big.matrix。
  • Give data.table a try, in some cases this can lead to a speed increase of several orders of magnitude. I'm not sure if data.table supports parallel processing, but even without parallelization, data.table might be hunderds of times faster. See a blog post of mine comparing ave, ddply and data.table for processing chunks of data.
  • 试试data.table,在某些情况下,这会导致速度增加几个数量级。我不确定data.table是否支持并行处理,但即使没有并行化,data.table可能会快几倍。查看我的博客文章,比较ave,ddply和data.table来处理数据块。