使用plyr，doMC和summarize（）与非常大的数据集？

I have a fairly large dataset (~1.4m rows) that I'm doing some splitting and summarizing on. The whole thing takes a while to run, and my final application depends on frequent running, so my thought was to use doMC and the .parallel=TRUE flag with plyr like so (simplified a bit):

我有一个相当大的数据集（约1.4米行），我正在做一些分裂和总结。整个过程需要一段时间才能运行，我的最终应用程序依赖于频繁运行，所以我的想法是使用doMC和.parallel = TRUE标志和plyr一样（简化了一下）：

library(plyr)
require(doMC)
registerDoMC()

df <- ddply(df, c("cat1", "cat2"), summarize, count=length(cat2), .parallel = TRUE)

If I set the number of cores explicitly to two (using registerDoMC(cores=2)) my 8 GB of RAM see me through, and it shaves a decent amount of time. However, if I let it use all 8 cores, I quickly run out of memory due to the fact that each of the forked processes appears to clone the entire dataset in memory.

如果我明确地将内核数量设置为两个（使用registerDoMC（cores = 2）），我的8 GB内存就可以看到我，并且它可以节省相当多的时间。但是，如果我让它使用所有8个内核，由于每个分叉进程似乎克隆了内存中的整个数据集，我很快就会耗尽内存。

My question is whether or not it is possible to use plyr's parallel execution facilities in a more memory-thrifty way? I tried converting my dataframe to a big.matrix, but this simply seemed to force the whole thing back to using a single core:

我的问题是，是否有可能以更节省内存的方式使用plyr的并行执行工具？我尝试将我的数据帧转换为big.matrix，但这似乎只是迫使整个事情回到使用单个核心：

library(plyr)
library(doMC)
registerDoMC()
library(bigmemory)

bm <- as.big.matrix(df)
df <- mdply(bm, c("cat1", "cat2"), summarize, count=length(cat2), .parallel = TRUE)

This is my first foray into multicore R computing, so if there is a better way of thinking about this, I'm open to suggestion.

这是我第一次涉足多核R计算，所以如果有更好的思考方式，我愿意接受建议。

UPDATE: As with many things in life, it turns out I was doing Other Stupid Things elsewhere in my code, and that the whole issue of multi-processing becomes a moot point in this particular instance. However, for big data folding tasks, I'll keep data.table in mind. I was able to replicate my folding task in a straightforward way.

更新：与生活中的许多事情一样，事实证明我在我的代码中的其他地方正在做其他愚蠢的事情，并且在这个特定实例中，整个多处理问题变得没有实际意义。但是，对于大数据折叠任务，我会记住data.table。我能够以直截了当的方式复制我的折叠任务。

1 个解决方案

#1

I do not think that plyr makes copies of the entire dataset. However, when processing a chunk of data, that subset is copied to the worker. Therefore, when using more workers, more subsets are in memory simultaneously (i.e. 8 instead of 2).

我认为plyr不会复制整个数据集。但是，在处理大量数据时，该子集将复制到工作程序。因此，当使用更多工作者时，更多子集同时存在于存储器中（即8而不是2）。

I can think of a few tips you could try:

我可以想到你可以尝试的一些技巧：

Put your data in to an array structure in stead of a data.frame and use adply to do the summarizing. arrays are much more efficient in terms of memory use and speed. I mean using normal matrices, not big.matrix.
将数据放入数组结构而不是data.frame，并使用adply进行汇总。数组在内存使用和速度方面更有效。我的意思是使用普通矩阵，而不是big.matrix。
Give data.table a try, in some cases this can lead to a speed increase of several orders of magnitude. I'm not sure if data.table supports parallel processing, but even without parallelization, data.table might be hunderds of times faster. See a blog post of mine comparing ave, ddply and data.table for processing chunks of data.
试试data.table，在某些情况下，这会导致速度增加几个数量级。我不确定data.table是否支持并行处理，但即使没有并行化，data.table可能会快几倍。查看我的博客文章，比较ave，ddply和data.table来处理数据块。

#1

I can think of a few tips you could try:

我可以想到你可以尝试的一些技巧：

Put your data in to an array structure in stead of a data.frame and use adply to do the summarizing. arrays are much more efficient in terms of memory use and speed. I mean using normal matrices, not big.matrix.
将数据放入数组结构而不是data.frame，并使用adply进行汇总。数组在内存使用和速度方面更有效。我的意思是使用普通矩阵，而不是big.matrix。
Give data.table a try, in some cases this can lead to a speed increase of several orders of magnitude. I'm not sure if data.table supports parallel processing, but even without parallelization, data.table might be hunderds of times faster. See a blog post of mine comparing ave, ddply and data.table for processing chunks of data.
试试data.table，在某些情况下，这会导致速度增加几个数量级。我不确定data.table是否支持并行处理，但即使没有并行化，data.table可能会快几倍。查看我的博客文章，比较ave，ddply和data.table来处理数据块。

秒客网

使用plyr，doMC和summarize（）与非常大的数据集？

1 个解决方案

#1

#1

相关文章