使用不同列对数据框架进行rbind的有效方法

时间:2022-08-22 22:54:59

I have a list of data frames with different set of columns, I would like combine them by rows into one data frame. I use plyr::rbind.fill to do that. I am looking something that would do it more efficiently. Similar to the answer given here

我有一组具有不同列的数据帧,我想按行将它们组合成一个数据帧。我使用plyr :: rbind.fill来做到这一点。我正在寻找能够更有效地完成工作的东西。类似于这里给出的答案

require(plyr)

set.seed(45)
sample.fun <- function() {
   nam <- sample(LETTERS, sample(5:15))
   val <- data.frame(matrix(sample(letters, length(nam)*10,replace=TRUE),nrow=10))
   setNames(val, nam)  
}
ll <- replicate(1e4, sample.fun())
rbind.fill(ll)

3 个解决方案

#1


32  

UPDATE: See this updated answer instead.

更新:请参阅此更新的答案。

UPDATE (eddi): This has now been implemented in version 1.8.11 as a fill argument to rbind. For example:

UPDATE(eddi):现在已经在1.8.11版本中实现了它作为rbind的填充参数。例如:

DT1 = data.table(a = 1:2, b = 1:2)
DT2 = data.table(a = 3:4, c = 1:2)

rbind(DT1, DT2, fill = TRUE)
#   a  b  c
#1: 1  1 NA
#2: 2  2 NA
#3: 3 NA  1
#4: 4 NA  2

FR #4790 added now - rbind.fill (from plyr) like functionality to merge list of data.frames/data.tables

Note 1:

This solution uses data.table's rbindlist function to "rbind" list of data.tables and for this, be sure to use version 1.8.9 because of this bug in versions < 1.8.9.

此解决方案使用data.table的rbindlist函数来“rbind”data.tables列表,为此,请确保使用版本1.8.9,因为版本<1.8.9中存在此错误。

Note 2:

rbindlist when binding lists of data.frames/data.tables, as of now, will retain the data type of the first column. That is, if a column in first data.frame is character and the same column in the 2nd data.frame is "factor", then, rbindlist will result in this column being a character. So, if your data.frame consisted of all character columns, then, your solution with this method will be identical to the plyr method. If not, the values will still be the same, but some columns will be character instead of factor. You'll have to convert to "factor" yourself after. Hopefully this behaviour will change in the future.

绑定data.frames / data.tables列表时的rbindlist,截至目前,将保留第一列的数据类型。也就是说,如果第一个data.frame中的列是字符,而第二个data.frame中的相同列是“factor”,则rbindlist将导致此列为字符。因此,如果您的data.frame由所有字符列组成,那么,使用此方法的解决方案将与plyr方法相同。如果不是,则值仍然相同,但有些列将是字符而不是因子。你必须自己转换为“因子”。希望这种行为将来会改变。

And now here's using data.table (and benchmarking comparison with rbind.fill from plyr):

现在这里使用data.table(与plyr的rbind.fill进行基准比较):

require(data.table)
rbind.fill.DT <- function(ll) {
    # changed sapply to lapply to return a list always
    all.names <- lapply(ll, names)
    unq.names <- unique(unlist(all.names))
    ll.m <- rbindlist(lapply(seq_along(ll), function(x) {
        tt <- ll[[x]]
        setattr(tt, 'class', c('data.table', 'data.frame'))
        data.table:::settruelength(tt, 0L)
        invisible(alloc.col(tt))
        tt[, c(unq.names[!unq.names %chin% all.names[[x]]]) := NA_character_]
        setcolorder(tt, unq.names)
    }))
}

rbind.fill.PLYR <- function(ll) {
    rbind.fill(ll)
}

require(microbenchmark)
microbenchmark(t1 <- rbind.fill.DT(ll), t2 <- rbind.fill.PLYR(ll), times=10)
# Unit: seconds
#                      expr      min        lq    median        uq       max neval
#   t1 <- rbind.fill.DT(ll)  10.8943  11.02312  11.26374  11.34757  11.51488    10
# t2 <- rbind.fill.PLYR(ll) 121.9868 134.52107 136.41375 184.18071 347.74724    10


# for comparison change t2 to data.table
setattr(t2, 'class', c('data.table', 'data.frame'))
data.table:::settruelength(t2, 0L)
invisible(alloc.col(t2))
setcolorder(t2, unique(unlist(sapply(ll, names))))

identical(t1, t2) # [1] TRUE

It should be noted that plyr's rbind.fill edges past this particular data.table solution until list size of about 500.

应该注意的是,plyr的rbind.fill会超过这个特定的data.table解决方案,直到列表大小约为500。

Benchmarking plot:

Here's the plot on runs with list length of data.frames with seq(1000, 10000, by=1000). I've used microbenchmark with 10 reps on each of these different list lengths.

这是带有seq(1000,10000,by = 1000)的data.frames列表长度的运行图。我已经在这些不同的列表长度中使用了每个10个代表的microbenchmark。

使用不同列对数据框架进行rbind的有效方法

Benchmarking gist:

Here's the gist for benchmarking, in case anyone wants to replicate the results.

这是基准测试的要点,以防有人想要复制结果。

#2


15  

Now that rbindlist (and rbind) for data.table has improved functionality and speed with the recent changes/commits in v1.9.3 (development version), and dplyr has a faster version of plyr's rbind.fill, named rbind_all, this answer of mine seems a bit too outdated.

现在,对于data.table的rbindlist(和rbind)已经改进了v1.9.3(开发版本)中最近的更改/提交的功能和速度,而dplyr有一个更快的版本的plyr的rbind.fill,名为rbind_all,我的这个答案似乎有点过时了。

Here's the relevant NEWS entry for rbindlist:

这是rbindlist的相关新闻条目:

o  'rbindlist' gains 'use.names' and 'fill' arguments and is now implemented entirely in C. Closes #5249    
  -> use.names by default is FALSE for backwards compatibility (doesn't bind by 
     names by default)
  -> rbind(...) now just calls rbindlist() internally, except that 'use.names' 
     is TRUE by default, for compatibility with base (and backwards compatibility).
  -> fill by default is FALSE. If fill is TRUE, use.names has to be TRUE.
  -> At least one item of the input list has to have non-null column names.
  -> Duplicate columns are bound in the order of occurrence, like base.
  -> Attributes that might exist in individual items would be lost in the bound result.
  -> Columns are coerced to the highest SEXPTYPE, if they are different, if/when possible.
  -> And incredibly fast ;).
  -> Documentation updated in much detail. Closes DR #5158.

So, I've benchmarked the newer (and faster versions) on relatively bigger data below.

所以,我已经在下面相对较大的数据上对较新的(和更快的版本)进行了基准测试。


New Benchmark:

We'll create a total of 10,000 data.tables with columns ranging from 200-300 with the total number of columns after binding to be 500.

我们将创建总共10,000个data.tables,列数范围为200-300,绑定后的列总数为500。

Functions to create data:

require(data.table) ## 1.9.3 commit 1267
require(dplyr)      ## commit 1504 devel
set.seed(1L)
names = paste0("V", 1:500)
foo <- function() {
    cols = sample(200:300, 1)
    data = setDT(lapply(1:cols, function(x) sample(10)))
    setnames(data, sample(names)[1:cols])
}
n = 10e3L
ll = vector("list", n)
for (i in 1:n) {
    .Call("Csetlistelt", ll, i, foo())
}

And here are the timings:

以下是时间安排:

## Updated timings on data.table v1.9.5 - three consecutive runs:
system.time(ans1 <- rbindlist(ll, fill=TRUE))
#   user  system elapsed 
#  1.993   0.106   2.107 
system.time(ans1 <- rbindlist(ll, fill=TRUE))
#   user  system elapsed 
#  1.644   0.092   1.744 
system.time(ans1 <- rbindlist(ll, fill=TRUE))
#   user  system elapsed 
#  1.297   0.088   1.389 


## dplyr's rbind_all - Timings for three consecutive runs
system.time(ans2 <- rbind_all(ll))
#   user  system elapsed  
#  9.525   0.121   9.761 

#   user  system elapsed  
#  9.194   0.112   9.370 

#   user  system elapsed  
#  8.665   0.081   8.780 

identical(ans1, setDT(ans2)) # [1] TRUE

#3


5  

There is still something to be gained if you parallelize both rbind.fill and rbindlist. The results are done with data.table version 1.8.8 as version 1.8.9 got bricked when I tried it with the parallelized function. So the results aren't identical between data.table and plyr, but they are identical within data.table or plyr solution. Meaning parallel plyr matches to unparallel plyr, and vice versa.

如果你并行化rbind.fill和rbindlist,仍然可以获得一些东西。结果是使用data.table版本1.8.8完成的,因为当我尝试使用并行化函数时,版本1.8.9变得很糟糕。因此data.table和plyr之间的结果并不相同,但它们在data.table或plyr解决方案中是相同的。意思是并行plyr匹配不平行的plyr,反之亦然。

Here's the benchmark/scripts. The parallel.rbind.fill.DT looks horrible, but that's the fastest one I could pull.

这是基准/脚本。 parallel.rbind.fill.DT看起来很糟糕,但那是我能拉的最快的。

require(plyr)
require(data.table)
require(ggplot2)
require(rbenchmark)
require(parallel) 

# data.table::rbindlist solutions
rbind.fill.DT <- function(ll) {
  all.names <- lapply(ll, names)
  unq.names <- unique(unlist(all.names))
  rbindlist(lapply(seq_along(ll), function(x) {
    tt <- ll[[x]]
    setattr(tt, 'class', c('data.table', 'data.frame'))
    data.table:::settruelength(tt, 0L)
    invisible(alloc.col(tt))
    tt[, c(unq.names[!unq.names %chin% all.names[[x]]]) := NA_character_]
    setcolorder(tt, unq.names)
  }))
}

 parallel.rbind.fill.DT <- function(ll, cluster=NULL){
   all.names <- lapply(ll, names)
   unq.names <- unique(unlist(all.names)) 
   if(is.null(cluster)){
     ll.m <- rbindlist(lapply(seq_along(ll), function(x) {
       tt <- ll[[x]]
       setattr(tt, 'class', c('data.table', 'data.frame'))
       data.table:::settruelength(tt, 0L)
       invisible(alloc.col(tt))
       tt[, c(unq.names[!unq.names %chin% all.names[[x]]]) := NA_character_]
       setcolorder(tt, unq.names)
     }))
   }else{
     cores <- length(cluster)
     sequ <- as.integer(seq(1, length(ll), length.out = cores+1))
     Call <- paste(paste("list", seq(cores), sep=""), " = ll[", c(1, sequ[2:cores]+1), ":", sequ[2:(cores+1)], "]", sep="", collapse=", ") 
     ll <- eval(parse(text=paste("list(", Call, ")")))
     rbindlist(clusterApply(cluster, ll, function(ll, unq.names){
        rbindlist(lapply(seq_along(ll), function(x, ll, unq.names) {
          tt <- ll[[x]]
          setattr(tt, 'class', c('data.table', 'data.frame'))
          data.table:::settruelength(tt, 0L)
          invisible(alloc.col(tt))
          tt[, c(unq.names[!unq.names %chin% colnames(tt)]) := NA_character_]
          setcolorder(tt, unq.names)
        }, ll=ll, unq.names=unq.names))
      }, unq.names=unq.names))
    }
  }           


# plyr::rbind.fill solutions
rbind.fill.PLYR <- function(ll) {
  rbind.fill(ll)
}

parallel.rbind.fill.PLYR <- function(ll, cluster=NULL, magicConst=400){
  if(is.null(cluster) | ceiling(length(ll)/magicConst) < length(cluster)){
    rbind.fill(ll)
  }else{
    cores <- length(cluster)
    sequ <- as.integer(seq(1, length(ll), length.out = ceiling(length(ll)/magicConst)))
    Call <- paste(paste("list", seq(cores), sep=""), " = ll[", c(1, sequ[2:(length(sequ)-1)]+1), ":", sequ[2:length(sequ)], "]", sep="", collapse=", ") 
    ll <- eval(parse(text=paste("list(", Call, ")")))
    rbind.fill(parLapply(cluster, ll, rbind.fill))
  }
} 

# Function to generate sample data of varying list length
set.seed(45)
sample.fun <- function() {
  nam <- sample(LETTERS, sample(5:15))
  val <- data.frame(matrix(sample(letters, length(nam)*10,replace=TRUE),nrow=10))
  setNames(val, nam)
}

ll <- replicate(10000, sample.fun())
cl <- makeCluster(4, type="SOCK")
clusterEvalQ(cl, library(data.table))
clusterEvalQ(cl, library(plyr))
benchmark(t1 <- rbind.fill.PLYR(ll),
  t2 <- rbind.fill.DT(ll),
  t3 <- parallel.rbind.fill.PLYR(ll, cluster=cl, 400),
  t4 <- parallel.rbind.fill.DT(ll, cluster=cl),
  replications=5)
stopCluster(cl)

# Results for rbinding 10000 dataframes
# done with 4 cores, i5 3570k and 16gb memory
# test                          reps elapsed relative 
# rbind.fill.PLYR                 5  321.80    16.682   
# rbind.fill.DT                   5   26.10    1.353    
# parallel.rbind.fill.PLYR        5   28.00    1.452     
# parallel.rbind.fill.DT          5   19.29    1.000    

# checking are results equal
t1 <- as.matrix(t1)
t2 <- as.matrix(t2)
t3 <- as.matrix(t3)
t4 <- as.matrix(t4)

t1 <- t1[order(t1[, 1], t1[, 2]), ]
t2 <- t2[order(t2[, 1], t2[, 2]), ]
t3 <- t3[order(t3[, 1], t3[, 2]), ]
t4 <- t4[order(t4[, 1], t4[, 2]), ]

identical(t2, t4) # TRUE
identical(t1, t3) # TRUE
identical(t1, t2) # FALSE, mismatch between plyr and data.table

As you can see parallesizing rbind.fill made it comparable to data.table, and you could get marginal increase of speed by parallesizing data.table even with this low of a dataframe count.

正如你所看到的那样,使用rbind.fill进行平移可以使它与data.table相当,而且即使数据帧数量很少,你也可以通过对data.table进行参数化来获得速度的微小增加。

#1


32  

UPDATE: See this updated answer instead.

更新:请参阅此更新的答案。

UPDATE (eddi): This has now been implemented in version 1.8.11 as a fill argument to rbind. For example:

UPDATE(eddi):现在已经在1.8.11版本中实现了它作为rbind的填充参数。例如:

DT1 = data.table(a = 1:2, b = 1:2)
DT2 = data.table(a = 3:4, c = 1:2)

rbind(DT1, DT2, fill = TRUE)
#   a  b  c
#1: 1  1 NA
#2: 2  2 NA
#3: 3 NA  1
#4: 4 NA  2

FR #4790 added now - rbind.fill (from plyr) like functionality to merge list of data.frames/data.tables

Note 1:

This solution uses data.table's rbindlist function to "rbind" list of data.tables and for this, be sure to use version 1.8.9 because of this bug in versions < 1.8.9.

此解决方案使用data.table的rbindlist函数来“rbind”data.tables列表,为此,请确保使用版本1.8.9,因为版本<1.8.9中存在此错误。

Note 2:

rbindlist when binding lists of data.frames/data.tables, as of now, will retain the data type of the first column. That is, if a column in first data.frame is character and the same column in the 2nd data.frame is "factor", then, rbindlist will result in this column being a character. So, if your data.frame consisted of all character columns, then, your solution with this method will be identical to the plyr method. If not, the values will still be the same, but some columns will be character instead of factor. You'll have to convert to "factor" yourself after. Hopefully this behaviour will change in the future.

绑定data.frames / data.tables列表时的rbindlist,截至目前,将保留第一列的数据类型。也就是说,如果第一个data.frame中的列是字符,而第二个data.frame中的相同列是“factor”,则rbindlist将导致此列为字符。因此,如果您的data.frame由所有字符列组成,那么,使用此方法的解决方案将与plyr方法相同。如果不是,则值仍然相同,但有些列将是字符而不是因子。你必须自己转换为“因子”。希望这种行为将来会改变。

And now here's using data.table (and benchmarking comparison with rbind.fill from plyr):

现在这里使用data.table(与plyr的rbind.fill进行基准比较):

require(data.table)
rbind.fill.DT <- function(ll) {
    # changed sapply to lapply to return a list always
    all.names <- lapply(ll, names)
    unq.names <- unique(unlist(all.names))
    ll.m <- rbindlist(lapply(seq_along(ll), function(x) {
        tt <- ll[[x]]
        setattr(tt, 'class', c('data.table', 'data.frame'))
        data.table:::settruelength(tt, 0L)
        invisible(alloc.col(tt))
        tt[, c(unq.names[!unq.names %chin% all.names[[x]]]) := NA_character_]
        setcolorder(tt, unq.names)
    }))
}

rbind.fill.PLYR <- function(ll) {
    rbind.fill(ll)
}

require(microbenchmark)
microbenchmark(t1 <- rbind.fill.DT(ll), t2 <- rbind.fill.PLYR(ll), times=10)
# Unit: seconds
#                      expr      min        lq    median        uq       max neval
#   t1 <- rbind.fill.DT(ll)  10.8943  11.02312  11.26374  11.34757  11.51488    10
# t2 <- rbind.fill.PLYR(ll) 121.9868 134.52107 136.41375 184.18071 347.74724    10


# for comparison change t2 to data.table
setattr(t2, 'class', c('data.table', 'data.frame'))
data.table:::settruelength(t2, 0L)
invisible(alloc.col(t2))
setcolorder(t2, unique(unlist(sapply(ll, names))))

identical(t1, t2) # [1] TRUE

It should be noted that plyr's rbind.fill edges past this particular data.table solution until list size of about 500.

应该注意的是,plyr的rbind.fill会超过这个特定的data.table解决方案,直到列表大小约为500。

Benchmarking plot:

Here's the plot on runs with list length of data.frames with seq(1000, 10000, by=1000). I've used microbenchmark with 10 reps on each of these different list lengths.

这是带有seq(1000,10000,by = 1000)的data.frames列表长度的运行图。我已经在这些不同的列表长度中使用了每个10个代表的microbenchmark。

使用不同列对数据框架进行rbind的有效方法

Benchmarking gist:

Here's the gist for benchmarking, in case anyone wants to replicate the results.

这是基准测试的要点,以防有人想要复制结果。

#2


15  

Now that rbindlist (and rbind) for data.table has improved functionality and speed with the recent changes/commits in v1.9.3 (development version), and dplyr has a faster version of plyr's rbind.fill, named rbind_all, this answer of mine seems a bit too outdated.

现在,对于data.table的rbindlist(和rbind)已经改进了v1.9.3(开发版本)中最近的更改/提交的功能和速度,而dplyr有一个更快的版本的plyr的rbind.fill,名为rbind_all,我的这个答案似乎有点过时了。

Here's the relevant NEWS entry for rbindlist:

这是rbindlist的相关新闻条目:

o  'rbindlist' gains 'use.names' and 'fill' arguments and is now implemented entirely in C. Closes #5249    
  -> use.names by default is FALSE for backwards compatibility (doesn't bind by 
     names by default)
  -> rbind(...) now just calls rbindlist() internally, except that 'use.names' 
     is TRUE by default, for compatibility with base (and backwards compatibility).
  -> fill by default is FALSE. If fill is TRUE, use.names has to be TRUE.
  -> At least one item of the input list has to have non-null column names.
  -> Duplicate columns are bound in the order of occurrence, like base.
  -> Attributes that might exist in individual items would be lost in the bound result.
  -> Columns are coerced to the highest SEXPTYPE, if they are different, if/when possible.
  -> And incredibly fast ;).
  -> Documentation updated in much detail. Closes DR #5158.

So, I've benchmarked the newer (and faster versions) on relatively bigger data below.

所以,我已经在下面相对较大的数据上对较新的(和更快的版本)进行了基准测试。


New Benchmark:

We'll create a total of 10,000 data.tables with columns ranging from 200-300 with the total number of columns after binding to be 500.

我们将创建总共10,000个data.tables,列数范围为200-300,绑定后的列总数为500。

Functions to create data:

require(data.table) ## 1.9.3 commit 1267
require(dplyr)      ## commit 1504 devel
set.seed(1L)
names = paste0("V", 1:500)
foo <- function() {
    cols = sample(200:300, 1)
    data = setDT(lapply(1:cols, function(x) sample(10)))
    setnames(data, sample(names)[1:cols])
}
n = 10e3L
ll = vector("list", n)
for (i in 1:n) {
    .Call("Csetlistelt", ll, i, foo())
}

And here are the timings:

以下是时间安排:

## Updated timings on data.table v1.9.5 - three consecutive runs:
system.time(ans1 <- rbindlist(ll, fill=TRUE))
#   user  system elapsed 
#  1.993   0.106   2.107 
system.time(ans1 <- rbindlist(ll, fill=TRUE))
#   user  system elapsed 
#  1.644   0.092   1.744 
system.time(ans1 <- rbindlist(ll, fill=TRUE))
#   user  system elapsed 
#  1.297   0.088   1.389 


## dplyr's rbind_all - Timings for three consecutive runs
system.time(ans2 <- rbind_all(ll))
#   user  system elapsed  
#  9.525   0.121   9.761 

#   user  system elapsed  
#  9.194   0.112   9.370 

#   user  system elapsed  
#  8.665   0.081   8.780 

identical(ans1, setDT(ans2)) # [1] TRUE

#3


5  

There is still something to be gained if you parallelize both rbind.fill and rbindlist. The results are done with data.table version 1.8.8 as version 1.8.9 got bricked when I tried it with the parallelized function. So the results aren't identical between data.table and plyr, but they are identical within data.table or plyr solution. Meaning parallel plyr matches to unparallel plyr, and vice versa.

如果你并行化rbind.fill和rbindlist,仍然可以获得一些东西。结果是使用data.table版本1.8.8完成的,因为当我尝试使用并行化函数时,版本1.8.9变得很糟糕。因此data.table和plyr之间的结果并不相同,但它们在data.table或plyr解决方案中是相同的。意思是并行plyr匹配不平行的plyr,反之亦然。

Here's the benchmark/scripts. The parallel.rbind.fill.DT looks horrible, but that's the fastest one I could pull.

这是基准/脚本。 parallel.rbind.fill.DT看起来很糟糕,但那是我能拉的最快的。

require(plyr)
require(data.table)
require(ggplot2)
require(rbenchmark)
require(parallel) 

# data.table::rbindlist solutions
rbind.fill.DT <- function(ll) {
  all.names <- lapply(ll, names)
  unq.names <- unique(unlist(all.names))
  rbindlist(lapply(seq_along(ll), function(x) {
    tt <- ll[[x]]
    setattr(tt, 'class', c('data.table', 'data.frame'))
    data.table:::settruelength(tt, 0L)
    invisible(alloc.col(tt))
    tt[, c(unq.names[!unq.names %chin% all.names[[x]]]) := NA_character_]
    setcolorder(tt, unq.names)
  }))
}

 parallel.rbind.fill.DT <- function(ll, cluster=NULL){
   all.names <- lapply(ll, names)
   unq.names <- unique(unlist(all.names)) 
   if(is.null(cluster)){
     ll.m <- rbindlist(lapply(seq_along(ll), function(x) {
       tt <- ll[[x]]
       setattr(tt, 'class', c('data.table', 'data.frame'))
       data.table:::settruelength(tt, 0L)
       invisible(alloc.col(tt))
       tt[, c(unq.names[!unq.names %chin% all.names[[x]]]) := NA_character_]
       setcolorder(tt, unq.names)
     }))
   }else{
     cores <- length(cluster)
     sequ <- as.integer(seq(1, length(ll), length.out = cores+1))
     Call <- paste(paste("list", seq(cores), sep=""), " = ll[", c(1, sequ[2:cores]+1), ":", sequ[2:(cores+1)], "]", sep="", collapse=", ") 
     ll <- eval(parse(text=paste("list(", Call, ")")))
     rbindlist(clusterApply(cluster, ll, function(ll, unq.names){
        rbindlist(lapply(seq_along(ll), function(x, ll, unq.names) {
          tt <- ll[[x]]
          setattr(tt, 'class', c('data.table', 'data.frame'))
          data.table:::settruelength(tt, 0L)
          invisible(alloc.col(tt))
          tt[, c(unq.names[!unq.names %chin% colnames(tt)]) := NA_character_]
          setcolorder(tt, unq.names)
        }, ll=ll, unq.names=unq.names))
      }, unq.names=unq.names))
    }
  }           


# plyr::rbind.fill solutions
rbind.fill.PLYR <- function(ll) {
  rbind.fill(ll)
}

parallel.rbind.fill.PLYR <- function(ll, cluster=NULL, magicConst=400){
  if(is.null(cluster) | ceiling(length(ll)/magicConst) < length(cluster)){
    rbind.fill(ll)
  }else{
    cores <- length(cluster)
    sequ <- as.integer(seq(1, length(ll), length.out = ceiling(length(ll)/magicConst)))
    Call <- paste(paste("list", seq(cores), sep=""), " = ll[", c(1, sequ[2:(length(sequ)-1)]+1), ":", sequ[2:length(sequ)], "]", sep="", collapse=", ") 
    ll <- eval(parse(text=paste("list(", Call, ")")))
    rbind.fill(parLapply(cluster, ll, rbind.fill))
  }
} 

# Function to generate sample data of varying list length
set.seed(45)
sample.fun <- function() {
  nam <- sample(LETTERS, sample(5:15))
  val <- data.frame(matrix(sample(letters, length(nam)*10,replace=TRUE),nrow=10))
  setNames(val, nam)
}

ll <- replicate(10000, sample.fun())
cl <- makeCluster(4, type="SOCK")
clusterEvalQ(cl, library(data.table))
clusterEvalQ(cl, library(plyr))
benchmark(t1 <- rbind.fill.PLYR(ll),
  t2 <- rbind.fill.DT(ll),
  t3 <- parallel.rbind.fill.PLYR(ll, cluster=cl, 400),
  t4 <- parallel.rbind.fill.DT(ll, cluster=cl),
  replications=5)
stopCluster(cl)

# Results for rbinding 10000 dataframes
# done with 4 cores, i5 3570k and 16gb memory
# test                          reps elapsed relative 
# rbind.fill.PLYR                 5  321.80    16.682   
# rbind.fill.DT                   5   26.10    1.353    
# parallel.rbind.fill.PLYR        5   28.00    1.452     
# parallel.rbind.fill.DT          5   19.29    1.000    

# checking are results equal
t1 <- as.matrix(t1)
t2 <- as.matrix(t2)
t3 <- as.matrix(t3)
t4 <- as.matrix(t4)

t1 <- t1[order(t1[, 1], t1[, 2]), ]
t2 <- t2[order(t2[, 1], t2[, 2]), ]
t3 <- t3[order(t3[, 1], t3[, 2]), ]
t4 <- t4[order(t4[, 1], t4[, 2]), ]

identical(t2, t4) # TRUE
identical(t1, t3) # TRUE
identical(t1, t2) # FALSE, mismatch between plyr and data.table

As you can see parallesizing rbind.fill made it comparable to data.table, and you could get marginal increase of speed by parallesizing data.table even with this low of a dataframe count.

正如你所看到的那样,使用rbind.fill进行平移可以使它与data.table相当,而且即使数据帧数量很少,你也可以通过对data.table进行参数化来获得速度的微小增加。