I have a list of data frames with different set of columns, I would like combine them by rows into one data frame. I use plyr::rbind.fill
to do that. I am looking something that would do it more efficiently. Similar to the answer given here
我有一组具有不同列的数据帧,我想按行将它们组合成一个数据帧。我使用plyr :: rbind.fill来做到这一点。我正在寻找能够更有效地完成工作的东西。类似于这里给出的答案
require(plyr)
set.seed(45)
sample.fun <- function() {
nam <- sample(LETTERS, sample(5:15))
val <- data.frame(matrix(sample(letters, length(nam)*10,replace=TRUE),nrow=10))
setNames(val, nam)
}
ll <- replicate(1e4, sample.fun())
rbind.fill(ll)
3 个解决方案
#1
32
UPDATE: See this updated answer instead.
更新:请参阅此更新的答案。
UPDATE (eddi): This has now been implemented in version 1.8.11 as a fill
argument to rbind
. For example:
UPDATE(eddi):现在已经在1.8.11版本中实现了它作为rbind的填充参数。例如:
DT1 = data.table(a = 1:2, b = 1:2)
DT2 = data.table(a = 3:4, c = 1:2)
rbind(DT1, DT2, fill = TRUE)
# a b c
#1: 1 1 NA
#2: 2 2 NA
#3: 3 NA 1
#4: 4 NA 2
FR #4790 added now - rbind.fill (from plyr) like functionality to merge list of data.frames/data.tables
Note 1:
This solution uses data.table
's rbindlist
function to "rbind" list of data.tables and for this, be sure to use version 1.8.9 because of this bug in versions < 1.8.9.
此解决方案使用data.table的rbindlist函数来“rbind”data.tables列表,为此,请确保使用版本1.8.9,因为版本<1.8.9中存在此错误。
Note 2:
rbindlist
when binding lists of data.frames/data.tables, as of now, will retain the data type of the first column. That is, if a column in first data.frame is character and the same column in the 2nd data.frame is "factor", then, rbindlist
will result in this column being a character. So, if your data.frame consisted of all character columns, then, your solution with this method will be identical to the plyr method. If not, the values will still be the same, but some columns will be character instead of factor. You'll have to convert to "factor" yourself after. Hopefully this behaviour will change in the future.
绑定data.frames / data.tables列表时的rbindlist,截至目前,将保留第一列的数据类型。也就是说,如果第一个data.frame中的列是字符,而第二个data.frame中的相同列是“factor”,则rbindlist将导致此列为字符。因此,如果您的data.frame由所有字符列组成,那么,使用此方法的解决方案将与plyr方法相同。如果不是,则值仍然相同,但有些列将是字符而不是因子。你必须自己转换为“因子”。希望这种行为将来会改变。
And now here's using data.table
(and benchmarking comparison with rbind.fill
from plyr
):
现在这里使用data.table(与plyr的rbind.fill进行基准比较):
require(data.table)
rbind.fill.DT <- function(ll) {
# changed sapply to lapply to return a list always
all.names <- lapply(ll, names)
unq.names <- unique(unlist(all.names))
ll.m <- rbindlist(lapply(seq_along(ll), function(x) {
tt <- ll[[x]]
setattr(tt, 'class', c('data.table', 'data.frame'))
data.table:::settruelength(tt, 0L)
invisible(alloc.col(tt))
tt[, c(unq.names[!unq.names %chin% all.names[[x]]]) := NA_character_]
setcolorder(tt, unq.names)
}))
}
rbind.fill.PLYR <- function(ll) {
rbind.fill(ll)
}
require(microbenchmark)
microbenchmark(t1 <- rbind.fill.DT(ll), t2 <- rbind.fill.PLYR(ll), times=10)
# Unit: seconds
# expr min lq median uq max neval
# t1 <- rbind.fill.DT(ll) 10.8943 11.02312 11.26374 11.34757 11.51488 10
# t2 <- rbind.fill.PLYR(ll) 121.9868 134.52107 136.41375 184.18071 347.74724 10
# for comparison change t2 to data.table
setattr(t2, 'class', c('data.table', 'data.frame'))
data.table:::settruelength(t2, 0L)
invisible(alloc.col(t2))
setcolorder(t2, unique(unlist(sapply(ll, names))))
identical(t1, t2) # [1] TRUE
It should be noted that plyr
's rbind.fill
edges past this particular data.table
solution until list size of about 500.
应该注意的是,plyr的rbind.fill会超过这个特定的data.table解决方案,直到列表大小约为500。
Benchmarking plot:
Here's the plot on runs with list length of data.frames with seq(1000, 10000, by=1000)
. I've used microbenchmark
with 10 reps on each of these different list lengths.
这是带有seq(1000,10000,by = 1000)的data.frames列表长度的运行图。我已经在这些不同的列表长度中使用了每个10个代表的microbenchmark。
Benchmarking gist:
Here's the gist for benchmarking, in case anyone wants to replicate the results.
这是基准测试的要点,以防有人想要复制结果。
#2
15
Now that rbindlist
(and rbind
) for data.table
has improved functionality and speed with the recent changes/commits in v1.9.3 (development version), and dplyr
has a faster version of plyr
's rbind.fill
, named rbind_all
, this answer of mine seems a bit too outdated.
现在,对于data.table的rbindlist(和rbind)已经改进了v1.9.3(开发版本)中最近的更改/提交的功能和速度,而dplyr有一个更快的版本的plyr的rbind.fill,名为rbind_all,我的这个答案似乎有点过时了。
Here's the relevant NEWS entry for rbindlist
:
这是rbindlist的相关新闻条目:
o 'rbindlist' gains 'use.names' and 'fill' arguments and is now implemented entirely in C. Closes #5249
-> use.names by default is FALSE for backwards compatibility (doesn't bind by
names by default)
-> rbind(...) now just calls rbindlist() internally, except that 'use.names'
is TRUE by default, for compatibility with base (and backwards compatibility).
-> fill by default is FALSE. If fill is TRUE, use.names has to be TRUE.
-> At least one item of the input list has to have non-null column names.
-> Duplicate columns are bound in the order of occurrence, like base.
-> Attributes that might exist in individual items would be lost in the bound result.
-> Columns are coerced to the highest SEXPTYPE, if they are different, if/when possible.
-> And incredibly fast ;).
-> Documentation updated in much detail. Closes DR #5158.
So, I've benchmarked the newer (and faster versions) on relatively bigger data below.
所以,我已经在下面相对较大的数据上对较新的(和更快的版本)进行了基准测试。
New Benchmark:
We'll create a total of 10,000 data.tables with columns ranging from 200-300 with the total number of columns after binding to be 500.
我们将创建总共10,000个data.tables,列数范围为200-300,绑定后的列总数为500。
Functions to create data:
require(data.table) ## 1.9.3 commit 1267
require(dplyr) ## commit 1504 devel
set.seed(1L)
names = paste0("V", 1:500)
foo <- function() {
cols = sample(200:300, 1)
data = setDT(lapply(1:cols, function(x) sample(10)))
setnames(data, sample(names)[1:cols])
}
n = 10e3L
ll = vector("list", n)
for (i in 1:n) {
.Call("Csetlistelt", ll, i, foo())
}
And here are the timings:
以下是时间安排:
## Updated timings on data.table v1.9.5 - three consecutive runs:
system.time(ans1 <- rbindlist(ll, fill=TRUE))
# user system elapsed
# 1.993 0.106 2.107
system.time(ans1 <- rbindlist(ll, fill=TRUE))
# user system elapsed
# 1.644 0.092 1.744
system.time(ans1 <- rbindlist(ll, fill=TRUE))
# user system elapsed
# 1.297 0.088 1.389
## dplyr's rbind_all - Timings for three consecutive runs
system.time(ans2 <- rbind_all(ll))
# user system elapsed
# 9.525 0.121 9.761
# user system elapsed
# 9.194 0.112 9.370
# user system elapsed
# 8.665 0.081 8.780
identical(ans1, setDT(ans2)) # [1] TRUE
#3
5
There is still something to be gained if you parallelize both rbind.fill
and rbindlist
. The results are done with data.table
version 1.8.8 as version 1.8.9 got bricked when I tried it with the parallelized function. So the results aren't identical between data.table
and plyr
, but they are identical within data.table
or plyr
solution. Meaning parallel plyr
matches to unparallel plyr
, and vice versa.
如果你并行化rbind.fill和rbindlist,仍然可以获得一些东西。结果是使用data.table版本1.8.8完成的,因为当我尝试使用并行化函数时,版本1.8.9变得很糟糕。因此data.table和plyr之间的结果并不相同,但它们在data.table或plyr解决方案中是相同的。意思是并行plyr匹配不平行的plyr,反之亦然。
Here's the benchmark/scripts. The parallel.rbind.fill.DT
looks horrible, but that's the fastest one I could pull.
这是基准/脚本。 parallel.rbind.fill.DT看起来很糟糕,但那是我能拉的最快的。
require(plyr)
require(data.table)
require(ggplot2)
require(rbenchmark)
require(parallel)
# data.table::rbindlist solutions
rbind.fill.DT <- function(ll) {
all.names <- lapply(ll, names)
unq.names <- unique(unlist(all.names))
rbindlist(lapply(seq_along(ll), function(x) {
tt <- ll[[x]]
setattr(tt, 'class', c('data.table', 'data.frame'))
data.table:::settruelength(tt, 0L)
invisible(alloc.col(tt))
tt[, c(unq.names[!unq.names %chin% all.names[[x]]]) := NA_character_]
setcolorder(tt, unq.names)
}))
}
parallel.rbind.fill.DT <- function(ll, cluster=NULL){
all.names <- lapply(ll, names)
unq.names <- unique(unlist(all.names))
if(is.null(cluster)){
ll.m <- rbindlist(lapply(seq_along(ll), function(x) {
tt <- ll[[x]]
setattr(tt, 'class', c('data.table', 'data.frame'))
data.table:::settruelength(tt, 0L)
invisible(alloc.col(tt))
tt[, c(unq.names[!unq.names %chin% all.names[[x]]]) := NA_character_]
setcolorder(tt, unq.names)
}))
}else{
cores <- length(cluster)
sequ <- as.integer(seq(1, length(ll), length.out = cores+1))
Call <- paste(paste("list", seq(cores), sep=""), " = ll[", c(1, sequ[2:cores]+1), ":", sequ[2:(cores+1)], "]", sep="", collapse=", ")
ll <- eval(parse(text=paste("list(", Call, ")")))
rbindlist(clusterApply(cluster, ll, function(ll, unq.names){
rbindlist(lapply(seq_along(ll), function(x, ll, unq.names) {
tt <- ll[[x]]
setattr(tt, 'class', c('data.table', 'data.frame'))
data.table:::settruelength(tt, 0L)
invisible(alloc.col(tt))
tt[, c(unq.names[!unq.names %chin% colnames(tt)]) := NA_character_]
setcolorder(tt, unq.names)
}, ll=ll, unq.names=unq.names))
}, unq.names=unq.names))
}
}
# plyr::rbind.fill solutions
rbind.fill.PLYR <- function(ll) {
rbind.fill(ll)
}
parallel.rbind.fill.PLYR <- function(ll, cluster=NULL, magicConst=400){
if(is.null(cluster) | ceiling(length(ll)/magicConst) < length(cluster)){
rbind.fill(ll)
}else{
cores <- length(cluster)
sequ <- as.integer(seq(1, length(ll), length.out = ceiling(length(ll)/magicConst)))
Call <- paste(paste("list", seq(cores), sep=""), " = ll[", c(1, sequ[2:(length(sequ)-1)]+1), ":", sequ[2:length(sequ)], "]", sep="", collapse=", ")
ll <- eval(parse(text=paste("list(", Call, ")")))
rbind.fill(parLapply(cluster, ll, rbind.fill))
}
}
# Function to generate sample data of varying list length
set.seed(45)
sample.fun <- function() {
nam <- sample(LETTERS, sample(5:15))
val <- data.frame(matrix(sample(letters, length(nam)*10,replace=TRUE),nrow=10))
setNames(val, nam)
}
ll <- replicate(10000, sample.fun())
cl <- makeCluster(4, type="SOCK")
clusterEvalQ(cl, library(data.table))
clusterEvalQ(cl, library(plyr))
benchmark(t1 <- rbind.fill.PLYR(ll),
t2 <- rbind.fill.DT(ll),
t3 <- parallel.rbind.fill.PLYR(ll, cluster=cl, 400),
t4 <- parallel.rbind.fill.DT(ll, cluster=cl),
replications=5)
stopCluster(cl)
# Results for rbinding 10000 dataframes
# done with 4 cores, i5 3570k and 16gb memory
# test reps elapsed relative
# rbind.fill.PLYR 5 321.80 16.682
# rbind.fill.DT 5 26.10 1.353
# parallel.rbind.fill.PLYR 5 28.00 1.452
# parallel.rbind.fill.DT 5 19.29 1.000
# checking are results equal
t1 <- as.matrix(t1)
t2 <- as.matrix(t2)
t3 <- as.matrix(t3)
t4 <- as.matrix(t4)
t1 <- t1[order(t1[, 1], t1[, 2]), ]
t2 <- t2[order(t2[, 1], t2[, 2]), ]
t3 <- t3[order(t3[, 1], t3[, 2]), ]
t4 <- t4[order(t4[, 1], t4[, 2]), ]
identical(t2, t4) # TRUE
identical(t1, t3) # TRUE
identical(t1, t2) # FALSE, mismatch between plyr and data.table
As you can see parallesizing rbind.fill
made it comparable to data.table
, and you could get marginal increase of speed by parallesizing data.table
even with this low of a dataframe count.
正如你所看到的那样,使用rbind.fill进行平移可以使它与data.table相当,而且即使数据帧数量很少,你也可以通过对data.table进行参数化来获得速度的微小增加。
#1
32
UPDATE: See this updated answer instead.
更新:请参阅此更新的答案。
UPDATE (eddi): This has now been implemented in version 1.8.11 as a fill
argument to rbind
. For example:
UPDATE(eddi):现在已经在1.8.11版本中实现了它作为rbind的填充参数。例如:
DT1 = data.table(a = 1:2, b = 1:2)
DT2 = data.table(a = 3:4, c = 1:2)
rbind(DT1, DT2, fill = TRUE)
# a b c
#1: 1 1 NA
#2: 2 2 NA
#3: 3 NA 1
#4: 4 NA 2
FR #4790 added now - rbind.fill (from plyr) like functionality to merge list of data.frames/data.tables
Note 1:
This solution uses data.table
's rbindlist
function to "rbind" list of data.tables and for this, be sure to use version 1.8.9 because of this bug in versions < 1.8.9.
此解决方案使用data.table的rbindlist函数来“rbind”data.tables列表,为此,请确保使用版本1.8.9,因为版本<1.8.9中存在此错误。
Note 2:
rbindlist
when binding lists of data.frames/data.tables, as of now, will retain the data type of the first column. That is, if a column in first data.frame is character and the same column in the 2nd data.frame is "factor", then, rbindlist
will result in this column being a character. So, if your data.frame consisted of all character columns, then, your solution with this method will be identical to the plyr method. If not, the values will still be the same, but some columns will be character instead of factor. You'll have to convert to "factor" yourself after. Hopefully this behaviour will change in the future.
绑定data.frames / data.tables列表时的rbindlist,截至目前,将保留第一列的数据类型。也就是说,如果第一个data.frame中的列是字符,而第二个data.frame中的相同列是“factor”,则rbindlist将导致此列为字符。因此,如果您的data.frame由所有字符列组成,那么,使用此方法的解决方案将与plyr方法相同。如果不是,则值仍然相同,但有些列将是字符而不是因子。你必须自己转换为“因子”。希望这种行为将来会改变。
And now here's using data.table
(and benchmarking comparison with rbind.fill
from plyr
):
现在这里使用data.table(与plyr的rbind.fill进行基准比较):
require(data.table)
rbind.fill.DT <- function(ll) {
# changed sapply to lapply to return a list always
all.names <- lapply(ll, names)
unq.names <- unique(unlist(all.names))
ll.m <- rbindlist(lapply(seq_along(ll), function(x) {
tt <- ll[[x]]
setattr(tt, 'class', c('data.table', 'data.frame'))
data.table:::settruelength(tt, 0L)
invisible(alloc.col(tt))
tt[, c(unq.names[!unq.names %chin% all.names[[x]]]) := NA_character_]
setcolorder(tt, unq.names)
}))
}
rbind.fill.PLYR <- function(ll) {
rbind.fill(ll)
}
require(microbenchmark)
microbenchmark(t1 <- rbind.fill.DT(ll), t2 <- rbind.fill.PLYR(ll), times=10)
# Unit: seconds
# expr min lq median uq max neval
# t1 <- rbind.fill.DT(ll) 10.8943 11.02312 11.26374 11.34757 11.51488 10
# t2 <- rbind.fill.PLYR(ll) 121.9868 134.52107 136.41375 184.18071 347.74724 10
# for comparison change t2 to data.table
setattr(t2, 'class', c('data.table', 'data.frame'))
data.table:::settruelength(t2, 0L)
invisible(alloc.col(t2))
setcolorder(t2, unique(unlist(sapply(ll, names))))
identical(t1, t2) # [1] TRUE
It should be noted that plyr
's rbind.fill
edges past this particular data.table
solution until list size of about 500.
应该注意的是,plyr的rbind.fill会超过这个特定的data.table解决方案,直到列表大小约为500。
Benchmarking plot:
Here's the plot on runs with list length of data.frames with seq(1000, 10000, by=1000)
. I've used microbenchmark
with 10 reps on each of these different list lengths.
这是带有seq(1000,10000,by = 1000)的data.frames列表长度的运行图。我已经在这些不同的列表长度中使用了每个10个代表的microbenchmark。
Benchmarking gist:
Here's the gist for benchmarking, in case anyone wants to replicate the results.
这是基准测试的要点,以防有人想要复制结果。
#2
15
Now that rbindlist
(and rbind
) for data.table
has improved functionality and speed with the recent changes/commits in v1.9.3 (development version), and dplyr
has a faster version of plyr
's rbind.fill
, named rbind_all
, this answer of mine seems a bit too outdated.
现在,对于data.table的rbindlist(和rbind)已经改进了v1.9.3(开发版本)中最近的更改/提交的功能和速度,而dplyr有一个更快的版本的plyr的rbind.fill,名为rbind_all,我的这个答案似乎有点过时了。
Here's the relevant NEWS entry for rbindlist
:
这是rbindlist的相关新闻条目:
o 'rbindlist' gains 'use.names' and 'fill' arguments and is now implemented entirely in C. Closes #5249
-> use.names by default is FALSE for backwards compatibility (doesn't bind by
names by default)
-> rbind(...) now just calls rbindlist() internally, except that 'use.names'
is TRUE by default, for compatibility with base (and backwards compatibility).
-> fill by default is FALSE. If fill is TRUE, use.names has to be TRUE.
-> At least one item of the input list has to have non-null column names.
-> Duplicate columns are bound in the order of occurrence, like base.
-> Attributes that might exist in individual items would be lost in the bound result.
-> Columns are coerced to the highest SEXPTYPE, if they are different, if/when possible.
-> And incredibly fast ;).
-> Documentation updated in much detail. Closes DR #5158.
So, I've benchmarked the newer (and faster versions) on relatively bigger data below.
所以,我已经在下面相对较大的数据上对较新的(和更快的版本)进行了基准测试。
New Benchmark:
We'll create a total of 10,000 data.tables with columns ranging from 200-300 with the total number of columns after binding to be 500.
我们将创建总共10,000个data.tables,列数范围为200-300,绑定后的列总数为500。
Functions to create data:
require(data.table) ## 1.9.3 commit 1267
require(dplyr) ## commit 1504 devel
set.seed(1L)
names = paste0("V", 1:500)
foo <- function() {
cols = sample(200:300, 1)
data = setDT(lapply(1:cols, function(x) sample(10)))
setnames(data, sample(names)[1:cols])
}
n = 10e3L
ll = vector("list", n)
for (i in 1:n) {
.Call("Csetlistelt", ll, i, foo())
}
And here are the timings:
以下是时间安排:
## Updated timings on data.table v1.9.5 - three consecutive runs:
system.time(ans1 <- rbindlist(ll, fill=TRUE))
# user system elapsed
# 1.993 0.106 2.107
system.time(ans1 <- rbindlist(ll, fill=TRUE))
# user system elapsed
# 1.644 0.092 1.744
system.time(ans1 <- rbindlist(ll, fill=TRUE))
# user system elapsed
# 1.297 0.088 1.389
## dplyr's rbind_all - Timings for three consecutive runs
system.time(ans2 <- rbind_all(ll))
# user system elapsed
# 9.525 0.121 9.761
# user system elapsed
# 9.194 0.112 9.370
# user system elapsed
# 8.665 0.081 8.780
identical(ans1, setDT(ans2)) # [1] TRUE
#3
5
There is still something to be gained if you parallelize both rbind.fill
and rbindlist
. The results are done with data.table
version 1.8.8 as version 1.8.9 got bricked when I tried it with the parallelized function. So the results aren't identical between data.table
and plyr
, but they are identical within data.table
or plyr
solution. Meaning parallel plyr
matches to unparallel plyr
, and vice versa.
如果你并行化rbind.fill和rbindlist,仍然可以获得一些东西。结果是使用data.table版本1.8.8完成的,因为当我尝试使用并行化函数时,版本1.8.9变得很糟糕。因此data.table和plyr之间的结果并不相同,但它们在data.table或plyr解决方案中是相同的。意思是并行plyr匹配不平行的plyr,反之亦然。
Here's the benchmark/scripts. The parallel.rbind.fill.DT
looks horrible, but that's the fastest one I could pull.
这是基准/脚本。 parallel.rbind.fill.DT看起来很糟糕,但那是我能拉的最快的。
require(plyr)
require(data.table)
require(ggplot2)
require(rbenchmark)
require(parallel)
# data.table::rbindlist solutions
rbind.fill.DT <- function(ll) {
all.names <- lapply(ll, names)
unq.names <- unique(unlist(all.names))
rbindlist(lapply(seq_along(ll), function(x) {
tt <- ll[[x]]
setattr(tt, 'class', c('data.table', 'data.frame'))
data.table:::settruelength(tt, 0L)
invisible(alloc.col(tt))
tt[, c(unq.names[!unq.names %chin% all.names[[x]]]) := NA_character_]
setcolorder(tt, unq.names)
}))
}
parallel.rbind.fill.DT <- function(ll, cluster=NULL){
all.names <- lapply(ll, names)
unq.names <- unique(unlist(all.names))
if(is.null(cluster)){
ll.m <- rbindlist(lapply(seq_along(ll), function(x) {
tt <- ll[[x]]
setattr(tt, 'class', c('data.table', 'data.frame'))
data.table:::settruelength(tt, 0L)
invisible(alloc.col(tt))
tt[, c(unq.names[!unq.names %chin% all.names[[x]]]) := NA_character_]
setcolorder(tt, unq.names)
}))
}else{
cores <- length(cluster)
sequ <- as.integer(seq(1, length(ll), length.out = cores+1))
Call <- paste(paste("list", seq(cores), sep=""), " = ll[", c(1, sequ[2:cores]+1), ":", sequ[2:(cores+1)], "]", sep="", collapse=", ")
ll <- eval(parse(text=paste("list(", Call, ")")))
rbindlist(clusterApply(cluster, ll, function(ll, unq.names){
rbindlist(lapply(seq_along(ll), function(x, ll, unq.names) {
tt <- ll[[x]]
setattr(tt, 'class', c('data.table', 'data.frame'))
data.table:::settruelength(tt, 0L)
invisible(alloc.col(tt))
tt[, c(unq.names[!unq.names %chin% colnames(tt)]) := NA_character_]
setcolorder(tt, unq.names)
}, ll=ll, unq.names=unq.names))
}, unq.names=unq.names))
}
}
# plyr::rbind.fill solutions
rbind.fill.PLYR <- function(ll) {
rbind.fill(ll)
}
parallel.rbind.fill.PLYR <- function(ll, cluster=NULL, magicConst=400){
if(is.null(cluster) | ceiling(length(ll)/magicConst) < length(cluster)){
rbind.fill(ll)
}else{
cores <- length(cluster)
sequ <- as.integer(seq(1, length(ll), length.out = ceiling(length(ll)/magicConst)))
Call <- paste(paste("list", seq(cores), sep=""), " = ll[", c(1, sequ[2:(length(sequ)-1)]+1), ":", sequ[2:length(sequ)], "]", sep="", collapse=", ")
ll <- eval(parse(text=paste("list(", Call, ")")))
rbind.fill(parLapply(cluster, ll, rbind.fill))
}
}
# Function to generate sample data of varying list length
set.seed(45)
sample.fun <- function() {
nam <- sample(LETTERS, sample(5:15))
val <- data.frame(matrix(sample(letters, length(nam)*10,replace=TRUE),nrow=10))
setNames(val, nam)
}
ll <- replicate(10000, sample.fun())
cl <- makeCluster(4, type="SOCK")
clusterEvalQ(cl, library(data.table))
clusterEvalQ(cl, library(plyr))
benchmark(t1 <- rbind.fill.PLYR(ll),
t2 <- rbind.fill.DT(ll),
t3 <- parallel.rbind.fill.PLYR(ll, cluster=cl, 400),
t4 <- parallel.rbind.fill.DT(ll, cluster=cl),
replications=5)
stopCluster(cl)
# Results for rbinding 10000 dataframes
# done with 4 cores, i5 3570k and 16gb memory
# test reps elapsed relative
# rbind.fill.PLYR 5 321.80 16.682
# rbind.fill.DT 5 26.10 1.353
# parallel.rbind.fill.PLYR 5 28.00 1.452
# parallel.rbind.fill.DT 5 19.29 1.000
# checking are results equal
t1 <- as.matrix(t1)
t2 <- as.matrix(t2)
t3 <- as.matrix(t3)
t4 <- as.matrix(t4)
t1 <- t1[order(t1[, 1], t1[, 2]), ]
t2 <- t2[order(t2[, 1], t2[, 2]), ]
t3 <- t3[order(t3[, 1], t3[, 2]), ]
t4 <- t4[order(t4[, 1], t4[, 2]), ]
identical(t2, t4) # TRUE
identical(t1, t3) # TRUE
identical(t1, t2) # FALSE, mismatch between plyr and data.table
As you can see parallesizing rbind.fill
made it comparable to data.table
, and you could get marginal increase of speed by parallesizing data.table
even with this low of a dataframe count.
正如你所看到的那样,使用rbind.fill进行平移可以使它与data.table相当,而且即使数据帧数量很少,你也可以通过对data.table进行参数化来获得速度的微小增加。