For each level of factor I need to extract values aggregated over all subsets of data.frame except the current one. For example, there is a several subjects doing a reaction time task during several days, and I need to compute mean reaction time for all subjects and all days, but not including the subject for whom the mean is computed. Currently, I do it like this:
对于每个层次的因子,我需要提取聚集在data.frame的所有子集(当前子集除外)上的值。例如,有几个受试者在几天内做一个反应时间任务,我需要计算所有受试者的平均反应时间以及一整天,但不包括计算平均值的受试者。目前,我是这样做的:
library(lme4)
ddply(sleepstudy, .(Subject, Days), summarise,
avg_rt = mean(sleepstudy[sleepstudy$Subject != Subject &
sleepstudy$Days == Days,"Reaction"]), .progress="text")
It works fine for small data sets, but for large ones it can be very slow. Is there a way to do it faster?
它适用于小数据集,但对于大数据集,它可能非常慢。有没有更快的方法?
2 个解决方案
#1
3
#create big dataset
n <- 1e4
set.seed(1)
sleepstudy <- data.frame(Reaction=rnorm(n),Subject=1:4,Days=sort(rep((1:(n/4)),4)))
library(plyr)
system.time(
res <- ddply(sleepstudy, .(Subject, Days), summarise,
avg_rt = mean(sleepstudy[sleepstudy$Subject != Subject &
sleepstudy$Days == Days,"Reaction"]))
)
#User System elapsed
#6.532 0.013 6.556
#use data.table for big datasets
library(data.table)
dt<- as.data.table(sleepstudy)
system.time(
{dt[,avg_rt:=mean(Reaction),by=Days];
dt[,n:=.N,by=Days];
dt[,avg_rt:=(avg_rt*n-Reaction)/(n-1)]}
)
#User System elapsed
#0.005 0.001 0.005
#test if results are equal
dt2 <- as.data.table(res)
setkey(dt2,Subject,Days)
setkey(dt,Subject,Days)
all.equal(dt[,avg_rt],dt2[,avg_rt])
#[1] TRUE
For really large datasets the speed gain should be more pronounced. I just couldn't compare with larger datasets since ddply
is so slow.
对于非常大的数据集,速度增益应该更明显。因为ddply很慢,所以我无法与更大的数据集进行比较。
#2
0
Maybe it's faster with lapply
and aggregate
:
lapply和aggregate可能更快:
do.call("rbind", (lapply(unique(sleepstudy$Subject),
function(x)
cbind(Subject = x,
aggregate(Reaction ~ Days,
subset(sleepstudy, Subject != x),
mean)))))
Update:
更新:
I compared both commands with system.time
and it appears the original is slower.
我将这两个命令与系统进行了比较。时间和它似乎原来是慢的。
library(lme4)
library(plyr)
system.time(
ddply(sleepstudy, .(Subject, Days), summarise,
avg_rt = mean(sleepstudy[sleepstudy$Subject != Subject &
sleepstudy$Days == Days,"Reaction"]))
)
# user system elapsed
# 0.17 0.00 0.22
system.time(
do.call("rbind", (lapply(unique(sleepstudy$Subject),
function(x)
cbind(Subject = x,
aggregate(Reaction ~ Days,
subset(sleepstudy, Subject != x),
mean)))))
)
# user system elapsed
# 0.12 0.00 0.12
#1
3
#create big dataset
n <- 1e4
set.seed(1)
sleepstudy <- data.frame(Reaction=rnorm(n),Subject=1:4,Days=sort(rep((1:(n/4)),4)))
library(plyr)
system.time(
res <- ddply(sleepstudy, .(Subject, Days), summarise,
avg_rt = mean(sleepstudy[sleepstudy$Subject != Subject &
sleepstudy$Days == Days,"Reaction"]))
)
#User System elapsed
#6.532 0.013 6.556
#use data.table for big datasets
library(data.table)
dt<- as.data.table(sleepstudy)
system.time(
{dt[,avg_rt:=mean(Reaction),by=Days];
dt[,n:=.N,by=Days];
dt[,avg_rt:=(avg_rt*n-Reaction)/(n-1)]}
)
#User System elapsed
#0.005 0.001 0.005
#test if results are equal
dt2 <- as.data.table(res)
setkey(dt2,Subject,Days)
setkey(dt,Subject,Days)
all.equal(dt[,avg_rt],dt2[,avg_rt])
#[1] TRUE
For really large datasets the speed gain should be more pronounced. I just couldn't compare with larger datasets since ddply
is so slow.
对于非常大的数据集,速度增益应该更明显。因为ddply很慢,所以我无法与更大的数据集进行比较。
#2
0
Maybe it's faster with lapply
and aggregate
:
lapply和aggregate可能更快:
do.call("rbind", (lapply(unique(sleepstudy$Subject),
function(x)
cbind(Subject = x,
aggregate(Reaction ~ Days,
subset(sleepstudy, Subject != x),
mean)))))
Update:
更新:
I compared both commands with system.time
and it appears the original is slower.
我将这两个命令与系统进行了比较。时间和它似乎原来是慢的。
library(lme4)
library(plyr)
system.time(
ddply(sleepstudy, .(Subject, Days), summarise,
avg_rt = mean(sleepstudy[sleepstudy$Subject != Subject &
sleepstudy$Days == Days,"Reaction"]))
)
# user system elapsed
# 0.17 0.00 0.22
system.time(
do.call("rbind", (lapply(unique(sleepstudy$Subject),
function(x)
cbind(Subject = x,
aggregate(Reaction ~ Days,
subset(sleepstudy, Subject != x),
mean)))))
)
# user system elapsed
# 0.12 0.00 0.12