I have a 25 years data set that looks similar to the following:
我有一个25年的数据集,看起来类似如下:
date name value tag
1 2014-12-01 f -0.338578654 12
2 2014-12-01 a 0.323379254 4
3 2014-12-01 f 0.004163806 9
4 2014-12-01 f 1.365219477 2
5 2014-12-01 l -1.225602543 7
6 2014-12-01 d -0.308544089 9
This is how to replicate it:
这就是如何复制它:
set.seed(9)
date <- rep(seq(as.Date("1990-01-01"), as.Date("2015-01-1"), by="months"), each=50)
N <- length(date)
name <- sample(letters, N, replace=T)
value <- rnorm(N)
tag <- sample(c(1:50), N, replace=T)
mydata <- data.frame(date, name, value, tag)
head(mydata)
I would like to create a new matrix that stores values that satisfy multiple criteria. For instance, the sum of values that have a name j and a tag i. I use two for-loops and the which() function to filter out the correct values. Like this:
我想创建一个新的矩阵来存储满足多个条件的值。例如,具有名称为j和标记为i的值的和,我使用了两个For循环和一个which()函数来过滤出正确的值。是这样的:
S <- matrix(data=NA, nrow=length(unique(mydata$tag)), ncol=length(unique(mydata$name)))
for(i in 1:nrow(S)){
for (j in 1:ncol(S)){
foo <- which(mydata$tag == unique(mydata$tag)[i] & mydata$name == unique(mydata$name)[j])
S[i,j] <- sum(mydata$value[foo])
}
}
This is ok for small data sets, but too slow for larger ones. Is it possible to avoid the for-loops or somehow speed up the process?
这对于小数据集来说是可以的,但是对于大数据集来说太慢了。是否有可能避免for循环或以某种方式加快进程?
1 个解决方案
#1
3
You can use dcast
from package reshape2
, with a custom function to sum your values:
您可以使用包reshape2中的dcast,并使用自定义函数来求和您的值:
library(reshape2)
dcast(mydata, name~tag, value.var='value', fun.aggregate=sum)
Or simply xtabs
, base R
:
或者简单的xtabs,基数为R:
xtabs(value~name+tag, mydata)
Some benchmark:
一些指标:
funcPer = function(){
S <- matrix(data=NA, nrow=length(unique(mydata$tag)), ncol=length(unique(mydata$name)))
for(i in 1:nrow(S)){
for (j in 1:ncol(S)){
foo <- which(mydata$tag == unique(mydata$tag)[i] & mydata$name == unique(mydata$name)[j])
S[i,j] <- sum(mydata$value[foo])
}
}
}
colonel1 = function() dcast(mydata, name~tag, value.var='value', fun.aggregate=sum)
colonel2 = function() xtabs(value~name+tag, mydata)
#> system.time(colonel1())
# user system elapsed
# 0.01 0.00 0.01
#> system.time(colonel2())
# user system elapsed
# 0.05 0.00 0.05
#> system.time(funcPer())
# user system elapsed
# 4.67 0.00 4.82
#1
3
You can use dcast
from package reshape2
, with a custom function to sum your values:
您可以使用包reshape2中的dcast,并使用自定义函数来求和您的值:
library(reshape2)
dcast(mydata, name~tag, value.var='value', fun.aggregate=sum)
Or simply xtabs
, base R
:
或者简单的xtabs,基数为R:
xtabs(value~name+tag, mydata)
Some benchmark:
一些指标:
funcPer = function(){
S <- matrix(data=NA, nrow=length(unique(mydata$tag)), ncol=length(unique(mydata$name)))
for(i in 1:nrow(S)){
for (j in 1:ncol(S)){
foo <- which(mydata$tag == unique(mydata$tag)[i] & mydata$name == unique(mydata$name)[j])
S[i,j] <- sum(mydata$value[foo])
}
}
}
colonel1 = function() dcast(mydata, name~tag, value.var='value', fun.aggregate=sum)
colonel2 = function() xtabs(value~name+tag, mydata)
#> system.time(colonel1())
# user system elapsed
# 0.01 0.00 0.01
#> system.time(colonel2())
# user system elapsed
# 0.05 0.00 0.05
#> system.time(funcPer())
# user system elapsed
# 4.67 0.00 4.82