Let's say I want to create a column in a data.table, in which the value in each row is equal to the standard deviation of the values in three other cells in the same row. E.g., if I make
假设我想在数据中创建一个列。表,其中每一行中的值等于同一行的其他三个单元格中的值的标准差。例如,如果我让
DT <- data.table(a = 1:4, b = c(5, 7, 9, 11), c = c(13, 16, 19, 22), d = c(25, 29, 33, 37))
DT
a b c d
1: 1 5 13 25
2: 2 7 16 29
3: 3 9 19 33
4: 4 11 22 37
and I'd like to add a column that contains the standard deviation of a, b, and d for each row, like this:
我想添加一个列,它包含a, b,和d的标准差,像这样:
a b c d abdSD
1: 1 5 13 23 12.86
2: 2 7 16 27 14.36
3: 3 9 19 31 15.87
4: 4 11 22 35 17.39
I could of course write a for-loop or use an apply function to calculate this. Unfortunately, what I actually want to do needs to be applied to millions of rows, isn't as simple a function as calculating a standard deviation, and needs to finish within a fraction of a second, so I really need a vectorized solution. I want to write something like
我当然可以写一个for循环或者使用一个apply函数来计算这个。不幸的是,我实际上想要做的事情需要应用到数百万行上,而不是像计算标准差那样简单的函数,需要在几分之一秒内完成,所以我真的需要一个矢量化的解决方案。我想写点东西。
DT[, abdSD := sd(c(a, b, d))]
but unfortunately that doesn't give the right answer. Is there any data.table syntax that can create a vector out of different values within the same row, and make that vector accessible to a function populating a new cell within that row? Any help would be greatly appreciated. @Arun
但不幸的是,这并不能给出正确的答案。有任何数据。表语法可以用同一行中的不同值创建一个向量,并使该向量可用于填充该行中的新单元格的函数?如有任何帮助,我们将不胜感激。@Arun
5 个解决方案
#1
2
Depending on the size of your data, you might want to convert the data into a long format, then calculate the result as follows:
根据数据的大小,您可能希望将数据转换成一个长格式,然后计算结果如下:
complexFunc <- function(x) sd(x)
cols <- c("a", "b", "d")
rowres <- melt(DT[, rn:=.I], id.vars="rn", variable.factor=FALSE)[,
list(abdRes=complexFunc(value[variable %chin% cols])), by=.(rn)]
DT[rowres, on=.(rn)]
or if your complex function has 3 arguments, you can do something like
如果你的复杂函数有3个参数,你可以做一些类似的事情。
DT[, abdSD := mapply(complexFunc, a, b, d)]
#2
1
As @Frank mentioned, I could avoid adding a column by doing by=1:nrow(DT)
正如@Frank提到的,我可以通过使用=1:nrow(DT)来避免添加列
DT[, abdSD:=sd(c(a,b,d)),by=1:nrow(DT)]
output:
输出:
a b c d abdSD
1: 1 5 13 25 12.85820
2: 2 7 16 29 14.36431
3: 3 9 19 33 15.87451
4: 4 11 22 37 17.38774
if you add a row_name column, it would be ultra easy
如果您添加一个row_name列,这将非常简单
DT$row_id<-row.names(DT)
Simply by=row_id, would get you the result you want
只需by=row_id,就可以得到所需的结果
DT[, abdSD:=sd(c(a,b,d)),by=row_id]
Result would have:
结果将会:
a b c d row_id abdSD
1: 1 5 13 25 1 12.85820
2: 2 7 16 29 2 14.36431
3: 3 9 19 33 3 15.87451
4: 4 11 22 37 4 17.38774
If you want row_id removed, simply adding [,row_id:=NULL]
如果想删除row_id,只需添加[,row_id:=NULL]
DT[, abdSD:=sd(c(a,b,d)),by=row_id][,row_id:=NULL]
This line would get everything you want
这条线可以得到你想要的一切
a b c d abdSD
1: 1 5 13 25 12.85820
2: 2 7 16 29 14.36431
3: 3 9 19 33 15.87451
4: 4 11 22 37 17.38774
You just gotta do it by row.
你只需要一行就行了。
data.frame does it by row on default, data.table does it by column on default I think. It's a bit tricky
frame在默认情况下按行执行。表在默认情况下是按列进行的。这是有点棘手
Hope this helps
希望这有助于
#3
0
I think you should try matrixStats
package
我想你应该试试matrixStats软件包
library(matrixStats)
#sample data
dt <- data.table(a = 1:4, b = c(5, 7, 9, 11), c = c(13, 16, 19, 22), d = c(25, 29, 33, 37))
dt[, `:=`(abdSD = rowSds(as.matrix(.SD), na.rm=T)), .SDcols=c('a','b','d')]
dt
Output is:
输出是:
a b c d abdSD
1: 1 5 13 25 12.85820
2: 2 7 16 29 14.36431
3: 3 9 19 33 15.87451
4: 4 11 22 37 17.38774
#4
0
Not an answer, but just trying to show the difference between using apply
and the solution provided by Prem above :
不是一个答案,只是试图说明使用apply和Prem提供的解决方案之间的区别:
I have blown up the sample data to 40,000 rows to show solid time differences :
我将示例数据放大到40000行,以显示确切的时间差异:
library(matrixStats)
#sample data
dt <- data.table(a = 1:40000, b = rep(c(5, 7, 9, 11),10000), c = rep(c(13, 16, 19, 22),10000), d = rep(c(25, 29, 33, 37),10000))
df <- data.frame(a = 1:40000, b = rep(c(5, 7, 9, 11),10000), c = rep(c(13, 16, 19, 22),10000), d = rep(c(25, 29, 33, 37),10000))
t0 = Sys.time()
dt[, `:=`(abdSD = rowSds(as.matrix(.SD), na.rm=T)), .SDcols=c('a','b','d')]
print(paste("Time taken for data table operation = ",Sys.time() - t0))
# [1] "Time taken for data table operation = 0.117115020751953"
t0 = Sys.time()
df$abdSD <- apply(df[,c("a","b","d")],1, function(x){sd(x)})
print(paste("Time taken for apply opertaion = ",Sys.time() - t0))
# [1] "Time taken for apply opertaion = 2.93488311767578"
Using DT
and matrixStats
clearly wins the race
使用DT和matrixStats显然赢得了比赛
#5
0
It's not hard to vectorize the sd
for this situation:
在这种情况下,不难将sd格式转换为:
vecSD = function(x) {
n = ncol(x)
sqrt((n/(n-1)) * (Reduce(`+`, x*x)/n - (Reduce(`+`, x)/n)^2))
}
DT[, vecSD(.SD), .SDcols = c('a', 'b', 'd')]
#[1] 12.85820 14.36431 15.87451 17.38774
#1
2
Depending on the size of your data, you might want to convert the data into a long format, then calculate the result as follows:
根据数据的大小,您可能希望将数据转换成一个长格式,然后计算结果如下:
complexFunc <- function(x) sd(x)
cols <- c("a", "b", "d")
rowres <- melt(DT[, rn:=.I], id.vars="rn", variable.factor=FALSE)[,
list(abdRes=complexFunc(value[variable %chin% cols])), by=.(rn)]
DT[rowres, on=.(rn)]
or if your complex function has 3 arguments, you can do something like
如果你的复杂函数有3个参数,你可以做一些类似的事情。
DT[, abdSD := mapply(complexFunc, a, b, d)]
#2
1
As @Frank mentioned, I could avoid adding a column by doing by=1:nrow(DT)
正如@Frank提到的,我可以通过使用=1:nrow(DT)来避免添加列
DT[, abdSD:=sd(c(a,b,d)),by=1:nrow(DT)]
output:
输出:
a b c d abdSD
1: 1 5 13 25 12.85820
2: 2 7 16 29 14.36431
3: 3 9 19 33 15.87451
4: 4 11 22 37 17.38774
if you add a row_name column, it would be ultra easy
如果您添加一个row_name列,这将非常简单
DT$row_id<-row.names(DT)
Simply by=row_id, would get you the result you want
只需by=row_id,就可以得到所需的结果
DT[, abdSD:=sd(c(a,b,d)),by=row_id]
Result would have:
结果将会:
a b c d row_id abdSD
1: 1 5 13 25 1 12.85820
2: 2 7 16 29 2 14.36431
3: 3 9 19 33 3 15.87451
4: 4 11 22 37 4 17.38774
If you want row_id removed, simply adding [,row_id:=NULL]
如果想删除row_id,只需添加[,row_id:=NULL]
DT[, abdSD:=sd(c(a,b,d)),by=row_id][,row_id:=NULL]
This line would get everything you want
这条线可以得到你想要的一切
a b c d abdSD
1: 1 5 13 25 12.85820
2: 2 7 16 29 14.36431
3: 3 9 19 33 15.87451
4: 4 11 22 37 17.38774
You just gotta do it by row.
你只需要一行就行了。
data.frame does it by row on default, data.table does it by column on default I think. It's a bit tricky
frame在默认情况下按行执行。表在默认情况下是按列进行的。这是有点棘手
Hope this helps
希望这有助于
#3
0
I think you should try matrixStats
package
我想你应该试试matrixStats软件包
library(matrixStats)
#sample data
dt <- data.table(a = 1:4, b = c(5, 7, 9, 11), c = c(13, 16, 19, 22), d = c(25, 29, 33, 37))
dt[, `:=`(abdSD = rowSds(as.matrix(.SD), na.rm=T)), .SDcols=c('a','b','d')]
dt
Output is:
输出是:
a b c d abdSD
1: 1 5 13 25 12.85820
2: 2 7 16 29 14.36431
3: 3 9 19 33 15.87451
4: 4 11 22 37 17.38774
#4
0
Not an answer, but just trying to show the difference between using apply
and the solution provided by Prem above :
不是一个答案,只是试图说明使用apply和Prem提供的解决方案之间的区别:
I have blown up the sample data to 40,000 rows to show solid time differences :
我将示例数据放大到40000行,以显示确切的时间差异:
library(matrixStats)
#sample data
dt <- data.table(a = 1:40000, b = rep(c(5, 7, 9, 11),10000), c = rep(c(13, 16, 19, 22),10000), d = rep(c(25, 29, 33, 37),10000))
df <- data.frame(a = 1:40000, b = rep(c(5, 7, 9, 11),10000), c = rep(c(13, 16, 19, 22),10000), d = rep(c(25, 29, 33, 37),10000))
t0 = Sys.time()
dt[, `:=`(abdSD = rowSds(as.matrix(.SD), na.rm=T)), .SDcols=c('a','b','d')]
print(paste("Time taken for data table operation = ",Sys.time() - t0))
# [1] "Time taken for data table operation = 0.117115020751953"
t0 = Sys.time()
df$abdSD <- apply(df[,c("a","b","d")],1, function(x){sd(x)})
print(paste("Time taken for apply opertaion = ",Sys.time() - t0))
# [1] "Time taken for apply opertaion = 2.93488311767578"
Using DT
and matrixStats
clearly wins the race
使用DT和matrixStats显然赢得了比赛
#5
0
It's not hard to vectorize the sd
for this situation:
在这种情况下,不难将sd格式转换为:
vecSD = function(x) {
n = ncol(x)
sqrt((n/(n-1)) * (Reduce(`+`, x*x)/n - (Reduce(`+`, x)/n)^2))
}
DT[, vecSD(.SD), .SDcols = c('a', 'b', 'd')]
#[1] 12.85820 14.36431 15.87451 17.38774