I have a data table in R:
R中有一个数据表:
library(data.table)
set.seed(1234)
DT <- data.table(x=rep(c(1,2,3),each=4), y=c("A","B"), v=sample(1:100,12))
DT
x y v
[1,] 1 A 12
[2,] 1 B 62
[3,] 1 A 60
[4,] 1 B 61
[5,] 2 A 83
[6,] 2 B 97
[7,] 2 A 1
[8,] 2 B 22
[9,] 3 A 99
[10,] 3 B 47
[11,] 3 A 63
[12,] 3 B 49
I can easily sum the variable v by the groups in the data.table:
我可以很容易地用数据中的组对变量v求和。
out <- DT[,list(SUM=sum(v)),by=list(x,y)]
out
x y SUM
[1,] 1 A 72
[2,] 1 B 123
[3,] 2 A 84
[4,] 2 B 119
[5,] 3 A 162
[6,] 3 B 96
However, I would like to have the groups (y) as columns, rather than rows. I can accomplish this using reshape
:
但是,我希望组(y)作为列,而不是行。我可以通过重塑来实现这一点:
out <- reshape(out,direction='wide',idvar='x', timevar='y')
out
x SUM.A SUM.B
[1,] 1 72 123
[2,] 2 84 119
[3,] 3 162 96
Is there a more efficient way to reshape the data after aggregating it? Is there any way to combine these operations into one step, using the data.table operations?
是否有一种更有效的方法在聚合数据后重塑数据?是否有办法将这些操作组合成一个步骤,使用这些数据。表操作吗?
4 个解决方案
#1
72
The data.table
package implements faster melt/dcast
functions (in C). It also has additional features by allowing to melt and cast multiple columns. Please see the new Efficient reshaping using data.tables on Github.
数据。表包实现了更快的熔融/dcast功能(在C语言中)。它还允许熔融和浇铸多列。请参阅使用数据的新的高效整形。表在Github上。
melt/dcast functions for data.table have been available since v1.9.0 and the features include:
融化/ dcast函数数据。从v1.9.0开始就有了表格,其特点包括:
-
There is no need to load
reshape2
package prior to casting. But if you want it loaded for other operations, please load it before loadingdata.table
.在浇铸之前不需要加载reshape2包。但是,如果您希望加载它用于其他操作,请在加载data.table之前加载它。
-
dcast
is also a S3 generic. No moredcast.data.table()
. Just usedcast()
.dcast也是S3通用的。不再dcast.data.table()。只使用dcast()。
-
melt
:融化:
-
is capable of melting on columns of type 'list'.
可以融化在“列表”的列上。
-
gains
variable.factor
andvalue.factor
which by default areTRUE
andFALSE
respectively for compatibility withreshape2
. This allows for directly controlling the output type ofvariable
andvalue
columns (as factors or not).收益变量。因素和价值。对于与reshape2的兼容性,默认为真和假的因子。这允许直接控制变量和值列的输出类型(作为因素或不作为因素)。
-
melt.data.table
'sna.rm = TRUE
parameter is internally optimised to remove NAs directly during melting and is therefore much more efficient.melt.data。表的na。rm = TRUE parameter在熔化过程中被内部优化以直接去除NAs,因此效率更高。
-
NEW:
melt
can accept a list formeasure.vars
and columns specified in each element of the list will be combined together. This is faciliated further through the use ofpatterns()
. See vignette or?melt
.新:熔体可以接受列表测量。在列表的每个元素中指定的vars和列将被合并在一起。通过使用patterns()进一步促进了这一点。看到小插图或?融化。
-
-
dcast
:dcast:
-
accepts multiple
fun.aggregate
and multiplevalue.var
. See vignette or?dcast
.接受多种乐趣。聚合和多个value.var。看到小插图或? dcast。
-
use
rowid()
function directly in formula to generate an id-column, which is sometimes required to identify the rows uniquely. See ?dcast.使用rowid()函数直接在公式中生成id-column,这有时需要惟一地标识行。看到了什么? dcast。
-
-
Old benchmarks:
古老的基准:
-
melt
: 10 million rows and 5 columns, 61.3 seconds reduced to 1.2 seconds. - 融化:1000万行和5列,61.3秒减少到1.2秒。
-
dcast
: 1 million rows and 4 columns, 192 seconds reduced to 3.6 seconds. - dcast: 100万行4列,192秒减少到3.6秒。
-
Reminder of Cologne (Dec 2013) presentation slide 32 : Why not submit a dcast
pull request to reshape2
?
科隆提醒(2013年12月)演示幻灯片32:为什么不提交一个dcast拉拽请求来重构2?
#2
31
This feature is now implemented into data.table (from version 1.8.11 on), as can be seen in Zach's answer above.
I just saw this great chunk of code from Arun here on SO. So I guess there is a data.table
solution. Applied to this problem:
我刚从阿伦那里看到了一大块代码。我想有一个数据。表解决方案。应用到这个问题:
library(data.table)
set.seed(1234)
DT <- data.table(x=rep(c(1,2,3),each=1e6),
y=c("A","B"),
v=sample(1:100,12))
out <- DT[,list(SUM=sum(v)),by=list(x,y)]
# edit (mnel) to avoid setNames which creates a copy
# when calling `names<-` inside the function
out[, as.list(setattr(SUM, 'names', y)), by=list(x)]
})
x A B
1: 1 26499966 28166677
2: 2 26499978 28166673
3: 3 26500056 28166650
This gives the same results as DWin's approach:
这与DWin的方法得到了相同的结果:
tapply(DT$v,list(DT$x, DT$y), FUN=sum)
A B
1 26499966 28166677
2 26499978 28166673
3 26500056 28166650
Also, it is fast:
此外,它是快:
system.time({
out <- DT[,list(SUM=sum(v)),by=list(x,y)]
out[, as.list(setattr(SUM, 'names', y)), by=list(x)]})
## user system elapsed
## 0.64 0.05 0.70
system.time(tapply(DT$v,list(DT$x, DT$y), FUN=sum))
## user system elapsed
## 7.23 0.16 7.39
UPDATE
更新
So that this solution also works for non-balanced data sets (i.e. some combinations do not exist), you have to enter those in the data table first:
因此,该解决方案也适用于非平衡数据集(即不存在某些组合),您必须首先将它们输入到数据表中:
library(data.table)
set.seed(1234)
DT <- data.table(x=c(rep(c(1,2,3),each=4),3,4), y=c("A","B"), v=sample(1:100,14))
out <- DT[,list(SUM=sum(v)),by=list(x,y)]
setkey(out, x, y)
intDT <- expand.grid(unique(out[,x]), unique(out[,y]))
setnames(intDT, c("x", "y"))
out <- out[intDT]
out[, as.list(setattr(SUM, 'names', y)), by=list(x)]
Summary
总结
Combining the comments with the above, here's the 1-line solution:
结合上面的评论,这里有一个一行的解决方案:
DT[, sum(v), keyby = list(x,y)][CJ(unique(x), unique(y)), allow.cartesian = T][,
setNames(as.list(V1), paste(y)), by = x]
It's also easy to modify this to have more than just the sum, e.g.:
也很容易修改它,使它不仅仅包含和,例如:
DT[, list(sum(v), mean(v)), keyby = list(x,y)][CJ(unique(x), unique(y)), allow.cartesian = T][,
setNames(as.list(c(V1, V2)), c(paste0(y,".sum"), paste0(y,".mean"))), by = x]
# x A.sum B.sum A.mean B.mean
#1: 1 72 123 36.00000 61.5
#2: 2 84 119 42.00000 59.5
#3: 3 187 96 62.33333 48.0
#4: 4 NA 81 NA 81.0
#3
21
Data.table objects inherit from 'data.frame' so you can just use tapply:
数据。表对象继承自“data.frame”,因此您可以使用tapply:
> tapply(DT$v,list(DT$x, DT$y), FUN=sum)
AA BB
a 72 123
b 84 119
c 162 96
#4
7
You can use dcast
from reshape2
library. Here is the code
您可以使用reshape2库中的dcast。这是代码
# DUMMY DATA
library(data.table)
mydf = data.table(
x = rep(1:3, each = 4),
y = rep(c('A', 'B'), times = 2),
v = rpois(12, 30)
)
# USE RESHAPE2
library(reshape2)
dcast(mydf, x ~ y, fun = sum, value_var = "v")
NOTE: The tapply
solution would be much faster.
注意:tapply解决方案要快得多。
#1
72
The data.table
package implements faster melt/dcast
functions (in C). It also has additional features by allowing to melt and cast multiple columns. Please see the new Efficient reshaping using data.tables on Github.
数据。表包实现了更快的熔融/dcast功能(在C语言中)。它还允许熔融和浇铸多列。请参阅使用数据的新的高效整形。表在Github上。
melt/dcast functions for data.table have been available since v1.9.0 and the features include:
融化/ dcast函数数据。从v1.9.0开始就有了表格,其特点包括:
-
There is no need to load
reshape2
package prior to casting. But if you want it loaded for other operations, please load it before loadingdata.table
.在浇铸之前不需要加载reshape2包。但是,如果您希望加载它用于其他操作,请在加载data.table之前加载它。
-
dcast
is also a S3 generic. No moredcast.data.table()
. Just usedcast()
.dcast也是S3通用的。不再dcast.data.table()。只使用dcast()。
-
melt
:融化:
-
is capable of melting on columns of type 'list'.
可以融化在“列表”的列上。
-
gains
variable.factor
andvalue.factor
which by default areTRUE
andFALSE
respectively for compatibility withreshape2
. This allows for directly controlling the output type ofvariable
andvalue
columns (as factors or not).收益变量。因素和价值。对于与reshape2的兼容性,默认为真和假的因子。这允许直接控制变量和值列的输出类型(作为因素或不作为因素)。
-
melt.data.table
'sna.rm = TRUE
parameter is internally optimised to remove NAs directly during melting and is therefore much more efficient.melt.data。表的na。rm = TRUE parameter在熔化过程中被内部优化以直接去除NAs,因此效率更高。
-
NEW:
melt
can accept a list formeasure.vars
and columns specified in each element of the list will be combined together. This is faciliated further through the use ofpatterns()
. See vignette or?melt
.新:熔体可以接受列表测量。在列表的每个元素中指定的vars和列将被合并在一起。通过使用patterns()进一步促进了这一点。看到小插图或?融化。
-
-
dcast
:dcast:
-
accepts multiple
fun.aggregate
and multiplevalue.var
. See vignette or?dcast
.接受多种乐趣。聚合和多个value.var。看到小插图或? dcast。
-
use
rowid()
function directly in formula to generate an id-column, which is sometimes required to identify the rows uniquely. See ?dcast.使用rowid()函数直接在公式中生成id-column,这有时需要惟一地标识行。看到了什么? dcast。
-
-
Old benchmarks:
古老的基准:
-
melt
: 10 million rows and 5 columns, 61.3 seconds reduced to 1.2 seconds. - 融化:1000万行和5列,61.3秒减少到1.2秒。
-
dcast
: 1 million rows and 4 columns, 192 seconds reduced to 3.6 seconds. - dcast: 100万行4列,192秒减少到3.6秒。
-
Reminder of Cologne (Dec 2013) presentation slide 32 : Why not submit a dcast
pull request to reshape2
?
科隆提醒(2013年12月)演示幻灯片32:为什么不提交一个dcast拉拽请求来重构2?
#2
31
This feature is now implemented into data.table (from version 1.8.11 on), as can be seen in Zach's answer above.
I just saw this great chunk of code from Arun here on SO. So I guess there is a data.table
solution. Applied to this problem:
我刚从阿伦那里看到了一大块代码。我想有一个数据。表解决方案。应用到这个问题:
library(data.table)
set.seed(1234)
DT <- data.table(x=rep(c(1,2,3),each=1e6),
y=c("A","B"),
v=sample(1:100,12))
out <- DT[,list(SUM=sum(v)),by=list(x,y)]
# edit (mnel) to avoid setNames which creates a copy
# when calling `names<-` inside the function
out[, as.list(setattr(SUM, 'names', y)), by=list(x)]
})
x A B
1: 1 26499966 28166677
2: 2 26499978 28166673
3: 3 26500056 28166650
This gives the same results as DWin's approach:
这与DWin的方法得到了相同的结果:
tapply(DT$v,list(DT$x, DT$y), FUN=sum)
A B
1 26499966 28166677
2 26499978 28166673
3 26500056 28166650
Also, it is fast:
此外,它是快:
system.time({
out <- DT[,list(SUM=sum(v)),by=list(x,y)]
out[, as.list(setattr(SUM, 'names', y)), by=list(x)]})
## user system elapsed
## 0.64 0.05 0.70
system.time(tapply(DT$v,list(DT$x, DT$y), FUN=sum))
## user system elapsed
## 7.23 0.16 7.39
UPDATE
更新
So that this solution also works for non-balanced data sets (i.e. some combinations do not exist), you have to enter those in the data table first:
因此,该解决方案也适用于非平衡数据集(即不存在某些组合),您必须首先将它们输入到数据表中:
library(data.table)
set.seed(1234)
DT <- data.table(x=c(rep(c(1,2,3),each=4),3,4), y=c("A","B"), v=sample(1:100,14))
out <- DT[,list(SUM=sum(v)),by=list(x,y)]
setkey(out, x, y)
intDT <- expand.grid(unique(out[,x]), unique(out[,y]))
setnames(intDT, c("x", "y"))
out <- out[intDT]
out[, as.list(setattr(SUM, 'names', y)), by=list(x)]
Summary
总结
Combining the comments with the above, here's the 1-line solution:
结合上面的评论,这里有一个一行的解决方案:
DT[, sum(v), keyby = list(x,y)][CJ(unique(x), unique(y)), allow.cartesian = T][,
setNames(as.list(V1), paste(y)), by = x]
It's also easy to modify this to have more than just the sum, e.g.:
也很容易修改它,使它不仅仅包含和,例如:
DT[, list(sum(v), mean(v)), keyby = list(x,y)][CJ(unique(x), unique(y)), allow.cartesian = T][,
setNames(as.list(c(V1, V2)), c(paste0(y,".sum"), paste0(y,".mean"))), by = x]
# x A.sum B.sum A.mean B.mean
#1: 1 72 123 36.00000 61.5
#2: 2 84 119 42.00000 59.5
#3: 3 187 96 62.33333 48.0
#4: 4 NA 81 NA 81.0
#3
21
Data.table objects inherit from 'data.frame' so you can just use tapply:
数据。表对象继承自“data.frame”,因此您可以使用tapply:
> tapply(DT$v,list(DT$x, DT$y), FUN=sum)
AA BB
a 72 123
b 84 119
c 162 96
#4
7
You can use dcast
from reshape2
library. Here is the code
您可以使用reshape2库中的dcast。这是代码
# DUMMY DATA
library(data.table)
mydf = data.table(
x = rep(1:3, each = 4),
y = rep(c('A', 'B'), times = 2),
v = rpois(12, 30)
)
# USE RESHAPE2
library(reshape2)
dcast(mydf, x ~ y, fun = sum, value_var = "v")
NOTE: The tapply
solution would be much faster.
注意:tapply解决方案要快得多。