正确的/最快的方式重塑数据表

时间:2022-09-16 11:39:08

I have a data table in R:

R中有一个数据表:

library(data.table)
set.seed(1234)
DT <- data.table(x=rep(c(1,2,3),each=4), y=c("A","B"), v=sample(1:100,12))
DT
      x y  v
 [1,] 1 A 12
 [2,] 1 B 62
 [3,] 1 A 60
 [4,] 1 B 61
 [5,] 2 A 83
 [6,] 2 B 97
 [7,] 2 A  1
 [8,] 2 B 22
 [9,] 3 A 99
[10,] 3 B 47
[11,] 3 A 63
[12,] 3 B 49

I can easily sum the variable v by the groups in the data.table:

我可以很容易地用数据中的组对变量v求和。

out <- DT[,list(SUM=sum(v)),by=list(x,y)]
out
     x  y SUM
[1,] 1 A  72
[2,] 1 B 123
[3,] 2 A  84
[4,] 2 B 119
[5,] 3 A 162
[6,] 3 B  96

However, I would like to have the groups (y) as columns, rather than rows. I can accomplish this using reshape:

但是,我希望组(y)作为列,而不是行。我可以通过重塑来实现这一点:

out <- reshape(out,direction='wide',idvar='x', timevar='y')
out
     x SUM.A SUM.B
[1,] 1    72   123
[2,] 2    84   119
[3,] 3   162    96

Is there a more efficient way to reshape the data after aggregating it? Is there any way to combine these operations into one step, using the data.table operations?

是否有一种更有效的方法在聚合数据后重塑数据?是否有办法将这些操作组合成一个步骤,使用这些数据。表操作吗?

4 个解决方案

#1


72  

The data.table package implements faster melt/dcast functions (in C). It also has additional features by allowing to melt and cast multiple columns. Please see the new Efficient reshaping using data.tables on Github.

数据。表包实现了更快的熔融/dcast功能(在C语言中)。它还允许熔融和浇铸多列。请参阅使用数据的新的高效整形。表在Github上。

melt/dcast functions for data.table have been available since v1.9.0 and the features include:

融化/ dcast函数数据。从v1.9.0开始就有了表格,其特点包括:

  • There is no need to load reshape2 package prior to casting. But if you want it loaded for other operations, please load it before loading data.table.

    在浇铸之前不需要加载reshape2包。但是,如果您希望加载它用于其他操作,请在加载data.table之前加载它。

  • dcast is also a S3 generic. No more dcast.data.table(). Just use dcast().

    dcast也是S3通用的。不再dcast.data.table()。只使用dcast()。

  • melt:

    融化:

    • is capable of melting on columns of type 'list'.

      可以融化在“列表”的列上。

    • gains variable.factor and value.factor which by default are TRUE and FALSE respectively for compatibility with reshape2. This allows for directly controlling the output type of variable and value columns (as factors or not).

      收益变量。因素和价值。对于与reshape2的兼容性,默认为真和假的因子。这允许直接控制变量和值列的输出类型(作为因素或不作为因素)。

    • melt.data.table's na.rm = TRUE parameter is internally optimised to remove NAs directly during melting and is therefore much more efficient.

      melt.data。表的na。rm = TRUE parameter在熔化过程中被内部优化以直接去除NAs,因此效率更高。

    • NEW: melt can accept a list for measure.vars and columns specified in each element of the list will be combined together. This is faciliated further through the use of patterns(). See vignette or ?melt.

      新:熔体可以接受列表测量。在列表的每个元素中指定的vars和列将被合并在一起。通过使用patterns()进一步促进了这一点。看到小插图或?融化。

  • dcast:

    dcast:

    • accepts multiple fun.aggregate and multiple value.var. See vignette or ?dcast.

      接受多种乐趣。聚合和多个value.var。看到小插图或? dcast。

    • use rowid() function directly in formula to generate an id-column, which is sometimes required to identify the rows uniquely. See ?dcast.

      使用rowid()函数直接在公式中生成id-column,这有时需要惟一地标识行。看到了什么? dcast。

  • Old benchmarks:

    古老的基准:

    • melt : 10 million rows and 5 columns, 61.3 seconds reduced to 1.2 seconds.
    • 融化:1000万行和5列,61.3秒减少到1.2秒。
    • dcast : 1 million rows and 4 columns, 192 seconds reduced to 3.6 seconds.
    • dcast: 100万行4列,192秒减少到3.6秒。

Reminder of Cologne (Dec 2013) presentation slide 32 : Why not submit a dcast pull request to reshape2?

科隆提醒(2013年12月)演示幻灯片32:为什么不提交一个dcast拉拽请求来重构2?

#2


31  

This feature is now implemented into data.table (from version 1.8.11 on), as can be seen in Zach's answer above.

I just saw this great chunk of code from Arun here on SO. So I guess there is a data.table solution. Applied to this problem:

我刚从阿伦那里看到了一大块代码。我想有一个数据。表解决方案。应用到这个问题:

library(data.table)
set.seed(1234)
DT <- data.table(x=rep(c(1,2,3),each=1e6), 
                  y=c("A","B"), 
                  v=sample(1:100,12))

out <- DT[,list(SUM=sum(v)),by=list(x,y)]
# edit (mnel) to avoid setNames which creates a copy
# when calling `names<-` inside the function
out[, as.list(setattr(SUM, 'names', y)), by=list(x)]
})
   x        A        B
1: 1 26499966 28166677
2: 2 26499978 28166673
3: 3 26500056 28166650

This gives the same results as DWin's approach:

这与DWin的方法得到了相同的结果:

tapply(DT$v,list(DT$x, DT$y), FUN=sum)
         A        B
1 26499966 28166677
2 26499978 28166673
3 26500056 28166650

Also, it is fast:

此外,它是快:

system.time({ 
   out <- DT[,list(SUM=sum(v)),by=list(x,y)]
   out[, as.list(setattr(SUM, 'names', y)), by=list(x)]})
##  user  system elapsed 
## 0.64    0.05    0.70 
system.time(tapply(DT$v,list(DT$x, DT$y), FUN=sum))
## user  system elapsed 
## 7.23    0.16    7.39 

UPDATE

更新

So that this solution also works for non-balanced data sets (i.e. some combinations do not exist), you have to enter those in the data table first:

因此,该解决方案也适用于非平衡数据集(即不存在某些组合),您必须首先将它们输入到数据表中:

library(data.table)
set.seed(1234)
DT <- data.table(x=c(rep(c(1,2,3),each=4),3,4), y=c("A","B"), v=sample(1:100,14))

out <- DT[,list(SUM=sum(v)),by=list(x,y)]
setkey(out, x, y)

intDT <- expand.grid(unique(out[,x]), unique(out[,y]))
setnames(intDT, c("x", "y"))
out <- out[intDT]

out[, as.list(setattr(SUM, 'names', y)), by=list(x)]

Summary

总结

Combining the comments with the above, here's the 1-line solution:

结合上面的评论,这里有一个一行的解决方案:

DT[, sum(v), keyby = list(x,y)][CJ(unique(x), unique(y)), allow.cartesian = T][,
   setNames(as.list(V1), paste(y)), by = x]

It's also easy to modify this to have more than just the sum, e.g.:

也很容易修改它,使它不仅仅包含和,例如:

DT[, list(sum(v), mean(v)), keyby = list(x,y)][CJ(unique(x), unique(y)), allow.cartesian = T][,
   setNames(as.list(c(V1, V2)), c(paste0(y,".sum"), paste0(y,".mean"))), by = x]
#   x A.sum B.sum   A.mean B.mean
#1: 1    72   123 36.00000   61.5
#2: 2    84   119 42.00000   59.5
#3: 3   187    96 62.33333   48.0
#4: 4    NA    81       NA   81.0

#3


21  

Data.table objects inherit from 'data.frame' so you can just use tapply:

数据。表对象继承自“data.frame”,因此您可以使用tapply:

> tapply(DT$v,list(DT$x, DT$y), FUN=sum)
   AA  BB
a  72 123
b  84 119
c 162  96

#4


7  

You can use dcast from reshape2 library. Here is the code

您可以使用reshape2库中的dcast。这是代码

# DUMMY DATA
library(data.table)
mydf = data.table(
  x = rep(1:3, each = 4),
  y = rep(c('A', 'B'), times = 2),
  v = rpois(12, 30)
)

# USE RESHAPE2
library(reshape2)
dcast(mydf, x ~ y, fun = sum, value_var = "v")

NOTE: The tapply solution would be much faster.

注意:tapply解决方案要快得多。

#1


72  

The data.table package implements faster melt/dcast functions (in C). It also has additional features by allowing to melt and cast multiple columns. Please see the new Efficient reshaping using data.tables on Github.

数据。表包实现了更快的熔融/dcast功能(在C语言中)。它还允许熔融和浇铸多列。请参阅使用数据的新的高效整形。表在Github上。

melt/dcast functions for data.table have been available since v1.9.0 and the features include:

融化/ dcast函数数据。从v1.9.0开始就有了表格,其特点包括:

  • There is no need to load reshape2 package prior to casting. But if you want it loaded for other operations, please load it before loading data.table.

    在浇铸之前不需要加载reshape2包。但是,如果您希望加载它用于其他操作,请在加载data.table之前加载它。

  • dcast is also a S3 generic. No more dcast.data.table(). Just use dcast().

    dcast也是S3通用的。不再dcast.data.table()。只使用dcast()。

  • melt:

    融化:

    • is capable of melting on columns of type 'list'.

      可以融化在“列表”的列上。

    • gains variable.factor and value.factor which by default are TRUE and FALSE respectively for compatibility with reshape2. This allows for directly controlling the output type of variable and value columns (as factors or not).

      收益变量。因素和价值。对于与reshape2的兼容性,默认为真和假的因子。这允许直接控制变量和值列的输出类型(作为因素或不作为因素)。

    • melt.data.table's na.rm = TRUE parameter is internally optimised to remove NAs directly during melting and is therefore much more efficient.

      melt.data。表的na。rm = TRUE parameter在熔化过程中被内部优化以直接去除NAs,因此效率更高。

    • NEW: melt can accept a list for measure.vars and columns specified in each element of the list will be combined together. This is faciliated further through the use of patterns(). See vignette or ?melt.

      新:熔体可以接受列表测量。在列表的每个元素中指定的vars和列将被合并在一起。通过使用patterns()进一步促进了这一点。看到小插图或?融化。

  • dcast:

    dcast:

    • accepts multiple fun.aggregate and multiple value.var. See vignette or ?dcast.

      接受多种乐趣。聚合和多个value.var。看到小插图或? dcast。

    • use rowid() function directly in formula to generate an id-column, which is sometimes required to identify the rows uniquely. See ?dcast.

      使用rowid()函数直接在公式中生成id-column,这有时需要惟一地标识行。看到了什么? dcast。

  • Old benchmarks:

    古老的基准:

    • melt : 10 million rows and 5 columns, 61.3 seconds reduced to 1.2 seconds.
    • 融化:1000万行和5列,61.3秒减少到1.2秒。
    • dcast : 1 million rows and 4 columns, 192 seconds reduced to 3.6 seconds.
    • dcast: 100万行4列,192秒减少到3.6秒。

Reminder of Cologne (Dec 2013) presentation slide 32 : Why not submit a dcast pull request to reshape2?

科隆提醒(2013年12月)演示幻灯片32:为什么不提交一个dcast拉拽请求来重构2?

#2


31  

This feature is now implemented into data.table (from version 1.8.11 on), as can be seen in Zach's answer above.

I just saw this great chunk of code from Arun here on SO. So I guess there is a data.table solution. Applied to this problem:

我刚从阿伦那里看到了一大块代码。我想有一个数据。表解决方案。应用到这个问题:

library(data.table)
set.seed(1234)
DT <- data.table(x=rep(c(1,2,3),each=1e6), 
                  y=c("A","B"), 
                  v=sample(1:100,12))

out <- DT[,list(SUM=sum(v)),by=list(x,y)]
# edit (mnel) to avoid setNames which creates a copy
# when calling `names<-` inside the function
out[, as.list(setattr(SUM, 'names', y)), by=list(x)]
})
   x        A        B
1: 1 26499966 28166677
2: 2 26499978 28166673
3: 3 26500056 28166650

This gives the same results as DWin's approach:

这与DWin的方法得到了相同的结果:

tapply(DT$v,list(DT$x, DT$y), FUN=sum)
         A        B
1 26499966 28166677
2 26499978 28166673
3 26500056 28166650

Also, it is fast:

此外,它是快:

system.time({ 
   out <- DT[,list(SUM=sum(v)),by=list(x,y)]
   out[, as.list(setattr(SUM, 'names', y)), by=list(x)]})
##  user  system elapsed 
## 0.64    0.05    0.70 
system.time(tapply(DT$v,list(DT$x, DT$y), FUN=sum))
## user  system elapsed 
## 7.23    0.16    7.39 

UPDATE

更新

So that this solution also works for non-balanced data sets (i.e. some combinations do not exist), you have to enter those in the data table first:

因此,该解决方案也适用于非平衡数据集(即不存在某些组合),您必须首先将它们输入到数据表中:

library(data.table)
set.seed(1234)
DT <- data.table(x=c(rep(c(1,2,3),each=4),3,4), y=c("A","B"), v=sample(1:100,14))

out <- DT[,list(SUM=sum(v)),by=list(x,y)]
setkey(out, x, y)

intDT <- expand.grid(unique(out[,x]), unique(out[,y]))
setnames(intDT, c("x", "y"))
out <- out[intDT]

out[, as.list(setattr(SUM, 'names', y)), by=list(x)]

Summary

总结

Combining the comments with the above, here's the 1-line solution:

结合上面的评论,这里有一个一行的解决方案:

DT[, sum(v), keyby = list(x,y)][CJ(unique(x), unique(y)), allow.cartesian = T][,
   setNames(as.list(V1), paste(y)), by = x]

It's also easy to modify this to have more than just the sum, e.g.:

也很容易修改它,使它不仅仅包含和,例如:

DT[, list(sum(v), mean(v)), keyby = list(x,y)][CJ(unique(x), unique(y)), allow.cartesian = T][,
   setNames(as.list(c(V1, V2)), c(paste0(y,".sum"), paste0(y,".mean"))), by = x]
#   x A.sum B.sum   A.mean B.mean
#1: 1    72   123 36.00000   61.5
#2: 2    84   119 42.00000   59.5
#3: 3   187    96 62.33333   48.0
#4: 4    NA    81       NA   81.0

#3


21  

Data.table objects inherit from 'data.frame' so you can just use tapply:

数据。表对象继承自“data.frame”,因此您可以使用tapply:

> tapply(DT$v,list(DT$x, DT$y), FUN=sum)
   AA  BB
a  72 123
b  84 119
c 162  96

#4


7  

You can use dcast from reshape2 library. Here is the code

您可以使用reshape2库中的dcast。这是代码

# DUMMY DATA
library(data.table)
mydf = data.table(
  x = rep(1:3, each = 4),
  y = rep(c('A', 'B'), times = 2),
  v = rpois(12, 30)
)

# USE RESHAPE2
library(reshape2)
dcast(mydf, x ~ y, fun = sum, value_var = "v")

NOTE: The tapply solution would be much faster.

注意:tapply解决方案要快得多。