在使用data.table进行分组时，如何保留输入数据中未出现的变量组合？

Using data.table package, is it possible to summarise data preserving combinations of variables that do not appear in the input?

使用data.table包,是否可以汇总保留输入中未出现的变量组合的数据?

With plyr package I know how to do this with the .drop argument, for example:

使用plyr包我知道如何使用.drop参数执行此操作,例如:

require(plyr)
df <- data.frame(categories = c(rep("A",3), rep("B",3), rep("C",3)), groups = c(rep(c("X", "Y"),4), "Z"), values = rep(1, 9))

df1 <- ddply(df, c("categories","groups"), .drop = F, summarise, sum = sum(values))

output:

 categories groups sum
1          A      X   2
2          A      Y   1
3          A      Z   0
4          B      X   1
5          B      Y   2
6          B      Z   0
7          C      X   1
8          C      Y   1
9          C      Z   1

In this case I preserve all groups/categories combinations even if its sum is 0.

在这种情况下,即使其总和为0,我也会保留所有组/类别组合。

1 个解决方案

#1

Great question. Here are two ways. They both use by-without-by.

好问题。这有两种方式。它们都是无用的。

DT = as.data.table(df)
setkey(DT,categories,groups)
DT[CJ(unique(categories),unique(groups)), sum(values,na.rm=TRUE)]

   categories groups V1
1:          A      X  2
2:          A      Y  1
3:          A      Z  0
4:          B      X  1
5:          B      Y  2
6:          B      Z  0
7:          C      X  1
8:          C      Y  1
9:          C      Z  1

where CJ stands for Cross Join, see ?CJ. by-without-by just means that j gets executed on each group that each row of i joins to.

CJ代表Cross Join,见?CJ。 by-without-by只意味着j在每一行i加入的每个组上执行。

Admitedly it looks tricky on first sight. The idea is that if you have a known subset of groups, this syntax is faster than grouping everything and then selecting just the results from that you need. But in this case you'd like everything anyway so there's not much advantage, other than being able to lookup groups that don't exist in the data (which you can't do with by).

令人钦佩的是,第一眼看上去很棘手。这个想法是,如果你有一个已知的组子集,这种语法比分组所有内容更快,然后只选择你需要的结果。但是在这种情况下你总是喜欢所有的东西,所以除了能够查找数据中不存在的组(你无法做到)之外,没有太大的优势。

Another way is to by first as normal, then join the CJ() result to that :

另一种方法是首先正常,然后将CJ()结果加入到:

DT[,sum(values),keyby='categories,groups'][CJ(unique(categories),unique(groups))]
   categories groups V1
1:          A      X  2
2:          A      Y  1
3:          A      Z NA
4:          B      X  1
5:          B      Y  2
6:          B      Z NA
7:          C      X  1
8:          C      Y  1
9:          C      Z  1

but then you get NA instead of the desired 0. Those could be replaced using set() if need be. The second way might be faster because the two unique calls are given much smaller input.

但是你会得到NA而不是所需的0.如果需要,可以使用set()替换。第二种方式可能更快,因为两个独特的调用给出了更小的输入。

Both methods can be wrapped up into small helper functions if you do this a lot.

如果你这么做的话,这两种方法都可以包装成小辅助函数。

#1