计算每个id长度为2的组合

I have a largish data.table with two columns, id and var:

我有一个较大的data.table，有两列，id和var：

head(DT)
#    id var
# 1:  1   B
# 2:  1   C
# 3:  1   A
# 4:  1   C
# 5:  2   B
# 6:  2   C

I would like to create a kind of cross-table that would show how many times different length 2-combinations of var occured in the data.

我想创建一种交叉表，显示数据中出现的不同长度2组合var的次数。

Expected output for the sample data:

样本数据的预期输出：

out
#    A  B C
# A  0  3 3
# B NA  1 3
# C NA NA 0

Explanation:

说明：

the diagonal of the resulting matrix/data.frame/data.table counts how many times all vars that occured for an id were all the same (either all A, or B, or C). In the sample data, id 4 only has one entry and that is B, so B - B is 1 in the desired result.
结果矩阵/ data.frame / data.table的对角线计算id发生的所有变量都是相同的（所有A，或B或C）。在样本数据中，id 4只有一个条目，即B，因此B - B在所需结果中为1。
the upper triangle counts for how many ids two specific vars were present, i.e. the combination A - B is present in 3 ids, as are combinations A - C and B - C.
上三角形计算出两个特定变量存在多少个ID，即组合A-B存在于3个ID中，组合A-C和B-C也是如此。
Note that for any id, a single combination of two vars can only be either 0 (not present) or 1 (present), i.e. I don't want to count it multiple times per id.
请注意，对于任何id，两个vars的单个组合只能是0（不存在）或1（存在），即我不希望每个id多次计数。
the lower triangle of the result can be left NA, or 0, or it could have the same values as the upper triangle, but that would be redundant.
结果的下三角形可以保留NA或0，或者它可以与上三角形具有相同的值，但这将是多余的。

(The result could also be given in long-format as long as the relevant information is present.)

（只要有相关信息，结果也可以长格式给出。）

I'm sure there's a clever (efficient) way of computing this, but I can't currently wrap my head around it.

我确信这是一种聪明（有效）的计算方法，但我现在无法解决这个问题。

Sample data:

样本数据：

DT <- structure(list(id = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 4L), var = c("B", "C", "A", 
"C", "B", "C", "C", "A", "B", "B", "C", "C", "C", "C", "B", "C", 
"B", "A", "C", "B")), .Names = c("id", "var"), row.names = c(NA, 
-20L), class = "data.frame")

library(data.table)
setDT(DT, key = "id")

1 个解决方案

#1

Since you're ok with long-form results:

由于您可以获得长期结果：

DT[, if(all(var == var[1]))
       .(var[1], var[1])
     else
       as.data.table(t(combn(sort(unique(var)), 2))), by = id][
   , .N, by = .(V1, V2)]
#   V1 V2 N
#1:  A  B 3
#2:  A  C 3
#3:  B  C 3
#4:  B  B 1

Or if we call the above output res:

或者如果我们调用上面的输出res：

dcast(res[CJ(c(V1,V2), c(V1,V2), unique = T), on = c('V1', 'V2')][
          V1 == V2 & is.na(N), N := 0], V1 ~ V2)
#   V1  A  B C
#1:  A  0  3 3
#2:  B NA  1 3
#3:  C NA NA 0

An alternative to combn is doing:

combn的另一种选择是：

DT[, if (all(var == var[1]))
       .(var[1], var[1])
     else
       CJ(var, var, unique = T)[V1 < V2], by = id][
   , .N, by = .(V1, V2)]
#    V1 V2 N
# 1:  A  B 3
# 2:  A  C 3
# 3:  B  C 3
# 4:  B  B 1

# or combn with list output (instead of matrix)

unique(DT, by=NULL)[ order(var), if(.N==1L)
       .(var, var)
     else
       transpose(combn(var, 2, simplify=FALSE)), by = id][
   , .N, by = .(V1, V2)]

#1