计算每个id长度为2的组合

时间:2023-01-28 18:28:56

I have a largish data.table with two columns, id and var:

我有一个较大的data.table,有两列,id和var:

head(DT)
#    id var
# 1:  1   B
# 2:  1   C
# 3:  1   A
# 4:  1   C
# 5:  2   B
# 6:  2   C

I would like to create a kind of cross-table that would show how many times different length 2-combinations of var occured in the data.

我想创建一种交叉表,显示数据中出现的不同长度2组合var的次数。

Expected output for the sample data:

样本数据的预期输出:

out
#    A  B C
# A  0  3 3
# B NA  1 3
# C NA NA 0

Explanation:

说明:

  • the diagonal of the resulting matrix/data.frame/data.table counts how many times all vars that occured for an id were all the same (either all A, or B, or C). In the sample data, id 4 only has one entry and that is B, so B - B is 1 in the desired result.
  • 结果矩阵/ data.frame / data.table的对角线计算id发生的所有变量都是相同的(所有A,或B或C)。在样本数据中,id 4只有一个条目,即B,因此B - B在所需结果中为1。
  • the upper triangle counts for how many ids two specific vars were present, i.e. the combination A - B is present in 3 ids, as are combinations A - C and B - C.
  • 上三角形计算出两个特定变量存在多少个ID,即组合A-B存在于3个ID中,组合A-C和B-C也是如此。
  • Note that for any id, a single combination of two vars can only be either 0 (not present) or 1 (present), i.e. I don't want to count it multiple times per id.
  • 请注意,对于任何id,两个vars的单个组合只能是0(不存在)或1(存在),即我不希望每个id多次计数。
  • the lower triangle of the result can be left NA, or 0, or it could have the same values as the upper triangle, but that would be redundant.
  • 结果的下三角形可以保留NA或0,或者它可以与上三角形具有相同的值,但这将是多余的。

(The result could also be given in long-format as long as the relevant information is present.)

(只要有相关信息,结果也可以长格式给出。)

I'm sure there's a clever (efficient) way of computing this, but I can't currently wrap my head around it.

我确信这是一种聪明(有效)的计算方法,但我现在无法解决这个问题。

Sample data:

样本数据:

DT <- structure(list(id = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 4L), var = c("B", "C", "A", 
"C", "B", "C", "C", "A", "B", "B", "C", "C", "C", "C", "B", "C", 
"B", "A", "C", "B")), .Names = c("id", "var"), row.names = c(NA, 
-20L), class = "data.frame")

library(data.table)
setDT(DT, key = "id")

1 个解决方案

#1


10  

Since you're ok with long-form results:

由于您可以获得长期结果:

DT[, if(all(var == var[1]))
       .(var[1], var[1])
     else
       as.data.table(t(combn(sort(unique(var)), 2))), by = id][
   , .N, by = .(V1, V2)]
#   V1 V2 N
#1:  A  B 3
#2:  A  C 3
#3:  B  C 3
#4:  B  B 1

Or if we call the above output res:

或者如果我们调用上面的输出res:

dcast(res[CJ(c(V1,V2), c(V1,V2), unique = T), on = c('V1', 'V2')][
          V1 == V2 & is.na(N), N := 0], V1 ~ V2)
#   V1  A  B C
#1:  A  0  3 3
#2:  B NA  1 3
#3:  C NA NA 0

An alternative to combn is doing:

combn的另一种选择是:

DT[, if (all(var == var[1]))
       .(var[1], var[1])
     else
       CJ(var, var, unique = T)[V1 < V2], by = id][
   , .N, by = .(V1, V2)]
#    V1 V2 N
# 1:  A  B 3
# 2:  A  C 3
# 3:  B  C 3
# 4:  B  B 1

# or combn with list output (instead of matrix)

unique(DT, by=NULL)[ order(var), if(.N==1L)
       .(var, var)
     else
       transpose(combn(var, 2, simplify=FALSE)), by = id][
   , .N, by = .(V1, V2)]

#1


10  

Since you're ok with long-form results:

由于您可以获得长期结果:

DT[, if(all(var == var[1]))
       .(var[1], var[1])
     else
       as.data.table(t(combn(sort(unique(var)), 2))), by = id][
   , .N, by = .(V1, V2)]
#   V1 V2 N
#1:  A  B 3
#2:  A  C 3
#3:  B  C 3
#4:  B  B 1

Or if we call the above output res:

或者如果我们调用上面的输出res:

dcast(res[CJ(c(V1,V2), c(V1,V2), unique = T), on = c('V1', 'V2')][
          V1 == V2 & is.na(N), N := 0], V1 ~ V2)
#   V1  A  B C
#1:  A  0  3 3
#2:  B NA  1 3
#3:  C NA NA 0

An alternative to combn is doing:

combn的另一种选择是:

DT[, if (all(var == var[1]))
       .(var[1], var[1])
     else
       CJ(var, var, unique = T)[V1 < V2], by = id][
   , .N, by = .(V1, V2)]
#    V1 V2 N
# 1:  A  B 3
# 2:  A  C 3
# 3:  B  C 3
# 4:  B  B 1

# or combn with list output (instead of matrix)

unique(DT, by=NULL)[ order(var), if(.N==1L)
       .(var, var)
     else
       transpose(combn(var, 2, simplify=FALSE)), by = id][
   , .N, by = .(V1, V2)]