R-Software:基于第二列计算列中出现组合的计数

I have a simple problem (seemingly) but have not yet able to find an appropriately quick/time & resource efficient solution. This is a problem in R-Software. My data is of format:

我有一个简单的问题(表面上)，但还没有找到一个合适的快速/时间和资源高效的解决方案。这是r软件的一个问题。我的资料格式为:

INPUT
col1     col2
A         q
C         w
B         e
A         r
A         t
A         y
C         q
B         w
C         e
C         r
B         t
C         y

DESIRED OUTPUT
unit1     unit2     same_col2_freq
A          B          1
A          C          3
B          A          1
B          C          2
C          A          3
C          B          2

That is in input A has occurred in col1 with q, r, t, y occurring in col2. Now, q, r, t, y occurs for B with t so the A-B combination has count 1. B has occurred in col1 with e, w, t occurring in col2. Now, e, w, t occurs for C with w, t so the B-C combination has count 2. .... and so on for all combinations in col1.

输入A发生在col1中q r t y发生在col2中。现在，q r t y出现在B和t之间所以A-B的组合数是1。B发生在col1中，e、w、t发生在col2中。现在,e,w,C和w t,t C组合计数2。....对于col1中的所有组合。

I have done it using a for loop but it is slow. I am picking unique elements from col1 and then, all the data is iterated for each element of col1. Then I am combining the results using rbind. This is slow and resource costly.

我用for循环做过，但是很慢。我正在从col1中选择唯一的元素，然后，对col1的每个元素迭代所有的数据。然后我用rbind方法合并结果。这是缓慢的并且资源昂贵。

I am looking for an efficient method. Maybe a library, function etc. exists that I am unaware of. I tried using co-occurrence matrix but the number of elements in col1 is of order of ~10,000 and it does not solve my purpose.

我正在寻找一种有效的方法。也许存在一个我不知道的库、函数等等。我尝试过使用共现矩阵，但是col1中的元素数是1万，这并不能解决我的目的。

Any help is greatly appreciated.

非常感谢您的帮助。

Thanks!

谢谢!

2 个解决方案

#1

Here is a similar approach (as showed by @cogitovita), but using data.table. Convert the "data.frame" to "data.table" using setDT, then Cross Join (CJ) the unique elements of "col1", grouped by "col2". Subset the rows of the output columns that are not equal (V1!=V2), get the count (.N), grouped by the new columns (.(V1, V2)) and finally order the columns (order(V1,V2))

这里有一个类似的方法(如@cogitovita所示)，但是使用data.table。将“data.frame”转换为“data”。表“使用setDT，然后交叉连接(CJ)“col1”的惟一元素，按“col2”分组。将不相等的输出列的行子集(V1!=V2)，获取计数(. n)，按新列(。n)分组。(V1,V2)最后对列排序((V1,V2)

library(data.table)
setDT(df)[,CJ(unique(col1), unique(col1)), col2][V1!=V2, 
         .N, .(V1,V2)][order(V1,V2)]  
#   V1 V2 N
#1:  A  B 1
#2:  A  C 3
#3:  B  A 1
#4:  B  C 2
#5:  C  A 3
#6:  C  B 2

data

df <-  structure(list(col1 = c("A", "C", "B", "A", "A", "A", "C", "B", 
"C", "C", "B", "C"), col2 = c("q", "w", "e", "r", "t", "y", "q", 
"w", "e", "r", "t", "y")), .Names = c("col1", "col2"), class =
"data.frame", row.names = c(NA, -12L))

#2

Use merge to join the dataframe with itself and then use aggregate to count within groups. demo:

使用merge将dataframe与自身连接，然后使用聚合在组中进行计数。演示:

d = data.frame(col1=c("A", "C", "B", "A", "A", "A", "C", "B", "C", "C", "B", "C"), col2=c("q", "w", "e", "r", "t", "y", "q", "w", "e", "r", "t", "y"))
dm = merge(d, d, by="col2")
dm = dm[dm[,'col1.x']!=dm[,'col1.y'],]
aggregate(col2 ~ col1.x + col1.y, data=dm, length)
#       col1.x col1.y col2
# 1      B      A    1
# 2      C      A    3
# 3      A      B    1
# 4      C      B    2
# 5      A      C    3
# 6      B      C    2

#1

library(data.table)
setDT(df)[,CJ(unique(col1), unique(col1)), col2][V1!=V2, 
         .N, .(V1,V2)][order(V1,V2)]  
#   V1 V2 N
#1:  A  B 1
#2:  A  C 3
#3:  B  A 1
#4:  B  C 2
#5:  C  A 3
#6:  C  B 2

data

df <-  structure(list(col1 = c("A", "C", "B", "A", "A", "A", "C", "B", 
"C", "C", "B", "C"), col2 = c("q", "w", "e", "r", "t", "y", "q", 
"w", "e", "r", "t", "y")), .Names = c("col1", "col2"), class =
"data.frame", row.names = c(NA, -12L))

#2

Use merge to join the dataframe with itself and then use aggregate to count within groups. demo:

使用merge将dataframe与自身连接，然后使用聚合在组中进行计数。演示:

d = data.frame(col1=c("A", "C", "B", "A", "A", "A", "C", "B", "C", "C", "B", "C"), col2=c("q", "w", "e", "r", "t", "y", "q", "w", "e", "r", "t", "y"))
dm = merge(d, d, by="col2")
dm = dm[dm[,'col1.x']!=dm[,'col1.y'],]
aggregate(col2 ~ col1.x + col1.y, data=dm, length)
#       col1.x col1.y col2
# 1      B      A    1
# 2      C      A    3
# 3      A      B    1
# 4      C      B    2
# 5      A      C    3
# 6      B      C    2

秒客网

R-Software:基于第二列计算列中出现组合的计数

2 个解决方案

#1

data

#2

#1

data

#2

相关文章