I have a simple problem (seemingly) but have not yet able to find an appropriately quick/time & resource efficient solution. This is a problem in R-Software. My data is of format:
我有一个简单的问题(表面上),但还没有找到一个合适的快速/时间和资源高效的解决方案。这是r软件的一个问题。我的资料格式为:
INPUT
col1 col2
A q
C w
B e
A r
A t
A y
C q
B w
C e
C r
B t
C y
DESIRED OUTPUT
unit1 unit2 same_col2_freq
A B 1
A C 3
B A 1
B C 2
C A 3
C B 2
That is in input A has occurred in col1 with q, r, t, y occurring in col2. Now, q, r, t, y occurs for B with t so the A-B combination has count 1. B has occurred in col1 with e, w, t occurring in col2. Now, e, w, t occurs for C with w, t so the B-C combination has count 2. .... and so on for all combinations in col1.
输入A发生在col1中q r t y发生在col2中。现在,q r t y出现在B和t之间所以A-B的组合数是1。B发生在col1中,e、w、t发生在col2中。现在,e,w,C和w t,t C组合计数2。....对于col1中的所有组合。
I have done it using a for loop but it is slow. I am picking unique elements from col1 and then, all the data is iterated for each element of col1. Then I am combining the results using rbind. This is slow and resource costly.
我用for循环做过,但是很慢。我正在从col1中选择唯一的元素,然后,对col1的每个元素迭代所有的数据。然后我用rbind方法合并结果。这是缓慢的并且资源昂贵。
I am looking for an efficient method. Maybe a library, function etc. exists that I am unaware of. I tried using co-occurrence matrix but the number of elements in col1 is of order of ~10,000 and it does not solve my purpose.
我正在寻找一种有效的方法。也许存在一个我不知道的库、函数等等。我尝试过使用共现矩阵,但是col1中的元素数是1万,这并不能解决我的目的。
Any help is greatly appreciated.
非常感谢您的帮助。
Thanks!
谢谢!
2 个解决方案
#1
0
Here is a similar approach (as showed by @cogitovita), but using data.table
. Convert the "data.frame" to "data.table" using setDT
, then Cross Join (CJ
) the unique elements of "col1", grouped by "col2". Subset the rows of the output columns that are not equal (V1!=V2
), get the count (.N
), grouped by the new columns (.(V1, V2)
) and finally order the columns (order(V1,V2)
)
这里有一个类似的方法(如@cogitovita所示),但是使用data.table。将“data.frame”转换为“data”。表“使用setDT,然后交叉连接(CJ)“col1”的惟一元素,按“col2”分组。将不相等的输出列的行子集(V1!=V2),获取计数(. n),按新列(。n)分组。(V1,V2)最后对列排序((V1,V2)
library(data.table)
setDT(df)[,CJ(unique(col1), unique(col1)), col2][V1!=V2,
.N, .(V1,V2)][order(V1,V2)]
# V1 V2 N
#1: A B 1
#2: A C 3
#3: B A 1
#4: B C 2
#5: C A 3
#6: C B 2
data
df <- structure(list(col1 = c("A", "C", "B", "A", "A", "A", "C", "B",
"C", "C", "B", "C"), col2 = c("q", "w", "e", "r", "t", "y", "q",
"w", "e", "r", "t", "y")), .Names = c("col1", "col2"), class =
"data.frame", row.names = c(NA, -12L))
#2
1
Use merge
to join the dataframe with itself and then use aggregate
to count within groups. demo:
使用merge将dataframe与自身连接,然后使用聚合在组中进行计数。演示:
d = data.frame(col1=c("A", "C", "B", "A", "A", "A", "C", "B", "C", "C", "B", "C"), col2=c("q", "w", "e", "r", "t", "y", "q", "w", "e", "r", "t", "y"))
dm = merge(d, d, by="col2")
dm = dm[dm[,'col1.x']!=dm[,'col1.y'],]
aggregate(col2 ~ col1.x + col1.y, data=dm, length)
# col1.x col1.y col2
# 1 B A 1
# 2 C A 3
# 3 A B 1
# 4 C B 2
# 5 A C 3
# 6 B C 2
#1
0
Here is a similar approach (as showed by @cogitovita), but using data.table
. Convert the "data.frame" to "data.table" using setDT
, then Cross Join (CJ
) the unique elements of "col1", grouped by "col2". Subset the rows of the output columns that are not equal (V1!=V2
), get the count (.N
), grouped by the new columns (.(V1, V2)
) and finally order the columns (order(V1,V2)
)
这里有一个类似的方法(如@cogitovita所示),但是使用data.table。将“data.frame”转换为“data”。表“使用setDT,然后交叉连接(CJ)“col1”的惟一元素,按“col2”分组。将不相等的输出列的行子集(V1!=V2),获取计数(. n),按新列(。n)分组。(V1,V2)最后对列排序((V1,V2)
library(data.table)
setDT(df)[,CJ(unique(col1), unique(col1)), col2][V1!=V2,
.N, .(V1,V2)][order(V1,V2)]
# V1 V2 N
#1: A B 1
#2: A C 3
#3: B A 1
#4: B C 2
#5: C A 3
#6: C B 2
data
df <- structure(list(col1 = c("A", "C", "B", "A", "A", "A", "C", "B",
"C", "C", "B", "C"), col2 = c("q", "w", "e", "r", "t", "y", "q",
"w", "e", "r", "t", "y")), .Names = c("col1", "col2"), class =
"data.frame", row.names = c(NA, -12L))
#2
1
Use merge
to join the dataframe with itself and then use aggregate
to count within groups. demo:
使用merge将dataframe与自身连接,然后使用聚合在组中进行计数。演示:
d = data.frame(col1=c("A", "C", "B", "A", "A", "A", "C", "B", "C", "C", "B", "C"), col2=c("q", "w", "e", "r", "t", "y", "q", "w", "e", "r", "t", "y"))
dm = merge(d, d, by="col2")
dm = dm[dm[,'col1.x']!=dm[,'col1.y'],]
aggregate(col2 ~ col1.x + col1.y, data=dm, length)
# col1.x col1.y col2
# 1 B A 1
# 2 C A 3
# 3 A B 1
# 4 C B 2
# 5 A C 3
# 6 B C 2