my data frame looks like this:
我的数据帧是这样的:
Index V1 v2 v3 v4 v5 v6
1 a b c d e f
2 b c d e
3 a b c f g
4 a c f d g
5 b c d g h i
. . . . . . .
. . . . . . .
I need to iterate through each row in the data frame and pick up pairs that appear together, and count them. For example a and b appears in row index 1 and 3, so count = 2.
我需要遍历数据框中的每一行,并选择出现在一起的对,并对它们进行计数。例如,a和b出现在第1和第3行,所以count = 2。
Data frame has 6 columns excluding index and 554 rows. 6 variables in each row out of a possible 11.
数据帧有6列,不包括索引和554行。每一行有6个变量。
First step would be to do the pair of a and b.
第一步是做a和b的对。
Then to do all combinations. eg. a+c
, a+d
, a+e...
b+c
, b+d...
然后做所有的组合。如。a + c + d、e +……b + c,b + d…
I've used table(apply(df,1,function(x) paste(sort(x), collapse='-')))
and count(df)
from the plyr
package but the output was freq of a+b
, a+b+c....
b+c
, b+c+d
.
我用表(适用(df 1函数(x)粘贴(排序(x)崩溃= '——')))和计数(df)plyr包但输出频率的a + b,a + b + c ....b + c,b + c + d。
I need freq of all pairs. So the freq of a+b = (freq of a+b) + (freq of a+b+c) + (freq of a+b+c+d)
and so on
我需要所有双的freq。a+b = (a+b) + (a+b+c) + (a+b+c) + (a+b+c+d)等等
In excel, I've tried COUNTIF
. Such that COUNTIF(column1,a,column2,b)
, but a and b aren't always in columns 1 and 2 respectively.
在excel中,我试过COUNTIF。这样,COUNTIF(column1,a,column2,b)但a和b并不总是分别位于列1和列2中。
Also tried COUNTIF(df,a,df,b)
but that gave me a huge number.
也尝试过COUNTIF(df,a,df,b)但那给了我一个很大的数字。
Can be done in either r or excel. Although I think it would be faster in R.
可以在r或excel中完成。虽然我认为R会更快。
1 个解决方案
#1
1
Using an example random data, let's assume that the data frame is in C5:H558
.
使用一个随机数据示例,我们假设数据帧位于C5:H558中。
Define a name str
as
将名称str定义为
=$C$5:$C$558&$D$5:$D$558&$E$5:$E$558&$F$5:$F$558&$G$5:$G$558&$H$5:$H$558
Enter the symbols in L5:V5
as well as in K6:K16
.
在L5:V5和K6:K16中输入符号。
Enter this counting formula
进入这个计算公式
=IF(CODE($K7)>CODE(L$5),SUMPRODUCT(1-N(ISERROR(FIND($K7,str))+N(ISERROR(FIND(L$5,str)))>0)),"")
in L6
and copy it to fill the rest of the table L6:V16
.
在L6中复制它以填充表的其余部分L6:V16。
#1
1
Using an example random data, let's assume that the data frame is in C5:H558
.
使用一个随机数据示例,我们假设数据帧位于C5:H558中。
Define a name str
as
将名称str定义为
=$C$5:$C$558&$D$5:$D$558&$E$5:$E$558&$F$5:$F$558&$G$5:$G$558&$H$5:$H$558
Enter the symbols in L5:V5
as well as in K6:K16
.
在L5:V5和K6:K16中输入符号。
Enter this counting formula
进入这个计算公式
=IF(CODE($K7)>CODE(L$5),SUMPRODUCT(1-N(ISERROR(FIND($K7,str))+N(ISERROR(FIND(L$5,str)))>0)),"")
in L6
and copy it to fill the rest of the table L6:V16
.
在L6中复制它以填充表的其余部分L6:V16。