在r中计算行之间的jaccard距离

时间:2020-12-15 15:21:51

I need to calculate jaccard distance between each row in a data frame. the return need to be a matrix/data frame that represent the distance.

我需要计算数据框中每一行之间的jaccard距离。返回必须是表示距离的矩阵/数据帧。

like this:

是这样的:

   1     2   3 ..
1  0    0.2  1 
2  0.2  0    0.4
3  1    0.4  0
.
.

my data:

我的数据:

dput(items[1:10,])

dput(项目[1:10])

structure(list(Drama = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L), Comedy = c(0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L), Crime = c(0L, 
1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), SciFi = c(1L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L), Kids = c(1L, 0L, 0L, 0L, 0L, 0L, 0L, 
1L, 0L, 0L), Classic = c(1L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 
0L), Foreign = c(0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L), Thriller = c(0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), Action = c(0L, 0L, 0L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L), Adventure = c(0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L), Animation = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L), Adult = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), History = c(0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), Documentry = c(0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L), Nature = c(0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L), Horror = c(0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 
0L), Show = c(0L, 1L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L), Series = c(0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L), BlackWhite = c(0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L)), .Names = c("Drama", "Comedy", "Crime", 
"SciFi", "Kids", "Classic", "Foreign", "Thriller", "Action", 
"Adventure", "Animation", "Adult", "History", "Documentry", "Nature", 
"Horror", "Show", "Series", "BlackWhite"), row.names = c(NA, 
10L), class = "data.frame")

my code:

我的代码:

Jaccard_dist <- dist(items, items, method = "Jaccard")

write.csv(Jaccard_dist,'Jaccard_dist.csv')

do you know of a way to do this without using two for-loops?

你知道一种不用两个for循环就能做到这一点的方法吗?

1 个解决方案

#1


2  

Not sure why you need two for loops.

不知道为什么需要两个for循环。

You can try the library proxy and use:

您可以尝试使用库代理并使用:

proxy::dist(dft, by_rows = TRUE, method = "Jaccard")

This returns:

这将返回:

#
       1         2         3         4         5         6         7         8         9
#2  1.0000000                                                                                
#3  1.0000000 0.6666667                                                                      
#4  0.8000000 0.8000000 1.0000000                                                            
#5  1.0000000 0.8000000 0.6666667 0.8000000                                                  
#6  1.0000000 1.0000000 1.0000000 0.6666667 0.6666667                                        
#7  1.0000000 1.0000000 1.0000000 0.7500000 0.7500000 0.5000000                              
#8  0.5000000 1.0000000 1.0000000 0.5000000 0.8000000 0.6666667 0.7500000                    
#9  1.0000000 1.0000000 1.0000000 0.6666667 0.6666667 0.0000000 0.5000000 0.6666667          
#10 1.0000000 1.0000000 1.0000000 0.7500000 0.7500000 0.5000000 0.6666667 0.7500000 0.5000000

#1


2  

Not sure why you need two for loops.

不知道为什么需要两个for循环。

You can try the library proxy and use:

您可以尝试使用库代理并使用:

proxy::dist(dft, by_rows = TRUE, method = "Jaccard")

This returns:

这将返回:

#
       1         2         3         4         5         6         7         8         9
#2  1.0000000                                                                                
#3  1.0000000 0.6666667                                                                      
#4  0.8000000 0.8000000 1.0000000                                                            
#5  1.0000000 0.8000000 0.6666667 0.8000000                                                  
#6  1.0000000 1.0000000 1.0000000 0.6666667 0.6666667                                        
#7  1.0000000 1.0000000 1.0000000 0.7500000 0.7500000 0.5000000                              
#8  0.5000000 1.0000000 1.0000000 0.5000000 0.8000000 0.6666667 0.7500000                    
#9  1.0000000 1.0000000 1.0000000 0.6666667 0.6666667 0.0000000 0.5000000 0.6666667          
#10 1.0000000 1.0000000 1.0000000 0.7500000 0.7500000 0.5000000 0.6666667 0.7500000 0.5000000