计算列表中字符向量之间的成对Jaccard索引

I have character vectors in the following format

我有如下格式的字符向量。

char1 <- c(“Hello”, “was”, “this”, “is”, “that”, “Boston”, “San”, “Francisco”)
char2 <- c(“John”, “was”, “they”, “is”, “Hello”, “Boston”, “San”, “Diego”)
char3 <- c(“John”, “very”, “happens”, “is”, “Hello”, “has”, “San”, “Diego”)

list <- list(char1, char2, char3)

However, I have around 500 of these, each of length 100,000.

但是，我有大约500个这样的，每个长度都是10万。

How can I calculate the pairwise Jaccard index (similarity measure) of all vectors in this list and output it as a data frame (NA for comparing the same character vector)? What would be the most efficient way of doing so?

如何计算该列表中所有向量的pairwise Jaccard索引(相似性度量)，并将其输出为一个数据框架(用于比较相同的字符向量)?最有效的方法是什么?

Thanks!

谢谢!

1 个解决方案

#1

you could try the following to obtain all the pairwise distances with union and intersect in dyplr

你可以尝试下面的方法来获得所有与联合和在dyplr中相交的成对距离。

dist <- unlist(lapply(combn(list, 2, simplify = FALSE), function(x) {
  length(intersect(x[[1]], x[[2]]))/length(union(x[[1]], x[[2]])) }))

dist
[1] 0.4545455 0.2307692 0.4545455

To see which pairs are associated with which values you could add:

查看哪些对与哪些值相关联:

cbind(t(combn(3,2)), dist)

              dist
[1,] 1 2 0.4545455
[2,] 1 3 0.2307692
[3,] 2 3 0.4545455

#1