I need to calculate jaccard distance between each row in a data frame. the return need to be a matrix/data frame that represent the distance.
我需要计算数据框中每一行之间的jaccard距离。返回必须是表示距离的矩阵/数据帧。
like this:
是这样的:
1 2 3 ..
1 0 0.2 1
2 0.2 0 0.4
3 1 0.4 0
.
.
my data:
我的数据:
dput(items[1:10,])
dput(项目[1:10])
structure(list(Drama = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L), Comedy = c(0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L), Crime = c(0L,
1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), SciFi = c(1L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L), Kids = c(1L, 0L, 0L, 0L, 0L, 0L, 0L,
1L, 0L, 0L), Classic = c(1L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 0L,
0L), Foreign = c(0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L), Thriller = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), Action = c(0L, 0L, 0L, 1L,
1L, 1L, 1L, 1L, 1L, 1L), Adventure = c(0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L), Animation = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L), Adult = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), History = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), Documentry = c(0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L), Nature = c(0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L), Horror = c(0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L,
0L), Show = c(0L, 1L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L), Series = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L), BlackWhite = c(0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L)), .Names = c("Drama", "Comedy", "Crime",
"SciFi", "Kids", "Classic", "Foreign", "Thriller", "Action",
"Adventure", "Animation", "Adult", "History", "Documentry", "Nature",
"Horror", "Show", "Series", "BlackWhite"), row.names = c(NA,
10L), class = "data.frame")
my code:
我的代码:
Jaccard_dist <- dist(items, items, method = "Jaccard")
write.csv(Jaccard_dist,'Jaccard_dist.csv')
do you know of a way to do this without using two for-loops?
你知道一种不用两个for循环就能做到这一点的方法吗?
1 个解决方案
#1
2
Not sure why you need two for loops.
不知道为什么需要两个for循环。
You can try the library proxy
and use:
您可以尝试使用库代理并使用:
proxy::dist(dft, by_rows = TRUE, method = "Jaccard")
This returns:
这将返回:
#
1 2 3 4 5 6 7 8 9
#2 1.0000000
#3 1.0000000 0.6666667
#4 0.8000000 0.8000000 1.0000000
#5 1.0000000 0.8000000 0.6666667 0.8000000
#6 1.0000000 1.0000000 1.0000000 0.6666667 0.6666667
#7 1.0000000 1.0000000 1.0000000 0.7500000 0.7500000 0.5000000
#8 0.5000000 1.0000000 1.0000000 0.5000000 0.8000000 0.6666667 0.7500000
#9 1.0000000 1.0000000 1.0000000 0.6666667 0.6666667 0.0000000 0.5000000 0.6666667
#10 1.0000000 1.0000000 1.0000000 0.7500000 0.7500000 0.5000000 0.6666667 0.7500000 0.5000000
#1
2
Not sure why you need two for loops.
不知道为什么需要两个for循环。
You can try the library proxy
and use:
您可以尝试使用库代理并使用:
proxy::dist(dft, by_rows = TRUE, method = "Jaccard")
This returns:
这将返回:
#
1 2 3 4 5 6 7 8 9
#2 1.0000000
#3 1.0000000 0.6666667
#4 0.8000000 0.8000000 1.0000000
#5 1.0000000 0.8000000 0.6666667 0.8000000
#6 1.0000000 1.0000000 1.0000000 0.6666667 0.6666667
#7 1.0000000 1.0000000 1.0000000 0.7500000 0.7500000 0.5000000
#8 0.5000000 1.0000000 1.0000000 0.5000000 0.8000000 0.6666667 0.7500000
#9 1.0000000 1.0000000 1.0000000 0.6666667 0.6666667 0.0000000 0.5000000 0.6666667
#10 1.0000000 1.0000000 1.0000000 0.7500000 0.7500000 0.5000000 0.6666667 0.7500000 0.5000000