如何在R中找到1行与其余数据帧之间的最佳相似度？

How can I find the best resemblance between one particular row and the rest of the rows in a dataframe?

如何在数据框中找到一个特定行与其余行之间的最佳相似性?

I try to explain what I mean. Take a look at this dataframe:

我试着解释一下我的意思。看看这个数据框:

df <- structure(list(person = 1:5, var1 = c(1L, 5L, 2L, 2L, 5L), var2 = c(4L, 
4L, 3L, 2L, 2L), var3 = c(5L, 4L, 4L, 3L, 1L)), .Names = c("person", 
"var1", "var2", "var3"), class = "data.frame", row.names = c(NA, 
-5L))

How can I find the best resemblance between person 1 (row 1) and the rest of the rows (persons) in the data frame. The output should be something like: person 1 still in row 1 and the rest of the rows in order of best resemblance. The simmilarity algorithm I want to use is cosine or pearson. I tried to solve my problem with functions from the arules package, but it didn't match well with my needs.

如何在数据框中找到人1(第1行)与其余行(人)之间的最佳相似性。输出应该是这样的:人1仍然在第1行,其余的行按照最佳相似性顺序排列。我想要使用的simmilarity算法是余弦或皮尔逊。我尝试使用arules包中的函数来解决我的问题,但它与我的需求不匹配。

Any ideas someone?

有人的想法吗?

2 个解决方案

#1

Another idea is to define the cosine function manually, and apply it on your data frame, i.e.

另一个想法是手动定义余弦函数,并将其应用于您的数据框,即

f1 <- function(x, y){
  crossprod(x, y)/sqrt(crossprod(x) * crossprod(y))
}

df[c(1, order(sapply(2:nrow(df), function(i) 
                                f1(unlist(df[1,-1]), unlist(df[i, -1]))), 
                                                          decreasing = TRUE)+1),]

which gives,

   person var1 var2 var3
1      1    1    4    5
3      3    2    3    4
4      4    2    2    3
2      2    5    4    4
5      5    5    2    1

#2

You could try cosine from lsa:

你可以从lsa尝试余弦:

library('lsa') 
cosine(t(df[-1]))
#          [,1]      [,2]      [,3]      [,4]      [,5]
#[1,] 1.0000000 0.8379571 0.9742160 0.9356015 0.5070926
#[2,] 0.8379571 1.0000000 0.9346460 0.9637388 0.8947540
#[3,] 0.9742160 0.9346460 1.0000000 0.9908302 0.6780635
#[4,] 0.9356015 0.9637388 0.9908302 1.0000000 0.7527727
#[5,] 0.5070926 0.8947540 0.6780635 0.7527727 1.0000000

You provide cosine with a matrix where each column represents a person (that's why I use t) and it calculates all the cosine similarities among them.

您为余弦提供了一个矩阵,其中每列代表一个人(这就是我使用t的原因)并计算它们之间的所有余弦相似性。

#1