How can I find the best resemblance between one particular row and the rest of the rows in a dataframe?
如何在数据框中找到一个特定行与其余行之间的最佳相似性?
I try to explain what I mean. Take a look at this dataframe:
我试着解释一下我的意思。看看这个数据框:
df <- structure(list(person = 1:5, var1 = c(1L, 5L, 2L, 2L, 5L), var2 = c(4L,
4L, 3L, 2L, 2L), var3 = c(5L, 4L, 4L, 3L, 1L)), .Names = c("person",
"var1", "var2", "var3"), class = "data.frame", row.names = c(NA,
-5L))
How can I find the best resemblance between person 1 (row 1) and the rest of the rows (persons) in the data frame. The output should be something like: person 1 still in row 1 and the rest of the rows in order of best resemblance. The simmilarity algorithm I want to use is cosine or pearson. I tried to solve my problem with functions from the arules package
, but it didn't match well with my needs.
如何在数据框中找到人1(第1行)与其余行(人)之间的最佳相似性。输出应该是这样的:人1仍然在第1行,其余的行按照最佳相似性顺序排列。我想要使用的simmilarity算法是余弦或皮尔逊。我尝试使用arules包中的函数来解决我的问题,但它与我的需求不匹配。
Any ideas someone?
有人的想法吗?
2 个解决方案
#1
2
Another idea is to define the cosine function manually, and apply it on your data frame, i.e.
另一个想法是手动定义余弦函数,并将其应用于您的数据框,即
f1 <- function(x, y){
crossprod(x, y)/sqrt(crossprod(x) * crossprod(y))
}
df[c(1, order(sapply(2:nrow(df), function(i)
f1(unlist(df[1,-1]), unlist(df[i, -1]))),
decreasing = TRUE)+1),]
which gives,
person var1 var2 var3 1 1 1 4 5 3 3 2 3 4 4 4 2 2 3 2 2 5 4 4 5 5 5 2 1
#2
2
You could try cosine
from lsa
:
你可以从lsa尝试余弦:
library('lsa')
cosine(t(df[-1]))
# [,1] [,2] [,3] [,4] [,5]
#[1,] 1.0000000 0.8379571 0.9742160 0.9356015 0.5070926
#[2,] 0.8379571 1.0000000 0.9346460 0.9637388 0.8947540
#[3,] 0.9742160 0.9346460 1.0000000 0.9908302 0.6780635
#[4,] 0.9356015 0.9637388 0.9908302 1.0000000 0.7527727
#[5,] 0.5070926 0.8947540 0.6780635 0.7527727 1.0000000
You provide cosine
with a matrix where each column represents a person (that's why I use t
) and it calculates all the cosine similarities among them.
您为余弦提供了一个矩阵,其中每列代表一个人(这就是我使用t的原因)并计算它们之间的所有余弦相似性。
#1
2
Another idea is to define the cosine function manually, and apply it on your data frame, i.e.
另一个想法是手动定义余弦函数,并将其应用于您的数据框,即
f1 <- function(x, y){
crossprod(x, y)/sqrt(crossprod(x) * crossprod(y))
}
df[c(1, order(sapply(2:nrow(df), function(i)
f1(unlist(df[1,-1]), unlist(df[i, -1]))),
decreasing = TRUE)+1),]
which gives,
person var1 var2 var3 1 1 1 4 5 3 3 2 3 4 4 4 2 2 3 2 2 5 4 4 5 5 5 2 1
#2
2
You could try cosine
from lsa
:
你可以从lsa尝试余弦:
library('lsa')
cosine(t(df[-1]))
# [,1] [,2] [,3] [,4] [,5]
#[1,] 1.0000000 0.8379571 0.9742160 0.9356015 0.5070926
#[2,] 0.8379571 1.0000000 0.9346460 0.9637388 0.8947540
#[3,] 0.9742160 0.9346460 1.0000000 0.9908302 0.6780635
#[4,] 0.9356015 0.9637388 0.9908302 1.0000000 0.7527727
#[5,] 0.5070926 0.8947540 0.6780635 0.7527727 1.0000000
You provide cosine
with a matrix where each column represents a person (that's why I use t
) and it calculates all the cosine similarities among them.
您为余弦提供了一个矩阵,其中每列代表一个人(这就是我使用t的原因)并计算它们之间的所有余弦相似性。