如何存储集,快速找到类似的模式?

时间:2021-10-01 19:15:00

(This is no homework and no work issue. It's just my personal interest/occupation and completly fictional. But I am interested in a good algorithm or data structure.)

(这不是作业,也不是工作问题。这只是我的个人兴趣/职业和完全虚构的。但我对一个好的算法或数据结构很感兴趣。)

Let's assume, that I would run a dating site. And my special feature would be that the singles were matched by movie taste. (Why not?)

我们假设,我会运行一个约会网站。我的特色是单曲与电影品味相匹配。 (为什么不?)

In that case I would need a way to store the movie ratings for each user. (So far no problem.) And I would need a data structure to find the best fitting user. The distance between two taste patterns would be the average distance between all ratings that both users made.

在这种情况下,我需要一种方法来存储每个用户的电影评级。 (到目前为止没问题。)我需要一个数据结构来找到最合适的用户。两种味道模式之间的距离将是两个用户所做的所有评级之间的平均距离。

Example

movies   A B C D E F G H I J K L M ...
user Xm  9 5   1   1   5
user Ym      4 6 1         8
user Zf  9   6 4           7

Distance(X,Z) = avg( abs(9-9) + abs(1-4) ) = 1.5

距离(X,Z)= avg(abs(9-9)+ abs(1-4))= 1.5

Distance(Y,Z) = avg( abs(4-6) + abs(6-4) + abs(8-7) ) = 1.666

距离(Y,Z)=平均(abs(4-6)+ abs(6-4)+ abs(8-7))= 1.666

So Mr. X fits slightly better to Mrs. Z, than Mr. Y does.

因此,X先生比Y先生更适合Z夫人。

I like soulution that ...

我喜欢那种......

  • ... don't need many operations on the database
  • ...在数据库上不需要很多操作

  • ... don't need to handle a lot of data
  • ...不需要处理大量数据

  • ... run fast
  • ... 快跑

  • ... deliver the best matching
  • ......提供最佳匹配

  • Ok, maybe I would consider good approximations too.
  • 好吧,也许我会考虑好的近似值。

Try to keep in mind that this should also work with thousands of possible movies, users that rate only about 20-50 movies, and thousands of users.

请记住,这也应该适用于数千种可能的电影,只能拍摄约20-50部电影的用户和数千名用户。

(Because this is a mental puzzle and not a real problem, work-arrounds are not really helping.)

(因为这是一个心理难题,而不是一个真正的问题,工作周围并没有真正帮助。)

What would be your search algorithm or data structure?

你的搜索算法或数据结构是什么?

3 个解决方案

#1


Sounds a lot like the Netflix Prize challenge, more specifically the first half of the most popular approach. The possible implementations of what you are trying to do are numerous and varied. None of them are exceptionally efficient, and the L1 metric is not a particularly good option for reliable correlations.

听起来很像Netflix奖的挑战,更具体地说是最受欢迎的方法的前半部分。您尝试做的事情的可能实现是多种多样的。它们都不是特别有效,并且L1度量不是可靠相关的特别好的选择。

#2


Looks like you are looking for the nearest neighbor in the movie space. And your distance function is the L1 metric. You can probably use a spatial index of some kind. Maybe you can use techniques from collaborative filtering.

看起来你正在寻找电影领域最近的邻居。您的距离函数是L1指标。您可以使用某种空间索引。也许你可以使用协同过滤技术。

#3


CREATE TABLE data (user INTEGER, movie INTEGER, rate INTEGER);

SELECT  other.user, AVG(ABS(d1.rate - d2.rate)) AS distance
FROM    data me, data other
WHERE   me.user = :user
    AND other.user <> me.user
    AND other.movie = me.movie
GROUP BY
    other.user
ORDER BY
    distance

Complexity will be O(n1.5)) rather than O(n2), as there will be n comparisons to sqrt(n) movies (average of movies filled together by each pair).

复杂性将是O(n1.5))而不是O(n2),因为将与n场电影(每对电影填充在一起的电影的平均值)进行n次比较。

#1


Sounds a lot like the Netflix Prize challenge, more specifically the first half of the most popular approach. The possible implementations of what you are trying to do are numerous and varied. None of them are exceptionally efficient, and the L1 metric is not a particularly good option for reliable correlations.

听起来很像Netflix奖的挑战,更具体地说是最受欢迎的方法的前半部分。您尝试做的事情的可能实现是多种多样的。它们都不是特别有效,并且L1度量不是可靠相关的特别好的选择。

#2


Looks like you are looking for the nearest neighbor in the movie space. And your distance function is the L1 metric. You can probably use a spatial index of some kind. Maybe you can use techniques from collaborative filtering.

看起来你正在寻找电影领域最近的邻居。您的距离函数是L1指标。您可以使用某种空间索引。也许你可以使用协同过滤技术。

#3


CREATE TABLE data (user INTEGER, movie INTEGER, rate INTEGER);

SELECT  other.user, AVG(ABS(d1.rate - d2.rate)) AS distance
FROM    data me, data other
WHERE   me.user = :user
    AND other.user <> me.user
    AND other.movie = me.movie
GROUP BY
    other.user
ORDER BY
    distance

Complexity will be O(n1.5)) rather than O(n2), as there will be n comparisons to sqrt(n) movies (average of movies filled together by each pair).

复杂性将是O(n1.5))而不是O(n2),因为将与n场电影(每对电影填充在一起的电影的平均值)进行n次比较。