我应该使用哪种算法来查找数据库中的相似点?

时间:2022-08-13 07:07:57

Let's say I have 1000 users for my app. I ask them 100 questions with answers just yes/no and I record those answers in a seperate table.

假设我的应用有1000个用户,我问他们100个问题,回答是或否,然后我把这些答案分别记录在一张表格中。

Now, I want to see people who has given the same answers to at least 20 questions.

现在,我希望看到那些对至少20个问题给出相同答案的人。

What kind of algorithm should I follow in order to do this? What are the relevant keywords for googling?

为了做到这一点,我应该遵循什么样的算法呢?谷歌搜索的关键词是什么?

P.S. I work in a WAMP environment.

附注:我在工作环境中。

1 个解决方案

#1


4  

Join your answers table to itself, selecting answers which share the same question_id and answer but have a different user_id. Group the rows by both user_ids and use a HAVING clause to exclude those with less than 20 matching answers.

将答案表连接到自己,选择共享相同的question_id和答案但拥有不同的user_id的答案。通过user_id对行进行分组,并使用一个have子句来排除那些小于20个匹配答案的行。

Example where you are looking for users similar to your user with user_id "1":

例如,您正在寻找与user_id“1”类似的用户:

SELECT DISTINCT a2.user_id FROM answers a
INNER JOIN answers a2
        ON a.question_id = a2.question_id
       AND a.answer = a2.answer
       AND a.user_id != a2.user_id
WHERE a.user_id = 1
GROUP BY a.user_id, a2.user_id
HAVING COUNT(*) >= 20;

Technically you don't need to group by a.user_id in this case but I've left it there in case you want to modify the WHERE clause to return results for more than one a.user_id.

严格地说,你不需要按a进行分组。user_id在本例中,但是我把它留在这里,以防您希望修改WHERE子句,以便为多个a.user_id返回结果。

#1


4  

Join your answers table to itself, selecting answers which share the same question_id and answer but have a different user_id. Group the rows by both user_ids and use a HAVING clause to exclude those with less than 20 matching answers.

将答案表连接到自己,选择共享相同的question_id和答案但拥有不同的user_id的答案。通过user_id对行进行分组,并使用一个have子句来排除那些小于20个匹配答案的行。

Example where you are looking for users similar to your user with user_id "1":

例如,您正在寻找与user_id“1”类似的用户:

SELECT DISTINCT a2.user_id FROM answers a
INNER JOIN answers a2
        ON a.question_id = a2.question_id
       AND a.answer = a2.answer
       AND a.user_id != a2.user_id
WHERE a.user_id = 1
GROUP BY a.user_id, a2.user_id
HAVING COUNT(*) >= 20;

Technically you don't need to group by a.user_id in this case but I've left it there in case you want to modify the WHERE clause to return results for more than one a.user_id.

严格地说,你不需要按a进行分组。user_id在本例中,但是我把它留在这里,以防您希望修改WHERE子句,以便为多个a.user_id返回结果。