This query suggests friendship based on how many words users have in common. in_common sets this threshold.
此查询基于用户共有多少单词来建议友谊。 in_common设置此阈值。
I was wondering if it was possible to make this query completely % based.
我想知道是否有可能使这个查询完全基于%。
What I want to do is have user suggested to current user, if 30% of their words match.
我想要做的是让用户向当前用户建议,如果他们的单词的30%匹配。
curent_user total words 100
curent_user总字数100
in_common threshold 30
in_common阈值30
some_other_user total words 10
some_other_user总字数10
3 out of these match current_users list.
其中3个匹配current_users列表。
Since 3 is 30% of 10, this is a match for the current user.
由于3是10的30%,因此这是当前用户的匹配。
Possible?
可能?
SELECT users.name_surname, users.avatar, t1.qty, GROUP_CONCAT(words_en.word) AS in_common, (users.id) AS friend_request_id
FROM (
SELECT c2.user_id, COUNT(*) AS qty
FROM `connections` c1
JOIN `connections` c2
ON c1.user_id <> c2.user_id
AND c1.word_id = c2.word_id
WHERE c1.user_id = :user_id
GROUP BY c2.user_id
HAVING count(*) >= :in_common) as t1
JOIN users
ON t1.user_id = users.id
JOIN connections
ON connections.user_id = t1.user_id
JOIN words_en
ON words_en.id = connections.word_id
WHERE EXISTS(SELECT *
FROM connections
WHERE connections.user_id = :user_id
AND connections.word_id = words_en.id)
GROUP BY users.id, users.name_surname, users.avatar, t1.qty
ORDER BY t1.qty DESC, users.name_surname ASC
SQL fiddle: http://www.sqlfiddle.com/#!2/c79a6/9
SQL小提琴:http://www.sqlfiddle.com/#!2 / c79a6 / 9
3 个解决方案
#1
1
May I suggest a different way to look at your problem?
我可以建议一种不同的方式来看待你的问题吗?
You might look into a similarity metric, such as Cosine Similarity which will give you a much better measure of similarity between your users based on words. To understand it for your case, consider the following example. You have a vector of words A = {house, car, burger, sun}
for a user u1
and another vector B = {flat, car, pizza, burger, cloud}
for user u2
.
您可以查看相似度指标,例如余弦相似度,它可以更好地衡量用户之间基于单词的相似度。要了解您的情况,请考虑以下示例。对于用户u1,你有一个单词A = {house,car,burger,sun}的向量,对于用户u2,你有另一个向量B = {flat,car,pizza,burger,cloud}。
Given these individual vectors you first construct another that positions them together so you can map to each user whether he/she has that word in its vector or not. Like so:
给定这些单独的向量,您首先构建另一个将它们放在一起的向量,这样您就可以向每个用户映射他/她是否在其向量中包含该单词。像这样:
| -- | house | car | burger | sun | flat | pizza | cloud |
----------------------------------------------------------
| A | 1 | 1 | 1 | 1 | 0 | 0 | 0 |
----------------------------------------------------------
| B | 0 | 1 | 1 | 0 | 1 | 1 | 1 |
----------------------------------------------------------
Now you have a vector for each user where each position corresponds to the value of each word to each user. Here it represents a simple count but you can improve it using different metrics based on word frequency if that applies to your case. Take a look at the most common one, called tf-idf.
现在,每个用户都有一个向量,其中每个位置对应于每个用户的每个单词的值。这里它代表一个简单的计数,但如果适用于您的情况,您可以使用基于字频的不同指标来改进它。看看最常见的一个叫做tf-idf。
Having these two vectors, you can compute the cosine similarity
between them as follows:
有了这两个向量,您可以计算它们之间的余弦相似度,如下所示:
Which basically is computing the sum of the product between each position of the vectors above, divided by their corresponding magnitude. In our example, that is 0.47, in a range that can vary between 0 and 1, the higher the most similar the two vectors are.
这基本上是计算上面向量的每个位置之间的乘积之和除以它们的相应幅度。在我们的例子中,即0.47,在0和1之间变化的范围内,两个矢量最相似的越高。
If you choose to go this way, you don't need to do this calculation in the database. You compute the similarity in your code and just save the result in the database. There are several libraries that can do that for you. In Python, take a look at the numpy library. In Java, look at Weka and/or Apache Lucene.
如果选择这种方式,则无需在数据库中进行此计算。您可以计算代码中的相似性,并将结果保存在数据库中。有几个库可以帮到你。在Python中,看一下numpy库。在Java中,查看Weka和/或Apache Lucene。
#2
3
OK, so the issue is "users in common" defined as asymmetric relation. To fix it, let's assume that in_common percentage threshold is checked against user with the least words.
好的,所以问题是“共同的用户”被定义为非对称关系。为了解决这个问题,我们假设对具有最少单词的用户检查in_common百分比阈值。
Try this query (fiddle), it gives you full list of users with at least 1 word in common, marking friendship suggestions:
尝试这个查询(小提琴),它会为您提供至少有一个共同字的用户的完整列表,标记友情建议:
SELECT user1_id, user2_id, user1_wc, user2_wc,
count(*) AS common_wc, count(*) / least(user1_wc, user2_wc) AS common_wc_pct,
CASE WHEN count(*) / least(user1_wc, user2_wc) > 0.7 THEN 1 ELSE 0 END AS frienship_suggestion
FROM (
SELECT u1.user_id AS user1_id, u2.user_id AS user2_id,
u1.word_count AS user1_wc, u2.word_count AS user2_wc,
c1.word_id AS word1_id, c2.word_id AS word2_id
FROM connections c1
JOIN connections c2 ON (c1.user_id < c2.user_id AND c1.word_id = c2.word_id)
JOIN (SELECT user_id, count(*) AS word_count
FROM connections
GROUP BY user_id) u1 ON (c1.user_id = u1.user_id)
JOIN (SELECT user_id, count(*) AS word_count
FROM connections
GROUP BY user_id) u2 ON (c2.user_id = u2.user_id)
) AS shared_words
GROUP BY user1_id, user2_id, user1_wc, user2_wc;
Friendship_suggestion is on SELECT for clarity, you probably need to filter by it, so yu may just move it to HAVING clause.
为了清楚起见,Friendship_suggestion在SELECT上,您可能需要按它进行过滤,因此您可以将其移动到HAVING子句。
#3
2
I throw this option into your querying consideration... The first part of the from query is to do nothing but get the one user you are considering as the basis to find all others having common words. The where clause is for that one user (alias result OnePerson).
我把这个选项放到你的查询考虑中......来自查询的第一部分是什么都不做,只是让你正在考虑的一个用户作为找到所有其他常用词的基础。 where子句适用于该用户(别名结果为OnePerson)。
Then, add to the from clause (WITHOUT A JOIN) since the OnePerson record will always be a single record, we want it's total word count available, but didn't actually see how your worked your 100 to 30 threashold if another person only had 10 words to match 3... I actually think its bloat and unnecessary as you'll see later in the where of PreQuery.
然后,添加到from子句(WITHOUT A JOIN),因为OnePerson记录将始终是单个记录,我们希望它的总字数可用,但实际上并没有看到如果另一个人只有10个字匹配3 ...我实际上认为它的膨胀和不必要的,你将在后面的PreQuery中看到。
So, the next table is the connections table (aliased c2) and that is normal INNER JOIN to the words table for each of the "other" people being considered.
因此,下一个表是连接表(别名为c2),对于正在考虑的每个“其他”人,这是正常的INNER JOIN到单词表。
This c2 is then joined again to the connections table again alias OnesWords based on the common word Id -- AND -- the OnesWords user ID is that of the primary user_id being compared against. This OnesWords alias is joined to the words table so IF THERE IS a match to the primary person, we can grab that "common word" as part of the group_concat().
然后,根据公共单词Id-AND再次将此c2再次连接到连接表OnesWords,并且OnesWords用户ID是要与之比较的主user_id的用户ID。此OnesWords别名与单词表连接,因此如果与主要人员匹配,我们可以将“常用单词”作为group_concat()的一部分。
So, now we grab the original single person's total words (still not SURE you need it), a count of ALL the words for the other person, and a count (via sum/case when) of all words that ARE IN COMMON with the original person grouped by the "other" user ID. This gets them all and results as alias "PreQuery".
所以,现在我们抓住原始单个人的总单词(仍然不是你需要它),对另一个人的所有单词的计数,以及所有单词的计数(通过总和/大小的时间)原始人按“其他”用户ID分组。这将获得所有结果并将结果作为别名“PreQuery”。
Now, from that, we can join that to the user's table to get the name and avatar along with respective counts and common words, but apply the WHERE clause based on the total per "other users" available words to the "in common" with the first person's words (see... I didn't think you NEEDED the original query/count as basis of percentage consideration).
现在,我们可以将其加入到用户的表中以获取名称和头像以及相应的计数和常用单词,但是将基于每个“其他用户”可用单词的总数的WHERE子句应用于“共同”第一个人的话(见......我认为你不需要原始查询/计数作为百分比考虑的基础)。
SELECT
u.name_surname,
u.avatar,
PreQuery.*
from
( SELECT
c2.user_id,
One.TotalWords,
COUNT(*) as OtherUserWords,
GROUP_CONCAT(words_en.word) AS InCommonWords,
SUM( case when OnesWords.word_id IS NULL then 0 else 1 end ) as InCommonWithOne
from
( SELECT c1.user_id,
COUNT(*) AS TotalWords
from
`connections` c1
where
c1.user_id = :PrimaryPersonBasis ) OnePerson,
`connections` c2
LEFT JOIN `connections` OnesWords
ON c2.word_id = OnesWords.word_id
AND OnesWords.user_id = OnePerson.User_ID
LEFT JOIN words_en
ON OnesWords.word_id = words_en.id
where
c2.user_id <> OnePerson.User_ID
group by
c2.user_id ) PreQuery
JOIN users u
ON PreQuery.user_id = u.id
where
PreQuery.OtherUserWords * :nPercentToConsider >= PreQuery.InCommonWithOne
order by
PreQuery.InCommonWithOne DESC,
u.name_surname
Here's a revised WITHOUT then need to prequery the total original words of the first person.
这是一个修订的WITHOUT然后需要预先查询第一个人的原始单词。
SELECT
u.name_surname,
u.avatar,
PreQuery.*
from
( SELECT
c2.user_id,
COUNT(*) as OtherUserWords,
GROUP_CONCAT(words_en.word) AS InCommonWords,
SUM( case when OnesWords.word_id IS NULL then 0 else 1 end ) as InCommonWithOne
from
`connections` c2
LEFT JOIN `connections` OnesWords
ON c2.word_id = OnesWords.word_id
AND OnesWords.user_id = :PrimaryPersonBasis
LEFT JOIN words_en
ON OnesWords.word_id = words_en.id
where
c2.user_id <> :PrimaryPersonBasis
group by
c2.user_id
having
COUNT(*) * :nPercentToConsider >=
SUM( case when OnesWords.word_id IS NULL then 0 else 1 end ) ) PreQuery
JOIN users u
ON PreQuery.user_id = u.id
order by
PreQuery.InCommonWithOne DESC,
u.name_surname
There might be some tweaking on the query, but your original query leads me to believe you can easily find simple things like alias or field name type-o instances.
可能会对查询进行一些调整,但是您的原始查询使我相信您可以轻松找到简单的内容,例如别名或字段名称类型-o实例。
Another options might be to prequery ALL users and how many respective words they have UP FRONT, then use the primary person's words to compare to anyone else explicitly ON those common words... This might be more efficient as the multiple joins would be better on the smaller result set. What if you have 10,000 users and user A has 30 words, and only 500 other users have one or more of those words in common... why compare against all 10,000... but if having up-front a simple summary of each user and how many should be an almost instant query basis.
另一个选择可能是预先查询所有用户以及他们拥有UP FRONT的相应单词数量,然后使用主要人物的单词与其他任何人明确地比较这些常用单词...这可能更有效,因为多个连接会更好较小的结果集。如果您拥有10,000个用户并且用户A有30个单词,并且只有500个其他用户拥有一个或多个这些单词,那么该怎么办...为什么要比较所有10,000个...但是如果预先拥有每个用户的简单摘要什么应该是一个几乎即时查询的基础。
SELECT
u.name_surname,
u.avatar,
PreQuery.*
from
( SELECT
OtherUser.User_ID,
AllUsers.EachUserWords,
COUNT(*) as CommonWordsCount,
group_concat( words_en.word ) as InCommonWords
from
`connections` OneUser
JOIN words_en
ON OneUser.word_id = words_en.id
JOIN `connections` OtherUser
ON OneUser.word_id = OtherUser.word_id
AND OneUser.user_id <> OtherUser.user_id
JOIN ( SELECT
c1.user_id,
COUNT(*) as EachUserWords
from
`connections` c1
group by
c1.user_id ) AllUsers
ON OtherUser.user_id = AllUsers.User_ID
where
OneUser.user_id = :nPrimaryUserToConsider
group by
OtherUser.User_id,
AllUsers.EachUserWords ) as PreQuery
JOIN users u
ON PreQuery.uer_id = u.id
where
PreQuery.EachUserWords * :nPercentToConsider >= PreQuery.CommonWordCount
order by
PreQuery.CommonWordCount DESC,
u.name_surname
#1
1
May I suggest a different way to look at your problem?
我可以建议一种不同的方式来看待你的问题吗?
You might look into a similarity metric, such as Cosine Similarity which will give you a much better measure of similarity between your users based on words. To understand it for your case, consider the following example. You have a vector of words A = {house, car, burger, sun}
for a user u1
and another vector B = {flat, car, pizza, burger, cloud}
for user u2
.
您可以查看相似度指标,例如余弦相似度,它可以更好地衡量用户之间基于单词的相似度。要了解您的情况,请考虑以下示例。对于用户u1,你有一个单词A = {house,car,burger,sun}的向量,对于用户u2,你有另一个向量B = {flat,car,pizza,burger,cloud}。
Given these individual vectors you first construct another that positions them together so you can map to each user whether he/she has that word in its vector or not. Like so:
给定这些单独的向量,您首先构建另一个将它们放在一起的向量,这样您就可以向每个用户映射他/她是否在其向量中包含该单词。像这样:
| -- | house | car | burger | sun | flat | pizza | cloud |
----------------------------------------------------------
| A | 1 | 1 | 1 | 1 | 0 | 0 | 0 |
----------------------------------------------------------
| B | 0 | 1 | 1 | 0 | 1 | 1 | 1 |
----------------------------------------------------------
Now you have a vector for each user where each position corresponds to the value of each word to each user. Here it represents a simple count but you can improve it using different metrics based on word frequency if that applies to your case. Take a look at the most common one, called tf-idf.
现在,每个用户都有一个向量,其中每个位置对应于每个用户的每个单词的值。这里它代表一个简单的计数,但如果适用于您的情况,您可以使用基于字频的不同指标来改进它。看看最常见的一个叫做tf-idf。
Having these two vectors, you can compute the cosine similarity
between them as follows:
有了这两个向量,您可以计算它们之间的余弦相似度,如下所示:
Which basically is computing the sum of the product between each position of the vectors above, divided by their corresponding magnitude. In our example, that is 0.47, in a range that can vary between 0 and 1, the higher the most similar the two vectors are.
这基本上是计算上面向量的每个位置之间的乘积之和除以它们的相应幅度。在我们的例子中,即0.47,在0和1之间变化的范围内,两个矢量最相似的越高。
If you choose to go this way, you don't need to do this calculation in the database. You compute the similarity in your code and just save the result in the database. There are several libraries that can do that for you. In Python, take a look at the numpy library. In Java, look at Weka and/or Apache Lucene.
如果选择这种方式,则无需在数据库中进行此计算。您可以计算代码中的相似性,并将结果保存在数据库中。有几个库可以帮到你。在Python中,看一下numpy库。在Java中,查看Weka和/或Apache Lucene。
#2
3
OK, so the issue is "users in common" defined as asymmetric relation. To fix it, let's assume that in_common percentage threshold is checked against user with the least words.
好的,所以问题是“共同的用户”被定义为非对称关系。为了解决这个问题,我们假设对具有最少单词的用户检查in_common百分比阈值。
Try this query (fiddle), it gives you full list of users with at least 1 word in common, marking friendship suggestions:
尝试这个查询(小提琴),它会为您提供至少有一个共同字的用户的完整列表,标记友情建议:
SELECT user1_id, user2_id, user1_wc, user2_wc,
count(*) AS common_wc, count(*) / least(user1_wc, user2_wc) AS common_wc_pct,
CASE WHEN count(*) / least(user1_wc, user2_wc) > 0.7 THEN 1 ELSE 0 END AS frienship_suggestion
FROM (
SELECT u1.user_id AS user1_id, u2.user_id AS user2_id,
u1.word_count AS user1_wc, u2.word_count AS user2_wc,
c1.word_id AS word1_id, c2.word_id AS word2_id
FROM connections c1
JOIN connections c2 ON (c1.user_id < c2.user_id AND c1.word_id = c2.word_id)
JOIN (SELECT user_id, count(*) AS word_count
FROM connections
GROUP BY user_id) u1 ON (c1.user_id = u1.user_id)
JOIN (SELECT user_id, count(*) AS word_count
FROM connections
GROUP BY user_id) u2 ON (c2.user_id = u2.user_id)
) AS shared_words
GROUP BY user1_id, user2_id, user1_wc, user2_wc;
Friendship_suggestion is on SELECT for clarity, you probably need to filter by it, so yu may just move it to HAVING clause.
为了清楚起见,Friendship_suggestion在SELECT上,您可能需要按它进行过滤,因此您可以将其移动到HAVING子句。
#3
2
I throw this option into your querying consideration... The first part of the from query is to do nothing but get the one user you are considering as the basis to find all others having common words. The where clause is for that one user (alias result OnePerson).
我把这个选项放到你的查询考虑中......来自查询的第一部分是什么都不做,只是让你正在考虑的一个用户作为找到所有其他常用词的基础。 where子句适用于该用户(别名结果为OnePerson)。
Then, add to the from clause (WITHOUT A JOIN) since the OnePerson record will always be a single record, we want it's total word count available, but didn't actually see how your worked your 100 to 30 threashold if another person only had 10 words to match 3... I actually think its bloat and unnecessary as you'll see later in the where of PreQuery.
然后,添加到from子句(WITHOUT A JOIN),因为OnePerson记录将始终是单个记录,我们希望它的总字数可用,但实际上并没有看到如果另一个人只有10个字匹配3 ...我实际上认为它的膨胀和不必要的,你将在后面的PreQuery中看到。
So, the next table is the connections table (aliased c2) and that is normal INNER JOIN to the words table for each of the "other" people being considered.
因此,下一个表是连接表(别名为c2),对于正在考虑的每个“其他”人,这是正常的INNER JOIN到单词表。
This c2 is then joined again to the connections table again alias OnesWords based on the common word Id -- AND -- the OnesWords user ID is that of the primary user_id being compared against. This OnesWords alias is joined to the words table so IF THERE IS a match to the primary person, we can grab that "common word" as part of the group_concat().
然后,根据公共单词Id-AND再次将此c2再次连接到连接表OnesWords,并且OnesWords用户ID是要与之比较的主user_id的用户ID。此OnesWords别名与单词表连接,因此如果与主要人员匹配,我们可以将“常用单词”作为group_concat()的一部分。
So, now we grab the original single person's total words (still not SURE you need it), a count of ALL the words for the other person, and a count (via sum/case when) of all words that ARE IN COMMON with the original person grouped by the "other" user ID. This gets them all and results as alias "PreQuery".
所以,现在我们抓住原始单个人的总单词(仍然不是你需要它),对另一个人的所有单词的计数,以及所有单词的计数(通过总和/大小的时间)原始人按“其他”用户ID分组。这将获得所有结果并将结果作为别名“PreQuery”。
Now, from that, we can join that to the user's table to get the name and avatar along with respective counts and common words, but apply the WHERE clause based on the total per "other users" available words to the "in common" with the first person's words (see... I didn't think you NEEDED the original query/count as basis of percentage consideration).
现在,我们可以将其加入到用户的表中以获取名称和头像以及相应的计数和常用单词,但是将基于每个“其他用户”可用单词的总数的WHERE子句应用于“共同”第一个人的话(见......我认为你不需要原始查询/计数作为百分比考虑的基础)。
SELECT
u.name_surname,
u.avatar,
PreQuery.*
from
( SELECT
c2.user_id,
One.TotalWords,
COUNT(*) as OtherUserWords,
GROUP_CONCAT(words_en.word) AS InCommonWords,
SUM( case when OnesWords.word_id IS NULL then 0 else 1 end ) as InCommonWithOne
from
( SELECT c1.user_id,
COUNT(*) AS TotalWords
from
`connections` c1
where
c1.user_id = :PrimaryPersonBasis ) OnePerson,
`connections` c2
LEFT JOIN `connections` OnesWords
ON c2.word_id = OnesWords.word_id
AND OnesWords.user_id = OnePerson.User_ID
LEFT JOIN words_en
ON OnesWords.word_id = words_en.id
where
c2.user_id <> OnePerson.User_ID
group by
c2.user_id ) PreQuery
JOIN users u
ON PreQuery.user_id = u.id
where
PreQuery.OtherUserWords * :nPercentToConsider >= PreQuery.InCommonWithOne
order by
PreQuery.InCommonWithOne DESC,
u.name_surname
Here's a revised WITHOUT then need to prequery the total original words of the first person.
这是一个修订的WITHOUT然后需要预先查询第一个人的原始单词。
SELECT
u.name_surname,
u.avatar,
PreQuery.*
from
( SELECT
c2.user_id,
COUNT(*) as OtherUserWords,
GROUP_CONCAT(words_en.word) AS InCommonWords,
SUM( case when OnesWords.word_id IS NULL then 0 else 1 end ) as InCommonWithOne
from
`connections` c2
LEFT JOIN `connections` OnesWords
ON c2.word_id = OnesWords.word_id
AND OnesWords.user_id = :PrimaryPersonBasis
LEFT JOIN words_en
ON OnesWords.word_id = words_en.id
where
c2.user_id <> :PrimaryPersonBasis
group by
c2.user_id
having
COUNT(*) * :nPercentToConsider >=
SUM( case when OnesWords.word_id IS NULL then 0 else 1 end ) ) PreQuery
JOIN users u
ON PreQuery.user_id = u.id
order by
PreQuery.InCommonWithOne DESC,
u.name_surname
There might be some tweaking on the query, but your original query leads me to believe you can easily find simple things like alias or field name type-o instances.
可能会对查询进行一些调整,但是您的原始查询使我相信您可以轻松找到简单的内容,例如别名或字段名称类型-o实例。
Another options might be to prequery ALL users and how many respective words they have UP FRONT, then use the primary person's words to compare to anyone else explicitly ON those common words... This might be more efficient as the multiple joins would be better on the smaller result set. What if you have 10,000 users and user A has 30 words, and only 500 other users have one or more of those words in common... why compare against all 10,000... but if having up-front a simple summary of each user and how many should be an almost instant query basis.
另一个选择可能是预先查询所有用户以及他们拥有UP FRONT的相应单词数量,然后使用主要人物的单词与其他任何人明确地比较这些常用单词...这可能更有效,因为多个连接会更好较小的结果集。如果您拥有10,000个用户并且用户A有30个单词,并且只有500个其他用户拥有一个或多个这些单词,那么该怎么办...为什么要比较所有10,000个...但是如果预先拥有每个用户的简单摘要什么应该是一个几乎即时查询的基础。
SELECT
u.name_surname,
u.avatar,
PreQuery.*
from
( SELECT
OtherUser.User_ID,
AllUsers.EachUserWords,
COUNT(*) as CommonWordsCount,
group_concat( words_en.word ) as InCommonWords
from
`connections` OneUser
JOIN words_en
ON OneUser.word_id = words_en.id
JOIN `connections` OtherUser
ON OneUser.word_id = OtherUser.word_id
AND OneUser.user_id <> OtherUser.user_id
JOIN ( SELECT
c1.user_id,
COUNT(*) as EachUserWords
from
`connections` c1
group by
c1.user_id ) AllUsers
ON OtherUser.user_id = AllUsers.User_ID
where
OneUser.user_id = :nPrimaryUserToConsider
group by
OtherUser.User_id,
AllUsers.EachUserWords ) as PreQuery
JOIN users u
ON PreQuery.uer_id = u.id
where
PreQuery.EachUserWords * :nPercentToConsider >= PreQuery.CommonWordCount
order by
PreQuery.CommonWordCount DESC,
u.name_surname