I'd really appreciate it if someone could validate my SQL query.
如果有人可以验证我的SQL查询,我真的很感激。
For the following dataset:
对于以下数据集:
MD5 UserPK CategoryPK
ADCDE 1 7
ADCDE 1 4
ADCDE 1 7
dffrf 1 7
dffrf 2 7
dffrf 2 6
dffrf 1 1
I'd like to select MD5 and CategoryPK where two or more rows exist with identical MD5 values, identical CatgegoryPK and two or more DIFFERENT UserPK values.
我想选择MD5和CategoryPK,其中两行或更多行存在相同的MD5值,相同的CatgegoryPK和两个或更多不同的UserPK值。
In other words, I'd like to know the MD5 and categoryPK of all records where two or more different users (UserPK) have assigned the same category (UserPK) to the same file (Md5). I'm not interested in records the same user has assigned the category to multiple times, (unless a different user has also assigned the same category to that file).
换句话说,我想知道两个或更多不同用户(UserPK)为同一文件(Md5)分配了相同类别(UserPK)的所有记录的MD5和categoryPK。我对同一个用户多次分配类别的记录不感兴趣(除非另一个用户也为该文件分配了相同的类别)。
So from the above data, I would like to be returned just:
所以从上面的数据来看,我想只返回:
md5 CategoryPK
dffrf 7
The query I've written is:
我写的查询是:
SELECT md5,
count(md5),
count(distinct categorypk) as cntcat,
count(distinct userpk) as cntpk
FROM Hash
group by md5 having count(md5) > 1
and cntpk > 1
and cntcat = 1;
It seems to work, but before I start using it in anger, I'd appreciate a second opinion in case I've missed something or if there is a better way of doing it.
它似乎有效,但在我开始使用它之前,我会欣赏第二个意见,以防我错过了某些内容或者是否有更好的方法。
Thanks
2 个解决方案
#1
11
I don't think your code will give you what you're after; what happens when a file has been assigned more than one category by multiple users, with some categories overlapping? Then cntcat != 1
, so your HAVING
clause will fail to match even though the file has indeed been categorised the same way by multiple users.
我认为你的代码不会给你你所追求的东西;如果多个用户为一个文件分配了多个类别,某些类别重叠,会发生什么?然后cntcat!= 1,所以你的HAVING子句将无法匹配,即使文件确实被多个用户以相同的方式分类。
I would instead use a self-join:
我会改为使用自联接:
SELECT a.MD5, a.CategoryPK
FROM Hash a
JOIN Hash b
ON a.MD5 = b.MD5
AND a.UserPK <> b.UserPK
AND a.CategoryPK = b.CategoryPK
GROUP BY a.MD5, a.CategoryPK
HAVING COUNT(DISTINCT a.UserPK) > 2 -- you said "more than 2" ?
#2
1
I can't see any problems with what you have written apart from you are not getting the category in your select list which appears to be in the criteria? I think you could simplify it slightly and get the category out:
我看不出你所写的内容有什么问题你没有在选择列表中看到符合标准的类别?我认为你可以稍微简化它并获得类别:
SELECT MD5, CategoryPK
FROM Hash
GROUP BY MD5, CategoryPK
HAVING MIN(UserPK) <> MAX(UserPK)
Alternatively, you could look at solving this with a join, you may need to run a few tests and use EXPLAIN, but sometimes joins perform better than GROUP BY. It is worth trying anyway to see if you see any significant difference.
或者,您可以通过连接来解决此问题,您可能需要运行一些测试并使用EXPLAIN,但有时连接的性能优于GROUP BY。无论如何,值得尝试看看你是否看到任何重大差异。
SELECT DISTINCT t1.MDF, t2.CategoryPK
FROM Hash T1
INNER JOIN Hash T2
ON T1.MD5 = T2.MD5
AND T1.CategoryPK = T2.CategoryPK
AND T1.UserPK < T2.UserPK
#1
11
I don't think your code will give you what you're after; what happens when a file has been assigned more than one category by multiple users, with some categories overlapping? Then cntcat != 1
, so your HAVING
clause will fail to match even though the file has indeed been categorised the same way by multiple users.
我认为你的代码不会给你你所追求的东西;如果多个用户为一个文件分配了多个类别,某些类别重叠,会发生什么?然后cntcat!= 1,所以你的HAVING子句将无法匹配,即使文件确实被多个用户以相同的方式分类。
I would instead use a self-join:
我会改为使用自联接:
SELECT a.MD5, a.CategoryPK
FROM Hash a
JOIN Hash b
ON a.MD5 = b.MD5
AND a.UserPK <> b.UserPK
AND a.CategoryPK = b.CategoryPK
GROUP BY a.MD5, a.CategoryPK
HAVING COUNT(DISTINCT a.UserPK) > 2 -- you said "more than 2" ?
#2
1
I can't see any problems with what you have written apart from you are not getting the category in your select list which appears to be in the criteria? I think you could simplify it slightly and get the category out:
我看不出你所写的内容有什么问题你没有在选择列表中看到符合标准的类别?我认为你可以稍微简化它并获得类别:
SELECT MD5, CategoryPK
FROM Hash
GROUP BY MD5, CategoryPK
HAVING MIN(UserPK) <> MAX(UserPK)
Alternatively, you could look at solving this with a join, you may need to run a few tests and use EXPLAIN, but sometimes joins perform better than GROUP BY. It is worth trying anyway to see if you see any significant difference.
或者,您可以通过连接来解决此问题,您可能需要运行一些测试并使用EXPLAIN,但有时连接的性能优于GROUP BY。无论如何,值得尝试看看你是否看到任何重大差异。
SELECT DISTINCT t1.MDF, t2.CategoryPK
FROM Hash T1
INNER JOIN Hash T2
ON T1.MD5 = T2.MD5
AND T1.CategoryPK = T2.CategoryPK
AND T1.UserPK < T2.UserPK