复杂SQL连接与group by

时间:2022-07-31 01:52:57

I'm trying to optimize a query which is taking a long time. The goal of the query is to get best similar F2 .(Specially similarity measure) This is an example of what I have:

我正在尝试优化需要很长时间的查询。查询的目标是获得最佳类似的F2。(特殊相似性度量)这是我的例子:

 CREATE TABLE Test
(
   F1 varchar(124),
   F2 varchar(124),
   F3 varchar(124)
)
INSERT INTO TEST ( F1, F2, F3 ) VALUES ( 'A', 'B', 'C' )
INSERT INTO TEST (  F1, F2, F3 ) VALUES ( 'D', 'B', 'E' )
INSERT INTO TEST (  F1, F2, F3 ) VALUES ( 'F', 'I', 'G' )
INSERT INTO TEST (  F1, F2, F3 ) VALUES ( 'F', 'I', 'G' )
INSERT INTO TEST (  F1, F2, F3 ) VALUES ( 'D', 'B', 'C' )
INSERT INTO TEST (  F1, F2, F3 ) VALUES ( 'F', 'B', 'G' )
INSERT INTO TEST (  F1, F2, F3 ) VALUES ( 'D', 'I', 'C' )
INSERT INTO TEST (  F1, F2, F3 ) VALUES ( 'A', 'B', 'C' )
INSERT INTO TEST (  F1, F2, F3 ) VALUES ( 'A', 'B', 'K' )
INSERT INTO TEST (  F1, F2, F3 ) VALUES ( 'A', 'K', 'K' )

Now if I run this query:

现在,如果我运行此查询:

SELECT B.f2,COUNT(*) AS CNT  
FROM 
(
select F1,F3 from Test
where F2='B'
 )AS A
    INNER JOIN  Test AS B
   ON A.F1 = B.F1 AND  A.F3 = B.F3
GROUP BY B.F2 
ORDER BY CNT DESC 

The table has 1m+ rows. What would be a better way to do this?

该表有1米+行。什么是更好的方法来做到这一点?

5 个解决方案

#1


2  

A filtered search for all rows WHERE F2 = 'B' will incur a full table scan unless you create an index that has F2 as its first or only column. Further down, the join condition involves columns F1 and F3, which you mention are already part of an index that begins with F1.

对所有行进行过滤搜索WHERE F2 ='B'将产生全表扫描,除非您创建一个将F2作为其第一列或唯一列的索引。再往下,连接条件涉及列F1和F3,您提到它们已经是以F1开头的索引的一部分。

I also notice that the first part of the your query doesn't eliminate duplicates for the set of (T1, T3) where T2 = 'B', as one might expect when intersecting that set right back against another subset of the same table. You may have a reason for doing this, but we can't know for sure until you provide some details about the similarity measurement algorithm you're trying to implement.

我还注意到你的查询的第一部分并没有消除T2 ='B'的(T1,T3)集合的重复,正如人们在将该集合右后卫与同一个表的另一个子集交叉时所期望的那样。您可能有理由这样做,但在您提供有关您尝试实施的相似性度量算法的一些详细信息之前,我们无法确切知道。

Your ORDER BY clause is also affecting the query run time by incurring a potentially large, internal sort on the final result set.

ORDER BY子句也会通过在最终结果集上产生可能较大的内部排序来影响查询运行时。

#2


3  

You can write your query in this form too, and because you have one select so your retrieve time will be reduced

您也可以在此表单中编写查询,并且因为您有一个选择,所以您的检索时间将减少

SELECT  Test_1.F2, COUNT(Test_1.F1) AS Cnt 
FROM    Test 
INNER JOIN Test AS Test_1 ON Test.F1 = Test_1.F1 AND Test.F3 = Test_1.F3 
WHERE   (Test.F2 = 'B') 
GROUP BY Test_1.F2

#3


3  

Here is another way to write your query. Close to guido's answer runnable in MS SQL.

这是编写查询的另一种方法。接近guido的答案可以在MS SQL中运行。

WITH Filtered AS (SELECT DISTINCT F1,F3 FROM Test WHERE F2='B')
SELECT B.f2,COUNT(*) AS CNT
  FROM Test B
       INNER JOIN Filtered
           ON B.F1 = Filtered.F1 AND B.F3 = Filtered.F3
 GROUP BY B.F2
 ORDER BY CNT DESC

I think your original query might have a bug, like Fred mentioned. The count of F2="B" should be 6, not 8, in your example, is that right? If 8 is intended, take out DISTINCT.

我认为你的原始查询可能有一个bug,就像Fred提到的那样。在你的例子中,F2 =“B”的计数应该是6而不是8,是吗?如果打算8,请取出DISTINCT。

Another thing you might try is to make TEST table's clustered index to be (F2, F1, F3), and make another non-clustered index on (F1, F3).

您可能尝试的另一件事是使TEST表的聚集索引为(F2,F1,F3),并在(F1,F3)上创建另一个非聚簇索引。

Sample code is also available on SqlFiddle.

示例代码也可以在SqlFiddle上获得。

#4


1  

If your Test table has 1m+ rows, the joined temporary table on which you group would easily have hundreds of millions of rows.

如果您的Test表有1m +行,则您组合的联接临时表很容易拥有数亿行。

This would work in mysql but not on sql-server afaik:

这可以在mysql中工作,但不能在sql-server afaik上工作:

SELECT F2,COUNT(*)
FROM Test AS B 
WHERE (B.F1,B.F3) IN (
  SELECT F1,F3 FROM Test
  WHERE F2='B') 
GROUP BY F2

#5


1  

I realize this has already been answered, but I think this approach might be much faster, particularly if F1 and F3 have many duplicate values:

我意识到这已经得到了解答,但我认为这种方法可能会快得多,特别是如果F1和F3有许多重复值:

SELECT B.f2, sum(A.cnt) AS CNT  
FROM (select F1, F3, count(*) as cnt
      from Test
      where F2='B'
      group by f1, f3
     ) A INNER JOIN
     Test B
     ON A.F1 = B.F1 AND A.F3 = B.F3
GROUP BY B.F2 
ORDER BY CNT DESC

If F1 and F3 don't have very many combinations, then the first subquery should reduce to a few hundred or thousand rows. (Your sample data has a single capital letter, so the number of combinations would be 576 if all letters are used.) SQL Server will probably do a merge or hash join on the result, which should perform well.

如果F1和F3没有很多组合,那么第一个子查询应该减少到几百或几千行。 (您的示例数据只有一个大写字母,因此如果使用所有字母,组合的数量将为576。)SQL Server可能会对结果执行合并或散列连接,这应该会很好。

You can also do this without the join and group by, using windows functions:

你也可以使用windows函数在没有join和group by的情况下执行此操作:

select t.f2, sum(nummatches) as cnt
from (select t.*,
             sum(isB) over (partition by f1, f3) as nummatches
      from (select t.*,
                   (case when F2 = 'B' then 1 else 0 end) as IsB
            from test
           ) t
     ) t
group by t.f2
order by 2 desc

The window functions often perform better because they work on smaller chunks of the data.

窗口函数通常表现更好,因为它们处理较小的数据块。

#1


2  

A filtered search for all rows WHERE F2 = 'B' will incur a full table scan unless you create an index that has F2 as its first or only column. Further down, the join condition involves columns F1 and F3, which you mention are already part of an index that begins with F1.

对所有行进行过滤搜索WHERE F2 ='B'将产生全表扫描,除非您创建一个将F2作为其第一列或唯一列的索引。再往下,连接条件涉及列F1和F3,您提到它们已经是以F1开头的索引的一部分。

I also notice that the first part of the your query doesn't eliminate duplicates for the set of (T1, T3) where T2 = 'B', as one might expect when intersecting that set right back against another subset of the same table. You may have a reason for doing this, but we can't know for sure until you provide some details about the similarity measurement algorithm you're trying to implement.

我还注意到你的查询的第一部分并没有消除T2 ='B'的(T1,T3)集合的重复,正如人们在将该集合右后卫与同一个表的另一个子集交叉时所期望的那样。您可能有理由这样做,但在您提供有关您尝试实施的相似性度量算法的一些详细信息之前,我们无法确切知道。

Your ORDER BY clause is also affecting the query run time by incurring a potentially large, internal sort on the final result set.

ORDER BY子句也会通过在最终结果集上产生可能较大的内部排序来影响查询运行时。

#2


3  

You can write your query in this form too, and because you have one select so your retrieve time will be reduced

您也可以在此表单中编写查询,并且因为您有一个选择,所以您的检索时间将减少

SELECT  Test_1.F2, COUNT(Test_1.F1) AS Cnt 
FROM    Test 
INNER JOIN Test AS Test_1 ON Test.F1 = Test_1.F1 AND Test.F3 = Test_1.F3 
WHERE   (Test.F2 = 'B') 
GROUP BY Test_1.F2

#3


3  

Here is another way to write your query. Close to guido's answer runnable in MS SQL.

这是编写查询的另一种方法。接近guido的答案可以在MS SQL中运行。

WITH Filtered AS (SELECT DISTINCT F1,F3 FROM Test WHERE F2='B')
SELECT B.f2,COUNT(*) AS CNT
  FROM Test B
       INNER JOIN Filtered
           ON B.F1 = Filtered.F1 AND B.F3 = Filtered.F3
 GROUP BY B.F2
 ORDER BY CNT DESC

I think your original query might have a bug, like Fred mentioned. The count of F2="B" should be 6, not 8, in your example, is that right? If 8 is intended, take out DISTINCT.

我认为你的原始查询可能有一个bug,就像Fred提到的那样。在你的例子中,F2 =“B”的计数应该是6而不是8,是吗?如果打算8,请取出DISTINCT。

Another thing you might try is to make TEST table's clustered index to be (F2, F1, F3), and make another non-clustered index on (F1, F3).

您可能尝试的另一件事是使TEST表的聚集索引为(F2,F1,F3),并在(F1,F3)上创建另一个非聚簇索引。

Sample code is also available on SqlFiddle.

示例代码也可以在SqlFiddle上获得。

#4


1  

If your Test table has 1m+ rows, the joined temporary table on which you group would easily have hundreds of millions of rows.

如果您的Test表有1m +行,则您组合的联接临时表很容易拥有数亿行。

This would work in mysql but not on sql-server afaik:

这可以在mysql中工作,但不能在sql-server afaik上工作:

SELECT F2,COUNT(*)
FROM Test AS B 
WHERE (B.F1,B.F3) IN (
  SELECT F1,F3 FROM Test
  WHERE F2='B') 
GROUP BY F2

#5


1  

I realize this has already been answered, but I think this approach might be much faster, particularly if F1 and F3 have many duplicate values:

我意识到这已经得到了解答,但我认为这种方法可能会快得多,特别是如果F1和F3有许多重复值:

SELECT B.f2, sum(A.cnt) AS CNT  
FROM (select F1, F3, count(*) as cnt
      from Test
      where F2='B'
      group by f1, f3
     ) A INNER JOIN
     Test B
     ON A.F1 = B.F1 AND A.F3 = B.F3
GROUP BY B.F2 
ORDER BY CNT DESC

If F1 and F3 don't have very many combinations, then the first subquery should reduce to a few hundred or thousand rows. (Your sample data has a single capital letter, so the number of combinations would be 576 if all letters are used.) SQL Server will probably do a merge or hash join on the result, which should perform well.

如果F1和F3没有很多组合,那么第一个子查询应该减少到几百或几千行。 (您的示例数据只有一个大写字母,因此如果使用所有字母,组合的数量将为576。)SQL Server可能会对结果执行合并或散列连接,这应该会很好。

You can also do this without the join and group by, using windows functions:

你也可以使用windows函数在没有join和group by的情况下执行此操作:

select t.f2, sum(nummatches) as cnt
from (select t.*,
             sum(isB) over (partition by f1, f3) as nummatches
      from (select t.*,
                   (case when F2 = 'B' then 1 else 0 end) as IsB
            from test
           ) t
     ) t
group by t.f2
order by 2 desc

The window functions often perform better because they work on smaller chunks of the data.

窗口函数通常表现更好,因为它们处理较小的数据块。