查询联合前后的连接性能

时间:2021-05-28 00:04:59

Let's say we have a query that is essentially using a union to combine 2 recordsets into 1. Now, I need to duplicate the records by way of typically using a join. I feel option 1 is in my opinion the best bet for performance reasons but was wondering what the SQL Query experts thought.

假设我们有一个查询,它本质上是使用一个联合将两个记录集合并为1。现在,我需要通过使用join来复制记录。我觉得选项1在我看来是最佳的性能理由,但我想知道SQL查询专家的想法。

Basically, I "know" the answer is "1". But, I am also wondering, could I be wrong - is there a side of this I might be missing?

基本上,我“知道”答案是“1”。但是,我也在想,我可能是错的吗?

(SQL Server) Here are my options.

(SQL Server)这是我的选择。

pseudo-code

伪代码

Original Query:

原始查询:

Select Name, Category from t1
Union
Select Name, Category from t2

Option 1)

选择1)

Select Name, Category from t1
Inner Join (here)
Union
Select Name, Category from t2
Same inner Join (here)

Option 2)

选择2)

Select * from (
Select Name, Category from t1
Union
Select Name, Category from t2
) t
(Inner Join Here)

4 个解决方案

#1


5  

SELECT  Name, Category
FROM    t1
JOIN    t_right
ON      right_category = category
UNION
SELECT  Name, Category
FROM    t2
JOIN    t_right
ON      right_category = category

SELECT  *
FROM    (
        SELECT  Name, Category
        FROM    t1
        UNION
        SELECT  Name, Category
        FROM    t2
        ) t
JOIN    t_right
ON      right_category = category

These queries are not identical: the second one can return duplicates if more than two records in the right table can satisfy the join condition, like this:

这些查询是不相同的:如果正确的表中有两个以上的记录可以满足连接条件,第二个查询可以返回副本,如下所示:

t1

Name   Category
---    ---
Apple  1


t2

Name   Category
---    ---
Apple  1

t_right

Category
---
1
1

The first query will return Apple, 1 once, the second query will return it twice.

第一个查询返回Apple,第一个查询返回1个,第二个查询返回2个。

Performance-wise, it's hard to tell which query will be more efficient until we see your data:

在我们看到你的数据之前,很难判断哪个查询会更有效。

  • The first option can gain efficiency by applying different algorithms to each query.

    第一个选项可以通过对每个查询应用不同的算法来提高效率。

  • The second option can gain efficiency by reading the right table only once.

    第二个选项可以通过只读取正确的表一次来获得效率。

As a very rough rule of thumb, the first option will be more efficient if the join condition is selective on t1 and t2, while the second option will be more efficient if it is not.

作为一个非常粗略的经验法则,如果连接条件在t1和t2上是选择性的,那么第一个选项将更有效,而第二个选项则不是。

However, in simple cases (a join on a sargable condition with few values of high cardinality) SQL Server's optimizer will push the concatenation out of the subquery so that it will be identical to the following query:

然而,在简单的情况下(在具有高基数的少数值的sargable条件下的连接),SQL Server的优化器将会将连接从子查询中推出来,使其与下面的查询相同:

SELECT  Name, Category
FROM    t_right
CROSS APPLY
        (
        SELECT  Name, Category
        FROM    t1
        WHERE   t1.Category = t_right.category
        UNION
        SELECT  Name, Category
        FROM    t2
        WHERE   t2.Category = t_right.category
        ) t

#2


1  

There are several different factors that could affect performance in this case. For instance, maybe putting it into a temporary table first (from doing the union in a subquery) would be worth the tradeoff from having to do two index scans as a result of doing the join twice.

在这种情况下,有几个不同的因素可能会影响性能。例如,可能先将它放入临时表中(通过在子查询中执行union),这样做的代价是值得的,因为两次执行连接需要执行两次索引扫描。

We could yap about it all day, but... Simple answer: test each one and see which has the most efficient query plan and/or best execution time. That's the only way to really tell.

我们可以喋喋不休地谈论一整天,但是……简单的回答:测试每一个,看看哪个查询计划最有效,哪个执行时间最好。这是唯一的办法。

#3


0  

As a baseline, I'd go with option 2, because--if all else is equal (there are always special cases and exceptions)--it should be quicker.

作为基线,我将选择选项2,因为——如果其他条件都是相等的(总是有特殊的情况和例外)——它应该会更快。

In option 1, you read t1, then join to via a read to "here", then you read t2, then join to another read to "here", then union them together.

在选项1中,您读取t1,然后通过一个read to“here”连接到“here”,然后读取t2,然后连接到另一个read to“here”,然后将它们合并到一起。

In option 2, you read t2, then read t2, join them together, and then join the merged set (distinct or not, depending on the use of UNION ALL) to "here".

在选项2中,您读取t2,然后读取t2,将它们连接在一起,然后将合并集(不同的集合,取决于UNION ALL的使用)连接到“here”。

In other words, in option 1 you read table "here" twice, and in option 2 you read it once. It could be one row in a table in memory, but it's still a read.

换句话说,在选项1中你读了两次表“here”,在选项2中你读了一次表“here”。它可以是内存中表中的一行,但仍然是读取。

#4


0  

In simple cases option two is better because index seek on the table "(inner join here)" will done for one time.

在简单的情况下,选项2更好,因为表上的索引“(这里的内部连接)”将只执行一次。

#1


5  

SELECT  Name, Category
FROM    t1
JOIN    t_right
ON      right_category = category
UNION
SELECT  Name, Category
FROM    t2
JOIN    t_right
ON      right_category = category

SELECT  *
FROM    (
        SELECT  Name, Category
        FROM    t1
        UNION
        SELECT  Name, Category
        FROM    t2
        ) t
JOIN    t_right
ON      right_category = category

These queries are not identical: the second one can return duplicates if more than two records in the right table can satisfy the join condition, like this:

这些查询是不相同的:如果正确的表中有两个以上的记录可以满足连接条件,第二个查询可以返回副本,如下所示:

t1

Name   Category
---    ---
Apple  1


t2

Name   Category
---    ---
Apple  1

t_right

Category
---
1
1

The first query will return Apple, 1 once, the second query will return it twice.

第一个查询返回Apple,第一个查询返回1个,第二个查询返回2个。

Performance-wise, it's hard to tell which query will be more efficient until we see your data:

在我们看到你的数据之前,很难判断哪个查询会更有效。

  • The first option can gain efficiency by applying different algorithms to each query.

    第一个选项可以通过对每个查询应用不同的算法来提高效率。

  • The second option can gain efficiency by reading the right table only once.

    第二个选项可以通过只读取正确的表一次来获得效率。

As a very rough rule of thumb, the first option will be more efficient if the join condition is selective on t1 and t2, while the second option will be more efficient if it is not.

作为一个非常粗略的经验法则,如果连接条件在t1和t2上是选择性的,那么第一个选项将更有效,而第二个选项则不是。

However, in simple cases (a join on a sargable condition with few values of high cardinality) SQL Server's optimizer will push the concatenation out of the subquery so that it will be identical to the following query:

然而,在简单的情况下(在具有高基数的少数值的sargable条件下的连接),SQL Server的优化器将会将连接从子查询中推出来,使其与下面的查询相同:

SELECT  Name, Category
FROM    t_right
CROSS APPLY
        (
        SELECT  Name, Category
        FROM    t1
        WHERE   t1.Category = t_right.category
        UNION
        SELECT  Name, Category
        FROM    t2
        WHERE   t2.Category = t_right.category
        ) t

#2


1  

There are several different factors that could affect performance in this case. For instance, maybe putting it into a temporary table first (from doing the union in a subquery) would be worth the tradeoff from having to do two index scans as a result of doing the join twice.

在这种情况下,有几个不同的因素可能会影响性能。例如,可能先将它放入临时表中(通过在子查询中执行union),这样做的代价是值得的,因为两次执行连接需要执行两次索引扫描。

We could yap about it all day, but... Simple answer: test each one and see which has the most efficient query plan and/or best execution time. That's the only way to really tell.

我们可以喋喋不休地谈论一整天,但是……简单的回答:测试每一个,看看哪个查询计划最有效,哪个执行时间最好。这是唯一的办法。

#3


0  

As a baseline, I'd go with option 2, because--if all else is equal (there are always special cases and exceptions)--it should be quicker.

作为基线,我将选择选项2,因为——如果其他条件都是相等的(总是有特殊的情况和例外)——它应该会更快。

In option 1, you read t1, then join to via a read to "here", then you read t2, then join to another read to "here", then union them together.

在选项1中,您读取t1,然后通过一个read to“here”连接到“here”,然后读取t2,然后连接到另一个read to“here”,然后将它们合并到一起。

In option 2, you read t2, then read t2, join them together, and then join the merged set (distinct or not, depending on the use of UNION ALL) to "here".

在选项2中,您读取t2,然后读取t2,将它们连接在一起,然后将合并集(不同的集合,取决于UNION ALL的使用)连接到“here”。

In other words, in option 1 you read table "here" twice, and in option 2 you read it once. It could be one row in a table in memory, but it's still a read.

换句话说,在选项1中你读了两次表“here”,在选项2中你读了一次表“here”。它可以是内存中表中的一行,但仍然是读取。

#4


0  

In simple cases option two is better because index seek on the table "(inner join here)" will done for one time.

在简单的情况下,选项2更好,因为表上的索引“(这里的内部连接)”将只执行一次。