为什么一个联盟比一个组更快

Well, maybe I am too old school and I would like to understand the following.

好吧,也许我太老了,我想了解以下内容。

query 1.

select count(*), gender from customer
group by gender

query 2.

select count(*), 'M' from customer
where gender ='M'
union
select count(*), 'F' from customer
where gender ='F'

the 1st query is simpler, but for some reason in the profiler,when I execute both at the same time, it says that query 2 uses 39% of the time, and query 1, 61%.

第一个查询更简单,但由于某些原因在分析器中,当我同时执行两个查询时,它表示查询2使用39%的时间,查询1,61%。

I would like to understand the reason, maybe I have to rewrite all my queries.

我想了解原因,也许我必须重写我的所有疑问。

4 个解决方案

#1

Your query 2 is actually a nice trick. It works like this: You have an index on gender. The DBMS can seek into that index two times to get two ranges of rows (one for M and one for F). It doesn't need to read anything from these rows, just that they exist. It can count the number of rows that exist in the two ranges.

您的查询2实际上是一个很好的技巧。它的工作原理如下:你有一个性别指数。 DBMS可以搜索该索引两次以获得两个行范围(一个用于M,一个用于F)。它不需要从这些行中读取任何内容,只需它们存在即可。它可以计算两个范围中存在的行数。

In the first query the DBMS needs to decode the rows to read the gender, then it needs to either sort the rows or build a hashtable to aggregate them. That is more expensive than just counting rows.

在第一个查询中,DBMS需要解码行以读取性别,然后它需要对行进行排序或构建哈希表以聚合它们。这比计算行更昂贵。

#2

Are you sure? Maybe the second query is just using cached resources from the first on.

你确定?也许第二个查询只是使用第一个查询中的缓存资源。

run them in two separately batches and before each one run DBCC FREEPROCCACHE to clean the cache. Then compare the values of each execution plan.

分别在两个批处理中运行它们,然后在每个批处理运行DBCC FREEPROCCACHE以清理缓存。然后比较每个执行计划的值。

#3

The optimization of a query depends on the database. What you are seeing is database specific.

查询的优化取决于数据库。您所看到的是特定于数据库。

The union, as written, would naively require two passes through the data, doing a filter and a count. Basically no other storage is necessary.

如上所述,联合将天真地要求两次遍历数据,进行过滤和计数。基本上不需要其他存储。

The aggregation might sort the data and then do a count. Or, it might generate a hash table. Given the performance difference, I would guess a sort is being used. Clearly, this is overkill for this type of query.

聚合可能会对数据进行排序,然后进行计数。或者,它可能会生成哈希表。鉴于性能差异,我猜想正在使用一种排序。显然,这对于这种类型的查询来说是过度的。

If you have an index on gender, both methods would essentially scan the index so the performance should be similar (the union version might scan it twice=.

如果你有一个性别索引,这两种方法基本上都会扫描索引,因此性能应该相似(联合版本可能会扫描两次=。

Does the database that you are using offer a way to calculate statistics on tables? If so, you should update the statistics and see if you still get the same results.

您使用的数据库是否提供了计算表统计信息的方法?如果是这样,您应该更新统计信息,看看是否仍然得到相同的结果。

Also, can you post the results of "explain" or the execution plan? That would precisely explain why one is faster than the other.

另外,您可以发布“解释”或执行计划的结果吗?这恰恰可以解释为什么一个比另一个快。

#4

I tried an equivalent query, but found the opposite result; the union took 65% and the 'group by' took 35%. (Using SQL Server 2008). I do not have an index on gender so my execution plan shows a clustered index scan. Unless you examine the execution plan in detail, it really isn't possible to explain this result.

我尝试了一个等效的查询,但发现了相反的结果;工会占65%,'group by'占35%。 (使用SQL Server 2008)。我没有关于性别的索引,所以我的执行计划显示了聚集索引扫描。除非您详细检查执行计划,否则无法解释此结果。

Adding an index for this query is probably not a good idea, since you are probably not going to be running this query nearly as often as you are going to insert records in the customer table. In some other database engines with bitmap indexes (Oracle, PostgreSQL), the database engine can combine multiple indexes, so that can alter the utility of single column indexes. But in SQL Server, you need to design the indexes to 'cover' the commonly used queries.

为此查询添加索引可能不是一个好主意,因为您可能不会像在客户表中插入记录那样频繁地运行此查询。在其他一些具有位图索引(Oracle,PostgreSQL)的数据库引擎中,数据库引擎可以组合多个索引,这样可以改变单列索引的效用。但是在SQL Server中,您需要设计索引以“覆盖”常用查询。

#1