我当然应该对条件聚合使用SUM()吗?

时间:2021-10-10 23:01:38

I seen before professional programmer use SUM() (not COUNT()) for conditional aggregation. For example take a look at this:

我以前见过专业程序员使用SUM()(而不是COUNT())进行条件聚合。举个例子,看看这个:

SELECT COUNT(field1), COUNT(field2), COUNT(field3),
       SUM(CASE WHEN field4 IS NOT NULL AND field5 IS NOT NULL AND field6 IS NOT NULL THEN 1 ELSE 0 END)
FROM (SELECT t.*
      FROM mytable t
      WHERE ...
      ORDER BY id
      LIMIT 500000
     ) rq;

Is there any specific reason? Actually tested it and COUNT() also works perfectly fine when there is a condition. So what's the difference between:

有什么特别的原因吗?实际测试它并COUNT()在有条件的情况下也能很好地工作。那么这两者的区别是什么呢?

SUM(CASE WHEN field4 IS NOT NULL AND field5 IS NOT NULL AND field6 IS NOT NULL THEN 1 ELSE 0 END)

And

COUNT(CASE WHEN field4 IS NOT NULL AND field5 IS NOT NULL AND field6 IS NOT NULL THEN 1 ELSE 0 END)

In other word, Why sometimes COUNT() is used and sometimes else SUM() in the query above (first query)?

换句话说,为什么有时候在上面的查询中(第一个查询)中使用了COUNT(),有时还使用了SUM() ?

2 个解决方案

#1


3  

SUM(CASE WHEN field4 IS NOT NULL AND field5 IS NOT NULL AND field6 IS NOT NULL THEN 1 ELSE 0 END)

And

COUNT(CASE WHEN field4 IS NOT NULL AND field5 IS NOT NULL AND field6 IS NOT NULL THEN 1 ELSE 0 END)

Are two different things and I donno how you get the same answer (might be because of the data set you are using). But in this setting its better to use COUNT but you have change code a bit

是两个不同的东西,我不知道如何得到相同的答案(可能是因为您使用的数据集)。但是在这个设置中,最好使用COUNT,但是您需要稍微修改一下代码

COUNT(CASE WHEN field4 IS NOT NULL AND field5 IS NOT NULL AND field6 IS NOT NULL THEN 1 END)

However i don't think there will be a high performance impact by any of them.

然而,我不认为它们会对性能产生很大的影响。

#2


1  

@Malinga explained that there really isn't a difference to the methods and I agree with him. I do have a personal preference to use SUM() only because NULLs are considered UNKNOWN in SQL and have a lot of rules/nuances to know so any time I can eliminate I prefer to do so.

@Malinga解释说这些方法没有什么不同,我同意他的观点。我确实有个人偏好使用SUM(),因为在SQL中,NULLs被认为是未知的,并且有很多规则/细节可以知道,所以任何时候我可以消除我喜欢这样做。

A quick note if you don't put an ELSE statement in NULL is implied when conditions are not met. So.

如果不将ELSE语句设为空,则在不满足条件时提示。所以。

COUNT(CASE WHEN field4 IS NOT NULL AND field5 IS NOT NULL AND field6 IS NOT NULL THEN 1 ELSE NULL END)

CAN BE WRITTEN

可以写

COUNT(CASE WHEN field4 IS NOT NULL AND field5 IS NOT NULL AND field6 IS NOT NULL THEN 1 END)

However, concerning performance I would be curious about 2 things.

然而,关于性能,我想知道两件事。

1 the order by in your inner select isn't necessary and would just add a task for the sql engine to do. So you might improve your performance by removing it. Note just choose the sum or the count method whichever is preferred.

1您的内部选择中的order by不是必需的,它只是为sql引擎添加一个任务。因此,您可以通过删除它来提高性能。注意,选择sum或count方法,两者都是首选。

SELECT COUNT(field1), COUNT(field2), COUNT(field3),
       SUM(CASE WHEN field4 IS NOT NULL AND field5 IS NOT NULL AND field6 IS NOT NULL THEN 1 ELSE 0 END)
      ,COUNT(CASE WHEN field4 IS NOT NULL AND field5 IS NOT NULL AND field6 IS NOT NULL THEN 1 END)
FROM (SELECT t.*
      FROM mytable t
      WHERE ...
      LIMIT 500000
     ) rq;

Next and I know this will seem weird but the LIMIT could actually decrease your performance too depending on how many records are in the table, indexes, and a few more things. So you if you really want the results for the entire table you could just write your query as such.

接下来,我知道这看起来很奇怪,但是这个限制实际上也会降低性能,这取决于表中有多少条记录、索引和其他一些东西。如果你真的想要整个表的结果你可以这样写你的查询。

SELECT COUNT(field1), COUNT(field2), COUNT(field3),
       SUM(CASE WHEN field4 IS NOT NULL AND field5 IS NOT NULL AND field6 IS NOT NULL THEN 1 ELSE 0 END)
      ,COUNT(CASE WHEN field4 IS NOT NULL AND field5 IS NOT NULL AND field6 IS NOT NULL THEN 1 END)
FROM mytable t
WHERE ...
;

#1


3  

SUM(CASE WHEN field4 IS NOT NULL AND field5 IS NOT NULL AND field6 IS NOT NULL THEN 1 ELSE 0 END)

And

COUNT(CASE WHEN field4 IS NOT NULL AND field5 IS NOT NULL AND field6 IS NOT NULL THEN 1 ELSE 0 END)

Are two different things and I donno how you get the same answer (might be because of the data set you are using). But in this setting its better to use COUNT but you have change code a bit

是两个不同的东西,我不知道如何得到相同的答案(可能是因为您使用的数据集)。但是在这个设置中,最好使用COUNT,但是您需要稍微修改一下代码

COUNT(CASE WHEN field4 IS NOT NULL AND field5 IS NOT NULL AND field6 IS NOT NULL THEN 1 END)

However i don't think there will be a high performance impact by any of them.

然而,我不认为它们会对性能产生很大的影响。

#2


1  

@Malinga explained that there really isn't a difference to the methods and I agree with him. I do have a personal preference to use SUM() only because NULLs are considered UNKNOWN in SQL and have a lot of rules/nuances to know so any time I can eliminate I prefer to do so.

@Malinga解释说这些方法没有什么不同,我同意他的观点。我确实有个人偏好使用SUM(),因为在SQL中,NULLs被认为是未知的,并且有很多规则/细节可以知道,所以任何时候我可以消除我喜欢这样做。

A quick note if you don't put an ELSE statement in NULL is implied when conditions are not met. So.

如果不将ELSE语句设为空,则在不满足条件时提示。所以。

COUNT(CASE WHEN field4 IS NOT NULL AND field5 IS NOT NULL AND field6 IS NOT NULL THEN 1 ELSE NULL END)

CAN BE WRITTEN

可以写

COUNT(CASE WHEN field4 IS NOT NULL AND field5 IS NOT NULL AND field6 IS NOT NULL THEN 1 END)

However, concerning performance I would be curious about 2 things.

然而,关于性能,我想知道两件事。

1 the order by in your inner select isn't necessary and would just add a task for the sql engine to do. So you might improve your performance by removing it. Note just choose the sum or the count method whichever is preferred.

1您的内部选择中的order by不是必需的,它只是为sql引擎添加一个任务。因此,您可以通过删除它来提高性能。注意,选择sum或count方法,两者都是首选。

SELECT COUNT(field1), COUNT(field2), COUNT(field3),
       SUM(CASE WHEN field4 IS NOT NULL AND field5 IS NOT NULL AND field6 IS NOT NULL THEN 1 ELSE 0 END)
      ,COUNT(CASE WHEN field4 IS NOT NULL AND field5 IS NOT NULL AND field6 IS NOT NULL THEN 1 END)
FROM (SELECT t.*
      FROM mytable t
      WHERE ...
      LIMIT 500000
     ) rq;

Next and I know this will seem weird but the LIMIT could actually decrease your performance too depending on how many records are in the table, indexes, and a few more things. So you if you really want the results for the entire table you could just write your query as such.

接下来,我知道这看起来很奇怪,但是这个限制实际上也会降低性能,这取决于表中有多少条记录、索引和其他一些东西。如果你真的想要整个表的结果你可以这样写你的查询。

SELECT COUNT(field1), COUNT(field2), COUNT(field3),
       SUM(CASE WHEN field4 IS NOT NULL AND field5 IS NOT NULL AND field6 IS NOT NULL THEN 1 ELSE 0 END)
      ,COUNT(CASE WHEN field4 IS NOT NULL AND field5 IS NOT NULL AND field6 IS NOT NULL THEN 1 END)
FROM mytable t
WHERE ...
;