My lecturer stated:
我的讲师说:
All column names in SELECT list must appear in GROUP BY clause unless name is used only in an aggregate function
SELECT列表中的所有列名必须以GROUP BY子句出现,除非只在聚合函数中使用名称。
I'm just wanting some confirmation of this as I cannot think of a logical explanation as to why it should be true...
我只是想确认一下,因为我想不出一个合理的解释为什么它应该是真的……
7 个解决方案
#1
29
Imagine the following:
想象一下:
A B C
Cat 10 False
Dog 25 True
Dog 20 False
Cat 5 False
If you select A, B and Group By Only A - what would your output be? You'd only have two rows (or tuples) because you have two values for A - but how does it display B?
如果你只选择A, B和组,你的输出是什么?您只有两行(或元组),因为您有两个值,但它如何显示B?
If you group by A, B, you'd get four rows, no problems there. If you group by A and perform a function on B - like SUM(B) then you get two rows again:
如果你用A, B,你会得到4行,没有问题。如果你对A进行分组,并在B - like SUM(B)上执行一个函数,那么你将再次得到两行:
Cat 15
Dog 45
But if you select A, B and only group by A - it doesn't know what to do. Truthfully, I believe there are some databases out there that will select a random value for B in that case and I believe there are some that will give you an error message.
但是如果你选择A, B,只有A组,它不知道该怎么做。老实说,我相信有一些数据库会在这种情况下为B选择一个随机值,我相信有一些会给你一个错误信息。
#2
13
That's historically true. Omitting unaggregated columns leads to indeterminate behavior. SQL aims at fully determinate behavior.
这是历史事实。省略未聚合的列将导致不确定的行为。SQL的目标是完全确定行为。
But SQL standards have recently changed to let you omit from the GROUP BY clause columns that are functionally dependent on columns that are in the GROUP BY. PostgreSQL follows the more recent SQL standards. (It's not the only one.) Behavior is still fully determinate.
但是,SQL标准最近已经更改了,让您从功能上依赖于组中列的子句列中删除。PostgreSQL遵循最近的SQL标准。(不是唯一的一个。)行为仍然是完全确定的。
create table a (
a_id integer primary key,
xfr_date date not null
);
create table b (
a_id integer not null references a (a_id),
recd_date date not null,
units_recd integer not null
check (units_recd >= 0),
primary key (a_id, recd_date)
);
select a.a_id, a.xfr_date, sum(b.units_recd)
from a
inner join b on a.a_id = b.a_id
group by a.a_id; -- The column a.xfr_date is functionally dependent
-- on a.a_id; it doesn't have to appear in the
-- GROUP BY clause.
The notable deviation from from SQL standards is MySQL. It lets you omit just about everything from the GROUP BY. But that design choice makes its behavior indeterminate when you omit columns that are in the SELECT list.
SQL标准的显著偏差是MySQL。它可以让您忽略从组到组的所有内容。但是,当忽略SELECT列表中的列时,这种设计选择使其行为不确定。
#3
6
Actually, in MySQL you don't have to group by all columns. You can just group by whatever columns you want. The problem is, it will just pull a random value (from the set of available rows in the group) for the fields which aren't in the group by. If you know that you are grouping by something that is a unique key, there's no point in grouping by the rest of the fields, as they will already all have the same value anyway. It can actually speed it up to not have to group by every field when it is completely unnecessary.
实际上,在MySQL中,不需要对所有列进行分组。你可以按你想要的任何列来分组。问题是,它只会从组中不存在的字段中提取一个随机值(从组中可用的行集)。如果你知道你是按一个独一无二的键来分组的,那么在其他的字段中分组是没有意义的,因为它们已经都有相同的值了。它实际上可以加快它的速度,不需要在完全没有必要的情况下对每个字段进行分组。
#4
1
If you are grouping on something you cannot see the individual values of non-grouped columns because there may be more than one value within each group. All you can do is report on aggregate functions (sum, count, min & etc) -- these are able to combine the multiple values into a single cell in the result.
如果您对某个东西进行分组,您就不能看到非分组列的单个值,因为每个组中可能有多个值。您所能做的就是报告聚合函数(sum、count、min & etc)——这些函数能够将多个值合并到结果中的单个单元中。
#5
1
There are exceptions as noted by Sam Saffron but generally what your lecturer said is true.
有一些例外,正如Sam Saffron所指出的,但通常你的讲师所说的是正确的。
If I select 3 columns and group by 2 what should the RDBMS do with the 3rd column?
如果我选择3列和2组,那么RDBMS应该如何处理第三列呢?
The developers of the RDBMS may make a decision of how to handle the extra colum (as it appears MySQL's developers have) but is it the decision I would have made or the one I want when writing the select? Will the decision always be valid? I certainly prefer the Oracle-like approach of forcing me to explicitly state what should happen.
RDBMS的开发人员可能会决定如何处理额外的colum(就像MySQL的开发人员所看到的那样),但这是我在编写select时所做的决定还是我想要的决定?这个决定是否总是有效的?我当然更喜欢用类似于oracl的方法来强迫我明确地说明应该发生什么。
If I select 3 columns and group by 2 should the RDBS group by all 3, pick a random value from the 3rd, the biggest or littlest, the most common?
如果我选择3列和2组,那么RDBS组应该是3,从第三个,最大的还是最小的,最常见的?
#6
1
So the simple answer is: It depends. Mysql allows it, vertica doesn't.
简单的回答是:这要看情况。Mysql允许,vertica不允许。
There is actually a valid use case for omitting and that is when you are already selecting say with MIN().
实际上,省略是一个有效的用例,这是当您已经选择用MIN()进行选择时。
Here is an actual example for event tracking. Imaging you have credit and purchase events.
下面是事件跟踪的实际示例。想象你有信用和购买事件。
For simplicity we say a=credit, b,c,d are some kind of purchase event, and time is tracked with a running number. Now you want to find the date of the first purchase after each credit. We also happen to have only one customer 0:
为了简单起见,我们说a=credit, b,c,d是某种购买事件,时间用一个运行的数字跟踪。现在你想找到每笔贷款后的第一次购买的日期。我们也刚好只有一个客户0:
create table events (user_id int ,created_at int, event varchar(255));
insert into events values (0,0, 'a');
insert into events values (0,1, 'b');
insert into events values (0,2, 'c');
insert into events values (0,3, 'a');
insert into events values (0,4, 'c');
insert into events values (0,5, 'b');
insert into events values (0,6, 'a');
insert into events values (0,7, 'a');
insert into events values (0,8, 'd');
mysql> SELECT user_id, MAX(purchased) AS purchased, spent, event FROM (SELECT e1.User_ID AS user_id, e1.created_at AS purchased, MIN(e2.created_at) AS spent, e2.event AS event FROM events e1, events e2 WHERE e1.user_id = e2.user_id AND e1.created_at <= e2.created_at AND e1.Event = 'a' AND e2.Event != 'a' GROUP BY e1.user_id, e1.created_at) e3 GROUP BY user_id, spent;
+---------+-----------+-------+-------+
| user_id | purchased | spent | event |
+---------+-----------+-------+-------+
| 0 | 0 | 1 | b |
| 0 | 3 | 4 | c |
| 0 | 7 | 8 | d |
+---------+-----------+-------+-------+
3 rows in set (0.00 sec)
looks good in mysql, does not work in vertica:
在mysql中看起来不错,在vertica中不适用:
ERROR 2640: Column "e2.event" must appear in the GROUP BY clause or be used in an aggregate function
错误2640:e2列”。事件“必须在GROUP BY子句中出现,或在聚合函数中使用。
if I omit the event column, it works in both, but I do actually want to know what specific value event had for the row that min selected.
如果我省略了事件列,它在两者中都是有效的,但是我确实想知道对于min选择的行有什么特定的值事件。
So my answer ends with a request for comment :) Any ideas?
因此,我的回答以一个评论的请求结束:)有什么想法吗?
#7
0
This is an answer for Michael Will's example/question.
这是迈克尔·威尔的例子/问题的答案。
SELECT
e3.user_id,
MAX(e3.purchased) AS purchased,
e3.spent,
e.event
FROM
events e
INNER JOIN
(SELECT
e1.user_id AS user_id,
MIN(e1.created_at) as spent,
e2.created_at as purchased
FROM
events e1
INNER JOIN
(SELECT e.user_id, e.created_at from events e WHERE e.event = 'a') e2
ON e1.user_id = e2.user_id
AND e1.created_at >= e2.created_at
AND e1.event != 'a'
GROUP BY e1.User_ID, e2.created_at
) e3
ON e.user_id = e3.user_id AND e.created_at = e3.spent
GROUP BY e3.user_id, e3.spent, e.event;
#1
29
Imagine the following:
想象一下:
A B C
Cat 10 False
Dog 25 True
Dog 20 False
Cat 5 False
If you select A, B and Group By Only A - what would your output be? You'd only have two rows (or tuples) because you have two values for A - but how does it display B?
如果你只选择A, B和组,你的输出是什么?您只有两行(或元组),因为您有两个值,但它如何显示B?
If you group by A, B, you'd get four rows, no problems there. If you group by A and perform a function on B - like SUM(B) then you get two rows again:
如果你用A, B,你会得到4行,没有问题。如果你对A进行分组,并在B - like SUM(B)上执行一个函数,那么你将再次得到两行:
Cat 15
Dog 45
But if you select A, B and only group by A - it doesn't know what to do. Truthfully, I believe there are some databases out there that will select a random value for B in that case and I believe there are some that will give you an error message.
但是如果你选择A, B,只有A组,它不知道该怎么做。老实说,我相信有一些数据库会在这种情况下为B选择一个随机值,我相信有一些会给你一个错误信息。
#2
13
That's historically true. Omitting unaggregated columns leads to indeterminate behavior. SQL aims at fully determinate behavior.
这是历史事实。省略未聚合的列将导致不确定的行为。SQL的目标是完全确定行为。
But SQL standards have recently changed to let you omit from the GROUP BY clause columns that are functionally dependent on columns that are in the GROUP BY. PostgreSQL follows the more recent SQL standards. (It's not the only one.) Behavior is still fully determinate.
但是,SQL标准最近已经更改了,让您从功能上依赖于组中列的子句列中删除。PostgreSQL遵循最近的SQL标准。(不是唯一的一个。)行为仍然是完全确定的。
create table a (
a_id integer primary key,
xfr_date date not null
);
create table b (
a_id integer not null references a (a_id),
recd_date date not null,
units_recd integer not null
check (units_recd >= 0),
primary key (a_id, recd_date)
);
select a.a_id, a.xfr_date, sum(b.units_recd)
from a
inner join b on a.a_id = b.a_id
group by a.a_id; -- The column a.xfr_date is functionally dependent
-- on a.a_id; it doesn't have to appear in the
-- GROUP BY clause.
The notable deviation from from SQL standards is MySQL. It lets you omit just about everything from the GROUP BY. But that design choice makes its behavior indeterminate when you omit columns that are in the SELECT list.
SQL标准的显著偏差是MySQL。它可以让您忽略从组到组的所有内容。但是,当忽略SELECT列表中的列时,这种设计选择使其行为不确定。
#3
6
Actually, in MySQL you don't have to group by all columns. You can just group by whatever columns you want. The problem is, it will just pull a random value (from the set of available rows in the group) for the fields which aren't in the group by. If you know that you are grouping by something that is a unique key, there's no point in grouping by the rest of the fields, as they will already all have the same value anyway. It can actually speed it up to not have to group by every field when it is completely unnecessary.
实际上,在MySQL中,不需要对所有列进行分组。你可以按你想要的任何列来分组。问题是,它只会从组中不存在的字段中提取一个随机值(从组中可用的行集)。如果你知道你是按一个独一无二的键来分组的,那么在其他的字段中分组是没有意义的,因为它们已经都有相同的值了。它实际上可以加快它的速度,不需要在完全没有必要的情况下对每个字段进行分组。
#4
1
If you are grouping on something you cannot see the individual values of non-grouped columns because there may be more than one value within each group. All you can do is report on aggregate functions (sum, count, min & etc) -- these are able to combine the multiple values into a single cell in the result.
如果您对某个东西进行分组,您就不能看到非分组列的单个值,因为每个组中可能有多个值。您所能做的就是报告聚合函数(sum、count、min & etc)——这些函数能够将多个值合并到结果中的单个单元中。
#5
1
There are exceptions as noted by Sam Saffron but generally what your lecturer said is true.
有一些例外,正如Sam Saffron所指出的,但通常你的讲师所说的是正确的。
If I select 3 columns and group by 2 what should the RDBMS do with the 3rd column?
如果我选择3列和2组,那么RDBMS应该如何处理第三列呢?
The developers of the RDBMS may make a decision of how to handle the extra colum (as it appears MySQL's developers have) but is it the decision I would have made or the one I want when writing the select? Will the decision always be valid? I certainly prefer the Oracle-like approach of forcing me to explicitly state what should happen.
RDBMS的开发人员可能会决定如何处理额外的colum(就像MySQL的开发人员所看到的那样),但这是我在编写select时所做的决定还是我想要的决定?这个决定是否总是有效的?我当然更喜欢用类似于oracl的方法来强迫我明确地说明应该发生什么。
If I select 3 columns and group by 2 should the RDBS group by all 3, pick a random value from the 3rd, the biggest or littlest, the most common?
如果我选择3列和2组,那么RDBS组应该是3,从第三个,最大的还是最小的,最常见的?
#6
1
So the simple answer is: It depends. Mysql allows it, vertica doesn't.
简单的回答是:这要看情况。Mysql允许,vertica不允许。
There is actually a valid use case for omitting and that is when you are already selecting say with MIN().
实际上,省略是一个有效的用例,这是当您已经选择用MIN()进行选择时。
Here is an actual example for event tracking. Imaging you have credit and purchase events.
下面是事件跟踪的实际示例。想象你有信用和购买事件。
For simplicity we say a=credit, b,c,d are some kind of purchase event, and time is tracked with a running number. Now you want to find the date of the first purchase after each credit. We also happen to have only one customer 0:
为了简单起见,我们说a=credit, b,c,d是某种购买事件,时间用一个运行的数字跟踪。现在你想找到每笔贷款后的第一次购买的日期。我们也刚好只有一个客户0:
create table events (user_id int ,created_at int, event varchar(255));
insert into events values (0,0, 'a');
insert into events values (0,1, 'b');
insert into events values (0,2, 'c');
insert into events values (0,3, 'a');
insert into events values (0,4, 'c');
insert into events values (0,5, 'b');
insert into events values (0,6, 'a');
insert into events values (0,7, 'a');
insert into events values (0,8, 'd');
mysql> SELECT user_id, MAX(purchased) AS purchased, spent, event FROM (SELECT e1.User_ID AS user_id, e1.created_at AS purchased, MIN(e2.created_at) AS spent, e2.event AS event FROM events e1, events e2 WHERE e1.user_id = e2.user_id AND e1.created_at <= e2.created_at AND e1.Event = 'a' AND e2.Event != 'a' GROUP BY e1.user_id, e1.created_at) e3 GROUP BY user_id, spent;
+---------+-----------+-------+-------+
| user_id | purchased | spent | event |
+---------+-----------+-------+-------+
| 0 | 0 | 1 | b |
| 0 | 3 | 4 | c |
| 0 | 7 | 8 | d |
+---------+-----------+-------+-------+
3 rows in set (0.00 sec)
looks good in mysql, does not work in vertica:
在mysql中看起来不错,在vertica中不适用:
ERROR 2640: Column "e2.event" must appear in the GROUP BY clause or be used in an aggregate function
错误2640:e2列”。事件“必须在GROUP BY子句中出现,或在聚合函数中使用。
if I omit the event column, it works in both, but I do actually want to know what specific value event had for the row that min selected.
如果我省略了事件列,它在两者中都是有效的,但是我确实想知道对于min选择的行有什么特定的值事件。
So my answer ends with a request for comment :) Any ideas?
因此,我的回答以一个评论的请求结束:)有什么想法吗?
#7
0
This is an answer for Michael Will's example/question.
这是迈克尔·威尔的例子/问题的答案。
SELECT
e3.user_id,
MAX(e3.purchased) AS purchased,
e3.spent,
e.event
FROM
events e
INNER JOIN
(SELECT
e1.user_id AS user_id,
MIN(e1.created_at) as spent,
e2.created_at as purchased
FROM
events e1
INNER JOIN
(SELECT e.user_id, e.created_at from events e WHERE e.event = 'a') e2
ON e1.user_id = e2.user_id
AND e1.created_at >= e2.created_at
AND e1.event != 'a'
GROUP BY e1.User_ID, e2.created_at
) e3
ON e.user_id = e3.user_id AND e.created_at = e3.spent
GROUP BY e3.user_id, e3.spent, e.event;