I am somewhat confused about how the group by
command works in mysql.
关于group by命令如何在mysql中工作,我有点困惑。
Suppose I have a table:
假设我有一张桌子:
mysql> select recordID, IPAddress, date, httpMethod from Log_Analysis_Records_dalhousieShort;
+----------+-----------------+---------------------+-------------------------------------------------+
| recordID | IPAddress | date | httpMethod |
+----------+-----------------+---------------------+-------------------------------------------------+
| 1 | 64.68.88.22 | 2003-07-09 00:00:21 | GET /news/science/cancer.shtml HTTP/1.0 |
| 2 | 64.68.88.166 | 2003-07-09 00:00:55 | GET /news/internet/xml.shtml HTTP/1.0 |
| 3 | 129.173.177.214 | 2003-07-09 00:01:23 | GET / HTTP/1.1 |
| 4 | 129.173.177.214 | 2003-07-09 00:01:23 | GET /include/fcs_style.css HTTP/1.1 |
| 5 | 129.173.177.214 | 2003-07-09 00:01:23 | GET /include/main_page.css HTTP/1.1 |
| 6 | 129.173.177.214 | 2003-07-09 00:01:23 | GET /images/bigportaltopbanner.gif HTTP/1.1 |
| 7 | 129.173.177.214 | 2003-07-09 00:01:23 | GET /images/right_1.jpg HTTP/1.1 |
| 8 | 64.68.88.165 | 2003-07-09 00:02:43 | GET /studentservices/responsible.shtml HTTP/1.0 |
| 9 | 64.68.88.165 | 2003-07-09 00:02:44 | GET /news/sports/basketball.shtml HTTP/1.0 |
| 10 | 64.68.88.34 | 2003-07-09 00:02:46 | GET /news/science/space.shtml HTTP/1.0 |
| 11 | 129.173.159.98 | 2003-07-09 00:03:46 | GET / HTTP/1.1 |
| 12 | 129.173.159.98 | 2003-07-09 00:03:46 | GET /include/fcs_style.css HTTP/1.1 |
| 13 | 129.173.159.98 | 2003-07-09 00:03:46 | GET /include/main_page.css HTTP/1.1 |
| 14 | 129.173.159.98 | 2003-07-09 00:03:48 | GET /images/bigportaltopbanner.gif HTTP/1.1 |
| 15 | 129.173.159.98 | 2003-07-09 00:03:48 | GET /images/left_1g.jpg HTTP/1.1 |
| 16 | 129.173.159.98 | 2003-07-09 00:03:48 | GET /images/webcam.gif HTTP/1.1 |
+----------+-----------------+---------------------+-------------------------------------------------+
When I am execute this statement how does it choose which recordID
to include since there are a range of recordID
s that would be correct? Does it just choose the first one that matches?
当我执行此语句时,它如何选择要包含哪个recordID,因为有一系列recordID是正确的?它只是选择匹配的第一个吗?
mysql> select recordID, IPAddress, date, httpMethod from Log_Analysis_Records_dalhousieShort GROUP BY IPADDRESS;
+----------+-----------------+---------------------+-------------------------------------------------+
| recordID | IPAddress | date | httpMethod |
+----------+-----------------+---------------------+-------------------------------------------------+
| 11 | 129.173.159.98 | 2003-07-09 00:03:46 | GET / HTTP/1.1 |
| 3 | 129.173.177.214 | 2003-07-09 00:01:23 | GET / HTTP/1.1 |
| 8 | 64.68.88.165 | 2003-07-09 00:02:43 | GET /studentservices/responsible.shtml HTTP/1.0 |
| 2 | 64.68.88.166 | 2003-07-09 00:00:55 | GET /news/internet/xml.shtml HTTP/1.0 |
| 1 | 64.68.88.22 | 2003-07-09 00:00:21 | GET /news/science/cancer.shtml HTTP/1.0 |
| 10 | 64.68.88.34 | 2003-07-09 00:02:46 | GET /news/science/space.shtml HTTP/1.0 |
+----------+-----------------+---------------------+-------------------------------------------------+
6 rows in set (0.00 sec)
For this table the max(date)
and min(date)
values seem logical to me but I am confused about how the recordID
and httpMethod
where chosen.
对于这个表,max(date)和min(date)值对我来说似乎合乎逻辑,但我对于如何选择recordID和httpMethod感到困惑。
Is it safe use two aggregate functions in one command?
在一个命令中使用两个聚合函数是否安全?
mysql> select recordID, IPAddress, min(date), max(date), httpMethod from Log_Analysis_Records_dalhousieShort GROUP BY IPADDRESS;
+----------+-----------------+---------------------+---------------------+-------------------------------------------------+
| recordID | IPAddress | min(date) | max(date) | httpMethod |
+----------+-----------------+---------------------+---------------------+-------------------------------------------------+
| 11 | 129.173.159.98 | 2003-07-09 00:03:46 | 2003-07-09 00:03:48 | GET / HTTP/1.1 |
| 3 | 129.173.177.214 | 2003-07-09 00:01:23 | 2003-07-09 00:01:23 | GET / HTTP/1.1 |
| 8 | 64.68.88.165 | 2003-07-09 00:02:43 | 2003-07-09 00:02:44 | GET /studentservices/responsible.shtml HTTP/1.0 |
| 2 | 64.68.88.166 | 2003-07-09 00:00:55 | 2003-07-09 00:00:55 | GET /news/internet/xml.shtml HTTP/1.0 |
| 1 | 64.68.88.22 | 2003-07-09 00:00:21 | 2003-07-09 00:00:21 | GET /news/science/cancer.shtml HTTP/1.0 |
| 10 | 64.68.88.34 | 2003-07-09 00:02:46 | 2003-07-09 00:02:46 | GET /news/science/space.shtml HTTP/1.0 |
+----------+-----------------+---------------------+---------------------+-------------------------------------------------+
6 rows in set (0.00 sec)
4 个解决方案
#1
12
Usually use of GROUP BY while listing a field in the select expression without an aggregate function is invalid SQL and should throw an error.
通常在没有聚合函数的select表达式中列出字段时使用GROUP BY是无效的SQL并且应该抛出错误。
MySQL, however, allows this and simply chooses one value randomly. Try to avoid it, because it is confusing.
但是,MySQL允许这样做,只需随机选择一个值。尽量避免它,因为它令人困惑。
To disallow this, you can say at runtime:
要禁止这一点,您可以在运行时说:
SET sql_mode := CONCAT('ONLY_FULL_GROUP_BY,',@@sql_mode);
SET sql_mode:= CONCAT('ONLY_FULL_GROUP_BY,',@@ sql_mode);
or use the configuration value and/or command line option sql-mode
.
或使用配置值和/或命令行选项sql-mode。
Yes, listing two aggregate functions is completely valid.
是的,列出两个聚合函数是完全有效的。
#2
4
Because I'm new apparently I can't post helpful images so I'll try to do this with text...
因为我很新,显然我无法发布有用的图片所以我会尝试用文字做这个...
I just tested this and it appears that the values of fields that are NOT in the GROUP BY will use the values of the FIRST row that matches the group by condition. This will also explain the perceived "randomness" that others have experienced with selecting columns that aren't in a group by clause.
我刚刚对此进行了测试,看起来不在GROUP BY中的字段值将使用与条件匹配的FIRST行的值。这也将解释其他人在选择不在group by子句中的列时所感知的“随机性”。
Example:
Create a table called "test" with 2 columns called "col1" and "col2" with data that looks like this:
创建一个名为“test”的表,其中有两列名为“col1”和“col2”,其数据如下所示:
Col1 Col2
1 2
1 2
1 3
2 1
2 2
2 3
3 1
3 2
3 3
Col1 Col2 1 2 1 2 1 3 2 1 2 2 2 3 3 1 3 2 3 3
Then run the following query:
然后运行以下查询:
select col1,col2
from test
order by col2 desc
通过col2 desc从测试顺序中选择col1,col2
You will get this result:
你会得到这个结果:
1 3
2 3
3 3
1 2
1 2
2 2
3 2
2 1
3 1
1 3 2 3 3 3 1 2 1 2 2 2 3 2 2 1 3 1
Now consider the following query:
现在考虑以下查询:
select groupTable.col1,groupTable.col2
from (
select col1,col2
from test
order by col2 desc
) groupTable
group by groupTable.col1
order by groupTable.col1 desc
选择groupTable.col1,groupTable.col2(从col2 desc选择col1,col2从测试顺序)groupTable组按groupTable.col1顺序分组groupTable.col1 desc
You will get this result:
你会得到这个结果:
3 3
2 3
1 3
3 3 2 3 1 3
Change the subquery to asc:
将子查询更改为asc:
select col1,col2
from test
order by col2 asc
通过col2 asc从测试顺序中选择col1,col2
Result:
2 1
3 1
1 2
1 2
2 2
3 2
1 3
2 3
3 3
2 1 3 1 1 2 1 2 2 2 3 2 1 3 2 3 3 3
Again use that as the basis for your subquery:
再次使用它作为子查询的基础:
select groupTable.col1,groupTable.col2
from (
select col1,col2
from test
order by col2 asc
) groupTable
group by groupTable.col1
order by groupTable.col1 desc
select groupTable.col1,groupTable.col2 from(select col1,col2 from test order by col2 asc)groupTable group by groupTable.col1 order by groupTable.col1 desc
Result:
3 1
2 1
1 2
结果:3 1 2 1 1 2
Now you should be able to see how the order of the subquery affects which values are chosen for fields that are selected but not in the group by clause. This would explain the perceived "randomness" that others have mentioned because if the subquery (or lack there of) is not combined with an ORDER BY clause then mysql will grab rows as they come in, but by defining a sort order in a subquery you are able to control this behavior and get predictable results.
现在,您应该能够看到子查询的顺序如何影响为所选字段选择但不在group by子句中选择的值。这可以解释其他人提到的感知“随机性”,因为如果子查询(或缺少子查询)没有与ORDER BY子句组合,那么mysql会在进入时抓取行,但是通过在子查询中定义排序顺序能够控制此行为并获得可预测的结果。
#3
0
I thought it takes the first row according to the PRIMARY KEY or any INDEX, because it looks like it works that way, but i've tried a GROUP BY query on various tables and didn't identify any pattern.
我认为根据PRIMARY KEY或任何INDEX需要第一行,因为它看起来像是这样工作,但我已经在各种表上尝试了GROUP BY查询并且没有识别任何模式。
Therefore i will avoid to use any value of non-grouped columns.
因此,我将避免使用任何非分组列的值。
#4
0
Group By picks up the first record based on the index. Let us say Log_Analysis_Records_dalhousieShort table has recoedID as index. Hence, group by picked 11 recordID for IPAddress 129.173.159.98 among recordID 11 to 16. However min and max are pre group by operations in a way so there values are computed logically for you.
Group By根据索引获取第一条记录。让我们说Log_Analysis_Records_dalhousieShort表有recoedID作为索引。因此,在记录ID 11到16中,分组为IPAddress 129.173.159.98选择11 recordID。但是min和max是按操作预先分组的,因此逻辑上为您计算值。
mysql> select recordID, IPAddress, date, httpMethod from Log_Analysis_Records_dalhousieShort GROUP BY IPADDRESS;
+----------+-----------------+---------------------+-------------------------------------------------+
| recordID | IPAddress | date | httpMethod |
+----------+-----------------+---------------------+-------------------------------------------------+
| 11 | 129.173.159.98 | 2003-07-09 00:03:46 | GET / HTTP/1.1 |
| 3 | 129.173.177.214 | 2003-07-09 00:01:23 | GET / HTTP/1.1 |
| 8 | 64.68.88.165 | 2003-07-09 00:02:43 | GET /studentservices/responsible.shtml HTTP/1.0 |
| 2 | 64.68.88.166 | 2003-07-09 00:00:55 | GET /news/internet/xml.shtml HTTP/1.0 |
| 1 | 64.68.88.22 | 2003-07-09 00:00:21 | GET /news/science/cancer.shtml HTTP/1.0 |
| 10 | 64.68.88.34 | 2003-07-09 00:02:46 | GET /news/science/space.shtml HTTP/1.0 |
+----------+-----------------+---------------------+-------------------------------------------------+
6 rows in set (0.00 sec)
#1
12
Usually use of GROUP BY while listing a field in the select expression without an aggregate function is invalid SQL and should throw an error.
通常在没有聚合函数的select表达式中列出字段时使用GROUP BY是无效的SQL并且应该抛出错误。
MySQL, however, allows this and simply chooses one value randomly. Try to avoid it, because it is confusing.
但是,MySQL允许这样做,只需随机选择一个值。尽量避免它,因为它令人困惑。
To disallow this, you can say at runtime:
要禁止这一点,您可以在运行时说:
SET sql_mode := CONCAT('ONLY_FULL_GROUP_BY,',@@sql_mode);
SET sql_mode:= CONCAT('ONLY_FULL_GROUP_BY,',@@ sql_mode);
or use the configuration value and/or command line option sql-mode
.
或使用配置值和/或命令行选项sql-mode。
Yes, listing two aggregate functions is completely valid.
是的,列出两个聚合函数是完全有效的。
#2
4
Because I'm new apparently I can't post helpful images so I'll try to do this with text...
因为我很新,显然我无法发布有用的图片所以我会尝试用文字做这个...
I just tested this and it appears that the values of fields that are NOT in the GROUP BY will use the values of the FIRST row that matches the group by condition. This will also explain the perceived "randomness" that others have experienced with selecting columns that aren't in a group by clause.
我刚刚对此进行了测试,看起来不在GROUP BY中的字段值将使用与条件匹配的FIRST行的值。这也将解释其他人在选择不在group by子句中的列时所感知的“随机性”。
Example:
Create a table called "test" with 2 columns called "col1" and "col2" with data that looks like this:
创建一个名为“test”的表,其中有两列名为“col1”和“col2”,其数据如下所示:
Col1 Col2
1 2
1 2
1 3
2 1
2 2
2 3
3 1
3 2
3 3
Col1 Col2 1 2 1 2 1 3 2 1 2 2 2 3 3 1 3 2 3 3
Then run the following query:
然后运行以下查询:
select col1,col2
from test
order by col2 desc
通过col2 desc从测试顺序中选择col1,col2
You will get this result:
你会得到这个结果:
1 3
2 3
3 3
1 2
1 2
2 2
3 2
2 1
3 1
1 3 2 3 3 3 1 2 1 2 2 2 3 2 2 1 3 1
Now consider the following query:
现在考虑以下查询:
select groupTable.col1,groupTable.col2
from (
select col1,col2
from test
order by col2 desc
) groupTable
group by groupTable.col1
order by groupTable.col1 desc
选择groupTable.col1,groupTable.col2(从col2 desc选择col1,col2从测试顺序)groupTable组按groupTable.col1顺序分组groupTable.col1 desc
You will get this result:
你会得到这个结果:
3 3
2 3
1 3
3 3 2 3 1 3
Change the subquery to asc:
将子查询更改为asc:
select col1,col2
from test
order by col2 asc
通过col2 asc从测试顺序中选择col1,col2
Result:
2 1
3 1
1 2
1 2
2 2
3 2
1 3
2 3
3 3
2 1 3 1 1 2 1 2 2 2 3 2 1 3 2 3 3 3
Again use that as the basis for your subquery:
再次使用它作为子查询的基础:
select groupTable.col1,groupTable.col2
from (
select col1,col2
from test
order by col2 asc
) groupTable
group by groupTable.col1
order by groupTable.col1 desc
select groupTable.col1,groupTable.col2 from(select col1,col2 from test order by col2 asc)groupTable group by groupTable.col1 order by groupTable.col1 desc
Result:
3 1
2 1
1 2
结果:3 1 2 1 1 2
Now you should be able to see how the order of the subquery affects which values are chosen for fields that are selected but not in the group by clause. This would explain the perceived "randomness" that others have mentioned because if the subquery (or lack there of) is not combined with an ORDER BY clause then mysql will grab rows as they come in, but by defining a sort order in a subquery you are able to control this behavior and get predictable results.
现在,您应该能够看到子查询的顺序如何影响为所选字段选择但不在group by子句中选择的值。这可以解释其他人提到的感知“随机性”,因为如果子查询(或缺少子查询)没有与ORDER BY子句组合,那么mysql会在进入时抓取行,但是通过在子查询中定义排序顺序能够控制此行为并获得可预测的结果。
#3
0
I thought it takes the first row according to the PRIMARY KEY or any INDEX, because it looks like it works that way, but i've tried a GROUP BY query on various tables and didn't identify any pattern.
我认为根据PRIMARY KEY或任何INDEX需要第一行,因为它看起来像是这样工作,但我已经在各种表上尝试了GROUP BY查询并且没有识别任何模式。
Therefore i will avoid to use any value of non-grouped columns.
因此,我将避免使用任何非分组列的值。
#4
0
Group By picks up the first record based on the index. Let us say Log_Analysis_Records_dalhousieShort table has recoedID as index. Hence, group by picked 11 recordID for IPAddress 129.173.159.98 among recordID 11 to 16. However min and max are pre group by operations in a way so there values are computed logically for you.
Group By根据索引获取第一条记录。让我们说Log_Analysis_Records_dalhousieShort表有recoedID作为索引。因此,在记录ID 11到16中,分组为IPAddress 129.173.159.98选择11 recordID。但是min和max是按操作预先分组的,因此逻辑上为您计算值。
mysql> select recordID, IPAddress, date, httpMethod from Log_Analysis_Records_dalhousieShort GROUP BY IPADDRESS;
+----------+-----------------+---------------------+-------------------------------------------------+
| recordID | IPAddress | date | httpMethod |
+----------+-----------------+---------------------+-------------------------------------------------+
| 11 | 129.173.159.98 | 2003-07-09 00:03:46 | GET / HTTP/1.1 |
| 3 | 129.173.177.214 | 2003-07-09 00:01:23 | GET / HTTP/1.1 |
| 8 | 64.68.88.165 | 2003-07-09 00:02:43 | GET /studentservices/responsible.shtml HTTP/1.0 |
| 2 | 64.68.88.166 | 2003-07-09 00:00:55 | GET /news/internet/xml.shtml HTTP/1.0 |
| 1 | 64.68.88.22 | 2003-07-09 00:00:21 | GET /news/science/cancer.shtml HTTP/1.0 |
| 10 | 64.68.88.34 | 2003-07-09 00:02:46 | GET /news/science/space.shtml HTTP/1.0 |
+----------+-----------------+---------------------+-------------------------------------------------+
6 rows in set (0.00 sec)