There is a table messages
that contains data as shown below:
有一个表消息包含如下所示的数据:
Id Name Other_Columns
-------------------------
1 A A_data_1
2 A A_data_2
3 A A_data_3
4 B B_data_1
5 B B_data_2
6 C C_data_1
If I run a query select * from messages group by name
, I will get the result as:
如果我按名称从消息组运行查询select *,我会得到如下结果:
1 A A_data_1
4 B B_data_1
6 C C_data_1
What query will return the following result?
哪个查询将返回以下结果?
3 A A_data_3
5 B B_data_2
6 C C_data_1
That is, the last record in each group should be returned.
也就是说,应该返回每个组中的最后一个记录。
At present, this is the query that I use:
目前我使用的查询是:
SELECT
*
FROM (SELECT
*
FROM messages
ORDER BY id DESC) AS x
GROUP BY name
But this looks highly inefficient. Any other ways to achieve the same result?
但这看起来效率非常低。还有其他方法可以达到同样的结果吗?
21 个解决方案
#1
692
MySQL 8.0 now supports windowing functions, like almost all popular SQL implementations. With this standard syntax, we can write greatest-n-per-group queries:
MySQL 8.0现在支持窗口函数,就像所有流行的SQL实现一样。使用这个标准语法,我们可以编写每个组最大的查询:
WITH ranked_messages AS (
SELECT m.*, ROW_NUMBER() OVER (PARTITION BY name ORDER BY id DESC) AS rn
FROM messages AS m
)
SELECT * FROM ranked_messages WHERE rn = 1;
Below is the original answer I wrote for this question in 2009:
以下是我在2009年为这个问题写下的最初答案:
I write the solution this way:
我这样写解:
SELECT m1.*
FROM messages m1 LEFT JOIN messages m2
ON (m1.name = m2.name AND m1.id < m2.id)
WHERE m2.id IS NULL;
Regarding performance, one solution or the other can be better, depending on the nature of your data. So you should test both queries and use the one that is better at performance given your database.
关于性能,根据数据的性质,一个解决方案或另一个解决方案可以更好。因此,您应该测试这两个查询,并使用性能更好的数据库。
For example, I have a copy of the * August data dump. I'll use that for benchmarking. There are 1,114,357 rows in the Posts
table. This is running on MySQL 5.0.75 on my Macbook Pro 2.40GHz.
例如,我有一个* August数据转储的副本。我将把它用于基准测试。Posts表中有114357行。这是在我的Macbook Pro 2.40GHz上运行的MySQL 5.0.75。
I'll write a query to find the most recent post for a given user ID (mine).
我将编写一个查询来查找给定用户ID(我的)的最新文章。
First using the technique shown by @Eric with the GROUP BY
in a subquery:
首先使用@Eric在子查询中对GROUP by显示的技术:
SELECT p1.postid
FROM Posts p1
INNER JOIN (SELECT pi.owneruserid, MAX(pi.postid) AS maxpostid
FROM Posts pi GROUP BY pi.owneruserid) p2
ON (p1.postid = p2.maxpostid)
WHERE p1.owneruserid = 20860;
1 row in set (1 min 17.89 sec)
Even the EXPLAIN
analysis takes over 16 seconds:
甚至解释分析也要花费16秒:
+----+-------------+------------+--------+----------------------------+-------------+---------+--------------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+--------+----------------------------+-------------+---------+--------------+---------+-------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 76756 | |
| 1 | PRIMARY | p1 | eq_ref | PRIMARY,PostId,OwnerUserId | PRIMARY | 8 | p2.maxpostid | 1 | Using where |
| 2 | DERIVED | pi | index | NULL | OwnerUserId | 8 | NULL | 1151268 | Using index |
+----+-------------+------------+--------+----------------------------+-------------+---------+--------------+---------+-------------+
3 rows in set (16.09 sec)
Now produce the same query result using my technique with LEFT JOIN
:
现在使用我的技术与左连接产生相同的查询结果:
SELECT p1.postid
FROM Posts p1 LEFT JOIN posts p2
ON (p1.owneruserid = p2.owneruserid AND p1.postid < p2.postid)
WHERE p2.postid IS NULL AND p1.owneruserid = 20860;
1 row in set (0.28 sec)
The EXPLAIN
analysis shows that both tables are able to use their indexes:
解释分析表明,两个表都可以使用它们的索引:
+----+-------------+-------+------+----------------------------+-------------+---------+-------+------+--------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+----------------------------+-------------+---------+-------+------+--------------------------------------+
| 1 | SIMPLE | p1 | ref | OwnerUserId | OwnerUserId | 8 | const | 1384 | Using index |
| 1 | SIMPLE | p2 | ref | PRIMARY,PostId,OwnerUserId | OwnerUserId | 8 | const | 1384 | Using where; Using index; Not exists |
+----+-------------+-------+------+----------------------------+-------------+---------+-------+------+--------------------------------------+
2 rows in set (0.00 sec)
Here's the DDL for my Posts
table:
下面是我的Posts表的DDL:
CREATE TABLE `posts` (
`PostId` bigint(20) unsigned NOT NULL auto_increment,
`PostTypeId` bigint(20) unsigned NOT NULL,
`AcceptedAnswerId` bigint(20) unsigned default NULL,
`ParentId` bigint(20) unsigned default NULL,
`CreationDate` datetime NOT NULL,
`Score` int(11) NOT NULL default '0',
`ViewCount` int(11) NOT NULL default '0',
`Body` text NOT NULL,
`OwnerUserId` bigint(20) unsigned NOT NULL,
`OwnerDisplayName` varchar(40) default NULL,
`LastEditorUserId` bigint(20) unsigned default NULL,
`LastEditDate` datetime default NULL,
`LastActivityDate` datetime default NULL,
`Title` varchar(250) NOT NULL default '',
`Tags` varchar(150) NOT NULL default '',
`AnswerCount` int(11) NOT NULL default '0',
`CommentCount` int(11) NOT NULL default '0',
`FavoriteCount` int(11) NOT NULL default '0',
`ClosedDate` datetime default NULL,
PRIMARY KEY (`PostId`),
UNIQUE KEY `PostId` (`PostId`),
KEY `PostTypeId` (`PostTypeId`),
KEY `AcceptedAnswerId` (`AcceptedAnswerId`),
KEY `OwnerUserId` (`OwnerUserId`),
KEY `LastEditorUserId` (`LastEditorUserId`),
KEY `ParentId` (`ParentId`),
CONSTRAINT `posts_ibfk_1` FOREIGN KEY (`PostTypeId`) REFERENCES `posttypes` (`PostTypeId`)
) ENGINE=InnoDB;
#2
116
UPD: 2017-03-31, the version 5.7.5 of MySQL made the ONLY_FULL_GROUP_BY switch enabled by default (hence, non-deterministic GROUP BY queries became disabled). Moreover, they updated the GROUP BY implementation and the solution might not work as expected anymore even with the disabled switch. One needs to check.
UPD: 2017-03-31, MySQL版本5.7.5使得ONLY_FULL_GROUP_BY开关默认启用(因此,查询不确定组被禁用)。此外,他们根据实现更新了组,即使禁用了开关,解决方案也可能不再像预期的那样工作。需要检查。
Bill Karwin's solution above works fine when item count within groups is rather small, but the performance of the query becomes bad when the groups are rather large, since the solution requires about n*n/2 + n/2
of only IS NULL
comparisons.
当组中的项计数很小时,Bill Karwin的上述解决方案可以正常工作,但是当组比较大时,查询的性能就会变差,因为该解决方案只需要n*n/2 + n/2的值为NULL比较。
I made my tests on a InnoDB table of 18684446
rows with 1182
groups. The table contains testresults for functional tests and has the (test_id, request_id)
as the primary key. Thus, test_id
is a group and I was searching for the last request_id
for each test_id
.
我对InnoDB表进行了测试,这个表有18684446行,有1182个组。该表包含用于功能测试的testresults,并以(test_id、request_id)作为主键。因此,test_id是一个组,我正在为每个test_id搜索最后一个request_id。
Bill's solution has already been running for several hours on my dell e4310 and I do not know when it is going to finish even though it operates on a coverage index (hence using index
in EXPLAIN).
Bill的解决方案已经在我的dell e4310上运行了好几个小时了,我不知道它什么时候结束,即使它在覆盖率指数上运行(因此在EXPLAIN中使用索引)。
I have a couple of other solutions that are based on the same ideas:
我有一些其他的解决方案基于相同的想法:
- if the underlying index is BTREE index (which is usually the case), the largest
(group_id, item_value)
pair is the last value within eachgroup_id
, that is the first for eachgroup_id
if we walk through the index in descending order; - 如果底层索引是BTREE索引(通常是这种情况),那么最大的(group_id、item_value)对是每个group_id中的最后一个值,这是按降序遍历索引的第一个group_id;
- if we read the values which are covered by an index, the values are read in the order of the index;
- 如果我们读取索引所覆盖的值,则按索引的顺序读取这些值;
- each index implicitly contains primary key columns appended to that (that is the primary key is in the coverage index). In solutions below I operate directly on the primary key, in you case, you will just need to add primary key columns in the result.
- 每个索引隐式地包含附加到该索引的主键列(即覆盖率索引中的主键)。在下面的解决方案中,我直接操作主键,在您的情况下,您只需要在结果中添加主键列。
- in many cases it is much cheaper to collect the required row ids in the required order in a subquery and join the result of the subquery on the id. Since for each row in the subquery result MySQL will need a single fetch based on primary key, the subquery will be put first in the join and the rows will be output in the order of the ids in the subquery (if we omit explicit ORDER BY for the join)
- 在许多情况下便宜得多收集所需的行id查询所需的顺序,加入id上的子查询的结果。因为MySQL查询结果中的每一行将需要一个获取基于主键,子查询将第一次加入,将输出的顺序id的子查询(如果我们忽略显式连接order BY)
3 ways MySQL uses indexes is a great article to understand some details.
MySQL使用索引的3种方式是理解一些细节的好文章。
Solution 1
解决方案1
This one is incredibly fast, it takes about 0,8 secs on my 18M+ rows:
这个速度非常快,在我的18M+行上大约需要0,8秒:
SELECT test_id, MAX(request_id), request_id
FROM testresults
GROUP BY test_id DESC;
If you want to change the order to ASC, put it in a subquery, return the ids only and use that as the subquery to join to the rest of the columns:
如果要将订单更改为ASC,请将其放入子查询中,仅返回id,并将其作为子查询连接到其他列:
SELECT test_id, request_id
FROM (
SELECT test_id, MAX(request_id), request_id
FROM testresults
GROUP BY test_id DESC) as ids
ORDER BY test_id;
This one takes about 1,2 secs on my data.
这个数据需要1 2秒。
Solution 2
解决方案2
Here is another solution that takes about 19 seconds for my table:
这是另一个解决方案,我的桌子需要19秒:
SELECT test_id, request_id
FROM testresults, (SELECT @group:=NULL) as init
WHERE IF(IFNULL(@group, -1)=@group:=test_id, 0, 1)
ORDER BY test_id DESC, request_id DESC
It returns tests in descending order as well. It is much slower since it does a full index scan but it is here to give you an idea how to output N max rows for each group.
它还按降序返回测试。因为它进行了完整的索引扫描,所以速度要慢得多,但是它在这里是为了让您了解如何为每个组输出N行最大值。
The disadvantage of the query is that its result cannot be cached by the query cache.
查询的缺点是查询缓存不能缓存其结果。
#3
79
Use your subquery to return the correct grouping, because you're halfway there.
使用子查询返回正确的分组,因为您已经完成了一半。
Try this:
试试这个:
select
a.*
from
messages a
inner join
(select name, max(id) as maxid from messages group by name) as b on
a.id = b.maxid
If it's not id
you want the max of:
如果不是id,你想要的最大值是:
select
a.*
from
messages a
inner join
(select name, max(other_col) as other_col
from messages group by name) as b on
a.name = b.name
and a.other_col = b.other_col
This way, you avoid correlated subqueries and/or ordering in your subqueries, which tend to be very slow/inefficient.
通过这种方式,您可以避免子查询中的相关子查询和/或排序,而这些子查询往往非常慢/低效。
#4
32
I arrived at a different solution, which is to get the IDs for the last post within each group, then select from the messages table using the result from the first query as the argument for a WHERE x IN
construct:
我得到了一个不同的解决方案,即在每个组中获取最后一篇文章的id,然后使用第一个查询的结果作为构造中的a WHERE x的参数从消息表中选择:
SELECT id, name, other_columns
FROM messages
WHERE id IN (
SELECT MAX(id)
FROM messages
GROUP BY name
);
I don't know how this performs compared to some of the other solutions, but it worked spectacularly for my table with 3+ million rows. (4 second execution with 1200+ results)
与其他一些解决方案相比,我不知道它的性能如何,但它在我的表中有300多万行。(4第二次执行,1200+结果)
This should work both on MySQL and SQL Server.
这在MySQL和SQL Server上都可以工作。
#5
22
Solution by sub query fiddle Link
通过子查询小提琴链接解决
select * from messages where id in
(select max(id) from messages group by Name)
Solution By join condition fiddle link
通过连接条件提琴链接解决
select m1.* from messages m1
left outer join messages m2
on ( m1.id<m2.id and m1.name=m2.name )
where m2.id is null
Reason for this post is to give fiddle link only. Same SQL is already provided in other answers.
这篇文章的原因是只提供小提琴链接。其他答案中已经提供了相同的SQL。
#6
7
I've not yet tested with large DB but I think this could be faster than joining tables:
我还没有对大DB进行测试,但我认为这可能比合并表要快:
SELECT *, Max(Id) FROM messages GROUP BY Name
#7
4
Here are two suggestions. First, if mysql supports ROW_NUMBER(), it's very simple:
这里有两个建议。首先,如果mysql支持ROW_NUMBER(),则非常简单:
WITH Ranked AS (
SELECT Id, Name, OtherColumns,
ROW_NUMBER() OVER (
PARTITION BY Name
ORDER BY Id DESC
) AS rk
FROM messages
)
SELECT Id, Name, OtherColumns
FROM messages
WHERE rk = 1;
I'm assuming by "last" you mean last in Id order. If not, change the ORDER BY clause of the ROW_NUMBER() window accordingly. If ROW_NUMBER() isn't available, this is another solution:
我假设你的意思是最后一个。如果不是,则相应地更改ROW_NUMBER()窗口的ORDER BY子句。如果ROW_NUMBER()不可用,这是另一个解决方案:
Second, if it doesn't, this is often a good way to proceed:
其次,如果没有的话,这通常是一个很好的方法:
SELECT
Id, Name, OtherColumns
FROM messages
WHERE NOT EXISTS (
SELECT * FROM messages as M2
WHERE M2.Name = messages.Name
AND M2.Id > messages.Id
)
In other words, select messages where there is no later-Id message with the same Name.
换句话说,选择没有具有相同名称的后id消息的消息。
#8
4
Here is my solution:
这是我的解决方案:
SELECT
DISTINCT NAME,
MAX(MESSAGES) OVER(PARTITION BY NAME) MESSAGES
FROM MESSAGE;
#9
3
Here is another way to get the last related record using GROUP_CONCAT
with order by and SUBSTRING_INDEX
to pick one of the record from the list
下面是另一种使用GROUP_CONCAT和SUBSTRING_INDEX获取最后一条相关记录的方法,以便从列表中选择一条记录
SELECT
`Id`,
`Name`,
SUBSTRING_INDEX(
GROUP_CONCAT(
`Other_Columns`
ORDER BY `Id` DESC
SEPARATOR '||'
),
'||',
1
) Other_Columns
FROM
messages
GROUP BY `Name`
Above query will group the all the Other_Columns
that are in same Name
group and using ORDER BY id DESC
will join all the Other_Columns
in a specific group in descending order with the provided separator in my case i have used ||
,using SUBSTRING_INDEX
over this list will pick the first one
上面查询将集团的所有Other_Columns同名组和使用命令id DESC将加入特定组中的所有Other_Columns提供分离器在降序排列在我的例子中我使用了| |,使用SUBSTRING_INDEX在这个列表将选择第一个
Fiddle Demo
#10
3
SELECT
column1,
column2
FROM
table_name
WHERE id IN
(SELECT
MAX(id)
FROM
table_name
GROUP BY column1)
ORDER BY column1 ;
#11
2
Try this:
试试这个:
SELECT jos_categories.title AS name,
joined .catid,
joined .title,
joined .introtext
FROM jos_categories
INNER JOIN (SELECT *
FROM (SELECT `title`,
catid,
`created`,
introtext
FROM `jos_content`
WHERE `sectionid` = 6
ORDER BY `id` DESC) AS yes
GROUP BY `yes`.`catid` DESC
ORDER BY `yes`.`created` DESC) AS joined
ON( joined.catid = jos_categories.id )
#12
2
You can take view from here as well.
你也可以从这里看。
http://sqlfiddle.com/#!9/ef42b/9
http://sqlfiddle.com/ ! 9 / ef42b / 9
FIRST SOLUTION
第一个解决方案
SELECT d1.ID,Name,City FROM Demo_User d1
INNER JOIN
(SELECT MAX(ID) AS ID FROM Demo_User GROUP By NAME) AS P ON (d1.ID=P.ID);
SECOND SOLUTION
第二个解决方案
SELECT * FROM (SELECT * FROM Demo_User ORDER BY ID DESC) AS T GROUP BY NAME ;
#13
1
Is there any way we could use this method to delete duplicates in a table? The result set is basically a collection of unique records, so if we could delete all records not in the result set, we would effectively have no duplicates? I tried this but mySQL gave a 1093 error.
有没有办法用这个方法删除表中的重复项?结果集基本上是唯一记录的集合,所以如果我们可以删除结果集中没有的所有记录,我们将有效地没有副本?我试过了,但是mySQL出了1093。
DELETE FROM messages WHERE id NOT IN
(SELECT m1.id
FROM messages m1 LEFT JOIN messages m2
ON (m1.name = m2.name AND m1.id < m2.id)
WHERE m2.id IS NULL)
Is there a way to maybe save the output to a temp variable then delete from NOT IN (temp variable)? @Bill thanks for a very useful solution.
是否有一种方法可以将输出保存到一个temp变量中,然后从NOT IN (temp变量)中删除?感谢一个非常有用的解决方案。
EDIT: Think i found the solution:
编辑:认为我找到了解决方案:
DROP TABLE IF EXISTS UniqueIDs;
CREATE Temporary table UniqueIDs (id Int(11));
INSERT INTO UniqueIDs
(SELECT T1.ID FROM Table T1 LEFT JOIN Table T2 ON
(T1.Field1 = T2.Field1 AND T1.Field2 = T2.Field2 #Comparison Fields
AND T1.ID < T2.ID)
WHERE T2.ID IS NULL);
DELETE FROM Table WHERE id NOT IN (SELECT ID FROM UniqueIDs);
#14
1
The below query will work fine as per your question.
下面的查询将按照您的问题正常工作。
SELECT M1.*
FROM MESSAGES M1,
(
SELECT SUBSTR(Others_data,1,2),MAX(Others_data) AS Max_Others_data
FROM MESSAGES
GROUP BY 1
) M2
WHERE M1.Others_data = M2.Max_Others_data
ORDER BY Others_data;
#15
1
Hi @Vijay Dev if your table messages contains Id which is auto increment primary key then to fetch the latest record basis on the primary key your query should read as below:
你好@Vijay Dev .如果你的表消息包含自动递增主键的Id,那么在主键上获取最新的记录,你的查询应该如下所示:
SELECT m1.* FROM messages m1 INNER JOIN (SELECT max(Id) as lastmsgId FROM messages GROUP BY Name) m2 ON m1.Id=m2.lastmsgId
#16
1
If you want the last row for each Name
, then you can give a row number to each row group by the Name
and order by Id
in descending order.
如果要为每个名称指定最后一行,那么可以按名称和按Id按降序给每个行组一个行号。
QUERY
查询
SELECT t1.Id,
t1.Name,
t1.Other_Columns
FROM
(
SELECT Id,
Name,
Other_Columns,
(
CASE Name WHEN @curA
THEN @curRow := @curRow + 1
ELSE @curRow := 1 AND @curA := Name END
) + 1 AS rn
FROM messages t,
(SELECT @curRow := 0, @curA := '') r
ORDER BY Name,Id DESC
)t1
WHERE t1.rn = 1
ORDER BY t1.Id;
SQL Fiddle
#17
1
An approach with considerable speed is as follows.
以下是一种速度相当快的方法。
SELECT *
FROM messages a
WHERE Id = (SELECT MAX(Id) FROM messages WHERE a.Name = Name)
Result
结果
Id Name Other_Columns
3 A A_data_3
5 B B_data_2
6 C C_data_1
#18
0
How about this:
这个怎么样:
SELECT DISTINCT ON (name) *
FROM messages
ORDER BY name, id DESC;
I had similar issue (on postgresql tough) and on a 1M records table. This solution takes 1.7s vs 44s produced by the one with LEFT JOIN. In my case I had to filter the corrispondant of your name field against NULL values, resulting in even better performances by 0.2 secs
我遇到过类似的问题(在postgresql tough和1M记录表上)。这个解决方案需要使用左连接产生的1.7s和44s。在我的例子中,我必须对name字段的对应项进行过滤,以防止空值,从而使性能提高0.2秒
#19
0
Clearly there are lots of different ways of getting the same results, your question seems to be what is an efficient way of getting the last results in each group in MySQL. If you are working with huge amounts of data and assuming you are using InnoDB with even the latest versions of MySQL (such as 5.7.21 and 8.0.4-rc) then there might not be an efficient way of doing this.
显然有很多不同的方法得到相同的结果,你的问题似乎是在MySQL中每组得到最后结果的有效方法。如果您正在处理大量的数据,并且假设您正在使用InnoDB,甚至是最新版本的MySQL(如5.7.21和8.0.4-rc),那么可能没有一种有效的方法来实现这一点。
We sometimes need to do this with tables with even more than 60 million rows.
我们有时需要使用超过6000万行的表来处理这个问题。
For these examples I will use data with only about 1.5 million rows where the queries would need to find results for all groups in the data. In our actual cases we would often need to return back data from about 2,000 groups (which hypothetically would not require examining very much of the data).
对于这些示例,我将使用只有150万行的数据,在这些数据中,查询需要查找数据中所有组的结果。在我们的实际案例中,我们经常需要返回来自大约2000组的数据(假设不需要检查很多数据)。
I will use the following tables:
我会使用以下表格:
CREATE TABLE temperature(
id INT UNSIGNED NOT NULL AUTO_INCREMENT,
groupID INT UNSIGNED NOT NULL,
recordedTimestamp TIMESTAMP NOT NULL,
recordedValue INT NOT NULL,
INDEX groupIndex(groupID, recordedTimestamp),
PRIMARY KEY (id)
);
CREATE TEMPORARY TABLE selected_group(id INT UNSIGNED NOT NULL, PRIMARY KEY(id));
The temperature table is populated with about 1.5 million random records, and with 100 different groups. The selected_group is populated with those 100 groups (in our cases this would normally be less than 20% for all of the groups).
温度表中有150万个随机记录,还有100个不同的组。selected_group中填充了这100个组(在我们的例子中,对于所有组来说,这通常都小于20%)。
As this data is random it means that multiple rows can have the same recordedTimestamps. What we want is to get a list of all of the selected groups in order of groupID with the last recordedTimestamp for each group, and if the same group has more than one matching row like that then the last matching id of those rows.
由于该数据是随机的,这意味着多个行可以拥有相同的记录时间戳。我们想要的是获得所有被选择的组的列表,以groupID的顺序为每个组的最后一个recordedTimestamp,如果相同的组有多个这样的匹配行,那么这些行的最后一个匹配id。
If hypothetically MySQL had a last() function which returned values from the last row in a special ORDER BY clause then we could simply do:
假设MySQL有一个last()函数,它以特殊的ORDER BY子句返回最后一行的值,那么我们只需:
SELECT
last(t1.id) AS id,
t1.groupID,
last(t1.recordedTimestamp) AS recordedTimestamp,
last(t1.recordedValue) AS recordedValue
FROM selected_group g
INNER JOIN temperature t1 ON t1.groupID = g.id
ORDER BY t1.recordedTimestamp, t1.id
GROUP BY t1.groupID;
which would only need to examine a few 100 rows in this case as it doesn't use any of the normal GROUP BY functions. This would execute in 0 seconds and hence be highly efficient. Note that normally in MySQL we would see an ORDER BY clause following the GROUP BY clause however this ORDER BY clause is used to determine the ORDER for the last() function, if it was after the GROUP BY then it would be ordering the GROUPS. If no GROUP BY clause is present then the last values will be the same in all of the returned rows.
在这种情况下,它只需要检查一些100行,因为它不使用任何正常组BY函数。这将在0秒内执行,因此效率很高。注意,通常在MySQL中,我们会在GROUP BY子句后面看到ORDER BY子句,但是这个ORDER BY子句用于确定最后一个()函数的顺序,如果它在GROUP之后,那么它将对组进行排序。如果没有GROUP BY子句,那么最后的值将在所有返回的行中是相同的。
However MySQL does not have this so let's look at different ideas of what it does have and prove that none of these are efficient.
但是MySQL没有这个功能,所以让我们看看它有什么功能,并证明它们都不是有效的。
Example 1
示例1
SELECT t1.id, t1.groupID, t1.recordedTimestamp, t1.recordedValue
FROM selected_group g
INNER JOIN temperature t1 ON t1.id = (
SELECT t2.id
FROM temperature t2
WHERE t2.groupID = g.id
ORDER BY t2.recordedTimestamp DESC, t2.id DESC
LIMIT 1
);
This examined 3,009,254 rows and took ~0.859 seconds on 5.7.21 and slightly longer on 8.0.4-rc
这检查了3,009,254行,在5.7.21上花费了约0.859秒,在8.4 -rc上稍微长一点
Example 2
示例2
SELECT t1.id, t1.groupID, t1.recordedTimestamp, t1.recordedValue
FROM temperature t1
INNER JOIN (
SELECT max(t2.id) AS id
FROM temperature t2
INNER JOIN (
SELECT t3.groupID, max(t3.recordedTimestamp) AS recordedTimestamp
FROM selected_group g
INNER JOIN temperature t3 ON t3.groupID = g.id
GROUP BY t3.groupID
) t4 ON t4.groupID = t2.groupID AND t4.recordedTimestamp = t2.recordedTimestamp
GROUP BY t2.groupID
) t5 ON t5.id = t1.id;
This examined 1,505,331 rows and took ~1.25 seconds on 5.7.21 and slightly longer on 8.0.4-rc
这检查了1,505,331行,在5.7.21上花费了大约1.25秒,在8.0.4-rc上稍微长一点
Example 3
示例3
SELECT t1.id, t1.groupID, t1.recordedTimestamp, t1.recordedValue
FROM temperature t1
WHERE t1.id IN (
SELECT max(t2.id) AS id
FROM temperature t2
INNER JOIN (
SELECT t3.groupID, max(t3.recordedTimestamp) AS recordedTimestamp
FROM selected_group g
INNER JOIN temperature t3 ON t3.groupID = g.id
GROUP BY t3.groupID
) t4 ON t4.groupID = t2.groupID AND t4.recordedTimestamp = t2.recordedTimestamp
GROUP BY t2.groupID
)
ORDER BY t1.groupID;
This examined 3,009,685 rows and took ~1.95 seconds on 5.7.21 and slightly longer on 8.0.4-rc
这检查了3,009,685行,在5.7.21上花费了大约1.95秒,在8.4 -rc上稍微长一点
Example 4
示例4
SELECT t1.id, t1.groupID, t1.recordedTimestamp, t1.recordedValue
FROM selected_group g
INNER JOIN temperature t1 ON t1.id = (
SELECT max(t2.id)
FROM temperature t2
WHERE t2.groupID = g.id AND t2.recordedTimestamp = (
SELECT max(t3.recordedTimestamp)
FROM temperature t3
WHERE t3.groupID = g.id
)
);
This examined 6,137,810 rows and took ~2.2 seconds on 5.7.21 and slightly longer on 8.0.4-rc
这检查了6,137,810行,在5.7.21上花费了约2.2秒,在8.0.4-rc上稍微长一点
Example 5
示例5
SELECT t1.id, t1.groupID, t1.recordedTimestamp, t1.recordedValue
FROM (
SELECT
t2.id,
t2.groupID,
t2.recordedTimestamp,
t2.recordedValue,
row_number() OVER (
PARTITION BY t2.groupID ORDER BY t2.recordedTimestamp DESC, t2.id DESC
) AS rowNumber
FROM selected_group g
INNER JOIN temperature t2 ON t2.groupID = g.id
) t1 WHERE t1.rowNumber = 1;
This examined 6,017,808 rows and took ~4.2 seconds on 8.0.4-rc
这检查了6,017,808行,在8.4 -rc上花费了大约4.2秒
Example 6
例子6
SELECT t1.id, t1.groupID, t1.recordedTimestamp, t1.recordedValue
FROM (
SELECT
last_value(t2.id) OVER w AS id,
t2.groupID,
last_value(t2.recordedTimestamp) OVER w AS recordedTimestamp,
last_value(t2.recordedValue) OVER w AS recordedValue
FROM selected_group g
INNER JOIN temperature t2 ON t2.groupID = g.id
WINDOW w AS (
PARTITION BY t2.groupID
ORDER BY t2.recordedTimestamp, t2.id
RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
)
) t1
GROUP BY t1.groupID;
This examined 6,017,908 rows and took ~17.5 seconds on 8.0.4-rc
这检查了6,017,908行,在8.4 -rc上花费了17.5秒
Example 7
例7
SELECT t1.id, t1.groupID, t1.recordedTimestamp, t1.recordedValue
FROM selected_group g
INNER JOIN temperature t1 ON t1.groupID = g.id
LEFT JOIN temperature t2
ON t2.groupID = g.id
AND (
t2.recordedTimestamp > t1.recordedTimestamp
OR (t2.recordedTimestamp = t1.recordedTimestamp AND t2.id > t1.id)
)
WHERE t2.id IS NULL
ORDER BY t1.groupID;
This one was taking forever so I had to kill it.
这是永远的,所以我必须杀死它。
#20
0
If performance is really your concern you can introduce a new column on the table called IsLastInGroup
of type BIT.
如果性能真的是您的关注,您可以在名为IsLastInGroup类型的表中引入一个新的列。
Set it to true on the columns which are last and maintain it with every row insert/update/delete. Writes will be slower, but you'll benefit on reads. It depends on your use case and I recommend it only if you're read-focused.
在最后的列上将它设置为true,并使用每一行插入/更新/删除来维护它。写操作会比较慢,但是读操作会让你受益。这取决于您的用例,我只建议您关注阅读。
So your query will look like:
因此,您的查询将如下:
SELECT * FROM Messages WHERE IsLastInGroup = 1
#21
-2
select * from messages group by name desc
#1
692
MySQL 8.0 now supports windowing functions, like almost all popular SQL implementations. With this standard syntax, we can write greatest-n-per-group queries:
MySQL 8.0现在支持窗口函数,就像所有流行的SQL实现一样。使用这个标准语法,我们可以编写每个组最大的查询:
WITH ranked_messages AS (
SELECT m.*, ROW_NUMBER() OVER (PARTITION BY name ORDER BY id DESC) AS rn
FROM messages AS m
)
SELECT * FROM ranked_messages WHERE rn = 1;
Below is the original answer I wrote for this question in 2009:
以下是我在2009年为这个问题写下的最初答案:
I write the solution this way:
我这样写解:
SELECT m1.*
FROM messages m1 LEFT JOIN messages m2
ON (m1.name = m2.name AND m1.id < m2.id)
WHERE m2.id IS NULL;
Regarding performance, one solution or the other can be better, depending on the nature of your data. So you should test both queries and use the one that is better at performance given your database.
关于性能,根据数据的性质,一个解决方案或另一个解决方案可以更好。因此,您应该测试这两个查询,并使用性能更好的数据库。
For example, I have a copy of the * August data dump. I'll use that for benchmarking. There are 1,114,357 rows in the Posts
table. This is running on MySQL 5.0.75 on my Macbook Pro 2.40GHz.
例如,我有一个* August数据转储的副本。我将把它用于基准测试。Posts表中有114357行。这是在我的Macbook Pro 2.40GHz上运行的MySQL 5.0.75。
I'll write a query to find the most recent post for a given user ID (mine).
我将编写一个查询来查找给定用户ID(我的)的最新文章。
First using the technique shown by @Eric with the GROUP BY
in a subquery:
首先使用@Eric在子查询中对GROUP by显示的技术:
SELECT p1.postid
FROM Posts p1
INNER JOIN (SELECT pi.owneruserid, MAX(pi.postid) AS maxpostid
FROM Posts pi GROUP BY pi.owneruserid) p2
ON (p1.postid = p2.maxpostid)
WHERE p1.owneruserid = 20860;
1 row in set (1 min 17.89 sec)
Even the EXPLAIN
analysis takes over 16 seconds:
甚至解释分析也要花费16秒:
+----+-------------+------------+--------+----------------------------+-------------+---------+--------------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+--------+----------------------------+-------------+---------+--------------+---------+-------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 76756 | |
| 1 | PRIMARY | p1 | eq_ref | PRIMARY,PostId,OwnerUserId | PRIMARY | 8 | p2.maxpostid | 1 | Using where |
| 2 | DERIVED | pi | index | NULL | OwnerUserId | 8 | NULL | 1151268 | Using index |
+----+-------------+------------+--------+----------------------------+-------------+---------+--------------+---------+-------------+
3 rows in set (16.09 sec)
Now produce the same query result using my technique with LEFT JOIN
:
现在使用我的技术与左连接产生相同的查询结果:
SELECT p1.postid
FROM Posts p1 LEFT JOIN posts p2
ON (p1.owneruserid = p2.owneruserid AND p1.postid < p2.postid)
WHERE p2.postid IS NULL AND p1.owneruserid = 20860;
1 row in set (0.28 sec)
The EXPLAIN
analysis shows that both tables are able to use their indexes:
解释分析表明,两个表都可以使用它们的索引:
+----+-------------+-------+------+----------------------------+-------------+---------+-------+------+--------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+----------------------------+-------------+---------+-------+------+--------------------------------------+
| 1 | SIMPLE | p1 | ref | OwnerUserId | OwnerUserId | 8 | const | 1384 | Using index |
| 1 | SIMPLE | p2 | ref | PRIMARY,PostId,OwnerUserId | OwnerUserId | 8 | const | 1384 | Using where; Using index; Not exists |
+----+-------------+-------+------+----------------------------+-------------+---------+-------+------+--------------------------------------+
2 rows in set (0.00 sec)
Here's the DDL for my Posts
table:
下面是我的Posts表的DDL:
CREATE TABLE `posts` (
`PostId` bigint(20) unsigned NOT NULL auto_increment,
`PostTypeId` bigint(20) unsigned NOT NULL,
`AcceptedAnswerId` bigint(20) unsigned default NULL,
`ParentId` bigint(20) unsigned default NULL,
`CreationDate` datetime NOT NULL,
`Score` int(11) NOT NULL default '0',
`ViewCount` int(11) NOT NULL default '0',
`Body` text NOT NULL,
`OwnerUserId` bigint(20) unsigned NOT NULL,
`OwnerDisplayName` varchar(40) default NULL,
`LastEditorUserId` bigint(20) unsigned default NULL,
`LastEditDate` datetime default NULL,
`LastActivityDate` datetime default NULL,
`Title` varchar(250) NOT NULL default '',
`Tags` varchar(150) NOT NULL default '',
`AnswerCount` int(11) NOT NULL default '0',
`CommentCount` int(11) NOT NULL default '0',
`FavoriteCount` int(11) NOT NULL default '0',
`ClosedDate` datetime default NULL,
PRIMARY KEY (`PostId`),
UNIQUE KEY `PostId` (`PostId`),
KEY `PostTypeId` (`PostTypeId`),
KEY `AcceptedAnswerId` (`AcceptedAnswerId`),
KEY `OwnerUserId` (`OwnerUserId`),
KEY `LastEditorUserId` (`LastEditorUserId`),
KEY `ParentId` (`ParentId`),
CONSTRAINT `posts_ibfk_1` FOREIGN KEY (`PostTypeId`) REFERENCES `posttypes` (`PostTypeId`)
) ENGINE=InnoDB;
#2
116
UPD: 2017-03-31, the version 5.7.5 of MySQL made the ONLY_FULL_GROUP_BY switch enabled by default (hence, non-deterministic GROUP BY queries became disabled). Moreover, they updated the GROUP BY implementation and the solution might not work as expected anymore even with the disabled switch. One needs to check.
UPD: 2017-03-31, MySQL版本5.7.5使得ONLY_FULL_GROUP_BY开关默认启用(因此,查询不确定组被禁用)。此外,他们根据实现更新了组,即使禁用了开关,解决方案也可能不再像预期的那样工作。需要检查。
Bill Karwin's solution above works fine when item count within groups is rather small, but the performance of the query becomes bad when the groups are rather large, since the solution requires about n*n/2 + n/2
of only IS NULL
comparisons.
当组中的项计数很小时,Bill Karwin的上述解决方案可以正常工作,但是当组比较大时,查询的性能就会变差,因为该解决方案只需要n*n/2 + n/2的值为NULL比较。
I made my tests on a InnoDB table of 18684446
rows with 1182
groups. The table contains testresults for functional tests and has the (test_id, request_id)
as the primary key. Thus, test_id
is a group and I was searching for the last request_id
for each test_id
.
我对InnoDB表进行了测试,这个表有18684446行,有1182个组。该表包含用于功能测试的testresults,并以(test_id、request_id)作为主键。因此,test_id是一个组,我正在为每个test_id搜索最后一个request_id。
Bill's solution has already been running for several hours on my dell e4310 and I do not know when it is going to finish even though it operates on a coverage index (hence using index
in EXPLAIN).
Bill的解决方案已经在我的dell e4310上运行了好几个小时了,我不知道它什么时候结束,即使它在覆盖率指数上运行(因此在EXPLAIN中使用索引)。
I have a couple of other solutions that are based on the same ideas:
我有一些其他的解决方案基于相同的想法:
- if the underlying index is BTREE index (which is usually the case), the largest
(group_id, item_value)
pair is the last value within eachgroup_id
, that is the first for eachgroup_id
if we walk through the index in descending order; - 如果底层索引是BTREE索引(通常是这种情况),那么最大的(group_id、item_value)对是每个group_id中的最后一个值,这是按降序遍历索引的第一个group_id;
- if we read the values which are covered by an index, the values are read in the order of the index;
- 如果我们读取索引所覆盖的值,则按索引的顺序读取这些值;
- each index implicitly contains primary key columns appended to that (that is the primary key is in the coverage index). In solutions below I operate directly on the primary key, in you case, you will just need to add primary key columns in the result.
- 每个索引隐式地包含附加到该索引的主键列(即覆盖率索引中的主键)。在下面的解决方案中,我直接操作主键,在您的情况下,您只需要在结果中添加主键列。
- in many cases it is much cheaper to collect the required row ids in the required order in a subquery and join the result of the subquery on the id. Since for each row in the subquery result MySQL will need a single fetch based on primary key, the subquery will be put first in the join and the rows will be output in the order of the ids in the subquery (if we omit explicit ORDER BY for the join)
- 在许多情况下便宜得多收集所需的行id查询所需的顺序,加入id上的子查询的结果。因为MySQL查询结果中的每一行将需要一个获取基于主键,子查询将第一次加入,将输出的顺序id的子查询(如果我们忽略显式连接order BY)
3 ways MySQL uses indexes is a great article to understand some details.
MySQL使用索引的3种方式是理解一些细节的好文章。
Solution 1
解决方案1
This one is incredibly fast, it takes about 0,8 secs on my 18M+ rows:
这个速度非常快,在我的18M+行上大约需要0,8秒:
SELECT test_id, MAX(request_id), request_id
FROM testresults
GROUP BY test_id DESC;
If you want to change the order to ASC, put it in a subquery, return the ids only and use that as the subquery to join to the rest of the columns:
如果要将订单更改为ASC,请将其放入子查询中,仅返回id,并将其作为子查询连接到其他列:
SELECT test_id, request_id
FROM (
SELECT test_id, MAX(request_id), request_id
FROM testresults
GROUP BY test_id DESC) as ids
ORDER BY test_id;
This one takes about 1,2 secs on my data.
这个数据需要1 2秒。
Solution 2
解决方案2
Here is another solution that takes about 19 seconds for my table:
这是另一个解决方案,我的桌子需要19秒:
SELECT test_id, request_id
FROM testresults, (SELECT @group:=NULL) as init
WHERE IF(IFNULL(@group, -1)=@group:=test_id, 0, 1)
ORDER BY test_id DESC, request_id DESC
It returns tests in descending order as well. It is much slower since it does a full index scan but it is here to give you an idea how to output N max rows for each group.
它还按降序返回测试。因为它进行了完整的索引扫描,所以速度要慢得多,但是它在这里是为了让您了解如何为每个组输出N行最大值。
The disadvantage of the query is that its result cannot be cached by the query cache.
查询的缺点是查询缓存不能缓存其结果。
#3
79
Use your subquery to return the correct grouping, because you're halfway there.
使用子查询返回正确的分组,因为您已经完成了一半。
Try this:
试试这个:
select
a.*
from
messages a
inner join
(select name, max(id) as maxid from messages group by name) as b on
a.id = b.maxid
If it's not id
you want the max of:
如果不是id,你想要的最大值是:
select
a.*
from
messages a
inner join
(select name, max(other_col) as other_col
from messages group by name) as b on
a.name = b.name
and a.other_col = b.other_col
This way, you avoid correlated subqueries and/or ordering in your subqueries, which tend to be very slow/inefficient.
通过这种方式,您可以避免子查询中的相关子查询和/或排序,而这些子查询往往非常慢/低效。
#4
32
I arrived at a different solution, which is to get the IDs for the last post within each group, then select from the messages table using the result from the first query as the argument for a WHERE x IN
construct:
我得到了一个不同的解决方案,即在每个组中获取最后一篇文章的id,然后使用第一个查询的结果作为构造中的a WHERE x的参数从消息表中选择:
SELECT id, name, other_columns
FROM messages
WHERE id IN (
SELECT MAX(id)
FROM messages
GROUP BY name
);
I don't know how this performs compared to some of the other solutions, but it worked spectacularly for my table with 3+ million rows. (4 second execution with 1200+ results)
与其他一些解决方案相比,我不知道它的性能如何,但它在我的表中有300多万行。(4第二次执行,1200+结果)
This should work both on MySQL and SQL Server.
这在MySQL和SQL Server上都可以工作。
#5
22
Solution by sub query fiddle Link
通过子查询小提琴链接解决
select * from messages where id in
(select max(id) from messages group by Name)
Solution By join condition fiddle link
通过连接条件提琴链接解决
select m1.* from messages m1
left outer join messages m2
on ( m1.id<m2.id and m1.name=m2.name )
where m2.id is null
Reason for this post is to give fiddle link only. Same SQL is already provided in other answers.
这篇文章的原因是只提供小提琴链接。其他答案中已经提供了相同的SQL。
#6
7
I've not yet tested with large DB but I think this could be faster than joining tables:
我还没有对大DB进行测试,但我认为这可能比合并表要快:
SELECT *, Max(Id) FROM messages GROUP BY Name
#7
4
Here are two suggestions. First, if mysql supports ROW_NUMBER(), it's very simple:
这里有两个建议。首先,如果mysql支持ROW_NUMBER(),则非常简单:
WITH Ranked AS (
SELECT Id, Name, OtherColumns,
ROW_NUMBER() OVER (
PARTITION BY Name
ORDER BY Id DESC
) AS rk
FROM messages
)
SELECT Id, Name, OtherColumns
FROM messages
WHERE rk = 1;
I'm assuming by "last" you mean last in Id order. If not, change the ORDER BY clause of the ROW_NUMBER() window accordingly. If ROW_NUMBER() isn't available, this is another solution:
我假设你的意思是最后一个。如果不是,则相应地更改ROW_NUMBER()窗口的ORDER BY子句。如果ROW_NUMBER()不可用,这是另一个解决方案:
Second, if it doesn't, this is often a good way to proceed:
其次,如果没有的话,这通常是一个很好的方法:
SELECT
Id, Name, OtherColumns
FROM messages
WHERE NOT EXISTS (
SELECT * FROM messages as M2
WHERE M2.Name = messages.Name
AND M2.Id > messages.Id
)
In other words, select messages where there is no later-Id message with the same Name.
换句话说,选择没有具有相同名称的后id消息的消息。
#8
4
Here is my solution:
这是我的解决方案:
SELECT
DISTINCT NAME,
MAX(MESSAGES) OVER(PARTITION BY NAME) MESSAGES
FROM MESSAGE;
#9
3
Here is another way to get the last related record using GROUP_CONCAT
with order by and SUBSTRING_INDEX
to pick one of the record from the list
下面是另一种使用GROUP_CONCAT和SUBSTRING_INDEX获取最后一条相关记录的方法,以便从列表中选择一条记录
SELECT
`Id`,
`Name`,
SUBSTRING_INDEX(
GROUP_CONCAT(
`Other_Columns`
ORDER BY `Id` DESC
SEPARATOR '||'
),
'||',
1
) Other_Columns
FROM
messages
GROUP BY `Name`
Above query will group the all the Other_Columns
that are in same Name
group and using ORDER BY id DESC
will join all the Other_Columns
in a specific group in descending order with the provided separator in my case i have used ||
,using SUBSTRING_INDEX
over this list will pick the first one
上面查询将集团的所有Other_Columns同名组和使用命令id DESC将加入特定组中的所有Other_Columns提供分离器在降序排列在我的例子中我使用了| |,使用SUBSTRING_INDEX在这个列表将选择第一个
Fiddle Demo
#10
3
SELECT
column1,
column2
FROM
table_name
WHERE id IN
(SELECT
MAX(id)
FROM
table_name
GROUP BY column1)
ORDER BY column1 ;
#11
2
Try this:
试试这个:
SELECT jos_categories.title AS name,
joined .catid,
joined .title,
joined .introtext
FROM jos_categories
INNER JOIN (SELECT *
FROM (SELECT `title`,
catid,
`created`,
introtext
FROM `jos_content`
WHERE `sectionid` = 6
ORDER BY `id` DESC) AS yes
GROUP BY `yes`.`catid` DESC
ORDER BY `yes`.`created` DESC) AS joined
ON( joined.catid = jos_categories.id )
#12
2
You can take view from here as well.
你也可以从这里看。
http://sqlfiddle.com/#!9/ef42b/9
http://sqlfiddle.com/ ! 9 / ef42b / 9
FIRST SOLUTION
第一个解决方案
SELECT d1.ID,Name,City FROM Demo_User d1
INNER JOIN
(SELECT MAX(ID) AS ID FROM Demo_User GROUP By NAME) AS P ON (d1.ID=P.ID);
SECOND SOLUTION
第二个解决方案
SELECT * FROM (SELECT * FROM Demo_User ORDER BY ID DESC) AS T GROUP BY NAME ;
#13
1
Is there any way we could use this method to delete duplicates in a table? The result set is basically a collection of unique records, so if we could delete all records not in the result set, we would effectively have no duplicates? I tried this but mySQL gave a 1093 error.
有没有办法用这个方法删除表中的重复项?结果集基本上是唯一记录的集合,所以如果我们可以删除结果集中没有的所有记录,我们将有效地没有副本?我试过了,但是mySQL出了1093。
DELETE FROM messages WHERE id NOT IN
(SELECT m1.id
FROM messages m1 LEFT JOIN messages m2
ON (m1.name = m2.name AND m1.id < m2.id)
WHERE m2.id IS NULL)
Is there a way to maybe save the output to a temp variable then delete from NOT IN (temp variable)? @Bill thanks for a very useful solution.
是否有一种方法可以将输出保存到一个temp变量中,然后从NOT IN (temp变量)中删除?感谢一个非常有用的解决方案。
EDIT: Think i found the solution:
编辑:认为我找到了解决方案:
DROP TABLE IF EXISTS UniqueIDs;
CREATE Temporary table UniqueIDs (id Int(11));
INSERT INTO UniqueIDs
(SELECT T1.ID FROM Table T1 LEFT JOIN Table T2 ON
(T1.Field1 = T2.Field1 AND T1.Field2 = T2.Field2 #Comparison Fields
AND T1.ID < T2.ID)
WHERE T2.ID IS NULL);
DELETE FROM Table WHERE id NOT IN (SELECT ID FROM UniqueIDs);
#14
1
The below query will work fine as per your question.
下面的查询将按照您的问题正常工作。
SELECT M1.*
FROM MESSAGES M1,
(
SELECT SUBSTR(Others_data,1,2),MAX(Others_data) AS Max_Others_data
FROM MESSAGES
GROUP BY 1
) M2
WHERE M1.Others_data = M2.Max_Others_data
ORDER BY Others_data;
#15
1
Hi @Vijay Dev if your table messages contains Id which is auto increment primary key then to fetch the latest record basis on the primary key your query should read as below:
你好@Vijay Dev .如果你的表消息包含自动递增主键的Id,那么在主键上获取最新的记录,你的查询应该如下所示:
SELECT m1.* FROM messages m1 INNER JOIN (SELECT max(Id) as lastmsgId FROM messages GROUP BY Name) m2 ON m1.Id=m2.lastmsgId
#16
1
If you want the last row for each Name
, then you can give a row number to each row group by the Name
and order by Id
in descending order.
如果要为每个名称指定最后一行,那么可以按名称和按Id按降序给每个行组一个行号。
QUERY
查询
SELECT t1.Id,
t1.Name,
t1.Other_Columns
FROM
(
SELECT Id,
Name,
Other_Columns,
(
CASE Name WHEN @curA
THEN @curRow := @curRow + 1
ELSE @curRow := 1 AND @curA := Name END
) + 1 AS rn
FROM messages t,
(SELECT @curRow := 0, @curA := '') r
ORDER BY Name,Id DESC
)t1
WHERE t1.rn = 1
ORDER BY t1.Id;
SQL Fiddle
#17
1
An approach with considerable speed is as follows.
以下是一种速度相当快的方法。
SELECT *
FROM messages a
WHERE Id = (SELECT MAX(Id) FROM messages WHERE a.Name = Name)
Result
结果
Id Name Other_Columns
3 A A_data_3
5 B B_data_2
6 C C_data_1
#18
0
How about this:
这个怎么样:
SELECT DISTINCT ON (name) *
FROM messages
ORDER BY name, id DESC;
I had similar issue (on postgresql tough) and on a 1M records table. This solution takes 1.7s vs 44s produced by the one with LEFT JOIN. In my case I had to filter the corrispondant of your name field against NULL values, resulting in even better performances by 0.2 secs
我遇到过类似的问题(在postgresql tough和1M记录表上)。这个解决方案需要使用左连接产生的1.7s和44s。在我的例子中,我必须对name字段的对应项进行过滤,以防止空值,从而使性能提高0.2秒
#19
0
Clearly there are lots of different ways of getting the same results, your question seems to be what is an efficient way of getting the last results in each group in MySQL. If you are working with huge amounts of data and assuming you are using InnoDB with even the latest versions of MySQL (such as 5.7.21 and 8.0.4-rc) then there might not be an efficient way of doing this.
显然有很多不同的方法得到相同的结果,你的问题似乎是在MySQL中每组得到最后结果的有效方法。如果您正在处理大量的数据,并且假设您正在使用InnoDB,甚至是最新版本的MySQL(如5.7.21和8.0.4-rc),那么可能没有一种有效的方法来实现这一点。
We sometimes need to do this with tables with even more than 60 million rows.
我们有时需要使用超过6000万行的表来处理这个问题。
For these examples I will use data with only about 1.5 million rows where the queries would need to find results for all groups in the data. In our actual cases we would often need to return back data from about 2,000 groups (which hypothetically would not require examining very much of the data).
对于这些示例,我将使用只有150万行的数据,在这些数据中,查询需要查找数据中所有组的结果。在我们的实际案例中,我们经常需要返回来自大约2000组的数据(假设不需要检查很多数据)。
I will use the following tables:
我会使用以下表格:
CREATE TABLE temperature(
id INT UNSIGNED NOT NULL AUTO_INCREMENT,
groupID INT UNSIGNED NOT NULL,
recordedTimestamp TIMESTAMP NOT NULL,
recordedValue INT NOT NULL,
INDEX groupIndex(groupID, recordedTimestamp),
PRIMARY KEY (id)
);
CREATE TEMPORARY TABLE selected_group(id INT UNSIGNED NOT NULL, PRIMARY KEY(id));
The temperature table is populated with about 1.5 million random records, and with 100 different groups. The selected_group is populated with those 100 groups (in our cases this would normally be less than 20% for all of the groups).
温度表中有150万个随机记录,还有100个不同的组。selected_group中填充了这100个组(在我们的例子中,对于所有组来说,这通常都小于20%)。
As this data is random it means that multiple rows can have the same recordedTimestamps. What we want is to get a list of all of the selected groups in order of groupID with the last recordedTimestamp for each group, and if the same group has more than one matching row like that then the last matching id of those rows.
由于该数据是随机的,这意味着多个行可以拥有相同的记录时间戳。我们想要的是获得所有被选择的组的列表,以groupID的顺序为每个组的最后一个recordedTimestamp,如果相同的组有多个这样的匹配行,那么这些行的最后一个匹配id。
If hypothetically MySQL had a last() function which returned values from the last row in a special ORDER BY clause then we could simply do:
假设MySQL有一个last()函数,它以特殊的ORDER BY子句返回最后一行的值,那么我们只需:
SELECT
last(t1.id) AS id,
t1.groupID,
last(t1.recordedTimestamp) AS recordedTimestamp,
last(t1.recordedValue) AS recordedValue
FROM selected_group g
INNER JOIN temperature t1 ON t1.groupID = g.id
ORDER BY t1.recordedTimestamp, t1.id
GROUP BY t1.groupID;
which would only need to examine a few 100 rows in this case as it doesn't use any of the normal GROUP BY functions. This would execute in 0 seconds and hence be highly efficient. Note that normally in MySQL we would see an ORDER BY clause following the GROUP BY clause however this ORDER BY clause is used to determine the ORDER for the last() function, if it was after the GROUP BY then it would be ordering the GROUPS. If no GROUP BY clause is present then the last values will be the same in all of the returned rows.
在这种情况下,它只需要检查一些100行,因为它不使用任何正常组BY函数。这将在0秒内执行,因此效率很高。注意,通常在MySQL中,我们会在GROUP BY子句后面看到ORDER BY子句,但是这个ORDER BY子句用于确定最后一个()函数的顺序,如果它在GROUP之后,那么它将对组进行排序。如果没有GROUP BY子句,那么最后的值将在所有返回的行中是相同的。
However MySQL does not have this so let's look at different ideas of what it does have and prove that none of these are efficient.
但是MySQL没有这个功能,所以让我们看看它有什么功能,并证明它们都不是有效的。
Example 1
示例1
SELECT t1.id, t1.groupID, t1.recordedTimestamp, t1.recordedValue
FROM selected_group g
INNER JOIN temperature t1 ON t1.id = (
SELECT t2.id
FROM temperature t2
WHERE t2.groupID = g.id
ORDER BY t2.recordedTimestamp DESC, t2.id DESC
LIMIT 1
);
This examined 3,009,254 rows and took ~0.859 seconds on 5.7.21 and slightly longer on 8.0.4-rc
这检查了3,009,254行,在5.7.21上花费了约0.859秒,在8.4 -rc上稍微长一点
Example 2
示例2
SELECT t1.id, t1.groupID, t1.recordedTimestamp, t1.recordedValue
FROM temperature t1
INNER JOIN (
SELECT max(t2.id) AS id
FROM temperature t2
INNER JOIN (
SELECT t3.groupID, max(t3.recordedTimestamp) AS recordedTimestamp
FROM selected_group g
INNER JOIN temperature t3 ON t3.groupID = g.id
GROUP BY t3.groupID
) t4 ON t4.groupID = t2.groupID AND t4.recordedTimestamp = t2.recordedTimestamp
GROUP BY t2.groupID
) t5 ON t5.id = t1.id;
This examined 1,505,331 rows and took ~1.25 seconds on 5.7.21 and slightly longer on 8.0.4-rc
这检查了1,505,331行,在5.7.21上花费了大约1.25秒,在8.0.4-rc上稍微长一点
Example 3
示例3
SELECT t1.id, t1.groupID, t1.recordedTimestamp, t1.recordedValue
FROM temperature t1
WHERE t1.id IN (
SELECT max(t2.id) AS id
FROM temperature t2
INNER JOIN (
SELECT t3.groupID, max(t3.recordedTimestamp) AS recordedTimestamp
FROM selected_group g
INNER JOIN temperature t3 ON t3.groupID = g.id
GROUP BY t3.groupID
) t4 ON t4.groupID = t2.groupID AND t4.recordedTimestamp = t2.recordedTimestamp
GROUP BY t2.groupID
)
ORDER BY t1.groupID;
This examined 3,009,685 rows and took ~1.95 seconds on 5.7.21 and slightly longer on 8.0.4-rc
这检查了3,009,685行,在5.7.21上花费了大约1.95秒,在8.4 -rc上稍微长一点
Example 4
示例4
SELECT t1.id, t1.groupID, t1.recordedTimestamp, t1.recordedValue
FROM selected_group g
INNER JOIN temperature t1 ON t1.id = (
SELECT max(t2.id)
FROM temperature t2
WHERE t2.groupID = g.id AND t2.recordedTimestamp = (
SELECT max(t3.recordedTimestamp)
FROM temperature t3
WHERE t3.groupID = g.id
)
);
This examined 6,137,810 rows and took ~2.2 seconds on 5.7.21 and slightly longer on 8.0.4-rc
这检查了6,137,810行,在5.7.21上花费了约2.2秒,在8.0.4-rc上稍微长一点
Example 5
示例5
SELECT t1.id, t1.groupID, t1.recordedTimestamp, t1.recordedValue
FROM (
SELECT
t2.id,
t2.groupID,
t2.recordedTimestamp,
t2.recordedValue,
row_number() OVER (
PARTITION BY t2.groupID ORDER BY t2.recordedTimestamp DESC, t2.id DESC
) AS rowNumber
FROM selected_group g
INNER JOIN temperature t2 ON t2.groupID = g.id
) t1 WHERE t1.rowNumber = 1;
This examined 6,017,808 rows and took ~4.2 seconds on 8.0.4-rc
这检查了6,017,808行,在8.4 -rc上花费了大约4.2秒
Example 6
例子6
SELECT t1.id, t1.groupID, t1.recordedTimestamp, t1.recordedValue
FROM (
SELECT
last_value(t2.id) OVER w AS id,
t2.groupID,
last_value(t2.recordedTimestamp) OVER w AS recordedTimestamp,
last_value(t2.recordedValue) OVER w AS recordedValue
FROM selected_group g
INNER JOIN temperature t2 ON t2.groupID = g.id
WINDOW w AS (
PARTITION BY t2.groupID
ORDER BY t2.recordedTimestamp, t2.id
RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
)
) t1
GROUP BY t1.groupID;
This examined 6,017,908 rows and took ~17.5 seconds on 8.0.4-rc
这检查了6,017,908行,在8.4 -rc上花费了17.5秒
Example 7
例7
SELECT t1.id, t1.groupID, t1.recordedTimestamp, t1.recordedValue
FROM selected_group g
INNER JOIN temperature t1 ON t1.groupID = g.id
LEFT JOIN temperature t2
ON t2.groupID = g.id
AND (
t2.recordedTimestamp > t1.recordedTimestamp
OR (t2.recordedTimestamp = t1.recordedTimestamp AND t2.id > t1.id)
)
WHERE t2.id IS NULL
ORDER BY t1.groupID;
This one was taking forever so I had to kill it.
这是永远的,所以我必须杀死它。
#20
0
If performance is really your concern you can introduce a new column on the table called IsLastInGroup
of type BIT.
如果性能真的是您的关注,您可以在名为IsLastInGroup类型的表中引入一个新的列。
Set it to true on the columns which are last and maintain it with every row insert/update/delete. Writes will be slower, but you'll benefit on reads. It depends on your use case and I recommend it only if you're read-focused.
在最后的列上将它设置为true,并使用每一行插入/更新/删除来维护它。写操作会比较慢,但是读操作会让你受益。这取决于您的用例,我只建议您关注阅读。
So your query will look like:
因此,您的查询将如下:
SELECT * FROM Messages WHERE IsLastInGroup = 1
#21
-2
select * from messages group by name desc