I have a table which I want to get the latest entry for each group. Here's the table:
我有一张表,我想为每一组获得最新的条目。表:
DocumentStatusLogs
Table
DocumentStatusLogs表
|ID| DocumentID | Status | DateCreated |
| 2| 1 | S1 | 7/29/2011 |
| 3| 1 | S2 | 7/30/2011 |
| 6| 1 | S1 | 8/02/2011 |
| 1| 2 | S1 | 7/28/2011 |
| 4| 2 | S2 | 7/30/2011 |
| 5| 2 | S3 | 8/01/2011 |
| 6| 3 | S1 | 8/02/2011 |
The table will be grouped by DocumentID
and sorted by DateCreated
in descending order. For each DocumentID
, I want to get the latest status.
该表将按文档分类,并按降序排列。对于每个文档,我想获得最新的状态。
My preferred output:
我喜欢的输出:
| DocumentID | Status | DateCreated |
| 1 | S1 | 8/02/2011 |
| 2 | S3 | 8/01/2011 |
| 3 | S1 | 8/02/2011 |
-
Is there any aggregate function to get only the top from each group? See pseudo-code
GetOnlyTheTop
below:是否有任何聚合函数只得到每个组的顶部?请参见下面的伪代码GetOnlyTheTop:
SELECT DocumentID, GetOnlyTheTop(Status), GetOnlyTheTop(DateCreated) FROM DocumentStatusLogs GROUP BY DocumentID ORDER BY DateCreated DESC
-
If such function doesn't exist, is there any way I can achieve the output I want?
如果这样的函数不存在,我是否可以实现我想要的输出?
- Or at the first place, could this be caused by unnormalized database? I'm thinking, since what I'm looking for is just one row, should that
status
also be located in the parent table? - 或者首先,这可能是由非规范化数据库造成的吗?我在想,既然我想要的只是一行,那么这个状态也应该位于父表中吗?
Please see the parent table for more information:
更多信息请见家长表:
Current Documents
Table
当前文件表
| DocumentID | Title | Content | DateCreated |
| 1 | TitleA | ... | ... |
| 2 | TitleB | ... | ... |
| 3 | TitleC | ... | ... |
Should the parent table be like this so that I can easily access its status?
如果父表是这样的,那么我可以很容易地访问它的状态吗?
| DocumentID | Title | Content | DateCreated | CurrentStatus |
| 1 | TitleA | ... | ... | s1 |
| 2 | TitleB | ... | ... | s3 |
| 3 | TitleC | ... | ... | s1 |
UPDATE I just learned how to use "apply" which makes it easier to address such problems.
我刚刚学会了如何使用“应用”,这样可以更容易地解决这些问题。
15 个解决方案
#1
518
;WITH cte AS
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC) AS rn
FROM DocumentStatusLogs
)
SELECT *
FROM cte
WHERE rn = 1
If you expect 2 entries per day, then this will arbitrarily pick one. To get both entries for a day, use DENSE_RANK instead
如果您期望每天有2个条目,那么这将任意选择一个。要获得这两个条目一天,使用DENSE_RANK代替。
As for normalised or not, it depends if you want to:
至于正常化与否,这取决于你是否愿意:
- maintain status in 2 places
- 保持2个位置的状态。
- preserve status history
- 保存状态的历史
- ...
- …
As it stands, you preserve status history. If you want latest status in the parent table too (which is denormalisation) you'd need a trigger to maintain "status" in the parent. or drop this status history table.
因为它的存在,你保存了状态历史。如果您想要在父表中最新的状态(这是反序列化),您需要一个触发器来保持父表中的“状态”。或者删除这个状态历史表。
#2
106
I just learned how to use cross apply
. Here's how to use it in this scenario:
我刚学会了如何使用交叉应用。下面是如何在这个场景中使用它:
select d.DocumentID, ds.Status, ds.DateCreated
from Documents as d
cross apply
(select top 1 Status, DateCreated
from DocumentStatusLogs
where DocumentID = d.DocumentId
order by DateCreated desc) as ds
#3
38
I've done some timings over the various recommendations here, and the results really depend on the size of the table involved, but the most consistent solution is using the CROSS APPLY These tests were run against SQL Server 2008-R2, using a table with 6,500 records, and another (identical schema) with 137 million records. The columns being queried are part of the primary key on the table, and the table width is very small (about 30 bytes). The times are reported by SQL Server from the actual execution plan.
我做了一些时间在这里的各种建议,结果真的取决于所涉及的表的大小,但最一致的解决方案是使用交叉应用这些测试对SQL Server 2008 r2,使用一个表包含6500条记录,另一个包含1.37亿条记录(相同的模式)。被查询的列是表中的主键的一部分,表的宽度非常小(大约30字节)。SQL Server从实际执行计划报告了时间。
Query Time for 6500 (ms) Time for 137M(ms)
CROSS APPLY 17.9 17.9
SELECT WHERE col = (SELECT MAX(COL)…) 6.6 854.4
DENSE_RANK() OVER PARTITION 6.6 907.1
I think the really amazing thing was how consistent the time was for the CROSS APPLY regardless of the number of rows involved.
我认为真正令人惊奇的是无论涉及多少行,交叉应用的时间是多么的一致。
#4
23
SELECT * FROM
DocumentStatusLogs JOIN (
SELECT DocumentID, MAX(DateCreated) DateCreated
FROM DocumentStatusLogs
GROUP BY DocumentID
) max_date USING (DocumentID, DateCreated)
What database server? This code doesn't work on all of them.
数据库服务器是什么?这段代码对所有的代码都不起作用。
Regarding the second half of your question, it seems reasonable to me to include the status as a column. You can leave DocumentStatusLogs
as a log, but still store the latest info in the main table.
关于你的问题的后半部分,我认为把它列为一列是合理的。您可以将documentstatuslog作为日志保存,但仍然将最新的信息存储在主表中。
BTW, if you already have the DateCreated
column in the Documents table you can just join DocumentStatusLogs
using that (as long as DateCreated
is unique in DocumentStatusLogs
).
顺便说一下,如果您已经在Documents表中已经有了DateCreated列,那么可以使用它来连接documentstatuslog(只要DateCreated在documentstatuslog中是惟一的)。
Edit: MsSQL does not support USING, so change it to:
编辑:MsSQL不支持使用,所以将其改为:
ON DocumentStatusLogs.DocumentID = max_date.DocumentID AND DocumentStatusLogs.DateCreated = max_date.DateCreated
#5
19
If you're worried about performance, you can also do this with MAX():
如果您担心性能,您也可以使用MAX():
SELECT *
FROM DocumentStatusLogs D
WHERE DateCreated = (SELECT MAX(DateCreated) FROM DocumentStatusLogs WHERE ID = D.ID)
ROW_NUMBER() requires a sort of all the rows in your SELECT statement, whereas MAX does not. Should drastically speed up your query.
ROW_NUMBER()需要选择语句中的所有行,而MAX不需要。应该大大加快查询速度。
#6
9
This is quite an old thread, but I thought I'd throw my two cents in just the same as the accepted answer didn't work particularly well for me. I tried gbn's solution on a large dataset and found it to be terribly slow (>45 seconds on 5 million plus records in SQL Server 2012). Looking at the execution plan it's obvious that the issue is that it requires a SORT operation which slows things down significantly.
这是一个很旧的思路,但我想我还是把我的两美分扔进这个被接受的答案,对我来说不是特别好。我尝试了gbn在一个大数据集上的解决方案,发现它的速度非常慢(>45秒,在SQL Server 2012中有500万的记录)。看看执行计划,很明显的问题是,它需要一个排序操作,它可以显著地降低速度。
Here's an alternative that I lifted from the entity framework that needs no SORT operation and does a NON-Clustered Index search. This reduces the execution time down to < 2 seconds on the aforementioned record set.
下面是我从不需要排序操作和非聚集索引搜索的实体框架中提取的另一种选择。这减少了在上述记录集中的执行时间降低到< 2秒。
SELECT
[Limit1].[DocumentID] AS [DocumentID],
[Limit1].[Status] AS [Status],
[Limit1].[DateCreated] AS [DateCreated]
FROM (SELECT DISTINCT [Extent1].[DocumentID] AS [DocumentID] FROM [dbo].[DocumentStatusLogs] AS [Extent1]) AS [Distinct1]
OUTER APPLY (SELECT TOP (1) [Project2].[ID] AS [ID], [Project2].[DocumentID] AS [DocumentID], [Project2].[Status] AS [Status], [Project2].[DateCreated] AS [DateCreated]
FROM (SELECT
[Extent2].[ID] AS [ID],
[Extent2].[DocumentID] AS [DocumentID],
[Extent2].[Status] AS [Status],
[Extent2].[DateCreated] AS [DateCreated]
FROM [dbo].[DocumentStatusLogs] AS [Extent2]
WHERE ([Distinct1].[DocumentID] = [Extent2].[DocumentID])
) AS [Project2]
ORDER BY [Project2].[ID] DESC) AS [Limit1]
Now I'm assuming something that isn't entirely specified in the original question, but if your table design is such that your ID column is an auto-increment ID, and the DateCreated is set to the current date with each insert, then even without running with my query above you could actually get a sizable performance boost to gbn's solution (about half the execution time) just from ordering on ID instead of ordering on DateCreated as this will provide an identical sort order and it's a faster sort.
现在我假设的东西不是完全指定的原始问题,但是如果你的表设计是你的ID列是一个自动递增ID,和DateCreated设置为当前日期与每个插入,那么即使没有上面运行与我查询,你可以得到一个相当大的性能提升gbn的解决方案(大约一半的执行时间)从订单ID,而不是订购DateCreated,这将提供一个相同的排序和快速排序。
#7
5
My code to select top 1 from each group
我的代码从每个组中选择top 1。
select a.* from #DocumentStatusLogs a where datecreated in( select top 1 datecreated from #DocumentStatusLogs b where a.documentid = b.documentid order by datecreated desc )
#8
2
Verifying Clint's awesome and correct answer from above:
验证克林特的超赞和正确答案:
The performance between the two queries below is interesting. 52% being the top one. And 48% being the second one. A 4% improvement in performance using DISTINCT instead of ORDER BY. But ORDER BY has the advantage to sort by multiple columns.
下面两个查询之间的性能非常有趣。52%是第一。48%是第二个。使用不同而不是顺序的性能提高了4%。但是ORDER BY具有排序多列的优势。
IF (OBJECT_ID('tempdb..#DocumentStatusLogs') IS NOT NULL) BEGIN DROP TABLE #DocumentStatusLogs END
CREATE TABLE #DocumentStatusLogs (
[ID] int NOT NULL,
[DocumentID] int NOT NULL,
[Status] varchar(20),
[DateCreated] datetime
)
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (2, 1, 'S1', '7/29/2011 1:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (3, 1, 'S2', '7/30/2011 2:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (6, 1, 'S1', '8/02/2011 3:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (1, 2, 'S1', '7/28/2011 4:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (4, 2, 'S2', '7/30/2011 5:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (5, 2, 'S3', '8/01/2011 6:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (6, 3, 'S1', '8/02/2011 7:00:00')
Option 1:
选项1:
SELECT
[Extent1].[ID],
[Extent1].[DocumentID],
[Extent1].[Status],
[Extent1].[DateCreated]
FROM #DocumentStatusLogs AS [Extent1]
OUTER APPLY (
SELECT TOP 1
[Extent2].[ID],
[Extent2].[DocumentID],
[Extent2].[Status],
[Extent2].[DateCreated]
FROM #DocumentStatusLogs AS [Extent2]
WHERE [Extent1].[DocumentID] = [Extent2].[DocumentID]
ORDER BY [Extent2].[DateCreated] DESC, [Extent2].[ID] DESC
) AS [Project2]
WHERE ([Project2].[ID] IS NULL OR [Project2].[ID] = [Extent1].[ID])
Option 2:
选项2:
SELECT
[Limit1].[DocumentID] AS [ID],
[Limit1].[DocumentID] AS [DocumentID],
[Limit1].[Status] AS [Status],
[Limit1].[DateCreated] AS [DateCreated]
FROM (
SELECT DISTINCT [Extent1].[DocumentID] AS [DocumentID] FROM #DocumentStatusLogs AS [Extent1]
) AS [Distinct1]
OUTER APPLY (
SELECT TOP (1) [Project2].[ID] AS [ID], [Project2].[DocumentID] AS [DocumentID], [Project2].[Status] AS [Status], [Project2].[DateCreated] AS [DateCreated]
FROM (
SELECT
[Extent2].[ID] AS [ID],
[Extent2].[DocumentID] AS [DocumentID],
[Extent2].[Status] AS [Status],
[Extent2].[DateCreated] AS [DateCreated]
FROM #DocumentStatusLogs AS [Extent2]
WHERE [Distinct1].[DocumentID] = [Extent2].[DocumentID]
) AS [Project2]
ORDER BY [Project2].[ID] DESC
) AS [Limit1]
M$'s Management Studio: After highlighting and running the first block, highlight both Option 1 and Option 2, Right click -> [Display Estimated Execution Plan]. Then run the entire thing to see the results.
M$'s Management Studio:在高亮显示和运行第一个块后,高亮显示选项1和选项2,右击->[显示估计执行计划]。然后运行整个程序来查看结果。
Option 1 Results:
选项1的结果:
ID DocumentID Status DateCreated
6 1 S1 8/2/11 3:00
5 2 S3 8/1/11 6:00
6 3 S1 8/2/11 7:00
Option 2 Results:
选项2的结果:
ID DocumentID Status DateCreated
6 1 S1 8/2/11 3:00
5 2 S3 8/1/11 6:00
6 3 S1 8/2/11 7:00
Note:
注意:
I tend to use APPLY when I want a join to be 1-to-(1 of many).
当我想要一个join是1-to-(many)时,我倾向于使用APPLY。
I use a JOIN if I want the join to be 1-to-many, or many-to-many.
如果希望连接为1对多或多对多,则使用JOIN。
I avoid CTE with ROW_NUMBER() unless I need to do something advanced and am ok with the windowing performance penalty.
我避免使用ROW_NUMBER()的CTE(),除非我需要做一些高级的事情,并且可以使用窗口性能惩罚。
I also avoid EXISTS / IN subqueries in the WHERE or ON clause, as I have experienced this causing some terrible execution plans. But mileage varies. Review the execution plan and profile performance where and when needed!
我也避免在WHERE或ON子句中存在/在子查询中,因为我已经经历过这导致了一些糟糕的执行计划。但里程而变化。在需要的时候和必要的时候检查执行计划和概要文件的性能!
#9
2
I know this is an old thread but the TOP 1 WITH TIES
solutions is quite nice and might be helpful to some reading through the solutions.
我知道这是一个旧的线程,但是上面的1和联系解决方案是相当不错的,可能有助于一些阅读解决方案。
select top 1 with ties
DocumentID
,Status
,DateCreated
from DocumentStatusLogs
order by row_number() over (partition by DocumentID order by DateCreated desc)
More about the TOP clause can be found here.
更多关于顶部的条款可以在这里找到。
#10
1
This is one of the most easily found question on the topic, so I wanted to give a modern answer to the it (both for my reference and to help others out). By using over and first value you can make short work of the above query:
这是关于这个话题最容易被发现的问题之一,所以我想给它一个现代的答案(我的参考和帮助别人)。通过使用over和first value,您可以完成上述查询的简短工作:
select distinct DocumentID
, first_value(status) over (partition by DocumentID order by DateCreated Desc) as Status
, first_value(DateCreated) over (partition by DocumentID order by DateCreated Desc) as DateCreated
From DocumentStatusLogs
This should work in sql server 2008 and up. First value can be thought of as a way to accomplish select top 1 when using an over clause. Over allows grouping in the select list so instead of writing nested subqueries (like many of the existing answers do), this does it in a more readable fashion. Hope this helps.
这应该适用于sql server 2008和up。第一个值可以被认为是在使用over子句时完成select top 1的一种方法。Over允许在选择列表中分组,因此不需要编写嵌套的子查询(像许多现有的答案一样),这样做的方式更容易阅读。希望这个有帮助。
#11
0
In scenarios where you want to avoid using row_count(), you can also use a left join:
在您希望避免使用row_count()的场景中,您也可以使用左连接:
select ds.DocumentID, ds.Status, ds.DateCreated
from DocumentStatusLogs ds
left join DocumentStatusLogs filter
ON ds.DocumentID = filter.DocumentID
-- Match any row that has another row that was created after it.
AND ds.DateCreated < filter.DateCreated
-- then filter out any rows that matched
where filter.DocumentID is null
For the example schema, you could also use a "not in subquery", which generally compiles to the same output as the left join:
对于示例模式,您还可以使用“not in subquery”,它通常编译到与左连接相同的输出:
select ds.DocumentID, ds.Status, ds.DateCreated
from DocumentStatusLogs ds
WHERE ds.ID NOT IN (
SELECT filter.ID
FROM DocumentStatusLogs filter
WHERE ds.DocumentID = filter.DocumentID
AND ds.DateCreated < filter.DateCreated)
Note, the subquery pattern wouldn't work if the table didn't have at least one single-column unique key/constraint/index, in this case the primary key "Id".
注意,如果表中没有至少一个单列惟一键/约束/索引,那么子查询模式将不起作用,在这种情况下,主键“Id”。
Both of these queries tend to be more "expensive" than the row_count() query (as measured by Query Analyzer). However, you might encounter scenarios where they return results faster or enable other optimizations.
这两个查询都比row_count()查询更“昂贵”(用query Analyzer度量)。但是,您可能会遇到一些场景,它们返回结果的速度更快,或者启用其他优化。
#12
0
Try this:
试试这个:
SELECT [DocumentID],
[tmpRez].value('/x[2]','varchar(20)') as [Status],
[tmpRez].value('/x[3]','datetime') as [DateCreated]
FROM (
SELECT [DocumentID],
cast('<x>'+max(cast([ID] as varchar(10))+'</x><x>'+[Status]+'</x><x>'
+cast([DateCreated] as varchar(20)))+'</x>' as XML) as [tmpRez]
FROM DocumentStatusLogs
GROUP by DocumentID) as [tmpQry]
#13
0
SELECT o.*
FROM `DocumentStatusLogs` o
LEFT JOIN `DocumentStatusLogs` b
ON o.DocumentID = b.DocumentID AND o.DateCreated < b.DateCreated
WHERE b.DocumentID is NULL ;
If you want to return only recent document order by DateCreated, it will return only top 1 document by DocumentID
如果您想要只返回最近的文档顺序,它将只返回文档的前1文档。
#14
-1
This is the most vanilla TSQL I can come up with
这是我能想到的最简单的TSQL。
SELECT * FROM DocumentStatusLogs D1 JOIN
(
SELECT
DocumentID,MAX(DateCreated) AS MaxDate
FROM
DocumentStatusLogs
GROUP BY
DocumentID
) D2
ON
D2.DocumentID=D1.DocumentID
AND
D2.MaxDate=D1.DateCreated
#15
-2
It is checked in SQLite that you can use the following simple query with GROUP BY
在SQLite中检查它,您可以使用下面的简单查询。
SELECT MAX(DateCreated), *
FROM DocumentStatusLogs
GROUP BY DocumentID
Here MAX help to get the maximum DateCreated FROM each group.
在这里,MAX帮助从每个组获得最大的数据。
But it seems that MYSQL doesn't associate *-columns with the value of max DateCreated :(
但似乎MYSQL并没有将*列与max DateCreated的值关联起来:
#1
518
;WITH cte AS
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC) AS rn
FROM DocumentStatusLogs
)
SELECT *
FROM cte
WHERE rn = 1
If you expect 2 entries per day, then this will arbitrarily pick one. To get both entries for a day, use DENSE_RANK instead
如果您期望每天有2个条目,那么这将任意选择一个。要获得这两个条目一天,使用DENSE_RANK代替。
As for normalised or not, it depends if you want to:
至于正常化与否,这取决于你是否愿意:
- maintain status in 2 places
- 保持2个位置的状态。
- preserve status history
- 保存状态的历史
- ...
- …
As it stands, you preserve status history. If you want latest status in the parent table too (which is denormalisation) you'd need a trigger to maintain "status" in the parent. or drop this status history table.
因为它的存在,你保存了状态历史。如果您想要在父表中最新的状态(这是反序列化),您需要一个触发器来保持父表中的“状态”。或者删除这个状态历史表。
#2
106
I just learned how to use cross apply
. Here's how to use it in this scenario:
我刚学会了如何使用交叉应用。下面是如何在这个场景中使用它:
select d.DocumentID, ds.Status, ds.DateCreated
from Documents as d
cross apply
(select top 1 Status, DateCreated
from DocumentStatusLogs
where DocumentID = d.DocumentId
order by DateCreated desc) as ds
#3
38
I've done some timings over the various recommendations here, and the results really depend on the size of the table involved, but the most consistent solution is using the CROSS APPLY These tests were run against SQL Server 2008-R2, using a table with 6,500 records, and another (identical schema) with 137 million records. The columns being queried are part of the primary key on the table, and the table width is very small (about 30 bytes). The times are reported by SQL Server from the actual execution plan.
我做了一些时间在这里的各种建议,结果真的取决于所涉及的表的大小,但最一致的解决方案是使用交叉应用这些测试对SQL Server 2008 r2,使用一个表包含6500条记录,另一个包含1.37亿条记录(相同的模式)。被查询的列是表中的主键的一部分,表的宽度非常小(大约30字节)。SQL Server从实际执行计划报告了时间。
Query Time for 6500 (ms) Time for 137M(ms)
CROSS APPLY 17.9 17.9
SELECT WHERE col = (SELECT MAX(COL)…) 6.6 854.4
DENSE_RANK() OVER PARTITION 6.6 907.1
I think the really amazing thing was how consistent the time was for the CROSS APPLY regardless of the number of rows involved.
我认为真正令人惊奇的是无论涉及多少行,交叉应用的时间是多么的一致。
#4
23
SELECT * FROM
DocumentStatusLogs JOIN (
SELECT DocumentID, MAX(DateCreated) DateCreated
FROM DocumentStatusLogs
GROUP BY DocumentID
) max_date USING (DocumentID, DateCreated)
What database server? This code doesn't work on all of them.
数据库服务器是什么?这段代码对所有的代码都不起作用。
Regarding the second half of your question, it seems reasonable to me to include the status as a column. You can leave DocumentStatusLogs
as a log, but still store the latest info in the main table.
关于你的问题的后半部分,我认为把它列为一列是合理的。您可以将documentstatuslog作为日志保存,但仍然将最新的信息存储在主表中。
BTW, if you already have the DateCreated
column in the Documents table you can just join DocumentStatusLogs
using that (as long as DateCreated
is unique in DocumentStatusLogs
).
顺便说一下,如果您已经在Documents表中已经有了DateCreated列,那么可以使用它来连接documentstatuslog(只要DateCreated在documentstatuslog中是惟一的)。
Edit: MsSQL does not support USING, so change it to:
编辑:MsSQL不支持使用,所以将其改为:
ON DocumentStatusLogs.DocumentID = max_date.DocumentID AND DocumentStatusLogs.DateCreated = max_date.DateCreated
#5
19
If you're worried about performance, you can also do this with MAX():
如果您担心性能,您也可以使用MAX():
SELECT *
FROM DocumentStatusLogs D
WHERE DateCreated = (SELECT MAX(DateCreated) FROM DocumentStatusLogs WHERE ID = D.ID)
ROW_NUMBER() requires a sort of all the rows in your SELECT statement, whereas MAX does not. Should drastically speed up your query.
ROW_NUMBER()需要选择语句中的所有行,而MAX不需要。应该大大加快查询速度。
#6
9
This is quite an old thread, but I thought I'd throw my two cents in just the same as the accepted answer didn't work particularly well for me. I tried gbn's solution on a large dataset and found it to be terribly slow (>45 seconds on 5 million plus records in SQL Server 2012). Looking at the execution plan it's obvious that the issue is that it requires a SORT operation which slows things down significantly.
这是一个很旧的思路,但我想我还是把我的两美分扔进这个被接受的答案,对我来说不是特别好。我尝试了gbn在一个大数据集上的解决方案,发现它的速度非常慢(>45秒,在SQL Server 2012中有500万的记录)。看看执行计划,很明显的问题是,它需要一个排序操作,它可以显著地降低速度。
Here's an alternative that I lifted from the entity framework that needs no SORT operation and does a NON-Clustered Index search. This reduces the execution time down to < 2 seconds on the aforementioned record set.
下面是我从不需要排序操作和非聚集索引搜索的实体框架中提取的另一种选择。这减少了在上述记录集中的执行时间降低到< 2秒。
SELECT
[Limit1].[DocumentID] AS [DocumentID],
[Limit1].[Status] AS [Status],
[Limit1].[DateCreated] AS [DateCreated]
FROM (SELECT DISTINCT [Extent1].[DocumentID] AS [DocumentID] FROM [dbo].[DocumentStatusLogs] AS [Extent1]) AS [Distinct1]
OUTER APPLY (SELECT TOP (1) [Project2].[ID] AS [ID], [Project2].[DocumentID] AS [DocumentID], [Project2].[Status] AS [Status], [Project2].[DateCreated] AS [DateCreated]
FROM (SELECT
[Extent2].[ID] AS [ID],
[Extent2].[DocumentID] AS [DocumentID],
[Extent2].[Status] AS [Status],
[Extent2].[DateCreated] AS [DateCreated]
FROM [dbo].[DocumentStatusLogs] AS [Extent2]
WHERE ([Distinct1].[DocumentID] = [Extent2].[DocumentID])
) AS [Project2]
ORDER BY [Project2].[ID] DESC) AS [Limit1]
Now I'm assuming something that isn't entirely specified in the original question, but if your table design is such that your ID column is an auto-increment ID, and the DateCreated is set to the current date with each insert, then even without running with my query above you could actually get a sizable performance boost to gbn's solution (about half the execution time) just from ordering on ID instead of ordering on DateCreated as this will provide an identical sort order and it's a faster sort.
现在我假设的东西不是完全指定的原始问题,但是如果你的表设计是你的ID列是一个自动递增ID,和DateCreated设置为当前日期与每个插入,那么即使没有上面运行与我查询,你可以得到一个相当大的性能提升gbn的解决方案(大约一半的执行时间)从订单ID,而不是订购DateCreated,这将提供一个相同的排序和快速排序。
#7
5
My code to select top 1 from each group
我的代码从每个组中选择top 1。
select a.* from #DocumentStatusLogs a where datecreated in( select top 1 datecreated from #DocumentStatusLogs b where a.documentid = b.documentid order by datecreated desc )
#8
2
Verifying Clint's awesome and correct answer from above:
验证克林特的超赞和正确答案:
The performance between the two queries below is interesting. 52% being the top one. And 48% being the second one. A 4% improvement in performance using DISTINCT instead of ORDER BY. But ORDER BY has the advantage to sort by multiple columns.
下面两个查询之间的性能非常有趣。52%是第一。48%是第二个。使用不同而不是顺序的性能提高了4%。但是ORDER BY具有排序多列的优势。
IF (OBJECT_ID('tempdb..#DocumentStatusLogs') IS NOT NULL) BEGIN DROP TABLE #DocumentStatusLogs END
CREATE TABLE #DocumentStatusLogs (
[ID] int NOT NULL,
[DocumentID] int NOT NULL,
[Status] varchar(20),
[DateCreated] datetime
)
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (2, 1, 'S1', '7/29/2011 1:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (3, 1, 'S2', '7/30/2011 2:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (6, 1, 'S1', '8/02/2011 3:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (1, 2, 'S1', '7/28/2011 4:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (4, 2, 'S2', '7/30/2011 5:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (5, 2, 'S3', '8/01/2011 6:00:00')
INSERT INTO #DocumentStatusLogs([ID], [DocumentID], [Status], [DateCreated]) VALUES (6, 3, 'S1', '8/02/2011 7:00:00')
Option 1:
选项1:
SELECT
[Extent1].[ID],
[Extent1].[DocumentID],
[Extent1].[Status],
[Extent1].[DateCreated]
FROM #DocumentStatusLogs AS [Extent1]
OUTER APPLY (
SELECT TOP 1
[Extent2].[ID],
[Extent2].[DocumentID],
[Extent2].[Status],
[Extent2].[DateCreated]
FROM #DocumentStatusLogs AS [Extent2]
WHERE [Extent1].[DocumentID] = [Extent2].[DocumentID]
ORDER BY [Extent2].[DateCreated] DESC, [Extent2].[ID] DESC
) AS [Project2]
WHERE ([Project2].[ID] IS NULL OR [Project2].[ID] = [Extent1].[ID])
Option 2:
选项2:
SELECT
[Limit1].[DocumentID] AS [ID],
[Limit1].[DocumentID] AS [DocumentID],
[Limit1].[Status] AS [Status],
[Limit1].[DateCreated] AS [DateCreated]
FROM (
SELECT DISTINCT [Extent1].[DocumentID] AS [DocumentID] FROM #DocumentStatusLogs AS [Extent1]
) AS [Distinct1]
OUTER APPLY (
SELECT TOP (1) [Project2].[ID] AS [ID], [Project2].[DocumentID] AS [DocumentID], [Project2].[Status] AS [Status], [Project2].[DateCreated] AS [DateCreated]
FROM (
SELECT
[Extent2].[ID] AS [ID],
[Extent2].[DocumentID] AS [DocumentID],
[Extent2].[Status] AS [Status],
[Extent2].[DateCreated] AS [DateCreated]
FROM #DocumentStatusLogs AS [Extent2]
WHERE [Distinct1].[DocumentID] = [Extent2].[DocumentID]
) AS [Project2]
ORDER BY [Project2].[ID] DESC
) AS [Limit1]
M$'s Management Studio: After highlighting and running the first block, highlight both Option 1 and Option 2, Right click -> [Display Estimated Execution Plan]. Then run the entire thing to see the results.
M$'s Management Studio:在高亮显示和运行第一个块后,高亮显示选项1和选项2,右击->[显示估计执行计划]。然后运行整个程序来查看结果。
Option 1 Results:
选项1的结果:
ID DocumentID Status DateCreated
6 1 S1 8/2/11 3:00
5 2 S3 8/1/11 6:00
6 3 S1 8/2/11 7:00
Option 2 Results:
选项2的结果:
ID DocumentID Status DateCreated
6 1 S1 8/2/11 3:00
5 2 S3 8/1/11 6:00
6 3 S1 8/2/11 7:00
Note:
注意:
I tend to use APPLY when I want a join to be 1-to-(1 of many).
当我想要一个join是1-to-(many)时,我倾向于使用APPLY。
I use a JOIN if I want the join to be 1-to-many, or many-to-many.
如果希望连接为1对多或多对多,则使用JOIN。
I avoid CTE with ROW_NUMBER() unless I need to do something advanced and am ok with the windowing performance penalty.
我避免使用ROW_NUMBER()的CTE(),除非我需要做一些高级的事情,并且可以使用窗口性能惩罚。
I also avoid EXISTS / IN subqueries in the WHERE or ON clause, as I have experienced this causing some terrible execution plans. But mileage varies. Review the execution plan and profile performance where and when needed!
我也避免在WHERE或ON子句中存在/在子查询中,因为我已经经历过这导致了一些糟糕的执行计划。但里程而变化。在需要的时候和必要的时候检查执行计划和概要文件的性能!
#9
2
I know this is an old thread but the TOP 1 WITH TIES
solutions is quite nice and might be helpful to some reading through the solutions.
我知道这是一个旧的线程,但是上面的1和联系解决方案是相当不错的,可能有助于一些阅读解决方案。
select top 1 with ties
DocumentID
,Status
,DateCreated
from DocumentStatusLogs
order by row_number() over (partition by DocumentID order by DateCreated desc)
More about the TOP clause can be found here.
更多关于顶部的条款可以在这里找到。
#10
1
This is one of the most easily found question on the topic, so I wanted to give a modern answer to the it (both for my reference and to help others out). By using over and first value you can make short work of the above query:
这是关于这个话题最容易被发现的问题之一,所以我想给它一个现代的答案(我的参考和帮助别人)。通过使用over和first value,您可以完成上述查询的简短工作:
select distinct DocumentID
, first_value(status) over (partition by DocumentID order by DateCreated Desc) as Status
, first_value(DateCreated) over (partition by DocumentID order by DateCreated Desc) as DateCreated
From DocumentStatusLogs
This should work in sql server 2008 and up. First value can be thought of as a way to accomplish select top 1 when using an over clause. Over allows grouping in the select list so instead of writing nested subqueries (like many of the existing answers do), this does it in a more readable fashion. Hope this helps.
这应该适用于sql server 2008和up。第一个值可以被认为是在使用over子句时完成select top 1的一种方法。Over允许在选择列表中分组,因此不需要编写嵌套的子查询(像许多现有的答案一样),这样做的方式更容易阅读。希望这个有帮助。
#11
0
In scenarios where you want to avoid using row_count(), you can also use a left join:
在您希望避免使用row_count()的场景中,您也可以使用左连接:
select ds.DocumentID, ds.Status, ds.DateCreated
from DocumentStatusLogs ds
left join DocumentStatusLogs filter
ON ds.DocumentID = filter.DocumentID
-- Match any row that has another row that was created after it.
AND ds.DateCreated < filter.DateCreated
-- then filter out any rows that matched
where filter.DocumentID is null
For the example schema, you could also use a "not in subquery", which generally compiles to the same output as the left join:
对于示例模式,您还可以使用“not in subquery”,它通常编译到与左连接相同的输出:
select ds.DocumentID, ds.Status, ds.DateCreated
from DocumentStatusLogs ds
WHERE ds.ID NOT IN (
SELECT filter.ID
FROM DocumentStatusLogs filter
WHERE ds.DocumentID = filter.DocumentID
AND ds.DateCreated < filter.DateCreated)
Note, the subquery pattern wouldn't work if the table didn't have at least one single-column unique key/constraint/index, in this case the primary key "Id".
注意,如果表中没有至少一个单列惟一键/约束/索引,那么子查询模式将不起作用,在这种情况下,主键“Id”。
Both of these queries tend to be more "expensive" than the row_count() query (as measured by Query Analyzer). However, you might encounter scenarios where they return results faster or enable other optimizations.
这两个查询都比row_count()查询更“昂贵”(用query Analyzer度量)。但是,您可能会遇到一些场景,它们返回结果的速度更快,或者启用其他优化。
#12
0
Try this:
试试这个:
SELECT [DocumentID],
[tmpRez].value('/x[2]','varchar(20)') as [Status],
[tmpRez].value('/x[3]','datetime') as [DateCreated]
FROM (
SELECT [DocumentID],
cast('<x>'+max(cast([ID] as varchar(10))+'</x><x>'+[Status]+'</x><x>'
+cast([DateCreated] as varchar(20)))+'</x>' as XML) as [tmpRez]
FROM DocumentStatusLogs
GROUP by DocumentID) as [tmpQry]
#13
0
SELECT o.*
FROM `DocumentStatusLogs` o
LEFT JOIN `DocumentStatusLogs` b
ON o.DocumentID = b.DocumentID AND o.DateCreated < b.DateCreated
WHERE b.DocumentID is NULL ;
If you want to return only recent document order by DateCreated, it will return only top 1 document by DocumentID
如果您想要只返回最近的文档顺序,它将只返回文档的前1文档。
#14
-1
This is the most vanilla TSQL I can come up with
这是我能想到的最简单的TSQL。
SELECT * FROM DocumentStatusLogs D1 JOIN
(
SELECT
DocumentID,MAX(DateCreated) AS MaxDate
FROM
DocumentStatusLogs
GROUP BY
DocumentID
) D2
ON
D2.DocumentID=D1.DocumentID
AND
D2.MaxDate=D1.DateCreated
#15
-2
It is checked in SQLite that you can use the following simple query with GROUP BY
在SQLite中检查它,您可以使用下面的简单查询。
SELECT MAX(DateCreated), *
FROM DocumentStatusLogs
GROUP BY DocumentID
Here MAX help to get the maximum DateCreated FROM each group.
在这里,MAX帮助从每个组获得最大的数据。
But it seems that MYSQL doesn't associate *-columns with the value of max DateCreated :(
但似乎MYSQL并没有将*列与max DateCreated的值关联起来: