In my table each row has some data columns Priority
column (for example, timestamp or just an integer). I want to group my data by ID and then in each group take latest not-null column. For example I have following table:
在我的表中,每行都有一些数据列优先列(例如,时间戳或一个整数)。我想按ID对数据进行分组,然后在每个组中取最新的非空列。例如,我有以下表格:
id A B C Priority
1 NULL 3 4 1
1 5 6 NULL 2
1 8 NULL NULL 3
2 634 346 359 1
2 34 NULL 734 2
Desired result is :
预期的结果是:
id A B C
1 8 6 4
2 34 346 734
In this example table is small and has only 5 columns, but in real table it will be much larger. I really want this script to work fast. I tried do it myself, but my script works for SQLSERVER2012+ so I deleted it as not applicable.
在这个示例表中,表很小,只有5列,但是在实际的表中,它会大得多。我真的希望这个脚本运行得快。我自己尝试过,但是我的脚本适用于SQLSERVER2012+,所以我删除了它,因为它不适用。
Numbers: table could have 150k of rows, 20 columns, 20-80k of unique id
s and average SELECT COUNT(id) FROM T GROUP BY ID
is 2..5
数字:表可以有150k行,20列,20-80k唯一id, T组按id的平均选择计数(id)为2..5
Now I have a working code (thanks to @ypercubeᵀᴹ), but it runs very slowly on big tables, in my case script can take one minute or even more (with indices and so on).
现在我有一个工作代码(由于@ypercubeᵀᴹ),但它运行大表上非常缓慢,在我的例子中脚本可以一分钟甚至更多(指数等等)。
How can it be speeded up?
怎么才能加速呢?
SELECT
d.id,
d1.A,
d2.B,
d3.C
FROM
( SELECT id
FROM T
GROUP BY id
) AS d
OUTER APPLY
( SELECT TOP (1) A
FROM T
WHERE id = d.id
AND A IS NOT NULL
ORDER BY priority DESC
) AS d1
OUTER APPLY
( SELECT TOP (1) B
FROM T
WHERE id = d.id
AND B IS NOT NULL
ORDER BY priority DESC
) AS d2
OUTER APPLY
( SELECT TOP (1) C
FROM T
WHERE id = d.id
AND C IS NOT NULL
ORDER BY priority DESC
) AS d3 ;
In my test database with real amount of data I get following execution plan:
在我的真实数据量的测试数据库中,我得到如下执行计划:
4 个解决方案
#1
4
This should do the trick, everything raised to the power 0 will return 1 except null:
这应该是一个技巧,所有被提升到0的都将返回1,除了null:
DECLARE @t table(id int,A int,B int,C int,Priority int)
INSERT @t
VALUES (1,NULL,3 ,4 ,1),
(1,5 ,6 ,NULL,2),(1,8 ,NULL,NULL,3),
(2,634 ,346 ,359 ,1),(2,34 ,NULL,734 ,2)
;WITH CTE as
(
SELECT id,
CASE WHEN row_number() over
(partition by id order by Priority*power(A,0) desc) = 1 THEN A END A,
CASE WHEN row_number() over
(partition by id order by Priority*power(B,0) desc) = 1 THEN B END B,
CASE WHEN row_number() over
(partition by id order by Priority*power(C,0) desc) = 1 THEN C END C
FROM @t
)
SELECT id, max(a) a, max(b) b, max(c) c
FROM CTE
GROUP BY id
Result:
结果:
id a b c
1 8 6 4
2 34 346 734
#2
2
One alternative that might be faster is a multiple join approach. Get the priority for each column and then join back to the original table. For the first part:
另一种可能更快的方法是多连接方法。获取每个列的优先级,然后连接回原始表。第一部分:
select id,
max(case when a is not null then priority end) as pa,
max(case when b is not null then priority end) as pb,
max(case when c is not null then priority end) as pc
from t
group by id;
Then join back to this table:
然后回到这张桌子:
with pabc as (
select id,
max(case when a is not null then priority end) as pa,
max(case when b is not null then priority end) as pb,
max(case when c is not null then priority end) as pc
from t
group by id
)
select pabc.id, ta.a, tb.b, tc.c
from pabc left join
t ta
on pabc.id = ta.id and pabc.pa = ta.priority left join
t tb
on pabc.id = tb.id and pabc.pb = tb.priority left join
t tc
on pabc.id = tc.id and pabc.pc = tc.priority ;
This can also take advantage of an index on t(id, priority)
.
这也可以利用t上的索引(id, priority)。
#3
0
previous code will work with following syntax:
以前的代码将使用以下语法:
with pabc as (
select id,
max(case when a is not null then priority end) as pa,
max(case when b is not null then priority end) as pb,
max(case when c is not null then priority end) as pc
from t
group by id
)
select pabc.Id,ta.a, tb.b, tc.c
from pabc
left join t ta on pabc.id = ta.id and pabc.pa = ta.priority
left join t tb on pabc.id = tb.id and pabc.pb = tb.priority
left join t tc on pabc.id = tc.id and pabc.pc = tc.priority ;
#4
-1
This looks rather strange. You have a log table for all column changes, but no associated table with current data. Now you are looking for a query to collect your current values from the log table, which is a laborious task naturally.
这看起来相当奇怪。对于所有列的更改,您都有一个日志表,但是没有与当前数据相关联的表。现在,您正在寻找一个查询来从日志表中收集当前值,这自然是一项费力的任务。
The solution is simple: have an additional table with the current data. You can even link the tables with a trigger (so either every time a record gets inserted in your log table you update the current table or everytime a change is written to the current table you write a log entry).
解决方案很简单:有一个包含当前数据的附加表。您甚至可以使用触发器链接表(因此,每次将记录插入日志表时,您都要更新当前表,或者每次将更改写入当前表时,您都要编写日志条目)。
Then just query your current table:
然后查询当前表:
select id, a, b, c from currenttable order by id;
#1
4
This should do the trick, everything raised to the power 0 will return 1 except null:
这应该是一个技巧,所有被提升到0的都将返回1,除了null:
DECLARE @t table(id int,A int,B int,C int,Priority int)
INSERT @t
VALUES (1,NULL,3 ,4 ,1),
(1,5 ,6 ,NULL,2),(1,8 ,NULL,NULL,3),
(2,634 ,346 ,359 ,1),(2,34 ,NULL,734 ,2)
;WITH CTE as
(
SELECT id,
CASE WHEN row_number() over
(partition by id order by Priority*power(A,0) desc) = 1 THEN A END A,
CASE WHEN row_number() over
(partition by id order by Priority*power(B,0) desc) = 1 THEN B END B,
CASE WHEN row_number() over
(partition by id order by Priority*power(C,0) desc) = 1 THEN C END C
FROM @t
)
SELECT id, max(a) a, max(b) b, max(c) c
FROM CTE
GROUP BY id
Result:
结果:
id a b c
1 8 6 4
2 34 346 734
#2
2
One alternative that might be faster is a multiple join approach. Get the priority for each column and then join back to the original table. For the first part:
另一种可能更快的方法是多连接方法。获取每个列的优先级,然后连接回原始表。第一部分:
select id,
max(case when a is not null then priority end) as pa,
max(case when b is not null then priority end) as pb,
max(case when c is not null then priority end) as pc
from t
group by id;
Then join back to this table:
然后回到这张桌子:
with pabc as (
select id,
max(case when a is not null then priority end) as pa,
max(case when b is not null then priority end) as pb,
max(case when c is not null then priority end) as pc
from t
group by id
)
select pabc.id, ta.a, tb.b, tc.c
from pabc left join
t ta
on pabc.id = ta.id and pabc.pa = ta.priority left join
t tb
on pabc.id = tb.id and pabc.pb = tb.priority left join
t tc
on pabc.id = tc.id and pabc.pc = tc.priority ;
This can also take advantage of an index on t(id, priority)
.
这也可以利用t上的索引(id, priority)。
#3
0
previous code will work with following syntax:
以前的代码将使用以下语法:
with pabc as (
select id,
max(case when a is not null then priority end) as pa,
max(case when b is not null then priority end) as pb,
max(case when c is not null then priority end) as pc
from t
group by id
)
select pabc.Id,ta.a, tb.b, tc.c
from pabc
left join t ta on pabc.id = ta.id and pabc.pa = ta.priority
left join t tb on pabc.id = tb.id and pabc.pb = tb.priority
left join t tc on pabc.id = tc.id and pabc.pc = tc.priority ;
#4
-1
This looks rather strange. You have a log table for all column changes, but no associated table with current data. Now you are looking for a query to collect your current values from the log table, which is a laborious task naturally.
这看起来相当奇怪。对于所有列的更改,您都有一个日志表,但是没有与当前数据相关联的表。现在,您正在寻找一个查询来从日志表中收集当前值,这自然是一项费力的任务。
The solution is simple: have an additional table with the current data. You can even link the tables with a trigger (so either every time a record gets inserted in your log table you update the current table or everytime a change is written to the current table you write a log entry).
解决方案很简单:有一个包含当前数据的附加表。您甚至可以使用触发器链接表(因此,每次将记录插入日志表时,您都要更新当前表,或者每次将更改写入当前表时,您都要编写日志条目)。
Then just query your current table:
然后查询当前表:
select id, a, b, c from currenttable order by id;