Given each row that represents a task, with start time and end time, how can I calculate the number of running tasks (i.e. that started and not ended) at the time each task starts (including itself) using a window function with COUNT OVER
? Is a window function even the right approach?
给定表示任务的每一行,以及开始时间和结束时间,我如何计算每个任务开始(包括自己)时运行的任务(即开始和未结束的任务)的数量,使用一个带COUNT OVER的窗口函数?窗口函数是正确的方法吗?
Example, given table tasks
:
例如,给定表任务:
task_id start_time end_time
a 1 10
b 2 5
c 5 15
d 8 13
e 12 20
f 21 30
Calculate running_tasks
:
计算running_tasks:
task_id start_time end_time running_tasks
a 1 10 1 # a
b 2 5 2 # a,b
c 5 15 2 # a,c (b has ended)
d 8 13 3 # a,c,d
e 12 20 3 # c,d,e (a has ended)
f 21 30 1 # f (c,d,e have ended)
3 个解决方案
#1
2
select task_id,start_time,end_time,running_tasks
from (select task_id,tm,op,start_time,end_time
,sum(op) over
(
order by tm,op
rows unbounded preceding
) as running_tasks
from (select task_id,start_time as tm,1 as op,start_time,end_time
from tasks
union all
select task_id,end_time as tm,-1 as op,start_time,end_time
from tasks
) t
)t
where op = 1
;
#2
2
You can use a correlated subquery, which in this case is a self-join; no analytic functions are needed. After enabling standard SQL (uncheck "Use Legacy SQL" under "Show Options" in the UI) you can run this example:
您可以使用关联子查询,在本例中它是一个自连接;不需要分析函数。在启用标准SQL(在UI的“Show Options”下取消“使用遗留SQL”选项)之后,您可以运行这个示例:
WITH tasks AS (
SELECT
task_id,
start_time,
end_time
FROM UNNEST(ARRAY<STRUCT<task_id STRING, start_time INT64, end_time INT64>>[
('a', 1, 10),
('b', 2, 5),
('c', 5, 15),
('d', 8, 13),
('e', 12, 20),
('f', 21, 30)
])
)
SELECT
*,
(SELECT COUNT(*) FROM tasks t2
WHERE t.start_time >= t2.start_time AND
t.start_time < t2.end_time) AS running_tasks
FROM tasks t
ORDER BY task_id;
#3
2
As Elliott has mentioned - "it's generally more difficult to explain analytic functions to new users" and even established users do not always 100% good at it (while being very very close to it)!
So, while Dudu Markovitz answer is great - unfortunatelly, it is still incorrect (at least as per how I understood question). The case when it is not correct is when you have multiple tasks started at the same start_time - so those tasks have wrong "running tasks" result
正如埃利奥特所提到的——“向新用户解释分析函数通常更困难”,即使是老用户也不总是100%地擅长于此(尽管非常接近它)!所以,尽管Dudu Markovitz的回答很好——不幸的是,它仍然是不正确的(至少就我理解问题的方式而言)。不正确的情况是在同一个start_time启动多个任务时—所以这些任务有错误的“运行任务”结果
As an example - consider below example:
举个例子——考虑下面的例子:
task_id start_time end_time
a 1 10
aa 1 2
aaa 1 8
b 2 5
c 5 15
d 8 13
e 12 20
f 21 30
I think, you would expect below result:
我认为,你会期待以下结果:
task_id start_time end_time running_tasks
a 1 10 3 # a,aa,aaa
aa 1 2 3 # a,aa,aaa
aaa 1 8 3 # a,aa,aaa
b 2 5 3 # a,aaa,b (aa has ended)
c 5 15 3 # a,aaa,c (b has ended)
d 8 13 3 # a,c,d (aaa has ended)
e 12 20 3 # c,d,e (a has ended)
f 21 30 1 # f (c,d,e have ended)
If you will try it with Dudu's code - you will get below instead
如果您尝试使用Dudu的代码,您将得到下面的代码。
task_id start_time end_time running_tasks
a 1 10 1
aa 1 2 2
aaa 1 8 3
b 2 5 3
c 5 15 3
d 8 13 3
e 12 20 3
f 21 30 1
As you can see result for tasks a and aa wrong.
The reason is because of use of ROWS UNBOUNDED PRECEDING
instead of RANGE UNBOUNDED PRECEDING
- small but very important nuance!
你可以看到任务a和aa的结果是错误的。原因是由于使用的行*前,而不是范围*前——小但非常重要的细微差别!
So below query will give you correct result
下面的查询将给出正确的结果
SELECT task_id,start_time,end_time,running_tasks
FROM (
SELECT
task_id, tm, op, start_time, end_time,
SUM(op) OVER (ORDER BY tm ,op RANGE UNBOUNDED PRECEDING) AS running_tasks
FROM (
SELECT
task_id, start_time AS tm, 1 AS op, start_time, end_time
FROM tasks UNION ALL
SELECT
task_id, end_time AS tm, -1 AS op, start_time, end_time
FROM tasks
) t
)t
WHERE op = 1
ORDER BY start_time
quick summary:
ROWS UNBOUNDED PRECEDING - sets the window frame based on rows' position
whereas
RANGE UNBOUNDED PRECEDING - sets the window frame based on rows values
快速总结:根据行的位置,根据行的位置设置窗口框架,而不受限制的行,根据行值设置窗口框架。
Again - as Elliott has mentioned - this is much more complex to fully get into it than JOIN concept - but it worth it (as it is much more efficient than joins) - see more about Window Frame Clause and ROWS vs RANGE use
同样——正如埃利奥特所提到的——完全理解它比加入概念要复杂得多——但是值得一试(因为它比加入更有效)——更多地了解窗口框架子句和行vs范围使用
#1
2
select task_id,start_time,end_time,running_tasks
from (select task_id,tm,op,start_time,end_time
,sum(op) over
(
order by tm,op
rows unbounded preceding
) as running_tasks
from (select task_id,start_time as tm,1 as op,start_time,end_time
from tasks
union all
select task_id,end_time as tm,-1 as op,start_time,end_time
from tasks
) t
)t
where op = 1
;
#2
2
You can use a correlated subquery, which in this case is a self-join; no analytic functions are needed. After enabling standard SQL (uncheck "Use Legacy SQL" under "Show Options" in the UI) you can run this example:
您可以使用关联子查询,在本例中它是一个自连接;不需要分析函数。在启用标准SQL(在UI的“Show Options”下取消“使用遗留SQL”选项)之后,您可以运行这个示例:
WITH tasks AS (
SELECT
task_id,
start_time,
end_time
FROM UNNEST(ARRAY<STRUCT<task_id STRING, start_time INT64, end_time INT64>>[
('a', 1, 10),
('b', 2, 5),
('c', 5, 15),
('d', 8, 13),
('e', 12, 20),
('f', 21, 30)
])
)
SELECT
*,
(SELECT COUNT(*) FROM tasks t2
WHERE t.start_time >= t2.start_time AND
t.start_time < t2.end_time) AS running_tasks
FROM tasks t
ORDER BY task_id;
#3
2
As Elliott has mentioned - "it's generally more difficult to explain analytic functions to new users" and even established users do not always 100% good at it (while being very very close to it)!
So, while Dudu Markovitz answer is great - unfortunatelly, it is still incorrect (at least as per how I understood question). The case when it is not correct is when you have multiple tasks started at the same start_time - so those tasks have wrong "running tasks" result
正如埃利奥特所提到的——“向新用户解释分析函数通常更困难”,即使是老用户也不总是100%地擅长于此(尽管非常接近它)!所以,尽管Dudu Markovitz的回答很好——不幸的是,它仍然是不正确的(至少就我理解问题的方式而言)。不正确的情况是在同一个start_time启动多个任务时—所以这些任务有错误的“运行任务”结果
As an example - consider below example:
举个例子——考虑下面的例子:
task_id start_time end_time
a 1 10
aa 1 2
aaa 1 8
b 2 5
c 5 15
d 8 13
e 12 20
f 21 30
I think, you would expect below result:
我认为,你会期待以下结果:
task_id start_time end_time running_tasks
a 1 10 3 # a,aa,aaa
aa 1 2 3 # a,aa,aaa
aaa 1 8 3 # a,aa,aaa
b 2 5 3 # a,aaa,b (aa has ended)
c 5 15 3 # a,aaa,c (b has ended)
d 8 13 3 # a,c,d (aaa has ended)
e 12 20 3 # c,d,e (a has ended)
f 21 30 1 # f (c,d,e have ended)
If you will try it with Dudu's code - you will get below instead
如果您尝试使用Dudu的代码,您将得到下面的代码。
task_id start_time end_time running_tasks
a 1 10 1
aa 1 2 2
aaa 1 8 3
b 2 5 3
c 5 15 3
d 8 13 3
e 12 20 3
f 21 30 1
As you can see result for tasks a and aa wrong.
The reason is because of use of ROWS UNBOUNDED PRECEDING
instead of RANGE UNBOUNDED PRECEDING
- small but very important nuance!
你可以看到任务a和aa的结果是错误的。原因是由于使用的行*前,而不是范围*前——小但非常重要的细微差别!
So below query will give you correct result
下面的查询将给出正确的结果
SELECT task_id,start_time,end_time,running_tasks
FROM (
SELECT
task_id, tm, op, start_time, end_time,
SUM(op) OVER (ORDER BY tm ,op RANGE UNBOUNDED PRECEDING) AS running_tasks
FROM (
SELECT
task_id, start_time AS tm, 1 AS op, start_time, end_time
FROM tasks UNION ALL
SELECT
task_id, end_time AS tm, -1 AS op, start_time, end_time
FROM tasks
) t
)t
WHERE op = 1
ORDER BY start_time
quick summary:
ROWS UNBOUNDED PRECEDING - sets the window frame based on rows' position
whereas
RANGE UNBOUNDED PRECEDING - sets the window frame based on rows values
快速总结:根据行的位置,根据行的位置设置窗口框架,而不受限制的行,根据行值设置窗口框架。
Again - as Elliott has mentioned - this is much more complex to fully get into it than JOIN concept - but it worth it (as it is much more efficient than joins) - see more about Window Frame Clause and ROWS vs RANGE use
同样——正如埃利奥特所提到的——完全理解它比加入概念要复杂得多——但是值得一试(因为它比加入更有效)——更多地了解窗口框架子句和行vs范围使用