COUNT()对当前行进行了限制

Given each row that represents a task, with start time and end time, how can I calculate the number of running tasks (i.e. that started and not ended) at the time each task starts (including itself) using a window function with COUNT OVER? Is a window function even the right approach?

给定表示任务的每一行，以及开始时间和结束时间，我如何计算每个任务开始(包括自己)时运行的任务(即开始和未结束的任务)的数量，使用一个带COUNT OVER的窗口函数?窗口函数是正确的方法吗?

Example, given table tasks:

例如,给定表任务:

task_id  start_time  end_time
   a         1          10
   b         2           5
   c         5          15
   d         8          13
   e        12          20
   f        21          30

Calculate running_tasks:

计算running_tasks:

task_id  start_time  end_time  running_tasks
   a         1          10           1         # a
   b         2           5           2         # a,b
   c         5          15           2         # a,c (b has ended)
   d         8          13           3         # a,c,d
   e        12          20           3         # c,d,e (a has ended)
   f        21          30           1         # f (c,d,e have ended)

3 个解决方案

#1

select      task_id,start_time,end_time,running_tasks 

from       (select      task_id,tm,op,start_time,end_time

                       ,sum(op) over 
                        (
                            order by    tm,op 
                            rows        unbounded preceding
                        ) as running_tasks 

            from       (select      task_id,start_time as tm,1 as op,start_time,end_time 
                        from        tasks 

                        union   all 

                        select      task_id,end_time as tm,-1 as op,start_time,end_time 
                        from        tasks 
                        ) t 
            )t 

where       op = 1
;

#2

You can use a correlated subquery, which in this case is a self-join; no analytic functions are needed. After enabling standard SQL (uncheck "Use Legacy SQL" under "Show Options" in the UI) you can run this example:

您可以使用关联子查询，在本例中它是一个自连接;不需要分析函数。在启用标准SQL(在UI的“Show Options”下取消“使用遗留SQL”选项)之后，您可以运行这个示例:

WITH tasks AS (
  SELECT
    task_id,
    start_time,
    end_time
  FROM UNNEST(ARRAY<STRUCT<task_id STRING, start_time INT64, end_time INT64>>[
    ('a', 1, 10),
    ('b', 2, 5),
    ('c', 5, 15),
    ('d', 8, 13),
    ('e', 12, 20),
    ('f', 21, 30)
  ])
)
SELECT
  *,
  (SELECT COUNT(*) FROM tasks t2
   WHERE t.start_time >= t2.start_time AND
   t.start_time < t2.end_time) AS running_tasks
FROM tasks t
ORDER BY task_id;

#3

As Elliott has mentioned - "it's generally more difficult to explain analytic functions to new users" and even established users do not always 100% good at it (while being very very close to it)!
So, while Dudu Markovitz answer is great - unfortunatelly, it is still incorrect (at least as per how I understood question). The case when it is not correct is when you have multiple tasks started at the same start_time - so those tasks have wrong "running tasks" result

正如埃利奥特所提到的——“向新用户解释分析函数通常更困难”，即使是老用户也不总是100%地擅长于此(尽管非常接近它)!所以，尽管Dudu Markovitz的回答很好——不幸的是，它仍然是不正确的(至少就我理解问题的方式而言)。不正确的情况是在同一个start_time启动多个任务时—所以这些任务有错误的“运行任务”结果

As an example - consider below example:

举个例子——考虑下面的例子:

task_id  start_time  end_time
   a         1          10
   aa        1           2
   aaa       1           8
   b         2           5
   c         5          15
   d         8          13
   e        12          20
   f        21          30

I think, you would expect below result:

我认为，你会期待以下结果:

task_id  start_time  end_time  running_tasks
   a         1          10           3         # a,aa,aaa
   aa        1           2           3         # a,aa,aaa
   aaa       1           8           3         # a,aa,aaa
   b         2           5           3         # a,aaa,b (aa has ended)
   c         5          15           3         # a,aaa,c (b has ended)
   d         8          13           3         # a,c,d (aaa has ended)
   e        12          20           3         # c,d,e (a has ended)
   f        21          30           1         # f (c,d,e have ended)

If you will try it with Dudu's code - you will get below instead

如果您尝试使用Dudu的代码，您将得到下面的代码。

task_id  start_time  end_time  running_tasks
   a         1          10           1        
   aa        1           2           2        
   aaa       1           8           3        
   b         2           5           3        
   c         5          15           3        
   d         8          13           3        
   e        12          20           3        
   f        21          30           1

As you can see result for tasks a and aa wrong.
The reason is because of use of ROWS UNBOUNDED PRECEDING instead of RANGE UNBOUNDED PRECEDING - small but very important nuance!

你可以看到任务a和aa的结果是错误的。原因是由于使用的行*前，而不是范围*前——小但非常重要的细微差别!

So below query will give you correct result

下面的查询将给出正确的结果

SELECT  task_id,start_time,end_time,running_tasks 
FROM  (
  SELECT  
    task_id, tm, op, start_time, end_time,
    SUM(op) OVER (ORDER BY  tm ,op RANGE UNBOUNDED PRECEDING) AS running_tasks 
  FROM  (
    SELECT  
      task_id, start_time AS tm, 1 AS op, start_time, end_time 
    FROM  tasks UNION  ALL 
    SELECT  
      task_id, end_time AS tm, -1 AS op, start_time, end_time 
    FROM  tasks 
  ) t 
)t 
WHERE  op = 1
ORDER BY start_time

quick summary:
ROWS UNBOUNDED PRECEDING - sets the window frame based on rows' position
whereas
RANGE UNBOUNDED PRECEDING - sets the window frame based on rows values

快速总结:根据行的位置，根据行的位置设置窗口框架，而不受限制的行，根据行值设置窗口框架。

Again - as Elliott has mentioned - this is much more complex to fully get into it than JOIN concept - but it worth it (as it is much more efficient than joins) - see more about Window Frame Clause and ROWS vs RANGE use

同样——正如埃利奥特所提到的——完全理解它比加入概念要复杂得多——但是值得一试(因为它比加入更有效)——更多地了解窗口框架子句和行vs范围使用

#1

select      task_id,start_time,end_time,running_tasks 

from       (select      task_id,tm,op,start_time,end_time

                       ,sum(op) over 
                        (
                            order by    tm,op 
                            rows        unbounded preceding
                        ) as running_tasks 

            from       (select      task_id,start_time as tm,1 as op,start_time,end_time 
                        from        tasks 

                        union   all 

                        select      task_id,end_time as tm,-1 as op,start_time,end_time 
                        from        tasks 
                        ) t 
            )t 

where       op = 1
;

#2

WITH tasks AS (
  SELECT
    task_id,
    start_time,
    end_time
  FROM UNNEST(ARRAY<STRUCT<task_id STRING, start_time INT64, end_time INT64>>[
    ('a', 1, 10),
    ('b', 2, 5),
    ('c', 5, 15),
    ('d', 8, 13),
    ('e', 12, 20),
    ('f', 21, 30)
  ])
)
SELECT
  *,
  (SELECT COUNT(*) FROM tasks t2
   WHERE t.start_time >= t2.start_time AND
   t.start_time < t2.end_time) AS running_tasks
FROM tasks t
ORDER BY task_id;

#3