聚合时间序列的高效子查询

时间:2022-10-17 22:08:26

I want to build a time series daily from a certain date and calculate a few statistics for each day. However this query is very slow... Any way to speed it up? (for example, select the table once in the subquery and compute various stats on that table for each day).

我想每天从特定日期建立一个时间序列,并计算每天的一些统计数据。但是这个查询非常慢......有什么方法可以加快速度吗? (例如,在子查询中选择一次表,并在每个日计算该表上的各种统计信息)。

In code this would look like

在代码中,这看起来像

for i, day in series:
    previous_days = series[0...i]
    some_calculation_a = some_operation_on(previous_days)
    some_calculation_b = some_other_operation_on(previous_days)

Here is an example for a time series looking for users with <= 5 messages up to that date:

以下是查找截至该日期<= 5条消息的用户的时间序列示例:

with
days as
(
select date::Timestamp with time zone from generate_series('2015-07-09', 
  now(), '1 day'::interval) date
),

msgs as 
(
select days.date, 
    (select count(customer_id) from daily_messages where sum < 5 and date_trunc('day'::text, created_at) <= days.date) as LT_5,
    (select count(customer_id) from daily_messages where sum = 1 and date_trunc('day'::text, created_at) <= days.date) as EQ_1
from days, daily_messages
where date_trunc('day'::text, created_at) = days.date
group by days.date
)

select * from msgs;

Query breakdown:

查询细分:

CTE Scan on msgs  (cost=815579.03..815583.03 rows=200 width=24)
  Output: msgs.date, msgs.lt_5, msgs.eq_1
  CTE days
    ->  Function Scan on pg_catalog.generate_series date  (cost=0.01..10.01 rows=1000 width=8)
          Output: date.date
          Function Call: generate_series('2015-07-09 00:00:00+00'::timestamp with time zone, now(), '1 day'::interval)
  CTE msgs
    ->  Group  (cost=6192.62..815569.02 rows=200 width=8)
          Output: days.date, (SubPlan 2), (SubPlan 3)
          Group Key: days.date
          ->  Merge Join  (cost=6192.62..11239.60 rows=287970 width=8)
                Output: days.date
                Merge Cond: (days.date = (date_trunc('day'::text, daily_messages_2.created_at)))
                ->  Sort  (cost=69.83..72.33 rows=1000 width=8)
                      Output: days.date
                      Sort Key: days.date
                      ->  CTE Scan on days  (cost=0.00..20.00 rows=1000 width=8)
                            Output: days.date
                ->  Sort  (cost=6122.79..6266.78 rows=57594 width=8)
                      Output: daily_messages_2.created_at, (date_trunc('day'::text, daily_messages_2.created_at))
                      Sort Key: (date_trunc('day'::text, daily_messages_2.created_at))
                      ->  Seq Scan on public.daily_messages daily_messages_2  (cost=0.00..1568.94 rows=57594 width=8)
                            Output: daily_messages_2.created_at, date_trunc('day'::text, daily_messages_2.created_at)
          SubPlan 2
            ->  Aggregate  (cost=2016.89..2016.90 rows=1 width=32)
                  Output: count(daily_messages.customer_id)
                  ->  Seq Scan on public.daily_messages  (cost=0.00..2000.89 rows=6399 width=32)
                        Output: daily_messages.created_at, daily_messages.customer_id, daily_messages.day_total, daily_messages.sum, daily_messages.elapsed
                        Filter: ((daily_messages.sum < '5'::numeric) AND (date_trunc('day'::text, daily_messages.created_at) <= days.date))
          SubPlan 3
            ->  Aggregate  (cost=2001.13..2001.14 rows=1 width=32)
                  Output: count(daily_messages_1.customer_id)
                  ->  Seq Scan on public.daily_messages daily_messages_1  (cost=0.00..2000.89 rows=96 width=32)
                        Output: daily_messages_1.created_at, daily_messages_1.customer_id, daily_messages_1.day_total, daily_messages_1.sum, daily_messages_1.elapsed
                        Filter: ((daily_messages_1.sum = '1'::numeric) AND (date_trunc('day'::text, daily_messages_1.created_at) <= days.date))

1 个解决方案

#1


1  

In addition to being very inefficient, I suspect the query is also incorrect. Assuming current Postgres 9.6, my educated guess:

除了非常低效,我怀疑查询也是不正确的。假设目前的Postgres 9.6,我的猜测是:

SELECT created_at::date
     , sum(count(customer_id) FILTER (WHERE sum < 5)) OVER w AS lt_5
     , sum(count(customer_id) FILTER (WHERE sum = 1)) OVER w AS eq_1
FROM   daily_messages m
WHERE  created_at >= timestamptz '2015-07-09'  -- sargable!
AND    created_at <  now()                     -- probably redundant
GROUP  BY 1
WINDOW w AS (ORDER BY created_at::date);

All those correlated subqueries are probably not needed. I replaced it with window functions combined with aggregate FILTER clauses. You can have a window function over an aggregate function. Related answers with more explanation:

可能不需要所有相关的子查询。我用窗口函数和聚合FILTER子句替换它。您可以在聚合函数上使用窗口函数。相关答案以及更多解释:

The CTEs don't help either (unnecessary overhead). You only would need a single subquery - or not even that, just the result from the set-returning function generate_series(). generate_series() can deliver timestamptz directly. Be aware of implications, though. You query depends on the time zone setting of the session. Details:

CTE也没有帮助(不必要的开销)。您只需要一个子查询 - 或者甚至不需要,只需要set-returns函数generate_series()的结果。 generate_series()可以直接传递timestamptz。但请注意其含义。您的查询取决于会话的时区设置。细节:

On second thought, I removed generate_series() completely. As long as you have an INNER JOIN to daily_messages, only days with actual rows remain in the result anyway. No need for generate_series() at all. Would make sense with LEFT JOIN. Not enough information in the question.

第二个想法,我完全删除了generate_series()。只要你有一个INNER JOIN到daily_messages,那么结果中只剩下实际行的天数。根本不需要generate_series()。 LEFT JOIN会有意义。问题中的信息不足。

Related answer explaining "sargable":

相关答案解释“sargable”:

You might replace count(customer_id) with count(*). Not enough information in the question.

您可以将count(customer_id)替换为count(*)。问题中的信息不足。

Might be optimized further, but there is not enough information to be more specific in the answer.

可能会进一步优化,但没有足够的信息在答案中更具体。

Include days without new entries in result

SELECT day
     , sum(lt_5_day) OVER w AS lt_5
     , sum(eq_1_day) OVER w AS eq_1
FROM  (
   SELECT day::date
   FROM   generate_series(date '2015-07-09', current_date, interval '1 day') day
   ) d
LEFT   JOIN (
   SELECT created_at::date AS day
        , count(customer_id) FILTER (WHERE sum < 5) AS lt_5_day
        , count(customer_id) FILTER (WHERE sum = 1) AS eq_1_day
   FROM   daily_messages m
   WHERE  created_at >= timestamptz '2015-07-09'
   GROUP  BY 1
   ) m USING (day)
WINDOW w AS (ORDER BY day);
  1. Aggregate daily sums in subquery m.
  2. 在子查询m中汇总每日总和。
  3. Generate series of all days in time range in subquery d.
  4. 在子查询d中生成时间范围内的所有日期系列。
  5. Use LEFT [OUTER] JOIN to retain all days in the result, even without new rows for the day.
  6. 使用LEFT [OUTER] JOIN可以保留结果中的所有日期,即使当天没有新行也是如此。

#1


1  

In addition to being very inefficient, I suspect the query is also incorrect. Assuming current Postgres 9.6, my educated guess:

除了非常低效,我怀疑查询也是不正确的。假设目前的Postgres 9.6,我的猜测是:

SELECT created_at::date
     , sum(count(customer_id) FILTER (WHERE sum < 5)) OVER w AS lt_5
     , sum(count(customer_id) FILTER (WHERE sum = 1)) OVER w AS eq_1
FROM   daily_messages m
WHERE  created_at >= timestamptz '2015-07-09'  -- sargable!
AND    created_at <  now()                     -- probably redundant
GROUP  BY 1
WINDOW w AS (ORDER BY created_at::date);

All those correlated subqueries are probably not needed. I replaced it with window functions combined with aggregate FILTER clauses. You can have a window function over an aggregate function. Related answers with more explanation:

可能不需要所有相关的子查询。我用窗口函数和聚合FILTER子句替换它。您可以在聚合函数上使用窗口函数。相关答案以及更多解释:

The CTEs don't help either (unnecessary overhead). You only would need a single subquery - or not even that, just the result from the set-returning function generate_series(). generate_series() can deliver timestamptz directly. Be aware of implications, though. You query depends on the time zone setting of the session. Details:

CTE也没有帮助(不必要的开销)。您只需要一个子查询 - 或者甚至不需要,只需要set-returns函数generate_series()的结果。 generate_series()可以直接传递timestamptz。但请注意其含义。您的查询取决于会话的时区设置。细节:

On second thought, I removed generate_series() completely. As long as you have an INNER JOIN to daily_messages, only days with actual rows remain in the result anyway. No need for generate_series() at all. Would make sense with LEFT JOIN. Not enough information in the question.

第二个想法,我完全删除了generate_series()。只要你有一个INNER JOIN到daily_messages,那么结果中只剩下实际行的天数。根本不需要generate_series()。 LEFT JOIN会有意义。问题中的信息不足。

Related answer explaining "sargable":

相关答案解释“sargable”:

You might replace count(customer_id) with count(*). Not enough information in the question.

您可以将count(customer_id)替换为count(*)。问题中的信息不足。

Might be optimized further, but there is not enough information to be more specific in the answer.

可能会进一步优化,但没有足够的信息在答案中更具体。

Include days without new entries in result

SELECT day
     , sum(lt_5_day) OVER w AS lt_5
     , sum(eq_1_day) OVER w AS eq_1
FROM  (
   SELECT day::date
   FROM   generate_series(date '2015-07-09', current_date, interval '1 day') day
   ) d
LEFT   JOIN (
   SELECT created_at::date AS day
        , count(customer_id) FILTER (WHERE sum < 5) AS lt_5_day
        , count(customer_id) FILTER (WHERE sum = 1) AS eq_1_day
   FROM   daily_messages m
   WHERE  created_at >= timestamptz '2015-07-09'
   GROUP  BY 1
   ) m USING (day)
WINDOW w AS (ORDER BY day);
  1. Aggregate daily sums in subquery m.
  2. 在子查询m中汇总每日总和。
  3. Generate series of all days in time range in subquery d.
  4. 在子查询d中生成时间范围内的所有日期系列。
  5. Use LEFT [OUTER] JOIN to retain all days in the result, even without new rows for the day.
  6. 使用LEFT [OUTER] JOIN可以保留结果中的所有日期,即使当天没有新行也是如此。