每周汇总最近加入的记录

I have an updates table in Postgres is 9.4.5 like this:

我在Postgres有一个更新表是9.4.5像这样:

goal_id    | created_at | status
1          | 2016-01-01 | green
1          | 2016-01-02 | red
2          | 2016-01-02 | amber

And a goals table like this:

还有一个像这样的目标表:

id | company_id
1  | 1
2  | 2

I want to create a chart for each company that shows the state of all of their goals, per week.

我想为每家公司创建一个图表,每周显示所有目标的状态。

I image this would require to generate a series of the past 8 weeks, finding the most recent update for each goal that came before that week, then counting the different statuses of the found updates.

我认为这需要生成一系列过去8周,找到该周之前的每个目标的最新更新,然后计算找到的更新的不同状态。

What I have so far:

到目前为止我所拥有的:

SELECT EXTRACT(year from generate_series) AS year, 
       EXTRACT(week from generate_series) AS week,
       u.company_id,
       COUNT(*) FILTER (WHERE u.status = 'green') AS green_count,
       COUNT(*) FILTER (WHERE u.status = 'amber') AS amber_count,
       COUNT(*) FILTER (WHERE u.status = 'red') AS red_count
FROM generate_series(NOW() - INTERVAL '2 MONTHS', NOW(), '1 week')
LEFT OUTER JOIN (
  SELECT DISTINCT ON(year, week)
         goals.company_id,
         updates.status, 
         EXTRACT(week from updates.created_at) week,
         EXTRACT(year from updates.created_at) AS year,
         updates.created_at 
  FROM updates
  JOIN goals ON goals.id = updates.goal_id
  ORDER BY year, week, updates.created_at DESC
) u ON u.week = week AND u.year = year
GROUP BY 1,2,3

But this has two problems. It seems that the join on u isn't working as I thought it would. It seems to be joining on every row (?) returned from the inner query as well as this only selects the most recent update that happened from that week. It should grab the most recent update from before that week if it needs to.

但这有两个问题。似乎加入你的工作并没有像我想象的那样有效。它似乎是在从内部查询返回的每一行(?)上加入,并且这只会选择从该周发生的最新更新。如果需要,它应该从该周之前获取最新更新。

This is some pretty complicated SQL and I love some input on how to pull it off.

这是一些相当复杂的SQL,我喜欢关于如何将它拉下来的一些输入。

Table structures and info

The goals table has around ~1000 goals ATM and is growing about ~100 a week:

目标表大约有1000个目标ATM,并且每周增长约100个:

                                           Table "goals"
     Column      |            Type             |                         Modifiers
-----------------+-----------------------------+-----------------------------------------------------------
 id              | integer                     | not null default nextval('goals_id_seq'::regclass)
 company_id      | integer                     | not null
 name            | text                        | not null
 created_at      | timestamp without time zone | not null default timezone('utc'::text, now())
 updated_at      | timestamp without time zone | not null default timezone('utc'::text, now())
Indexes:
    "goals_pkey" PRIMARY KEY, btree (id)
    "entity_goals_company_id_fkey" btree (company_id)
Foreign-key constraints:
    "goals_company_id_fkey" FOREIGN KEY (company_id) REFERENCES companies(id) ON DELETE RESTRICT

The updates table has around ~1000 and is growing around ~100 a week:

更新表大约有1000左右,每周增长约100个:

                                         Table "updates"
   Column   |            Type             |                            Modifiers
------------+-----------------------------+------------------------------------------------------------------
 id         | integer                     | not null default nextval('updates_id_seq'::regclass)
 status     | entity.goalstatus           | not null
 goal_id    | integer                     | not null
 created_at | timestamp without time zone | not null default timezone('utc'::text, now())
 updated_at | timestamp without time zone | not null default timezone('utc'::text, now())
Indexes:
    "goal_updates_pkey" PRIMARY KEY, btree (id)
    "entity_goal_updates_goal_id_fkey" btree (goal_id)
Foreign-key constraints:
    "updates_goal_id_fkey" FOREIGN KEY (goal_id) REFERENCES goals(id) ON DELETE CASCADE

 Schema |       Name        | Internal name | Size | Elements | Access privileges | Description
--------+-------------------+---------------+------+----------+-------------------+-------------
 entity | entity.goalstatus | goalstatus    | 4    | green   +|                   |
        |                   |               |      | amber   +|                   |
        |                   |               |      | red      |                   |

3 个解决方案

#1

You need one data item per week and goal (before aggregating counts per company). That's a plain CROSS JOIN between generate_series() and goals. The (possibly) expensive part is to get the current state from updates for each. Like @Paul already suggested, a LATERAL join seems like the best tool. Do it only for updates, though, and use a faster technique with LIMIT 1.

您需要每周一个数据项目和目标(在汇总每个公司的计数之前)。这是generate_series()和目标之间的简单CROSS JOIN。 (可能)昂贵的部分是从每个更新中获取当前状态。就像@Paul已经建议的那样,LATERAL连接似乎是最好的工具。不过只做更新,并使用LIMIT 1更快的技术。

And simplify date handling with date_trunc().

并使用date_trunc()简化日期处理。

SELECT w_start
     , g.company_id
     , count(*) FILTER (WHERE u.status = 'green') AS green_count
     , count(*) FILTER (WHERE u.status = 'amber') AS amber_count
     , count(*) FILTER (WHERE u.status = 'red')   AS red_count
FROM   generate_series(date_trunc('week', NOW() - interval '2 months')
                     , date_trunc('week', NOW())
                     , interval '1 week') w_start
CROSS  JOIN goals g
LEFT   JOIN LATERAL (
   SELECT status
   FROM   updates
   WHERE  goal_id = g.id
   AND    created_at < w_start
   ORDER  BY created_at DESC
   LIMIT  1
   ) u ON true
GROUP  BY w_start, g.company_id
ORDER  BY w_start, g.company_id;

To make this fast you need a multicolumn index:

要快速实现这一点,您需要一个多列索引:

CREATE INDEX updates_special_idx ON updates (goal_id, created_at DESC, status);

Descending order for created_at is best, but not strictly necessary. Postgres can scan indexes backwards almost exactly as fast. (Not applicable for inverted sort order of multiple columns, though.)

create_at的降序最好,但不是绝对必要的。 Postgres几乎可以快速地向后扫描索引。 (但不适用于多列的反向排序顺序。)

Index columns in that order. Why?

按顺序索引列。为什么?

Multicolumn index and performance

多列索引和性能

And the third column status is only appended to allow fast index-only scans on updates. Related case:

并且仅附加第三列状态以允许对更新进行快速仅索引扫描。相关案例:

Slow index scans in large table

大表中的慢速索引扫描

1k goals for 9 weeks (your interval of 2 months overlaps with at least 9 weeks) only require 9k index look-ups for the 2nd table of only 1k rows. For small tables like this, performance shouldn't be much of a problem. But once you have a couple of thousand more in each table, performance will deteriorate with sequential scans.

9周的1k目标(2个月的间隔与至少9周重叠)仅需要9k索引查找仅第1行的第2个表。对于像这样的小表,性能应该不是很大的问题。但是,如果每个表中还有几千个,则顺序扫描会降低性能。

w_start represents the start of each week. Consequently, counts are for the start of the week. You can still extract year and week (or any other details represent your week), if you insist:

w_start代表每周的开始。因此,计数是一周的开始。如果你坚持,你仍然可以提取年份和周(或任何其他细节代表你的一周):

   EXTRACT(isoyear from w_start) AS year
 , EXTRACT(week    from w_start) AS week

Best with ISOYEAR, like @Paul explained.

最好用ISOYEAR,就像@Paul解释的那样。

SQL Fiddle.

What is the difference between LATERAL and a subquery in PostgreSQL?

LATERAL和PostgreSQL中的子查询有什么区别?

Optimize GROUP BY query to retrieve latest record per user

优化GROUP BY查询以检索每个用户的最新记录

Select first row in each GROUP BY group?

选择每个GROUP BY组中的第一行?

PostgreSQL: running count of rows for a query 'by minute'

PostgreSQL:按分钟运行查询的行数

#2

This seems like a good use for LATERAL joins:

这似乎是LATERAL连接的一个很好的用途:

SELECT  EXTRACT(ISOYEAR FROM s) AS year,
        EXTRACT(WEEK FROM s) AS week,
        u.company_id,
        COUNT(u.goal_id) FILTER (WHERE u.status = 'green') AS green_count,
        COUNT(u.goal_id) FILTER (WHERE u.status = 'amber') AS amber_count,
        COUNT(u.goal_id) FILTER (WHERE u.status = 'red') AS red_count
FROM    generate_series(NOW() - INTERVAL '2 months', NOW(), '1 week') s(w)
LEFT OUTER JOIN LATERAL (
  SELECT  DISTINCT ON (g.company_id, u2.goal_id) g.company_id, u2.goal_id, u2.status
  FROM    updates u2
  INNER JOIN goals g
  ON      g.id = u2.goal_id
  WHERE   u2.created_at <= s.w
  ORDER BY g.company_id, u2.goal_id, u2.created_at DESC
) u 
ON true
WHERE   u.company_id IS NOT NULL
GROUP BY year, week, u.company_id
ORDER BY u.company_id, year, week
;

Btw I am extracting ISOYEAR not YEAR to ensure I get sensible results around the beginning of January. For instance EXTRACT(YEAR FROM '2016-01-01 08:49:56.734556-08') is 2016 but EXTRACT(WEEK FROM '2016-01-01 08:49:56.734556-08') is 53!

顺便说一句,我正在提取ISOYEAR年份,以确保我在1月初获得明智的结果。例如EXTRACT(年份来自'2016-01-01 08:49:56.734556-08')是2016年,但是EXTRACT(周2016年08月08日:49:56.734556-08周)是53!

EDIT: You should test on your real data, but I feel like this ought to be faster:

编辑:你应该测试你的真实数据,但我觉得这应该更快:

SELECT  year,
        week,
        company_id,
        COUNT(goal_id) FILTER (WHERE last_status = 'green') AS green_count,
        COUNT(goal_id) FILTER (WHERE last_status = 'amber') AS amber_count,
        COUNT(goal_id) FILTER (WHERE last_status = 'red') AS red_count
FROM    (
  SELECT  EXTRACT(ISOYEAR FROM s) AS year,
          EXTRACT(WEEK FROM s) AS week,
          u.company_id,
          u.goal_id,
          (array_agg(u.status ORDER BY u.created_at DESC))[1] AS last_status
  FROM    generate_series(NOW() - INTERVAL '2 months', NOW(), '1 week') s(t)
  LEFT OUTER JOIN ( 
    SELECT  g.company_id, u2.goal_id, u2.created_at, u2.status
    FROM    updates u2
    INNER JOIN goals g 
    ON      g.id = u2.goal_id
  ) u 
  ON      s.t >= u.created_at
  WHERE   u.company_id IS NOT NULL
  GROUP BY year, week, u.company_id, u.goal_id
) x
GROUP BY year, week, company_id
ORDER BY company_id, year, week
;

Still no window functions though. :-) Also you can speed it up a bit more by replacing (array_agg(...))[1] with a real first function. You'll have to define that yourself, but there are implementations on the Postgres wiki that are easy to Google for.

但是仍然没有窗口功能。 :-)你也可以通过用真正的第一个函数替换(array_agg(...))[1]来加快速度。你必须自己定义,但Postgres维基上的实现很容易谷歌。

#3

I use PostgreSQL 9.3. I'm interested in your question. I examined your data structure. Than I create the following tables.

我使用PostgreSQL 9.3。我对你的问题很感兴趣。我检查了你的数据结构。比我创建下表。

I insert the following records;

我插入以下记录;

Company

Goals

Updates

After that I wrote the following query, for correction

之后,我写了以下查询,以进行更正

SELECT c.id company_id, c.name company_name, u.status goal_status, 
         EXTRACT(week from u.created_at) goal_status_week,
         EXTRACT(year from u.created_at) AS goal_status_year 
FROM company c
INNER JOIN goals g ON g.company_id = c.id 
INNER JOIN updates u ON u.goal_id = g.id
ORDER BY goal_status_year DESC, goal_status_week DESC;

I get the following results;

我得到以下结果;

At last I merge this query with week series

最后我将此查询与周系列合并

SELECT
             gs.company_id,
             gs.company_name,
             gs.goal_status,
             EXTRACT(year from w) AS year, 
       EXTRACT(week from w) AS week,
             COUNT(gs.*) cnt
FROM generate_series(NOW() - INTERVAL '3 MONTHS', NOW(), '1 week') w
LEFT JOIN(
SELECT c.id company_id, c.name company_name, u.status goal_status, 
             EXTRACT(week from u.created_at) goal_status_week,
       EXTRACT(year from u.created_at) AS goal_status_year 
FROM company c
INNER JOIN goals g ON g.company_id = c.id 
INNER JOIN updates u ON u.goal_id = g.id ) gs 
ON gs.goal_status_week = EXTRACT(week from w) AND gs.goal_status_year = EXTRACT(year from w)
GROUP BY company_id, company_name, goal_status, year, week
ORDER BY  year DESC, week DESC;

I get this result

我得到了这个结果

Have a good day.

祝你有美好的一天。

#1