I have an updates
table in Postgres is 9.4.5 like this:
我在Postgres有一个更新表是9.4.5像这样:
goal_id | created_at | status
1 | 2016-01-01 | green
1 | 2016-01-02 | red
2 | 2016-01-02 | amber
And a goals
table like this:
还有一个像这样的目标表:
id | company_id
1 | 1
2 | 2
I want to create a chart for each company that shows the state of all of their goals, per week.
我想为每家公司创建一个图表,每周显示所有目标的状态。
I image this would require to generate a series of the past 8 weeks, finding the most recent update for each goal that came before that week, then counting the different statuses of the found updates.
我认为这需要生成一系列过去8周,找到该周之前的每个目标的最新更新,然后计算找到的更新的不同状态。
What I have so far:
到目前为止我所拥有的:
SELECT EXTRACT(year from generate_series) AS year,
EXTRACT(week from generate_series) AS week,
u.company_id,
COUNT(*) FILTER (WHERE u.status = 'green') AS green_count,
COUNT(*) FILTER (WHERE u.status = 'amber') AS amber_count,
COUNT(*) FILTER (WHERE u.status = 'red') AS red_count
FROM generate_series(NOW() - INTERVAL '2 MONTHS', NOW(), '1 week')
LEFT OUTER JOIN (
SELECT DISTINCT ON(year, week)
goals.company_id,
updates.status,
EXTRACT(week from updates.created_at) week,
EXTRACT(year from updates.created_at) AS year,
updates.created_at
FROM updates
JOIN goals ON goals.id = updates.goal_id
ORDER BY year, week, updates.created_at DESC
) u ON u.week = week AND u.year = year
GROUP BY 1,2,3
But this has two problems. It seems that the join on u
isn't working as I thought it would. It seems to be joining on every row (?) returned from the inner query as well as this only selects the most recent update that happened from that week. It should grab the most recent update from before that week if it needs to.
但这有两个问题。似乎加入你的工作并没有像我想象的那样有效。它似乎是在从内部查询返回的每一行(?)上加入,并且这只会选择从该周发生的最新更新。如果需要,它应该从该周之前获取最新更新。
This is some pretty complicated SQL and I love some input on how to pull it off.
这是一些相当复杂的SQL,我喜欢关于如何将它拉下来的一些输入。
Table structures and info
The goals table has around ~1000 goals ATM and is growing about ~100 a week:
目标表大约有1000个目标ATM,并且每周增长约100个:
Table "goals"
Column | Type | Modifiers
-----------------+-----------------------------+-----------------------------------------------------------
id | integer | not null default nextval('goals_id_seq'::regclass)
company_id | integer | not null
name | text | not null
created_at | timestamp without time zone | not null default timezone('utc'::text, now())
updated_at | timestamp without time zone | not null default timezone('utc'::text, now())
Indexes:
"goals_pkey" PRIMARY KEY, btree (id)
"entity_goals_company_id_fkey" btree (company_id)
Foreign-key constraints:
"goals_company_id_fkey" FOREIGN KEY (company_id) REFERENCES companies(id) ON DELETE RESTRICT
The updates
table has around ~1000 and is growing around ~100 a week:
更新表大约有1000左右,每周增长约100个:
Table "updates"
Column | Type | Modifiers
------------+-----------------------------+------------------------------------------------------------------
id | integer | not null default nextval('updates_id_seq'::regclass)
status | entity.goalstatus | not null
goal_id | integer | not null
created_at | timestamp without time zone | not null default timezone('utc'::text, now())
updated_at | timestamp without time zone | not null default timezone('utc'::text, now())
Indexes:
"goal_updates_pkey" PRIMARY KEY, btree (id)
"entity_goal_updates_goal_id_fkey" btree (goal_id)
Foreign-key constraints:
"updates_goal_id_fkey" FOREIGN KEY (goal_id) REFERENCES goals(id) ON DELETE CASCADE
Schema | Name | Internal name | Size | Elements | Access privileges | Description
--------+-------------------+---------------+------+----------+-------------------+-------------
entity | entity.goalstatus | goalstatus | 4 | green +| |
| | | | amber +| |
| | | | red | |
3 个解决方案
#1
6
You need one data item per week and goal (before aggregating counts per company). That's a plain CROSS JOIN
between generate_series()
and goals
. The (possibly) expensive part is to get the current state
from updates
for each. Like @Paul already suggested, a LATERAL
join seems like the best tool. Do it only for updates
, though, and use a faster technique with LIMIT 1
.
您需要每周一个数据项目和目标(在汇总每个公司的计数之前)。这是generate_series()和目标之间的简单CROSS JOIN。 (可能)昂贵的部分是从每个更新中获取当前状态。就像@Paul已经建议的那样,LATERAL连接似乎是最好的工具。不过只做更新,并使用LIMIT 1更快的技术。
And simplify date handling with date_trunc()
.
并使用date_trunc()简化日期处理。
SELECT w_start
, g.company_id
, count(*) FILTER (WHERE u.status = 'green') AS green_count
, count(*) FILTER (WHERE u.status = 'amber') AS amber_count
, count(*) FILTER (WHERE u.status = 'red') AS red_count
FROM generate_series(date_trunc('week', NOW() - interval '2 months')
, date_trunc('week', NOW())
, interval '1 week') w_start
CROSS JOIN goals g
LEFT JOIN LATERAL (
SELECT status
FROM updates
WHERE goal_id = g.id
AND created_at < w_start
ORDER BY created_at DESC
LIMIT 1
) u ON true
GROUP BY w_start, g.company_id
ORDER BY w_start, g.company_id;
To make this fast you need a multicolumn index:
要快速实现这一点,您需要一个多列索引:
CREATE INDEX updates_special_idx ON updates (goal_id, created_at DESC, status);
Descending order for created_at
is best, but not strictly necessary. Postgres can scan indexes backwards almost exactly as fast. (Not applicable for inverted sort order of multiple columns, though.)
create_at的降序最好,但不是绝对必要的。 Postgres几乎可以快速地向后扫描索引。 (但不适用于多列的反向排序顺序。)
Index columns in that order. Why?
按顺序索引列。为什么?
And the third column status
is only appended to allow fast index-only scans on updates
. Related case:
并且仅附加第三列状态以允许对更新进行快速仅索引扫描。相关案例:
- Slow index scans in large table
大表中的慢速索引扫描
1k goals for 9 weeks (your interval of 2 months overlaps with at least 9 weeks) only require 9k index look-ups for the 2nd table of only 1k rows. For small tables like this, performance shouldn't be much of a problem. But once you have a couple of thousand more in each table, performance will deteriorate with sequential scans.
9周的1k目标(2个月的间隔与至少9周重叠)仅需要9k索引查找仅第1行的第2个表。对于像这样的小表,性能应该不是很大的问题。但是,如果每个表中还有几千个,则顺序扫描会降低性能。
w_start
represents the start of each week. Consequently, counts are for the start of the week. You can still extract year and week (or any other details represent your week), if you insist:
w_start代表每周的开始。因此,计数是一周的开始。如果你坚持,你仍然可以提取年份和周(或任何其他细节代表你的一周):
EXTRACT(isoyear from w_start) AS year
, EXTRACT(week from w_start) AS week
Best with ISOYEAR
, like @Paul explained.
最好用ISOYEAR,就像@Paul解释的那样。
Related:
- What is the difference between LATERAL and a subquery in PostgreSQL?
- Optimize GROUP BY query to retrieve latest record per user
- Select first row in each GROUP BY group?
- PostgreSQL: running count of rows for a query 'by minute'
LATERAL和PostgreSQL中的子查询有什么区别?
优化GROUP BY查询以检索每个用户的最新记录
选择每个GROUP BY组中的第一行?
PostgreSQL:按分钟运行查询的行数
#2
3
This seems like a good use for LATERAL
joins:
这似乎是LATERAL连接的一个很好的用途:
SELECT EXTRACT(ISOYEAR FROM s) AS year,
EXTRACT(WEEK FROM s) AS week,
u.company_id,
COUNT(u.goal_id) FILTER (WHERE u.status = 'green') AS green_count,
COUNT(u.goal_id) FILTER (WHERE u.status = 'amber') AS amber_count,
COUNT(u.goal_id) FILTER (WHERE u.status = 'red') AS red_count
FROM generate_series(NOW() - INTERVAL '2 months', NOW(), '1 week') s(w)
LEFT OUTER JOIN LATERAL (
SELECT DISTINCT ON (g.company_id, u2.goal_id) g.company_id, u2.goal_id, u2.status
FROM updates u2
INNER JOIN goals g
ON g.id = u2.goal_id
WHERE u2.created_at <= s.w
ORDER BY g.company_id, u2.goal_id, u2.created_at DESC
) u
ON true
WHERE u.company_id IS NOT NULL
GROUP BY year, week, u.company_id
ORDER BY u.company_id, year, week
;
Btw I am extracting ISOYEAR
not YEAR
to ensure I get sensible results around the beginning of January. For instance EXTRACT(YEAR FROM '2016-01-01 08:49:56.734556-08')
is 2016
but EXTRACT(WEEK FROM '2016-01-01 08:49:56.734556-08')
is 53
!
顺便说一句,我正在提取ISOYEAR年份,以确保我在1月初获得明智的结果。例如EXTRACT(年份来自'2016-01-01 08:49:56.734556-08')是2016年,但是EXTRACT(周2016年08月08日:49:56.734556-08周)是53!
EDIT: You should test on your real data, but I feel like this ought to be faster:
编辑:你应该测试你的真实数据,但我觉得这应该更快:
SELECT year,
week,
company_id,
COUNT(goal_id) FILTER (WHERE last_status = 'green') AS green_count,
COUNT(goal_id) FILTER (WHERE last_status = 'amber') AS amber_count,
COUNT(goal_id) FILTER (WHERE last_status = 'red') AS red_count
FROM (
SELECT EXTRACT(ISOYEAR FROM s) AS year,
EXTRACT(WEEK FROM s) AS week,
u.company_id,
u.goal_id,
(array_agg(u.status ORDER BY u.created_at DESC))[1] AS last_status
FROM generate_series(NOW() - INTERVAL '2 months', NOW(), '1 week') s(t)
LEFT OUTER JOIN (
SELECT g.company_id, u2.goal_id, u2.created_at, u2.status
FROM updates u2
INNER JOIN goals g
ON g.id = u2.goal_id
) u
ON s.t >= u.created_at
WHERE u.company_id IS NOT NULL
GROUP BY year, week, u.company_id, u.goal_id
) x
GROUP BY year, week, company_id
ORDER BY company_id, year, week
;
Still no window functions though. :-) Also you can speed it up a bit more by replacing (array_agg(...))[1]
with a real first
function. You'll have to define that yourself, but there are implementations on the Postgres wiki that are easy to Google for.
但是仍然没有窗口功能。 :-)你也可以通过用真正的第一个函数替换(array_agg(...))[1]来加快速度。你必须自己定义,但Postgres维基上的实现很容易谷歌。
#3
0
I use PostgreSQL 9.3. I'm interested in your question. I examined your data structure. Than I create the following tables.
我使用PostgreSQL 9.3。我对你的问题很感兴趣。我检查了你的数据结构。比我创建下表。
I insert the following records;
我插入以下记录;
Company
Goals
Updates
After that I wrote the following query, for correction
之后,我写了以下查询,以进行更正
SELECT c.id company_id, c.name company_name, u.status goal_status,
EXTRACT(week from u.created_at) goal_status_week,
EXTRACT(year from u.created_at) AS goal_status_year
FROM company c
INNER JOIN goals g ON g.company_id = c.id
INNER JOIN updates u ON u.goal_id = g.id
ORDER BY goal_status_year DESC, goal_status_week DESC;
我得到以下结果;
At last I merge this query with week series
最后我将此查询与周系列合并
SELECT
gs.company_id,
gs.company_name,
gs.goal_status,
EXTRACT(year from w) AS year,
EXTRACT(week from w) AS week,
COUNT(gs.*) cnt
FROM generate_series(NOW() - INTERVAL '3 MONTHS', NOW(), '1 week') w
LEFT JOIN(
SELECT c.id company_id, c.name company_name, u.status goal_status,
EXTRACT(week from u.created_at) goal_status_week,
EXTRACT(year from u.created_at) AS goal_status_year
FROM company c
INNER JOIN goals g ON g.company_id = c.id
INNER JOIN updates u ON u.goal_id = g.id ) gs
ON gs.goal_status_week = EXTRACT(week from w) AND gs.goal_status_year = EXTRACT(year from w)
GROUP BY company_id, company_name, goal_status, year, week
ORDER BY year DESC, week DESC;
I get this result
我得到了这个结果
Have a good day.
祝你有美好的一天。
#1
6
You need one data item per week and goal (before aggregating counts per company). That's a plain CROSS JOIN
between generate_series()
and goals
. The (possibly) expensive part is to get the current state
from updates
for each. Like @Paul already suggested, a LATERAL
join seems like the best tool. Do it only for updates
, though, and use a faster technique with LIMIT 1
.
您需要每周一个数据项目和目标(在汇总每个公司的计数之前)。这是generate_series()和目标之间的简单CROSS JOIN。 (可能)昂贵的部分是从每个更新中获取当前状态。就像@Paul已经建议的那样,LATERAL连接似乎是最好的工具。不过只做更新,并使用LIMIT 1更快的技术。
And simplify date handling with date_trunc()
.
并使用date_trunc()简化日期处理。
SELECT w_start
, g.company_id
, count(*) FILTER (WHERE u.status = 'green') AS green_count
, count(*) FILTER (WHERE u.status = 'amber') AS amber_count
, count(*) FILTER (WHERE u.status = 'red') AS red_count
FROM generate_series(date_trunc('week', NOW() - interval '2 months')
, date_trunc('week', NOW())
, interval '1 week') w_start
CROSS JOIN goals g
LEFT JOIN LATERAL (
SELECT status
FROM updates
WHERE goal_id = g.id
AND created_at < w_start
ORDER BY created_at DESC
LIMIT 1
) u ON true
GROUP BY w_start, g.company_id
ORDER BY w_start, g.company_id;
To make this fast you need a multicolumn index:
要快速实现这一点,您需要一个多列索引:
CREATE INDEX updates_special_idx ON updates (goal_id, created_at DESC, status);
Descending order for created_at
is best, but not strictly necessary. Postgres can scan indexes backwards almost exactly as fast. (Not applicable for inverted sort order of multiple columns, though.)
create_at的降序最好,但不是绝对必要的。 Postgres几乎可以快速地向后扫描索引。 (但不适用于多列的反向排序顺序。)
Index columns in that order. Why?
按顺序索引列。为什么?
And the third column status
is only appended to allow fast index-only scans on updates
. Related case:
并且仅附加第三列状态以允许对更新进行快速仅索引扫描。相关案例:
- Slow index scans in large table
大表中的慢速索引扫描
1k goals for 9 weeks (your interval of 2 months overlaps with at least 9 weeks) only require 9k index look-ups for the 2nd table of only 1k rows. For small tables like this, performance shouldn't be much of a problem. But once you have a couple of thousand more in each table, performance will deteriorate with sequential scans.
9周的1k目标(2个月的间隔与至少9周重叠)仅需要9k索引查找仅第1行的第2个表。对于像这样的小表,性能应该不是很大的问题。但是,如果每个表中还有几千个,则顺序扫描会降低性能。
w_start
represents the start of each week. Consequently, counts are for the start of the week. You can still extract year and week (or any other details represent your week), if you insist:
w_start代表每周的开始。因此,计数是一周的开始。如果你坚持,你仍然可以提取年份和周(或任何其他细节代表你的一周):
EXTRACT(isoyear from w_start) AS year
, EXTRACT(week from w_start) AS week
Best with ISOYEAR
, like @Paul explained.
最好用ISOYEAR,就像@Paul解释的那样。
Related:
- What is the difference between LATERAL and a subquery in PostgreSQL?
- Optimize GROUP BY query to retrieve latest record per user
- Select first row in each GROUP BY group?
- PostgreSQL: running count of rows for a query 'by minute'
LATERAL和PostgreSQL中的子查询有什么区别?
优化GROUP BY查询以检索每个用户的最新记录
选择每个GROUP BY组中的第一行?
PostgreSQL:按分钟运行查询的行数
#2
3
This seems like a good use for LATERAL
joins:
这似乎是LATERAL连接的一个很好的用途:
SELECT EXTRACT(ISOYEAR FROM s) AS year,
EXTRACT(WEEK FROM s) AS week,
u.company_id,
COUNT(u.goal_id) FILTER (WHERE u.status = 'green') AS green_count,
COUNT(u.goal_id) FILTER (WHERE u.status = 'amber') AS amber_count,
COUNT(u.goal_id) FILTER (WHERE u.status = 'red') AS red_count
FROM generate_series(NOW() - INTERVAL '2 months', NOW(), '1 week') s(w)
LEFT OUTER JOIN LATERAL (
SELECT DISTINCT ON (g.company_id, u2.goal_id) g.company_id, u2.goal_id, u2.status
FROM updates u2
INNER JOIN goals g
ON g.id = u2.goal_id
WHERE u2.created_at <= s.w
ORDER BY g.company_id, u2.goal_id, u2.created_at DESC
) u
ON true
WHERE u.company_id IS NOT NULL
GROUP BY year, week, u.company_id
ORDER BY u.company_id, year, week
;
Btw I am extracting ISOYEAR
not YEAR
to ensure I get sensible results around the beginning of January. For instance EXTRACT(YEAR FROM '2016-01-01 08:49:56.734556-08')
is 2016
but EXTRACT(WEEK FROM '2016-01-01 08:49:56.734556-08')
is 53
!
顺便说一句,我正在提取ISOYEAR年份,以确保我在1月初获得明智的结果。例如EXTRACT(年份来自'2016-01-01 08:49:56.734556-08')是2016年,但是EXTRACT(周2016年08月08日:49:56.734556-08周)是53!
EDIT: You should test on your real data, but I feel like this ought to be faster:
编辑:你应该测试你的真实数据,但我觉得这应该更快:
SELECT year,
week,
company_id,
COUNT(goal_id) FILTER (WHERE last_status = 'green') AS green_count,
COUNT(goal_id) FILTER (WHERE last_status = 'amber') AS amber_count,
COUNT(goal_id) FILTER (WHERE last_status = 'red') AS red_count
FROM (
SELECT EXTRACT(ISOYEAR FROM s) AS year,
EXTRACT(WEEK FROM s) AS week,
u.company_id,
u.goal_id,
(array_agg(u.status ORDER BY u.created_at DESC))[1] AS last_status
FROM generate_series(NOW() - INTERVAL '2 months', NOW(), '1 week') s(t)
LEFT OUTER JOIN (
SELECT g.company_id, u2.goal_id, u2.created_at, u2.status
FROM updates u2
INNER JOIN goals g
ON g.id = u2.goal_id
) u
ON s.t >= u.created_at
WHERE u.company_id IS NOT NULL
GROUP BY year, week, u.company_id, u.goal_id
) x
GROUP BY year, week, company_id
ORDER BY company_id, year, week
;
Still no window functions though. :-) Also you can speed it up a bit more by replacing (array_agg(...))[1]
with a real first
function. You'll have to define that yourself, but there are implementations on the Postgres wiki that are easy to Google for.
但是仍然没有窗口功能。 :-)你也可以通过用真正的第一个函数替换(array_agg(...))[1]来加快速度。你必须自己定义,但Postgres维基上的实现很容易谷歌。
#3
0
I use PostgreSQL 9.3. I'm interested in your question. I examined your data structure. Than I create the following tables.
我使用PostgreSQL 9.3。我对你的问题很感兴趣。我检查了你的数据结构。比我创建下表。
I insert the following records;
我插入以下记录;
Company
Goals
Updates
After that I wrote the following query, for correction
之后,我写了以下查询,以进行更正
SELECT c.id company_id, c.name company_name, u.status goal_status,
EXTRACT(week from u.created_at) goal_status_week,
EXTRACT(year from u.created_at) AS goal_status_year
FROM company c
INNER JOIN goals g ON g.company_id = c.id
INNER JOIN updates u ON u.goal_id = g.id
ORDER BY goal_status_year DESC, goal_status_week DESC;
我得到以下结果;
At last I merge this query with week series
最后我将此查询与周系列合并
SELECT
gs.company_id,
gs.company_name,
gs.goal_status,
EXTRACT(year from w) AS year,
EXTRACT(week from w) AS week,
COUNT(gs.*) cnt
FROM generate_series(NOW() - INTERVAL '3 MONTHS', NOW(), '1 week') w
LEFT JOIN(
SELECT c.id company_id, c.name company_name, u.status goal_status,
EXTRACT(week from u.created_at) goal_status_week,
EXTRACT(year from u.created_at) AS goal_status_year
FROM company c
INNER JOIN goals g ON g.company_id = c.id
INNER JOIN updates u ON u.goal_id = g.id ) gs
ON gs.goal_status_week = EXTRACT(week from w) AND gs.goal_status_year = EXTRACT(year from w)
GROUP BY company_id, company_name, goal_status, year, week
ORDER BY year DESC, week DESC;
I get this result
我得到了这个结果
Have a good day.
祝你有美好的一天。