I have the following table (simplified form) in Postgres 9.2
我在Postgres 9.2中有以下表格(简化形式)
CREATE TABLE user_msg_log (
aggr_date DATE,
user_id INTEGER,
running_total INTEGER
);
It contains up to one record per user and per day. There will be approximately 500K records per day for 300 days. running_total is always increasing for each user.
它每个用户每天最多包含一条记录。每天约有500K记录,持续300天。running_total对于每个用户都是递增的。
I want to efficiently retrieve the latest record for each user before a specific date. My query is:
我希望在特定日期之前有效地检索每个用户的最新记录。我查询的方法是:
SELECT user_id, max(aggr_date), max(running_total)
FROM user_msg_log
WHERE aggr_date <= :mydate
GROUP BY user_id
which is extremely slow. I have also tried:
这是极其缓慢的。我也试过:
SELECT DISTINCT ON(user_id), aggr_date, running_total
FROM user_msg_log
WHERE aggr_date <= :mydate
ORDER BY user_id, aggr_date DESC;
which has the same plan and is equally slow.
它有相同的计划,而且同样缓慢。
So far I have a single index on user_msg_log(aggr_date), but doesn't help much. Is there any other index I should use to speed this up, or any other way to achieve what I want?
到目前为止,我在user_msg_log(aggr_date)上只有一个索引,但是没有多大帮助。我是否应该用其他的指数来加速这一进程,或者用其他的方法来实现我想要的?
3 个解决方案
#1
73
For best read performance you need a multicolumn index:
为了获得最佳的阅读性能,您需要一个多色索引:
CREATE INDEX user_msg_log_combo_idx
ON user_msg_log (user_id, aggr_date DESC NULLS LAST)
To make index only scans possible, add the otherwise not needed column running_total
:
要使索引仅扫描成为可能,添加不需要的列running_total:
CREATE INDEX user_msg_log_combo_covering_idx
ON user_msg_log (user_id, aggr_date DESC NULLS LAST, running_total)
Why DESC NULLS LAST
?
为什么DESC null去年?
- Unused index in range of dates query
- 日期查询范围内未使用的索引
For few rows per user_id
or small tables a simple DISTINCT ON
is among the fastest and simplest solutions:
对于每个user_id或小表的几行,一个简单的区别是最快和最简单的解决方案:
- Select first row in each GROUP BY group?
- 在每个组中按组选择第一行?
For many rows per user_id
a loose index scan would be (much) more efficient. That's not implemented in Postgres (at least up to Postgres 10), but there are ways to emulate it:
对于每个user_id的许多行,松散的索引扫描将(非常)高效。这在Postgres中没有实现(至少在Postgres 10中没有实现),但是有一些方法可以模仿它:
1. No separate table with unique users
The following solutions go beyond what's covered in the Postgres Wiki.
With a separate users
table, solutions in 2. below are typically simpler and faster.
下面的解决方案超出了Postgres Wiki的范围。有一个单独的用户表,解决方案在2中。下面是典型的简单和快速。
1a. Recursive CTE with LATERAL
join
Common Table Expressions require Postgres 8.4+.LATERAL
requires Postgres 9.3+.
公共表表达式需要Postgres 8.4+。侧需要Postgres 9.3 +。
WITH RECURSIVE cte AS (
( -- parentheses required
SELECT user_id, aggr_date, running_total
FROM user_msg_log
WHERE aggr_date <= :mydate
ORDER BY user_id, aggr_date DESC NULLS LAST
LIMIT 1
)
UNION ALL
SELECT u.user_id, u.aggr_date, u.running_total
FROM cte c
, LATERAL (
SELECT user_id, aggr_date, running_total
FROM user_msg_log
WHERE user_id > c.user_id -- lateral reference
AND aggr_date <= :mydate -- repeat condition
ORDER BY user_id, aggr_date DESC NULLS LAST
LIMIT 1
) u
)
SELECT user_id, aggr_date, running_total
FROM cte
ORDER BY user_id;
This is preferable in current versions of Postgres and it's simple to retrieve arbitrary columns. More explanation in chapter 2a. below.
在当前版本的Postgres中,这是更可取的,并且很容易检索任意列。更多的解释在第2a章。在下面。
1b. Recursive CTE with correlated subqueries
Convenient to retrieve either a single column or the whole row. The example uses the whole row type of the table. Other variants are possible.
方便检索单个列或整个行。该示例使用了整个表的行类型。其他变体是可能的。
WITH RECURSIVE cte AS (
(
SELECT u -- whole row
FROM user_msg_log u
WHERE aggr_date <= :mydate
ORDER BY user_id, aggr_date DESC NULLS LAST
LIMIT 1
)
UNION ALL
SELECT (SELECT u1 -- again, whole row
FROM user_msg_log u1
WHERE user_id > (c.u).user_id -- parentheses to access row type
AND aggr_date <= :mydate -- repeat predicate
ORDER BY user_id, aggr_date DESC NULLS LAST
LIMIT 1)
FROM cte c
WHERE (c.u).user_id IS NOT NULL -- any NOT NULL column of the row
)
SELECT (u).* -- finally decompose row
FROM cte
WHERE (u).user_id IS NOT NULL -- any column defined NOT NULL
ORDER BY (u).user_id;
It could be misleading to test the row value with c.u IS NOT NULL
. This only returns true
if every single column of the tested row is NOT NULL
and would fail if a single NULL
value is contained. (I had this bug in my answer for some time.) Instead, to assert a row was found in the previous iteration, test a single column of the row that is defined NOT NULL
(like the primary key). More:
用c测试行值可能具有误导性。你不是零。只有当被测试行的每一列都不是NULL时才返回true,并且如果包含一个空值就会失败。(一段时间以来,我一直在回答这个问题。)相反,要断言在前一次迭代中找到了一行,请测试定义为NOT NULL的行的单个列(如主键)。更多:
- NOT NULL constraint over a set of columns
- 对一组列的非空约束
- IS NOT NULL test for a record does not return TRUE when variable is set
- 当设置变量时,记录的非空测试是否返回TRUE
More explanation for this query in chapter 2b. below.
Related answers:
关于这个查询的更多解释见第2b章。在下面。相关的答案:
- Query last N related rows per row
- 每一行查询最后N个相关行
- GROUP BY one column, while sorting by another in PostgreSQL
- 在PostgreSQL中按一列分组,而按另一列进行排序
2. With separate users
table
Table layout hardly matters as long as we have exactly one row per relevant user_id
. Example:
只要每个相关的user_id只有一行,表布局就不重要了。例子:
CREATE TABLE users (
user_id serial PRIMARY KEY
, username text NOT NULL
);
Ideally, the table is physically sorted. See:
理想情况下,表是物理排序的。看到的:
- Optimize Postgres timestamp query range
- 优化Postgres时间戳查询范围
Or it's small enough (low cardinality) that it hardly matters.
Else, sorting rows in the query can help to further optimize performance. See Gang Liang's addition.
或者它足够小(低基数),这几乎不重要。另外,在查询中对行进行排序可以帮助进一步优化性能。看到梁帮派的加法。
2a. LATERAL
join
SELECT u.user_id, l.aggr_date, l.running_total
FROM users u
CROSS JOIN LATERAL (
SELECT aggr_date, running_total
FROM user_msg_log
WHERE user_id = u.user_id -- lateral reference
AND aggr_date <= :mydate
ORDER BY aggr_date DESC NULLS LAST
LIMIT 1
) l;
JOIN LATERAL
allows to reference preceding FROM
items on the same query level. You get one index (-only) look-up per user.
JOIN LATERAL允许从相同查询级别的项引用前面的内容。每个用户只有一个索引查询。
- What is the difference between LATERAL and a subquery in PostgreSQL?
- PostgreSQL中的横向查询和子查询有什么区别?
Consider the possible improvement by sorting the users
table suggested by Gang Liang in another answer. If the physical sort order of the users
table happens to match the index on user_msg_log
, you don't need this.
通过对梁刚在另一个答案中建议的用户表进行排序来考虑可能的改进。如果users表的物理排序顺序恰好与user_msg_log上的索引匹配,则不需要这个。
You don't get results for users missing in the users
table, even if you have entries in user_msg_log
. Typically, you would have a foreign key constraint enforcing referential integrity to rule that out.
对于用户表中丢失的用户,即使在user_msg_log中有条目,也不会得到结果。通常,您会有一个外键约束强制引用完整性以排除这种情况。
You also don't get a row for any user that has no matching entry in user_msg_log
. That conforms to your original question. If you need to include those rows in the result use LEFT JOIN LATERAL ... ON true
instead of CROSS JOIN LATERAL
:
对于user_msg_log中没有匹配条目的任何用户,也不会获得行。那符合你原来的问题。如果需要在结果中包含这些行,请使用左侧连接……正确而非横向交叉连接:
- Call a set-returning function with an array argument multiple times
- 多次调用带有数组参数的返回集函数
This form is also best to retrieve more than one rows (but not all) per user. Just use LIMIT n
instead of LIMIT 1
.
这个表单也最好为每个用户检索多个行(但不是全部)。用极限n代替极限1。
Effectively, all of these would do the same:
实际上,所有这些都将起到同样的作用:
JOIN LATERAL ... ON true
CROSS JOIN LATERAL ...
, LATERAL ...
The latter has a lower priority, though. Explicit JOIN
binds before comma.
不过,后者的优先级较低。显式连接在逗号之前绑定。
2b. Correlated subquery
Good choice to retrieve a single column from a single row. Code example:
从单个行检索单个列的好选择。代码示例:
- Optimize groupwise maximum query
- 优化groupwise最大查询
The same is possible for multiple columns, but you need more smarts:
对于多个列也是如此,但是您需要更多的智慧:
CREATE TEMP TABLE combo (aggr_date date, running_total int);
SELECT user_id, (my_combo).* -- note the parentheses
FROM (
SELECT u.user_id
, (SELECT (aggr_date, running_total)::combo
FROM user_msg_log
WHERE user_id = u.user_id
AND aggr_date <= :mydate
ORDER BY aggr_date DESC NULLS LAST
LIMIT 1) AS my_combo
FROM users u
) sub;
-
Like
LEFT JOIN LATERAL
above, this variant includes all users, even without entries inuser_msg_log
. You getNULL
formy_combo
, which you can easily filter with aWHERE
clause in the outer query if need be.
Nitpick: in the outer query you can't distinguish whether the subquery didn't find a row or all values returned happen to be NULL - same result. You would have to include aNOT NULL
column in the subquery to be sure.与上面的左连接侧边一样,此变体包含所有用户,即使user_msg_log中没有条目。my_combo可以得到空值,如果需要的话,可以使用外部查询中的WHERE子句轻松地进行筛选。Nitpick:在外部查询中,您无法区分子查询是否没有找到一行,或者返回的所有值恰好为空——结果相同。您必须在子查询中包含一个非空列才能确定。
-
A correlated subquery can only return a single value. You can wrap multiple columns into a composite type. But to decompose it later, Postgres demands a well-known composite type. Anonymous records can only be decomposed providing a column definition list.
关联子查询只能返回一个值。可以将多个列封装到复合类型中。但是要以后分解它,Postgres需要一个众所周知的复合类型。只有提供列定义列表才能分解匿名记录。
-
Use a registered type like the row type of an existing table, or create a type. Register a composite type explicitly (and permanently) with
CREATE TYPE
, or create a temporary table (dropped automatically at end of session) to provide a row type temporarily. Cast to that type:(aggr_date, running_total)::combo
使用已注册的类型,如现有表的行类型,或创建类型。使用CREATE类型显式地(和永久地)注册一个复合类型,或者创建一个临时表(在会话结束时自动删除),以临时提供一个行类型。转换为该类型:(aggr_date, running_total)::组合。
-
Finally, we do not want to decompose
combo
on the same query level. Due to a weakness in the query planner this would evaluate the subquery once for each column (up to Postgres 9.6 - improvements are planned for Postgres 10). Instead, make it a subquery and decompose in the outer query.最后,我们不希望在相同的查询级别上分解组合。由于查询计划器的弱点,这将对每个列计算一次子查询(直到Postgres 9.6 -计划对Postgres 10进行改进)。相反,将它设置为子查询并在外部查询中分解。
Related:
相关:
- Get values from first and last row per group
- 获取每个组的第一行和最后一行的值
SQL Fiddle demonstrating all four queries.
With a big test for 1k users and 100k log entries.
SQL小提琴演示了所有四个查询。对1k用户和100k日志条目进行了大型测试。
#2
4
Perhaps a different index on the table would help. Try this one: user_msg_log(user_id, aggr_date)
. I am not positive that Postgres will make optimal use with distinct on
.
也许桌上的一个不同的指数会有所帮助。试试这个:user_msg_log(user_id, aggr_date)。我不认为Postgres会以独特的方式发挥最佳作用。
So, I would stick with that index and try this version:
所以,我还是用那个索引试试这个版本:
select *
from user_msg_log uml
where not exists (select 1
from user_msg_log uml2
where uml2.user_id = u.user_id and
uml2.aggr_date <= :mydate and
uml2.aggr_date > uml.aggr_date
);
This should replace the sorting/grouping with index look ups. It might be faster.
这将用索引查找取代排序/分组。可能会更快。
#3
2
This is not a standalone answer but rather a comment to @Erwin's answer. For 2a, the lateral join example, the query can be improved by sorting the users
table to exploit the locality of the index on user_msg_log
.
这不是一个独立的答案,而是对@Erwin的答案的评论。对于2a(横向连接示例),可以通过对users表进行排序以利用user_msg_log上索引的位置来改进查询。
SELECT u.user_id, l.aggr_date, l.running_total
FROM (SELECT user_id FROM users ORDER BY user_id) u,
LATERAL (SELECT aggr_date, running_total
FROM user_msg_log
WHERE user_id = u.user_id -- lateral reference
AND aggr_date <= :mydate
ORDER BY aggr_date DESC NULLS LAST
LIMIT 1) l;
The rationale is that index lookup is expensive if user_id
values are random. By sorting out user_id
first, the subsequent lateral join would be like a simple scan on the index of user_msg_log
. Even though both query plans look alike, the running time would differ much especially for large tables.
其基本原理是,如果user_id值是随机的,那么索引查找的开销就很大。通过首先对user_id进行排序,后续的横向连接将类似于对user_msg_log索引的简单扫描。尽管这两个查询计划看起来很相似,但是运行时间会有很大的不同,特别是对于大型表。
The cost of the sorting is minimal especially if there is an index on the user_id
field.
排序的代价是最小的,特别是如果user_id字段上有一个索引。
#1
73
For best read performance you need a multicolumn index:
为了获得最佳的阅读性能,您需要一个多色索引:
CREATE INDEX user_msg_log_combo_idx
ON user_msg_log (user_id, aggr_date DESC NULLS LAST)
To make index only scans possible, add the otherwise not needed column running_total
:
要使索引仅扫描成为可能,添加不需要的列running_total:
CREATE INDEX user_msg_log_combo_covering_idx
ON user_msg_log (user_id, aggr_date DESC NULLS LAST, running_total)
Why DESC NULLS LAST
?
为什么DESC null去年?
- Unused index in range of dates query
- 日期查询范围内未使用的索引
For few rows per user_id
or small tables a simple DISTINCT ON
is among the fastest and simplest solutions:
对于每个user_id或小表的几行,一个简单的区别是最快和最简单的解决方案:
- Select first row in each GROUP BY group?
- 在每个组中按组选择第一行?
For many rows per user_id
a loose index scan would be (much) more efficient. That's not implemented in Postgres (at least up to Postgres 10), but there are ways to emulate it:
对于每个user_id的许多行,松散的索引扫描将(非常)高效。这在Postgres中没有实现(至少在Postgres 10中没有实现),但是有一些方法可以模仿它:
1. No separate table with unique users
The following solutions go beyond what's covered in the Postgres Wiki.
With a separate users
table, solutions in 2. below are typically simpler and faster.
下面的解决方案超出了Postgres Wiki的范围。有一个单独的用户表,解决方案在2中。下面是典型的简单和快速。
1a. Recursive CTE with LATERAL
join
Common Table Expressions require Postgres 8.4+.LATERAL
requires Postgres 9.3+.
公共表表达式需要Postgres 8.4+。侧需要Postgres 9.3 +。
WITH RECURSIVE cte AS (
( -- parentheses required
SELECT user_id, aggr_date, running_total
FROM user_msg_log
WHERE aggr_date <= :mydate
ORDER BY user_id, aggr_date DESC NULLS LAST
LIMIT 1
)
UNION ALL
SELECT u.user_id, u.aggr_date, u.running_total
FROM cte c
, LATERAL (
SELECT user_id, aggr_date, running_total
FROM user_msg_log
WHERE user_id > c.user_id -- lateral reference
AND aggr_date <= :mydate -- repeat condition
ORDER BY user_id, aggr_date DESC NULLS LAST
LIMIT 1
) u
)
SELECT user_id, aggr_date, running_total
FROM cte
ORDER BY user_id;
This is preferable in current versions of Postgres and it's simple to retrieve arbitrary columns. More explanation in chapter 2a. below.
在当前版本的Postgres中,这是更可取的,并且很容易检索任意列。更多的解释在第2a章。在下面。
1b. Recursive CTE with correlated subqueries
Convenient to retrieve either a single column or the whole row. The example uses the whole row type of the table. Other variants are possible.
方便检索单个列或整个行。该示例使用了整个表的行类型。其他变体是可能的。
WITH RECURSIVE cte AS (
(
SELECT u -- whole row
FROM user_msg_log u
WHERE aggr_date <= :mydate
ORDER BY user_id, aggr_date DESC NULLS LAST
LIMIT 1
)
UNION ALL
SELECT (SELECT u1 -- again, whole row
FROM user_msg_log u1
WHERE user_id > (c.u).user_id -- parentheses to access row type
AND aggr_date <= :mydate -- repeat predicate
ORDER BY user_id, aggr_date DESC NULLS LAST
LIMIT 1)
FROM cte c
WHERE (c.u).user_id IS NOT NULL -- any NOT NULL column of the row
)
SELECT (u).* -- finally decompose row
FROM cte
WHERE (u).user_id IS NOT NULL -- any column defined NOT NULL
ORDER BY (u).user_id;
It could be misleading to test the row value with c.u IS NOT NULL
. This only returns true
if every single column of the tested row is NOT NULL
and would fail if a single NULL
value is contained. (I had this bug in my answer for some time.) Instead, to assert a row was found in the previous iteration, test a single column of the row that is defined NOT NULL
(like the primary key). More:
用c测试行值可能具有误导性。你不是零。只有当被测试行的每一列都不是NULL时才返回true,并且如果包含一个空值就会失败。(一段时间以来,我一直在回答这个问题。)相反,要断言在前一次迭代中找到了一行,请测试定义为NOT NULL的行的单个列(如主键)。更多:
- NOT NULL constraint over a set of columns
- 对一组列的非空约束
- IS NOT NULL test for a record does not return TRUE when variable is set
- 当设置变量时,记录的非空测试是否返回TRUE
More explanation for this query in chapter 2b. below.
Related answers:
关于这个查询的更多解释见第2b章。在下面。相关的答案:
- Query last N related rows per row
- 每一行查询最后N个相关行
- GROUP BY one column, while sorting by another in PostgreSQL
- 在PostgreSQL中按一列分组,而按另一列进行排序
2. With separate users
table
Table layout hardly matters as long as we have exactly one row per relevant user_id
. Example:
只要每个相关的user_id只有一行,表布局就不重要了。例子:
CREATE TABLE users (
user_id serial PRIMARY KEY
, username text NOT NULL
);
Ideally, the table is physically sorted. See:
理想情况下,表是物理排序的。看到的:
- Optimize Postgres timestamp query range
- 优化Postgres时间戳查询范围
Or it's small enough (low cardinality) that it hardly matters.
Else, sorting rows in the query can help to further optimize performance. See Gang Liang's addition.
或者它足够小(低基数),这几乎不重要。另外,在查询中对行进行排序可以帮助进一步优化性能。看到梁帮派的加法。
2a. LATERAL
join
SELECT u.user_id, l.aggr_date, l.running_total
FROM users u
CROSS JOIN LATERAL (
SELECT aggr_date, running_total
FROM user_msg_log
WHERE user_id = u.user_id -- lateral reference
AND aggr_date <= :mydate
ORDER BY aggr_date DESC NULLS LAST
LIMIT 1
) l;
JOIN LATERAL
allows to reference preceding FROM
items on the same query level. You get one index (-only) look-up per user.
JOIN LATERAL允许从相同查询级别的项引用前面的内容。每个用户只有一个索引查询。
- What is the difference between LATERAL and a subquery in PostgreSQL?
- PostgreSQL中的横向查询和子查询有什么区别?
Consider the possible improvement by sorting the users
table suggested by Gang Liang in another answer. If the physical sort order of the users
table happens to match the index on user_msg_log
, you don't need this.
通过对梁刚在另一个答案中建议的用户表进行排序来考虑可能的改进。如果users表的物理排序顺序恰好与user_msg_log上的索引匹配,则不需要这个。
You don't get results for users missing in the users
table, even if you have entries in user_msg_log
. Typically, you would have a foreign key constraint enforcing referential integrity to rule that out.
对于用户表中丢失的用户,即使在user_msg_log中有条目,也不会得到结果。通常,您会有一个外键约束强制引用完整性以排除这种情况。
You also don't get a row for any user that has no matching entry in user_msg_log
. That conforms to your original question. If you need to include those rows in the result use LEFT JOIN LATERAL ... ON true
instead of CROSS JOIN LATERAL
:
对于user_msg_log中没有匹配条目的任何用户,也不会获得行。那符合你原来的问题。如果需要在结果中包含这些行,请使用左侧连接……正确而非横向交叉连接:
- Call a set-returning function with an array argument multiple times
- 多次调用带有数组参数的返回集函数
This form is also best to retrieve more than one rows (but not all) per user. Just use LIMIT n
instead of LIMIT 1
.
这个表单也最好为每个用户检索多个行(但不是全部)。用极限n代替极限1。
Effectively, all of these would do the same:
实际上,所有这些都将起到同样的作用:
JOIN LATERAL ... ON true
CROSS JOIN LATERAL ...
, LATERAL ...
The latter has a lower priority, though. Explicit JOIN
binds before comma.
不过,后者的优先级较低。显式连接在逗号之前绑定。
2b. Correlated subquery
Good choice to retrieve a single column from a single row. Code example:
从单个行检索单个列的好选择。代码示例:
- Optimize groupwise maximum query
- 优化groupwise最大查询
The same is possible for multiple columns, but you need more smarts:
对于多个列也是如此,但是您需要更多的智慧:
CREATE TEMP TABLE combo (aggr_date date, running_total int);
SELECT user_id, (my_combo).* -- note the parentheses
FROM (
SELECT u.user_id
, (SELECT (aggr_date, running_total)::combo
FROM user_msg_log
WHERE user_id = u.user_id
AND aggr_date <= :mydate
ORDER BY aggr_date DESC NULLS LAST
LIMIT 1) AS my_combo
FROM users u
) sub;
-
Like
LEFT JOIN LATERAL
above, this variant includes all users, even without entries inuser_msg_log
. You getNULL
formy_combo
, which you can easily filter with aWHERE
clause in the outer query if need be.
Nitpick: in the outer query you can't distinguish whether the subquery didn't find a row or all values returned happen to be NULL - same result. You would have to include aNOT NULL
column in the subquery to be sure.与上面的左连接侧边一样,此变体包含所有用户,即使user_msg_log中没有条目。my_combo可以得到空值,如果需要的话,可以使用外部查询中的WHERE子句轻松地进行筛选。Nitpick:在外部查询中,您无法区分子查询是否没有找到一行,或者返回的所有值恰好为空——结果相同。您必须在子查询中包含一个非空列才能确定。
-
A correlated subquery can only return a single value. You can wrap multiple columns into a composite type. But to decompose it later, Postgres demands a well-known composite type. Anonymous records can only be decomposed providing a column definition list.
关联子查询只能返回一个值。可以将多个列封装到复合类型中。但是要以后分解它,Postgres需要一个众所周知的复合类型。只有提供列定义列表才能分解匿名记录。
-
Use a registered type like the row type of an existing table, or create a type. Register a composite type explicitly (and permanently) with
CREATE TYPE
, or create a temporary table (dropped automatically at end of session) to provide a row type temporarily. Cast to that type:(aggr_date, running_total)::combo
使用已注册的类型,如现有表的行类型,或创建类型。使用CREATE类型显式地(和永久地)注册一个复合类型,或者创建一个临时表(在会话结束时自动删除),以临时提供一个行类型。转换为该类型:(aggr_date, running_total)::组合。
-
Finally, we do not want to decompose
combo
on the same query level. Due to a weakness in the query planner this would evaluate the subquery once for each column (up to Postgres 9.6 - improvements are planned for Postgres 10). Instead, make it a subquery and decompose in the outer query.最后,我们不希望在相同的查询级别上分解组合。由于查询计划器的弱点,这将对每个列计算一次子查询(直到Postgres 9.6 -计划对Postgres 10进行改进)。相反,将它设置为子查询并在外部查询中分解。
Related:
相关:
- Get values from first and last row per group
- 获取每个组的第一行和最后一行的值
SQL Fiddle demonstrating all four queries.
With a big test for 1k users and 100k log entries.
SQL小提琴演示了所有四个查询。对1k用户和100k日志条目进行了大型测试。
#2
4
Perhaps a different index on the table would help. Try this one: user_msg_log(user_id, aggr_date)
. I am not positive that Postgres will make optimal use with distinct on
.
也许桌上的一个不同的指数会有所帮助。试试这个:user_msg_log(user_id, aggr_date)。我不认为Postgres会以独特的方式发挥最佳作用。
So, I would stick with that index and try this version:
所以,我还是用那个索引试试这个版本:
select *
from user_msg_log uml
where not exists (select 1
from user_msg_log uml2
where uml2.user_id = u.user_id and
uml2.aggr_date <= :mydate and
uml2.aggr_date > uml.aggr_date
);
This should replace the sorting/grouping with index look ups. It might be faster.
这将用索引查找取代排序/分组。可能会更快。
#3
2
This is not a standalone answer but rather a comment to @Erwin's answer. For 2a, the lateral join example, the query can be improved by sorting the users
table to exploit the locality of the index on user_msg_log
.
这不是一个独立的答案,而是对@Erwin的答案的评论。对于2a(横向连接示例),可以通过对users表进行排序以利用user_msg_log上索引的位置来改进查询。
SELECT u.user_id, l.aggr_date, l.running_total
FROM (SELECT user_id FROM users ORDER BY user_id) u,
LATERAL (SELECT aggr_date, running_total
FROM user_msg_log
WHERE user_id = u.user_id -- lateral reference
AND aggr_date <= :mydate
ORDER BY aggr_date DESC NULLS LAST
LIMIT 1) l;
The rationale is that index lookup is expensive if user_id
values are random. By sorting out user_id
first, the subsequent lateral join would be like a simple scan on the index of user_msg_log
. Even though both query plans look alike, the running time would differ much especially for large tables.
其基本原理是,如果user_id值是随机的,那么索引查找的开销就很大。通过首先对user_id进行排序,后续的横向连接将类似于对user_msg_log索引的简单扫描。尽管这两个查询计划看起来很相似,但是运行时间会有很大的不同,特别是对于大型表。
The cost of the sorting is minimal especially if there is an index on the user_id
field.
排序的代价是最小的,特别是如果user_id字段上有一个索引。