I am writing a node.js application to enable search over a PostgreSQL database. In order to enable twitter type-ahead in the search box, I have to crunch a set of keywords from database to initialize Bloodhound before page loading. This is something like below:
我正在编写node.js应用程序以启用对PostgreSQL数据库的搜索。为了在搜索框中启用twitter type-ahead,我必须在数据库中处理一组关键字以在页面加载之前初始化Bloodhound。这类似于下面的内容:
SELECT distinct handlerid from lotintro where char_length(lotid)=7;
So for a large table (lotintro), this is costly; it is also stupid as the query result most likely stays the same for different web visitors over a period of time.
所以对于一张大桌子(lotintro)来说,这是昂贵的;它也是愚蠢的,因为查询结果很可能在一段时间内对不同的Web访问者保持不变。
What is the proper way to handle this? I am thinking a few options:
处理这个问题的正确方法是什么?我在想几个选择:
1) Put the query in a stored procedure and call it from node.js:
1)将查询放入存储过程并从node.js调用它:
SELECT * from getallhandlerid()
Does it mean the query will be compiled and the database will automatically return the same result sets without actual running query knowing the result wouldn't have changed?
这是否意味着将编译查询并且数据库将自动返回相同的结果集而没有实际运行的查询,因为知道结果不会更改?
2) Or, create a separate table to store the distinct handlerid
and update the table using a trigger which runs every day? (I know ideally, the trigger should run for every insert/update to the table, but this costs too much).
2)或者,创建一个单独的表来存储不同的handlerid并使用每天运行的触发器更新表? (我知道理想情况下,触发器应该针对表的每次插入/更新运行,但这会花费太多)。
3) create a partial index as suggested. Here is what gathered:
3)按照建议创建部分索引。以下是收集的内容:
Query
SELECT distinct handlerid from lotintro where length(lotid) = 7;
Index
CREATE INDEX lotid7_idx ON lotintro (handlerid)
WHERE length(lotid) = 7;
With index, query cost around 250ms, try run
使用索引,查询成本大约250毫秒,尝试运行
explain (analyze on, TIMING OFF) SELECT distinct handlerid from lotintro where length(lotid) = 7
"HashAggregate (cost=5542.64..5542.65 rows=1 width=6) (actual rows=151 loops=1)"
" -> Bitmap Heap Scan on lotintro (cost=39.08..5537.50 rows=2056 width=6) (actual rows=298350 loops=1)"
" Recheck Cond: (length(lotid) = 7)"
" Rows Removed by Index Recheck: 55285"
" -> Bitmap Index Scan on lotid7_idx (cost=0.00..38.57 rows=2056 width=0) (actual rows=298350 loops=1)"
"Total runtime: 243.686 ms"
Without index, query cost around 210ms, try run
没有索引,查询成本大约210ms,尝试运行
explain (analyze on, TIMING OFF) SELECT distinct handlerid from lotintro where length(lotid) = 7
"HashAggregate (cost=19490.11..19490.12 rows=1 width=6) (actual rows=151 loops=1)"
" -> Seq Scan on lotintro (cost=0.00..19484.97 rows=2056 width=6) (actual rows=298350 loops=1)"
" Filter: (length(lotid) = 7)"
" Rows Removed by Filter: 112915"
"Total runtime: 214.235 ms"
What am I doing wrong here?
我在这做错了什么?
4) Using alexius' suggested index and query:
4)使用alexius建议的索引和查询:
create index on lotintro using btree(char_length(lotid), handlerid);
But it's not an optimal solution. Because there is only few distinct values you may use trick called loose index scan, which should work much faster in your case:
但它不是最佳解决方案。因为只有很少的不同值,你可以使用称为松散索引扫描的技巧,在你的情况下应该更快地工作:
explain (analyze on, BUFFERS on, TIMING OFF)
WITH RECURSIVE t AS (
(SELECT handlerid FROM lotintro WHERE char_length(lotid)=7 ORDER BY handlerid LIMIT 1) -- parentheses required
UNION ALL
SELECT (SELECT handlerid FROM lotintro WHERE char_length(lotid)=7 AND handlerid > t.handlerid ORDER BY handlerid LIMIT 1)
FROM t
WHERE t.handlerid IS NOT NULL
)
SELECT handlerid FROM t WHERE handlerid IS NOT NULL;
"CTE Scan on t (cost=444.52..446.54 rows=100 width=32) (actual rows=151 loops=1)"
" Filter: (handlerid IS NOT NULL)"
" Rows Removed by Filter: 1"
" Buffers: shared hit=608"
" CTE t"
" -> Recursive Union (cost=0.42..444.52 rows=101 width=32) (actual rows=152 loops=1)"
" Buffers: shared hit=608"
" -> Limit (cost=0.42..4.17 rows=1 width=6) (actual rows=1 loops=1)"
" Buffers: shared hit=4"
" -> Index Scan using lotid_btree on lotintro lotintro_1 (cost=0.42..7704.41 rows=2056 width=6) (actual rows=1 loops=1)"
" Index Cond: (char_length(lotid) = 7)"
" Buffers: shared hit=4"
" -> WorkTable Scan on t t_1 (cost=0.00..43.83 rows=10 width=32) (actual rows=1 loops=152)"
" Filter: (handlerid IS NOT NULL)"
" Rows Removed by Filter: 0"
" Buffers: shared hit=604"
" SubPlan 1"
" -> Limit (cost=0.42..4.36 rows=1 width=6) (actual rows=1 loops=151)"
" Buffers: shared hit=604"
" -> Index Scan using lotid_btree on lotintro (cost=0.42..2698.13 rows=685 width=6) (actual rows=1 loops=151)"
" Index Cond: ((char_length(lotid) = 7) AND (handlerid > t_1.handlerid))"
" Buffers: shared hit=604"
"Planning time: 1.574 ms"
**"Execution time: 25.476 ms"**
========= more info on db ============================
=========有关db ============================的更多信息
dataloggerDB=# \d lotintro Table "public.lotintro"
dataloggerDB =#\ d lotintro表“public.lotintro”
Column | Type | Modifiers
--------------+-----------------------------+--------------
lotstartdt | timestamp without time zone | not null
lotid | text | not null
ftc | text | not null
deviceid | text | not null
packageid | text | not null
testprogname | text | not null
testprogdir | text | not null
testgrade | text | not null
testgroup | text | not null
temperature | smallint | not null
testerid | text | not null
handlerid | text | not null
numofsite | text | not null
masknum | text |
soaktime | text |
xamsqty | smallint |
scd | text |
speedgrade | text |
loginid | text |
operatorid | text | not null
loadboardid | text | not null
checksum | text |
lotenddt | timestamp without time zone | not null
totaltest | integer | default (-1)
totalpass | integer | default (-1)
earnhour | real | default 0
avetesttime | real | default 0
Indexes:
"pkey_lotintro" PRIMARY KEY, btree (lotstartdt, testerid)
"lotid7_idx" btree (handlerid) WHERE length(lotid) = 7
your version of Postgres, [PostgreSQL 9.2] cardinalities (how many rows?), [411K rows for table lotintro] percentage for length(lotid) = 7. [298350/411000= 73%]
============= after porting over everything to PG 9.4 =====================
将所有内容移植到PG 9.4后============= =====================
With index:
explain (analyze on, BUFFERS on, TIMING OFF) SELECT distinct handlerid from lotintro where length(lotid) = 7
"HashAggregate (cost=5542.78..5542.79 rows=1 width=6) (actual rows=151 loops=1)"
" Group Key: handlerid"
" Buffers: shared hit=14242"
" -> Bitmap Heap Scan on lotintro (cost=39.22..5537.64 rows=2056 width=6) (actual rows=298350 loops=1)"
" Recheck Cond: (length(lotid) = 7)"
" Heap Blocks: exact=13313"
" Buffers: shared hit=14242"
" -> Bitmap Index Scan on lotid7_idx (cost=0.00..38.70 rows=2056 width=0) (actual rows=298350 loops=1)"
" Buffers: shared hit=929"
"Planning time: 0.256 ms"
"Execution time: 154.657 ms"
Without index:
explain (analyze on, BUFFERS on, TIMING OFF) SELECT distinct handlerid from lotintro where length(lotid) = 7
"HashAggregate (cost=19490.11..19490.12 rows=1 width=6) (actual rows=151 loops=1)"
" Group Key: handlerid"
" Buffers: shared hit=13316"
" -> Seq Scan on lotintro (cost=0.00..19484.97 rows=2056 width=6) (actual rows=298350 loops=1)"
" Filter: (length(lotid) = 7)"
" Rows Removed by Filter: 112915"
" Buffers: shared hit=13316"
"Planning time: 0.168 ms"
"Execution time: 176.466 ms"
3 个解决方案
#1
1)
No, a function does not preserve snapshots of the result in any way. There is some potential for performance optimization if you define the function STABLE
(which would be correct). Per documentation:
不,函数不会以任何方式保留结果的快照。如果定义函数STABLE(这是正确的),则存在一些性能优化的可能性。每个文件:
A
STABLE
function cannot modify the database and is guaranteed to return the same results given the same arguments for all rows within a single statement.STABLE函数不能修改数据库,并且保证在单个语句中为所有行提供相同的参数时返回相同的结果。
IMMUTABLE
would be wrong here and potentially cause errors.
IMMUTABLE在这里会出错并可能导致错误。
So this can hugely benefit multiple calls within the same statement - but that doesn't fit your use case ...
因此,这可以极大地使同一语句中的多个调用受益 - 但这不适合您的用例...
And plpgsql functions work like prepared statements giving you a similar bonus inside the same session:
并且plpgsql函数就像准备好的语句一样,在同一个会话中给你一个类似的奖励:
- Difference between language sql and language plpgsql in PostgreSQL functions
PostgreSQL函数中语言sql和语言plpgsql的区别
2)
Try a MATERIALIZED VIEW
. With or without MV (or some other caching technique), a partial index would be most efficient for your special case:
尝试一个物质化的视图。无论有没有MV(或其他一些缓存技术),部分索引对您的特殊情况最有效:
CREATE INDEX lotid7_idx ON lotintro (handlerid)
WHERE length(lotid) = 7;
Remember to include the index condition in queries that are supposed to use the index, even if that seems redundant:
请记住在应该使用索引的查询中包含索引条件,即使这似乎是多余的:
- PostgreSQL does not use a partial index
PostgreSQL不使用部分索引
However, as you supplied:
但是,正如您提供的:
percentage for length(lotid) = 7. [298350/411000= 73%]
长度百分比(lotid)= 7. [298350/411000 = 73%]
That index is only going to help if you can get an index-only scan out of it because the condition is hardly selective. Since the table has very wide rows, index-only scans can be substantially faster.
如果您可以从中获取仅索引扫描,那么该索引只会有所帮助,因为条件几乎没有选择性。由于表具有非常宽的行,因此仅索引扫描可以快得多。
Loose index scan
Also, rows=298350
are folded to rows=151
, so a loose index scan will pay, as I explained here:
此外,行= 298350折叠为行= 151,因此松散的索引扫描将支付,如我在此解释:
- Optimize GROUP BY query to retrieve latest record per user
优化GROUP BY查询以检索每个用户的最新记录
Or in the Postgres Wiki - which is actually based on this post.
或者在Postgres Wiki中 - 实际上是基于这篇文章。
WITH RECURSIVE t AS (
(SELECT handlerid FROM lotintro
WHERE length(lotid) = 7
ORDER BY 1 LIMIT 1)
UNION ALL
SELECT (SELECT handlerid FROM lotintro
WHERE length(lotid) = 7
AND handlerid > t.handlerid
ORDER BY 1 LIMIT 1)
FROM t
WHERE t.handlerid IS NOT NULL
)
SELECT handlerid FROM t
WHERE handlerid IS NOT NULL;
This is going to be faster, yet, in combination with the partial index I suggested. Since the partial index is only about half the size and updated less often (depends on access patterns), it's cheaper overall.
这会更快,但与我建议的部分索引相结合。由于部分索引只有大约一半大小并且更新频率较低(取决于访问模式),因此总体上更便宜。
Faster still if you keep the table vacuumed to allow index-only scans. You can set more aggressive storage parameters for just this table if you have lots of writes to it:
如果将表保持为真空以允许仅索引扫描,则更快。如果您有大量写入,则可以为此表设置更积极的存储参数:
- PostgreSQL Initial Database Size
PostgreSQL初始数据库大小
Finally you can have this faster still with a materialized view based on this query.
最后,您可以使用基于此查询的物化视图更快地完成此操作。
#2
You need to index the exact expression that's used in your WHERE
clause: http://www.postgresql.org/docs/9.4/static/indexes-expressional.html
您需要索引WHERE子句中使用的确切表达式:http://www.postgresql.org/docs/9.4/static/indexes-expressional.html
CREATE INDEX char_length_lotid_idx ON lotintro (char_length(lotid));
You can also create a STABLE
or IMMUTABLE
function to wrap this query as you suggested: http://www.postgresql.org/docs/9.4/static/sql-createfunction.html
您还可以创建一个STABLE或IMMUTABLE函数来按照您的建议包装此查询:http://www.postgresql.org/docs/9.4/static/sql-createfunction.html
Your last suggestion is also viable, what you are looking for are MATERIALIZED VIEWS
: http://www.postgresql.org/docs/9.4/static/sql-creatematerializedview.html This prevent you from writing a custom trigger to update the denormalized table.
您的最后一个建议也是可行的,您正在寻找的是物化视图:http://www.postgresql.org/docs/9.4/static/sql-creatematerializedview.html这可以防止您编写自定义触发器来更新非规范化表。
#3
Since 3/4 of rows satisfy your condition (length(lotid) = 7) index itself won't help much. You might get a little better performance with this index because of index only scans:
由于3/4行满足您的条件(长度(lotid)= 7)索引本身无济于事。由于仅使用索引扫描,因此使用此索引可能会获得更好的性能:
create index on lotintro using btree(char_length(lotid), handlerid);
But it's not an optimal solution. Because there is only few distinct values you may use trick called loose index scan, which should work much faster in your case:
但它不是最佳解决方案。因为只有很少的不同值,你可以使用称为松散索引扫描的技巧,在你的情况下应该更快地工作:
WITH RECURSIVE t AS (
(SELECT handlerid FROM lotintro WHERE char_length(lotid)=7 ORDER BY handlerid LIMIT 1) -- parentheses required
UNION ALL
SELECT (SELECT handlerid FROM lotintro WHERE char_length(lotid)=7 AND handlerid > t.handlerid ORDER BY handlerid LIMIT 1)
FROM t
WHERE t.handlerid IS NOT NULL
)
SELECT handlerid FROM t WHERE handlerid IS NOT NULL;
for this query you also need to create index I mentioned above.
对于此查询,您还需要创建上面提到的索引。
#1
1)
No, a function does not preserve snapshots of the result in any way. There is some potential for performance optimization if you define the function STABLE
(which would be correct). Per documentation:
不,函数不会以任何方式保留结果的快照。如果定义函数STABLE(这是正确的),则存在一些性能优化的可能性。每个文件:
A
STABLE
function cannot modify the database and is guaranteed to return the same results given the same arguments for all rows within a single statement.STABLE函数不能修改数据库,并且保证在单个语句中为所有行提供相同的参数时返回相同的结果。
IMMUTABLE
would be wrong here and potentially cause errors.
IMMUTABLE在这里会出错并可能导致错误。
So this can hugely benefit multiple calls within the same statement - but that doesn't fit your use case ...
因此,这可以极大地使同一语句中的多个调用受益 - 但这不适合您的用例...
And plpgsql functions work like prepared statements giving you a similar bonus inside the same session:
并且plpgsql函数就像准备好的语句一样,在同一个会话中给你一个类似的奖励:
- Difference between language sql and language plpgsql in PostgreSQL functions
PostgreSQL函数中语言sql和语言plpgsql的区别
2)
Try a MATERIALIZED VIEW
. With or without MV (or some other caching technique), a partial index would be most efficient for your special case:
尝试一个物质化的视图。无论有没有MV(或其他一些缓存技术),部分索引对您的特殊情况最有效:
CREATE INDEX lotid7_idx ON lotintro (handlerid)
WHERE length(lotid) = 7;
Remember to include the index condition in queries that are supposed to use the index, even if that seems redundant:
请记住在应该使用索引的查询中包含索引条件,即使这似乎是多余的:
- PostgreSQL does not use a partial index
PostgreSQL不使用部分索引
However, as you supplied:
但是,正如您提供的:
percentage for length(lotid) = 7. [298350/411000= 73%]
长度百分比(lotid)= 7. [298350/411000 = 73%]
That index is only going to help if you can get an index-only scan out of it because the condition is hardly selective. Since the table has very wide rows, index-only scans can be substantially faster.
如果您可以从中获取仅索引扫描,那么该索引只会有所帮助,因为条件几乎没有选择性。由于表具有非常宽的行,因此仅索引扫描可以快得多。
Loose index scan
Also, rows=298350
are folded to rows=151
, so a loose index scan will pay, as I explained here:
此外,行= 298350折叠为行= 151,因此松散的索引扫描将支付,如我在此解释:
- Optimize GROUP BY query to retrieve latest record per user
优化GROUP BY查询以检索每个用户的最新记录
Or in the Postgres Wiki - which is actually based on this post.
或者在Postgres Wiki中 - 实际上是基于这篇文章。
WITH RECURSIVE t AS (
(SELECT handlerid FROM lotintro
WHERE length(lotid) = 7
ORDER BY 1 LIMIT 1)
UNION ALL
SELECT (SELECT handlerid FROM lotintro
WHERE length(lotid) = 7
AND handlerid > t.handlerid
ORDER BY 1 LIMIT 1)
FROM t
WHERE t.handlerid IS NOT NULL
)
SELECT handlerid FROM t
WHERE handlerid IS NOT NULL;
This is going to be faster, yet, in combination with the partial index I suggested. Since the partial index is only about half the size and updated less often (depends on access patterns), it's cheaper overall.
这会更快,但与我建议的部分索引相结合。由于部分索引只有大约一半大小并且更新频率较低(取决于访问模式),因此总体上更便宜。
Faster still if you keep the table vacuumed to allow index-only scans. You can set more aggressive storage parameters for just this table if you have lots of writes to it:
如果将表保持为真空以允许仅索引扫描,则更快。如果您有大量写入,则可以为此表设置更积极的存储参数:
- PostgreSQL Initial Database Size
PostgreSQL初始数据库大小
Finally you can have this faster still with a materialized view based on this query.
最后,您可以使用基于此查询的物化视图更快地完成此操作。
#2
You need to index the exact expression that's used in your WHERE
clause: http://www.postgresql.org/docs/9.4/static/indexes-expressional.html
您需要索引WHERE子句中使用的确切表达式:http://www.postgresql.org/docs/9.4/static/indexes-expressional.html
CREATE INDEX char_length_lotid_idx ON lotintro (char_length(lotid));
You can also create a STABLE
or IMMUTABLE
function to wrap this query as you suggested: http://www.postgresql.org/docs/9.4/static/sql-createfunction.html
您还可以创建一个STABLE或IMMUTABLE函数来按照您的建议包装此查询:http://www.postgresql.org/docs/9.4/static/sql-createfunction.html
Your last suggestion is also viable, what you are looking for are MATERIALIZED VIEWS
: http://www.postgresql.org/docs/9.4/static/sql-creatematerializedview.html This prevent you from writing a custom trigger to update the denormalized table.
您的最后一个建议也是可行的,您正在寻找的是物化视图:http://www.postgresql.org/docs/9.4/static/sql-creatematerializedview.html这可以防止您编写自定义触发器来更新非规范化表。
#3
Since 3/4 of rows satisfy your condition (length(lotid) = 7) index itself won't help much. You might get a little better performance with this index because of index only scans:
由于3/4行满足您的条件(长度(lotid)= 7)索引本身无济于事。由于仅使用索引扫描,因此使用此索引可能会获得更好的性能:
create index on lotintro using btree(char_length(lotid), handlerid);
But it's not an optimal solution. Because there is only few distinct values you may use trick called loose index scan, which should work much faster in your case:
但它不是最佳解决方案。因为只有很少的不同值,你可以使用称为松散索引扫描的技巧,在你的情况下应该更快地工作:
WITH RECURSIVE t AS (
(SELECT handlerid FROM lotintro WHERE char_length(lotid)=7 ORDER BY handlerid LIMIT 1) -- parentheses required
UNION ALL
SELECT (SELECT handlerid FROM lotintro WHERE char_length(lotid)=7 AND handlerid > t.handlerid ORDER BY handlerid LIMIT 1)
FROM t
WHERE t.handlerid IS NOT NULL
)
SELECT handlerid FROM t WHERE handlerid IS NOT NULL;
for this query you also need to create index I mentioned above.
对于此查询,您还需要创建上面提到的索引。