I have a ~2TB fully vacuumed Redshift table with a distkey phash
(high cardinality, hundreds of millions of values) and compound sortkeys (phash, last_seen)
.
我有一个~2TB完全真空的Redshift表,带有distkey phash(高基数,数亿个值)和复合sortkeys(phash,last_seen)。
When I do a query like:
当我做一个查询,如:
SELECT
DISTINCT ret_field
FROM
table
WHERE
phash IN (
'5c8615fa967576019f846b55f11b6e41',
'8719c8caa9740bec10f914fc2434ccfd',
'9b657c9f6bf7c5bbd04b5baf94e61dae'
)
AND
last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'
It returns very quickly. However when I increase the number of hashes beyond 10, Redshift converts the IN condition from a bunch of ORs to an array, per http://docs.aws.amazon.com/redshift/latest/dg/r_in_condition.html#r_in_condition-optimization-for-large-in-lists
它很快就会返回。但是,当我将哈希数增加到10以上时,Redshift会将IN条件从一堆OR转换为数组,符合http://docs.aws.amazon.com/redshift/latest/dg/r_in_condition.html#r_in_condition-优化换大的,名单
The problem is when I have a couple dozen phash
values, the "optimized" query goes from less than a second response time to over half an hour. In other words it stops using the sortkey and does a full table scan.
问题是当我有几十个phash值时,“优化”查询从不到一秒的响应时间变为超过半小时。换句话说,它停止使用sortkey并进行全表扫描。
Any idea how I can prevent this behavior and retain the use of sortkeys to keep the query quick?
知道如何防止这种行为并保留使用sortkeys来保持查询的快速性吗?
Here is the EXPLAIN
difference between <10 hashes and >10 hashes:
这是<10个哈希和> 10个哈希之间的EXPLAIN区别:
Less than 10 (0.4 seconds):
少于10(0.4秒):
XN Unique (cost=0.00..157253450.20 rows=43 width=27)
-> XN Seq Scan on table (cost=0.00..157253393.92 rows=22510 width=27)
Filter: ((((phash)::text = '394e9a527f93377912cbdcf6789787f1'::text) OR ((phash)::text = '4534f9f8f68cc937f66b50760790c795'::text) OR ((phash)::text = '5c8615fa967576019f846b55f11b6e61'::text) OR ((phash)::text = '5d5743a86b5ff3d60b133c6475e7dce0'::text) OR ((phash)::text = '8719c8caa9740bec10f914fc2434cced'::text) OR ((phash)::text = '9b657c9f6bf7c5bbd04b5baf94e61d9e'::text) OR ((phash)::text = 'd7337d324be519abf6dbfd3612aad0c0'::text) OR ((phash)::text = 'ea43b04ac2f84710dd1f775efcd5ab40'::text)) AND (last_seen >= '2015-10-01 00:00:00'::timestamp without time zone) AND (last_seen <= '2015-10-31 23:59:59'::timestamp without time zone))
More than 10 (45-60 minutes):
超过10(45-60分钟):
XN Unique (cost=0.00..181985241.25 rows=1717530 width=27)
-> XN Seq Scan on table (cost=0.00..179718164.48 rows=906830708 width=27)
Filter: ((last_seen >= '2015-10-01 00:00:00'::timestamp without time zone) AND (last_seen <= '2015-10-31 23:59:59'::timestamp without time zone) AND ((phash)::text = ANY ('{33b84c5775b6862df965a0e00478840e,394e9a527f93377912cbdcf6789787f1,3d27b96948b6905ffae503d48d75f3d1,4534f9f8f68cc937f66b50760790c795,5a63cd6686f7c7ed07a614e245da60c2,5c8615fa967576019f846b55f11b6e61,5d5743a86b5ff3d60b133c6475e7dce0,8719c8caa9740bec10f914fc2434cced,9b657c9f6bf7c5bbd04b5baf94e61d9e,d7337d324be519abf6dbfd3612aad0c0,dbf4c743832c72e9c8c3cc3b17bfae5f,ea43b04ac2f84710dd1f775efcd5ab40,fb4b83121cad6d23e6da6c7b14d2724c}'::text[])))
5 个解决方案
#1
2
It's worth a try to set sortkeys (last_seen, phash)
, putting last_seen
first.
值得尝试设置sortkeys(last_seen,phash),将last_seen放在第一位。
The reason of slowness might be because the leading column for the sort key is phash
which looks like a random character. As AWS redshift dev docs says, the timestamp columns should be as the leading column for the sort key if using that for where conditions.
缓慢的原因可能是因为排序键的前导列是phash,看起来像一个随机字符。正如AWS redshift开发文档所述,如果将条件列用于where条件,则timestamp列应作为排序键的前导列。
If recent data is queried most frequently, specify the timestamp column as the leading column for the sort key. - Choose the Best Sort Key - Amazon Redshift
如果最常查询最新数据,请将timestamp列指定为排序键的前导列。 - 选择最佳排序键 - 亚马逊Redshift
With this order of the sort key, all columns will be sorted by last_seen
, then phash
. (What does it mean to have multiple sortkey columns?)
使用此排序键的顺序,所有列将按last_seen排序,然后是phash。 (有多个sortkey列是什么意思?)
One note is that you have to recreate your table to change the sort key. This will help you to do that.
需要注意的是,您必须重新创建表以更改排序键。这将帮助您做到这一点。
#2
3
You can try to create temporary table/subquery:
您可以尝试创建临时表/子查询:
SELECT DISTINCT t.ret_field
FROM table t
JOIN (
SELECT '5c8615fa967576019f846b55f11b6e41' AS phash
UNION ALL
SELECT '8719c8caa9740bec10f914fc2434ccfd' AS phash
UNION ALL
SELECT '9b657c9f6bf7c5bbd04b5baf94e61dae' AS phash
-- UNION ALL
) AS sub
ON t.phash = sub.phash
WHERE t.last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59';
Alternatively do searching in chunks (if query optimizer merge it to one, use auxiliary table to store intermediate results):
或者,在块中搜索(如果查询优化器将其合并为一个,则使用辅助表来存储中间结果):
SELECT ret_field
FROM table
WHERE phash IN (
'5c8615fa967576019f846b55f11b6e41',
'8719c8caa9740bec10f914fc2434ccfd',
'9b657c9f6bf7c5bbd04b5baf94e61dae')
AND last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'
UNION
SELECT ret_field
FROM table
WHERE phash IN ( ) -- more hashes)
AND last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'
UNION
-- ...
If query optimizer merge it to one you can try to use temp table for intermediate results
如果查询优化器将其合并为一个,您可以尝试将临时表用于中间结果
EDIT:
编辑:
SELECT DISTINCT t.ret_field
FROM table t
JOIN (SELECT ... AS phash
FROM ...
) AS sub
ON t.phash = sub.phash
WHERE t.last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59';
#3
2
Do you really need DISTINCT
? This operator could be expensive.
你真的需要DISTINCT吗?这个运营商可能很贵。
I'd try to use LATERAL JOIN
. In the query below the table Hashes
has a column phash
- this is your big batch of hashes. It could be a temp table, a (sub)query, anything.
我试着用LATERAL JOIN。在下面的查询中,表Hashes有一个列phash - 这是你的大批哈希。它可以是临时表,(子)查询,任何东西。
SELECT DISTINCT T.ret_field
FROM
Hashes
INNER JOIN LATERAL
(
SELECT table.ret_field
FROM table
WHERE
table.phash = Hashes.phash
AND table.last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'
) AS T ON true
It is quite likely that optimizer implements LATERAL JOIN
as a nested loop. It would loop through all rows in Hashes
and for each row run the SELECT FROM table
. The inner SELECT
should use index that you have on (phash, last_seen)
. To play it safe include ret_field
into the index as well to make it a covering index: (phash, last_seen, ret_field)
.
优化器很可能将LATERAL JOIN实现为嵌套循环。它将循环遍历Hashes中的所有行,并为每一行运行SELECT FROM表。内部SELECT应该使用你所拥有的索引(phash,last_seen)。为了安全起见,将ret_field包括在索引中以使其成为覆盖索引:(phash,last_seen,ret_field)。
There is a very valid point in the answer by @Diego: instead of putting constant phash
values into the query, put them in a temporary or permanent table.
@Diego在答案中有一个非常有效的观点:不是将常量的phash值放入查询中,而是将它们放在临时表或永久表中。
I'd like to extend the answer by @Diego and add that it is important that this table with hashes has index, unique index.
我想扩展@Diego的答案,并补充说这个带哈希的表具有索引,唯一索引是很重要的。
So, create a table Hashes
with one column phash
that has exactly the same type as in your main table.phash
. It is important that types match. Make that column a primary key with unique clustered index. Dump your dozens of phash
values into the Hashes
table.
因此,创建一个具有一个列phash的表Hashes,其类型与主table.phash中的类型完全相同。类型匹配很重要。使该列成为具有唯一聚簇索引的主键。将几十个phash值转储到Hashes表中。
Then the query becomes a simple INNER JOIN
, not lateral:
然后查询变为简单的INNER JOIN,而不是横向:
SELECT DISTINCT T.ret_field
FROM
Hashes
INNER JOIN table ON table.phash = Hashes.phash
WHERE
table.last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'
It is still important that table
has index on (phash, last_seen, ret_field)
.
表具有索引(phash,last_seen,ret_field)仍然很重要。
Optimizer should be able to take advantage of the fact that both joined tables are sorted by phash
column and that it is unique in the Hashes
table.
优化器应该能够利用两个连接表按phash列排序并且在Hashes表中是唯一的这一事实。
#4
1
you can get rid of the "ORs" by inserting the data you want into a temp table and joining it with your actual table.
您可以通过将所需数据插入临时表并将其与实际表连接来删除“OR”。
Here's an example (I'm using a CTE because with the tool Im using is hard to capture the plan when you have more than one SQL statement - but go with a temp table if you can)
这是一个例子(我正在使用CTE,因为当你有多个SQL语句时,我使用的工具难以捕获计划 - 但如果可以,请使用临时表)
select *
from <my_table>
where checksum in
(
'd7360f1b600ae9e895e8b38262cee47936fb6ced',
'd1606f795152c73558513909cd59a8bc3ad865a8',
'bb3f6bb3d1a98d35a0f952a53d738ddec5c72c84',
'b2cad5a92575ed3868ac6e405647c2213eea74a5'
)
VERSUS
与
with foo as
(
select 'd7360f1b600ae9e895e8b38262cee47936fb6ced' as my_key union
select 'd1606f795152c73558513909cd59a8bc3ad865a8' union
select 'bb3f6bb3d1a98d35a0f952a53d738ddec5c72c84' union
select 'b2cad5a92575ed3868ac6e405647c2213eea74a5'
)
select *
from <my_table> r
join foo f on r.checksum = F.my_key
and here's the plan, as you can see it looks more complex but that's because of the CTE, it wouldn't look that ways on a temp table:
这是计划,因为你可以看到它看起来更复杂,但这是因为CTE,它在临时表上看起来不会那样:
#5
1
Did you try using union for all phash values?
您是否尝试使用union来获取所有phash值?
Just like that:
就像那样:
SELECT ret_field
FROM table
WHERE phash = '5c8615fa967576019f846b55f11b6e41' -- 1st phash value
and last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'
UNION
SELECT ret_field
FROM table
WHERE phash = '8719c8caa9740bec10f914fc2434ccfd' -- 2nd phash value
and last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'
UNION
SELECT ret_field
FROM table
WHERE phash = '9b657c9f6bf7c5bbd04b5baf94e61dae' -- 3rd phash value
and last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'
-- and so on...
UNION
SELECT ret_field
FROM table
WHERE phash = 'nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn' -- Nth phash value
and last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'
#1
2
It's worth a try to set sortkeys (last_seen, phash)
, putting last_seen
first.
值得尝试设置sortkeys(last_seen,phash),将last_seen放在第一位。
The reason of slowness might be because the leading column for the sort key is phash
which looks like a random character. As AWS redshift dev docs says, the timestamp columns should be as the leading column for the sort key if using that for where conditions.
缓慢的原因可能是因为排序键的前导列是phash,看起来像一个随机字符。正如AWS redshift开发文档所述,如果将条件列用于where条件,则timestamp列应作为排序键的前导列。
If recent data is queried most frequently, specify the timestamp column as the leading column for the sort key. - Choose the Best Sort Key - Amazon Redshift
如果最常查询最新数据,请将timestamp列指定为排序键的前导列。 - 选择最佳排序键 - 亚马逊Redshift
With this order of the sort key, all columns will be sorted by last_seen
, then phash
. (What does it mean to have multiple sortkey columns?)
使用此排序键的顺序,所有列将按last_seen排序,然后是phash。 (有多个sortkey列是什么意思?)
One note is that you have to recreate your table to change the sort key. This will help you to do that.
需要注意的是,您必须重新创建表以更改排序键。这将帮助您做到这一点。
#2
3
You can try to create temporary table/subquery:
您可以尝试创建临时表/子查询:
SELECT DISTINCT t.ret_field
FROM table t
JOIN (
SELECT '5c8615fa967576019f846b55f11b6e41' AS phash
UNION ALL
SELECT '8719c8caa9740bec10f914fc2434ccfd' AS phash
UNION ALL
SELECT '9b657c9f6bf7c5bbd04b5baf94e61dae' AS phash
-- UNION ALL
) AS sub
ON t.phash = sub.phash
WHERE t.last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59';
Alternatively do searching in chunks (if query optimizer merge it to one, use auxiliary table to store intermediate results):
或者,在块中搜索(如果查询优化器将其合并为一个,则使用辅助表来存储中间结果):
SELECT ret_field
FROM table
WHERE phash IN (
'5c8615fa967576019f846b55f11b6e41',
'8719c8caa9740bec10f914fc2434ccfd',
'9b657c9f6bf7c5bbd04b5baf94e61dae')
AND last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'
UNION
SELECT ret_field
FROM table
WHERE phash IN ( ) -- more hashes)
AND last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'
UNION
-- ...
If query optimizer merge it to one you can try to use temp table for intermediate results
如果查询优化器将其合并为一个,您可以尝试将临时表用于中间结果
EDIT:
编辑:
SELECT DISTINCT t.ret_field
FROM table t
JOIN (SELECT ... AS phash
FROM ...
) AS sub
ON t.phash = sub.phash
WHERE t.last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59';
#3
2
Do you really need DISTINCT
? This operator could be expensive.
你真的需要DISTINCT吗?这个运营商可能很贵。
I'd try to use LATERAL JOIN
. In the query below the table Hashes
has a column phash
- this is your big batch of hashes. It could be a temp table, a (sub)query, anything.
我试着用LATERAL JOIN。在下面的查询中,表Hashes有一个列phash - 这是你的大批哈希。它可以是临时表,(子)查询,任何东西。
SELECT DISTINCT T.ret_field
FROM
Hashes
INNER JOIN LATERAL
(
SELECT table.ret_field
FROM table
WHERE
table.phash = Hashes.phash
AND table.last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'
) AS T ON true
It is quite likely that optimizer implements LATERAL JOIN
as a nested loop. It would loop through all rows in Hashes
and for each row run the SELECT FROM table
. The inner SELECT
should use index that you have on (phash, last_seen)
. To play it safe include ret_field
into the index as well to make it a covering index: (phash, last_seen, ret_field)
.
优化器很可能将LATERAL JOIN实现为嵌套循环。它将循环遍历Hashes中的所有行,并为每一行运行SELECT FROM表。内部SELECT应该使用你所拥有的索引(phash,last_seen)。为了安全起见,将ret_field包括在索引中以使其成为覆盖索引:(phash,last_seen,ret_field)。
There is a very valid point in the answer by @Diego: instead of putting constant phash
values into the query, put them in a temporary or permanent table.
@Diego在答案中有一个非常有效的观点:不是将常量的phash值放入查询中,而是将它们放在临时表或永久表中。
I'd like to extend the answer by @Diego and add that it is important that this table with hashes has index, unique index.
我想扩展@Diego的答案,并补充说这个带哈希的表具有索引,唯一索引是很重要的。
So, create a table Hashes
with one column phash
that has exactly the same type as in your main table.phash
. It is important that types match. Make that column a primary key with unique clustered index. Dump your dozens of phash
values into the Hashes
table.
因此,创建一个具有一个列phash的表Hashes,其类型与主table.phash中的类型完全相同。类型匹配很重要。使该列成为具有唯一聚簇索引的主键。将几十个phash值转储到Hashes表中。
Then the query becomes a simple INNER JOIN
, not lateral:
然后查询变为简单的INNER JOIN,而不是横向:
SELECT DISTINCT T.ret_field
FROM
Hashes
INNER JOIN table ON table.phash = Hashes.phash
WHERE
table.last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'
It is still important that table
has index on (phash, last_seen, ret_field)
.
表具有索引(phash,last_seen,ret_field)仍然很重要。
Optimizer should be able to take advantage of the fact that both joined tables are sorted by phash
column and that it is unique in the Hashes
table.
优化器应该能够利用两个连接表按phash列排序并且在Hashes表中是唯一的这一事实。
#4
1
you can get rid of the "ORs" by inserting the data you want into a temp table and joining it with your actual table.
您可以通过将所需数据插入临时表并将其与实际表连接来删除“OR”。
Here's an example (I'm using a CTE because with the tool Im using is hard to capture the plan when you have more than one SQL statement - but go with a temp table if you can)
这是一个例子(我正在使用CTE,因为当你有多个SQL语句时,我使用的工具难以捕获计划 - 但如果可以,请使用临时表)
select *
from <my_table>
where checksum in
(
'd7360f1b600ae9e895e8b38262cee47936fb6ced',
'd1606f795152c73558513909cd59a8bc3ad865a8',
'bb3f6bb3d1a98d35a0f952a53d738ddec5c72c84',
'b2cad5a92575ed3868ac6e405647c2213eea74a5'
)
VERSUS
与
with foo as
(
select 'd7360f1b600ae9e895e8b38262cee47936fb6ced' as my_key union
select 'd1606f795152c73558513909cd59a8bc3ad865a8' union
select 'bb3f6bb3d1a98d35a0f952a53d738ddec5c72c84' union
select 'b2cad5a92575ed3868ac6e405647c2213eea74a5'
)
select *
from <my_table> r
join foo f on r.checksum = F.my_key
and here's the plan, as you can see it looks more complex but that's because of the CTE, it wouldn't look that ways on a temp table:
这是计划,因为你可以看到它看起来更复杂,但这是因为CTE,它在临时表上看起来不会那样:
#5
1
Did you try using union for all phash values?
您是否尝试使用union来获取所有phash值?
Just like that:
就像那样:
SELECT ret_field
FROM table
WHERE phash = '5c8615fa967576019f846b55f11b6e41' -- 1st phash value
and last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'
UNION
SELECT ret_field
FROM table
WHERE phash = '8719c8caa9740bec10f914fc2434ccfd' -- 2nd phash value
and last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'
UNION
SELECT ret_field
FROM table
WHERE phash = '9b657c9f6bf7c5bbd04b5baf94e61dae' -- 3rd phash value
and last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'
-- and so on...
UNION
SELECT ret_field
FROM table
WHERE phash = 'nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn' -- Nth phash value
and last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'