Redshift UPDATE使用Seq Scan非常慢

时间:2021-08-16 23:08:21

I have to update about 300 rows in a large table (600m rows) and I'm trying to make it faster.

我必须更新一个大表(600米行)中的大约300行,我试图让它更快。

The query I am using is a bit tricky:

我正在使用的查询有点棘手:

UPDATE my_table
SET name = CASE WHEN (event_name in ('event_1', 'event_2', 'event_3')) 
THEN 'deleted' ELSE name END
WHERE uid IN ('id_1', 'id_2')

I try to use EXPLAIN on this query and I get:

我尝试在此查询上使用EXPLAIN,我得到:

XN Seq Scan on my_table  (cost=0.00..103935.76 rows=4326 width=9838)
   Filter: (((uid)::text = 'id_1'::text) OR ((uid)::text = 'id_2'::text))

I have an interleaved sortkey, and uid is one of the columns included in this sortkey. The reason for why the query looks like this is that in the real context the number of columns in SET (along with name) might vary, but it probably won't be more than 10. Basic idea is that I don't want cross join (update rules are specific to the columns, I don't want to mix them together). For example in future there will be a query like:

我有一个interleaved sortkey,uid是此sortkey中包含的列之一。为什么查询看起来像这样的原因是在真实的上下文中,SET中的列数(以及名称)可能会有所不同,但它可能不会超过10.基本的想法是我不想要交叉join(更新规则特定于列,我不想将它们混合在一起)。例如,将来会有一个类似的查询:

UPDATE my_table
SET name = CASE WHEN (event_name in ("event_1", "event_2", "event_3")) THEN 'deleted' ELSE name END,
address = CASE WHEN (event_name in ("event_1", "event_4")) THEN 'deleted' ELSE address END
WHERE uid IN ("id_1", "id_2")

Anyway, back to the first query, it runs for a very long time (about 45 minutes) and takes 100% CPU.

无论如何,回到第一个查询,它运行很长时间(约45分钟)并占用100%的CPU。

I tried to check even simpler query:

我试着检查更简单的查询:

explain UPDATE my_table SET name = 'deleted' WHERE uid IN ('id_1', 'id_2')
XN Seq Scan on my_table  (cost=0.00..103816.80 rows=4326 width=9821)
   Filter: (((uid)::text = 'id_1'::text) OR ((uid)::text = 'id_2'::text))

I don't know what else I can add to the question to make it more clear, would be happy to hear any advice.

我不知道我还能在问题中添加什么来使其更清楚,我很乐意听到任何建议。

2 个解决方案

#1


1  

Have you tried removing the interleaved sort key and replacing it with a simple sort key on uid or a compound sort key with uid as the first column?

您是否尝试删除交错排序键并将其替换为uid上的简单排序键或使用uid作为第一列的复合排序键?

Also, the name uid makes me think that you may being using a GUID/UUID as the value. I would suggest that this is an anti-pattern for an id value in Redshift and especially for a sort key.

此外,名称uid让我认为您可能正在使用GUID / UUID作为值。我建议这是Redshift中id值的反模式,尤其是排序键。

Problems with GUID/UUID id:

GUID / UUID ID的问题:

  • Do not occur in a predictable sequence
    • Often triggers a full sequential scan
    • 经常触发完整的顺序扫描
    • New rows always disrupt the sort
    • 新行总是会破坏排序
  • 不要以可预测的顺序发生通常会触发完整的顺序扫描新行总是会中断排序
  • Compress very poorly
    • Requires more disk space for storage
    • 需要更多磁盘空间进行存储
    • Requires more data to be read when queried
    • 查询时需要读取更多数据
  • 压缩非常差需要更多磁盘空间进行存储需要在查询时读取更多数据

#2


0  

update in redshift is delete and then insert. Redshift by design just mark the rows as deleted and not deleting them physically(ghost rows). Explicit vacuum delete only < table_name > required to reclaim space.

redshift中的update是删除然后插入。设计的Redshift只是将行标记为已删除,而不是物理删除(鬼行)。显式真空仅删除回收空间所需的

Seq. Scan impacted by these ghost rows. Would suggest to run above command and check the performance of query later.

序列。扫描受这些鬼行的影响。建议运行上面的命令并稍后检查查询的性能。

#1


1  

Have you tried removing the interleaved sort key and replacing it with a simple sort key on uid or a compound sort key with uid as the first column?

您是否尝试删除交错排序键并将其替换为uid上的简单排序键或使用uid作为第一列的复合排序键?

Also, the name uid makes me think that you may being using a GUID/UUID as the value. I would suggest that this is an anti-pattern for an id value in Redshift and especially for a sort key.

此外,名称uid让我认为您可能正在使用GUID / UUID作为值。我建议这是Redshift中id值的反模式,尤其是排序键。

Problems with GUID/UUID id:

GUID / UUID ID的问题:

  • Do not occur in a predictable sequence
    • Often triggers a full sequential scan
    • 经常触发完整的顺序扫描
    • New rows always disrupt the sort
    • 新行总是会破坏排序
  • 不要以可预测的顺序发生通常会触发完整的顺序扫描新行总是会中断排序
  • Compress very poorly
    • Requires more disk space for storage
    • 需要更多磁盘空间进行存储
    • Requires more data to be read when queried
    • 查询时需要读取更多数据
  • 压缩非常差需要更多磁盘空间进行存储需要在查询时读取更多数据

#2


0  

update in redshift is delete and then insert. Redshift by design just mark the rows as deleted and not deleting them physically(ghost rows). Explicit vacuum delete only < table_name > required to reclaim space.

redshift中的update是删除然后插入。设计的Redshift只是将行标记为已删除,而不是物理删除(鬼行)。显式真空仅删除回收空间所需的

Seq. Scan impacted by these ghost rows. Would suggest to run above command and check the performance of query later.

序列。扫描受这些鬼行的影响。建议运行上面的命令并稍后检查查询的性能。