DELETE和INSERT之后的Redbackift（AWS）上的VACUUM

I have a table as below (simplified example, we have over 60 fields):

我有一个表格如下(简化示例,我们有超过60个字段):

CREATE TABLE "fact_table" (
  "pk_a" bigint                 NOT NULL ENCODE lzo,
  "pk_b" bigint                 NOT NULL ENCODE delta,
  "d_1"  bigint                 NOT NULL ENCODE runlength,
  "d_2"  bigint                 NOT NULL ENCODE lzo,
  "d_3"  character varying(255) NOT NULL ENCODE lzo,
  "f_1"  bigint                 NOT NULL ENCODE bytedict,
  "f_2"  bigint                     NULL ENCODE delta32k
)
DISTSTYLE KEY
DISTKEY ( d_1 )
SORTKEY ( pk_a, pk_b );

The table is distributed by a high-cardinality dimension.

该表以高基数维度分布。

The table is sorted by a pair of fields that increment in time order.

该表按一对按时间顺序递增的字段排序。

The table contains over 2 billion rows, and uses ~350GB of disk space, both "per node".

该表包含超过20亿行,并使用~350GB的磁盘空间,均为“每个节点”。

Our hourly house-keeping involves updating some recent records (within the last 0.1% of the table, based on the sort order) and inserting another 100k rows.

我们的每小时管理包括更新一些最近的记录(在表的最后0.1%内,基于排序顺序)并插入另外的100k行。

Whatever mechanism we choose, VACUUMing the table becomes overly burdensome:
- The sort step takes seconds
- The merge step takes over 6 hours

无论我们选择什么机制,VACUUMing表都变得过于繁琐: - 排序步骤需要几秒钟 - 合并步骤需要6个小时

We can see from SELECT * FROM svv_vacuum_progress; that all 2billion rows are being merged. Even though the first 99.9% are completely unaffected.

我们可以从SELECT * FROM svv_vacuum_progress中看到;所有20亿行都被合并了。即使前99.9%完全不受影响。

Our understanding was that the merge should only affect:
1. Deleted records
2. Inserted records
3. And all the records from (1) or (2) up to the end of the table

我们的理解是合并只会影响:1。删除的记录2.插入的记录3.以及从(1)或(2)到表格末尾的所有记录

We have tried DELETE and INSERT rather than UPDATE and that DML step is now significantly quicker. But the VACUUM still merges all 2billion rows.

我们尝试过DELETE和INSERT而不是UPDATE,现在DML步骤明显更快了。但是VACUUM仍然合并了所有20亿行。

DELETE FROM fact_table WHERE pk_a > X;
-- 42 seconds

INSERT INTO fact_table SELECT <blah> FROM <query> WHERE pk_a > X ORDER BY pk_a, pk_b;
-- 90 seconds

VACUUM fact_table;
-- 23645 seconds

In fact, the VACUUM merges all 2 billion records even if we just trim the last 746 rows off the end of the table.

实际上,VACUUM合并了所有20亿条记录,即使我们只是修剪了表格末尾的最后746行。

The Question

Does anyone have any advice on how to avoid this immense VACUUM overhead, and only MERGE on the last 0.1% of the table?

有没有人对如何避免这种巨大的VACUUM开销有任何建议,并且只有MERGE在最后0.1%的表上?

2 个解决方案

#1

How often are you VACUUMing the table? How does the long duration effect you? our load processing continues to run during VACUUM and we've never experienced any performance problems with doing that. Basically it doesn't matter how long it takes because we just keep running BAU.

你多久经常把桌子拿走?持续时间如何影响你?我们的加载处理在VACUUM期间继续运行,我们从未遇到任何性能问题。基本上,由于我们只是继续运行BAU,所以需要多长时间。

I've also found that we don't need to VACUUM our big tables very often. Once a week is more than enough. Your use case may be very performance sensitive but we find the query times to be within normal variations until the table is more than, say, 90% unsorted.

我还发现我们不需要经常使用VACUUM我们的大表。每周一次绰绰有余。您的用例可能对性能非常敏感,但我们发现查询时间在正常变化范围内,直到表格超过90%未排序。

If you find that there's a meaningful performance difference, have you considered using recent and history tables (inside a UNION view if needed)? That way you can VACUUM the small "recent" table quickly.

如果您发现有显着的性能差异,您是否考虑使用最近和历史表(如果需要,在UNION视图内)?这样你就可以快速VACUUM这个小的“最近”表。

#2

Couldn't fix it in comments section, so posting it as answer

无法在评论部分修复它,因此将其作为答案发布

I think right now, if the SORT keys are same across the time series tables and you have a UNION ALL view as time series view and still performance is bad, then you may want to have a time series view structure with explicit filters as

我想现在,如果SORT键在时间序列表中是相同的,并且你有一个UNION ALL视图作为时间序列视图并且性能仍然不好,那么你可能想要一个带有显式过滤器的时间序列视图结构

create or replace view schemaname.table_name as 
select * from table_20140901 where sort_key_date = '2014-09-01' union all 
select * from table_20140902 where sort_key_date = '2014-09-02' union all .......
select * from table_20140925 where sort_key_date = '2014-09-25';

Also make sure to have stats collected on all these tables on sort keys after every load and try running queries against it. It should be able to push down any filter values into the view if you are using any. End of day after load, just run a VACUUM SORT ONLY or full vacuum on the current day's table which should be much faster.

还要确保在每次加载后在排序键上的所有这些表上收集统计信息,并尝试对其运行查询。如果您使用任何过滤器值,它应该能够将任何过滤器值下推到视图中。加载后的一天结束,只需在当天的桌子上运行VACUUM SORT或全真空,这应该快得多。

Let me know if you are still facing any issues after the above test.

如果您在上述测试后仍然遇到任何问题,请告诉我。

#1