如何在PostgreSQL中进行大型非阻塞更新?

时间:2023-01-26 22:57:28

I want to do a large update on a table in PostgreSQL, but I don't need the transactional integrity to be maintained across the entire operation, because I know that the column I'm changing is not going to be written to or read during the update. I want to know if there is an easy way in the psql console to make these types of operations faster.

我想在PostgreSQL中对表进行大量更新,但我不需要在整个操作中维护事务完整性,因为我知道我正在更改的列不会被写入或读取更新。我想知道psql控制台中是否有一种简单的方法可以更快地完成这些类型的操作。

For example, let's say I have a table called "orders" with 35 million rows, and I want to do this:

例如,假设我有一个名为“orders”的表,有3500万行,我想这样做:

UPDATE orders SET status = null;

To avoid being diverted to an offtopic discussion, let's assume that all the values of status for the 35 million columns are currently set to the same (non-null) value, thus rendering an index useless.

为避免被转移到offtopic讨论,让我们假设3500万列的所有状态值当前都设置为相同(非空)值,从而使索引无用。

The problem with this statement is that it takes a very long time to go into effect (solely because of the locking), and all changed rows are locked until the entire update is complete. This update might take 5 hours, whereas something like

此语句的问题是需要很长时间才能生效(仅因为锁定),并且所有更改的行都将被锁定,直到整个更新完成。此更新可能需要5个小时,而类似

UPDATE orders SET status = null WHERE (order_id > 0 and order_id < 1000000);

might take 1 minute. Over 35 million rows, doing the above and breaking it into chunks of 35 would only take 35 minutes and save me 4 hours and 25 minutes.

可能需要1分钟。超过3500万行,执行上述操作并将其分成35块只需要35分钟,节省了4小时25分钟。

I could break it down even further with a script (using pseudocode here):

我可以用脚本进一步分解它(在这里使用伪代码):

for (i = 0 to 3500) {
  db_operation ("UPDATE orders SET status = null
                 WHERE (order_id >" + (i*1000)"
             + " AND order_id <" + ((i+1)*1000) " +  ")");
}

This operation might complete in only a few minutes, rather than 35.

此操作可能仅在几分钟内完成,而不是35分钟。

So that comes down to what I'm really asking. I don't want to write a freaking script to break down operations every single time I want to do a big one-time update like this. Is there a way to accomplish what I want entirely within SQL?

所以这归结为我真正的要求。我不想写一个怪异的脚本来分解操作,每次我想做这样一个大的一次性更新。有没有办法在SQL中完成我想要的东西?

8 个解决方案

#1


Column / Row

... I don't need the transactional integrity to be maintained across the entire operation, because I know that the column I'm changing is not going to be written to or read during the update.

...我不需要在整个操作中维护事务完整性,因为我知道我正在更改的列不会在更新期间写入或读取。

Any UPDATE in PostgreSQL's MVCC model writes a new version of the whole row. If concurrent transactions change any column of the same row, time-consuming concurrency issues arise. Details in the manual. Knowing the same column won't be touched by concurrent transactions avoids some possible complications, but not others.

PostgreSQL的MVCC模型中的任何UPDATE都会写入整个行的新版本。如果并发事务更改同一行的任何列,则会出现耗时的并发问题。手册中的详细信息。并发事务不会触及相同的列,避免了一些可能的并发症,而不是其他的。

Index

To avoid being diverted to an offtopic discussion, let's assume that all the values of status for the 35 million columns are currently set to the same (non-null) value, thus rendering an index useless.

为避免被转移到offtopic讨论,让我们假设3500万列的所有状态值当前都设置为相同(非空)值,从而使索引无用。

When updating the whole table (or major parts of it) Postgres never uses an index. A sequential scan is faster when all or most rows have to be read. On the contrary: Index maintenance means additional cost for the UPDATE.

更新整个表(或其主要部分)时,Postgres从不使用索引。当必须读取所有或大多数行时,顺序扫描更快。相反:索引维护意味着UPDATE的额外成本。

Performance

For example, let's say I have a table called "orders" with 35 million rows, and I want to do this:

例如,假设我有一个名为“orders”的表,有3500万行,我想这样做:

UPDATE orders SET status = null;

I understand you are aiming for a more general solution (see below). But to address the actual question asked: This can be dealt with in a matter milliseconds, regardless of table size:

我知道你的目标是更通用的解决方案(见下文)。但要解决实际问题:无论表大小如何,都可以在几毫秒内处理:

ALTER TABLE orders DROP column status
                 , ADD  column status text;

Per documentation:

When a column is added with ADD COLUMN, all existing rows in the table are initialized with the column's default value (NULL if no DEFAULT clause is specified). If there is no DEFAULT clause, this is merely a metadata change...

使用ADD COLUMN添加列时,表中的所有现有行都使用列的缺省值进行初始化(如果未指定DEFAULT子句,则为NULL)。如果没有DEFAULT子句,这只是一个元数据更改......

And:

The DROP COLUMN form does not physically remove the column, but simply makes it invisible to SQL operations. Subsequent insert and update operations in the table will store a null value for the column. Thus, dropping a column is quick but it will not immediately reduce the on-disk size of your table, as the space occupied by the dropped column is not reclaimed. The space will be reclaimed over time as existing rows are updated. (These statements do not apply when dropping the system oid column; that is done with an immediate rewrite.)

DROP COLUMN表单不会物理删除列,但只是使其对SQL操作不可见。表中的后续插入和更新操作将为列存储空值。因此,删除列很快,但不会立即减少表的磁盘大小,因为已删除列所占用的空间不会被回收。随着现有行的更新,该空间将随着时间的推移而被回收。 (这些语句在删除系统oid列时不适用;这是通过立即重写完成的。)

Make sure you don't have objects depending on the column (foreign key constraints, indices, views, ...). You would need to drop / recreate those. Barring that, tiny operations on the system catalog table pg_attribute do the job. Requires an exclusive lock on the table which may be a problem for heavy concurrent load. Since it only takes a few milliseconds, you should still be fine.

确保根据列不具有对象(外键约束,索引,视图......)。您需要删除/重新创建它们。除此之外,系统目录表pg_attribute上的微小操作可以完成这项工作。需要对表进行独占锁定,这可能是重度并发负载的问题。因为它只需要几毫秒,你应该仍然没问题。

If you have a column default you want to keep, add it back in a separate command. Doing it in the same command would apply it to all rows immediately, voiding the effect. You could then update the existing columns in batches. Follow the documentation link and read the Notes in the manual.

如果您希望保留列默认值,请将其添加回单独的命令中。在同一个命令中执行此操作会立即将其应用于所有行,从而消除效果。然后,您可以批量更新现有列。请按照文档链接阅读手册中的注释。

General solution

dblink has been mentioned in another answer. It allows access to "remote" Postgres databases in implicit separate connections. The "remote" database can be the current one, thereby achieving "autonomous transactions": what the function writes in the "remote" db is committed and can't be rolled back.

另一个答案中提到了dblink。它允许在隐式单独连接中访问“远程”Postgres数据库。 “远程”数据库可以是当前数据库,从而实现“自治事务”:函数在“远程”数据库中写入的内容已提交且无法回滚。

This allows to run a single function that updates a big table in smaller parts and each part is committed separately. Avoids building up transaction overhead for very big numbers of rows and, more importantly, releases locks after each part. This allows concurrent operations to proceed without much delay and makes deadlocks less likely.

这允许运行单个函数,以更小的部分更新大表,并且每个部分都单独提交。避免为非常大的行构建事务开销,更重要的是,在每个部分之后释放锁。这允许并发操作在没有太多延迟的情况下继续进行并且使死锁的可能性降低。

If you don't have concurrent access, this is hardly useful - except to avoid ROLLBACK after an exception. Also consider SAVEPOINT for that case.

如果您没有并发访问权限,这几乎没有用 - 除了在异常后避免ROLLBACK。在这种情况下也要考虑SAVEPOINT。

Disclaimer

First of all, lots of small transactions are actually more expensive. This only makes sense for big tables. The sweet spot depends on many factors.

首先,许多小交易实际上更昂贵。这对大表来说才有意义。甜蜜点取决于许多因素。

If you are not sure what you are doing: a single transaction is the safe method. For this to work properly, concurrent operations on the table have to play along. For instance: concurrent writes can move a row to a partition that's supposedly already processed. Or concurrent reads can see inconsistent intermediary states. You have been warned.

如果您不确定自己在做什么:单个交易是安全的方法。为了使其正常工作,桌面上的并发操作必须发挥作用。例如:并发写入可以将行移动到应该已经处理的分区。或并发读取可以看到不一致的中间状态。你被警告了。

Step-by-step instructions

The additional module dblink needs to be installed first:

需要首先安装附加模块dblink:

Setting up the connection with dblink very much depends on the setup of your DB cluster and security policies in place. It can be tricky. Related later answer with more how to connect with dblink:

与dblink建立连接非常依赖于数据库集群的设置和安全策略。这可能很棘手。相关的后续回答以及如何与dblink连接:

Create a FOREIGN SERVER and a USER MAPPING as instructed there to simplify and streamline the connection (unless you have one already).
Assuming a serial PRIMARY KEY with or without some gaps.

按照那里的指示创建一个FOREIGN SERVER和一个USER MAPPING来简化和简化连接(除非你已经有了)。假设一个序列PRIMARY KEY有或没有一些间隙。

CREATE OR REPLACE FUNCTION f_update_in_steps()
  RETURNS void AS
$func$
DECLARE
   _step int;   -- size of step
   _cur  int;   -- current ID (starting with minimum)
   _max  int;   -- maximum ID
BEGIN
   SELECT INTO _cur, _max  min(order_id), max(order_id) FROM orders;
                                        -- 100 slices (steps) hard coded
   _step := ((_max - _cur) / 100) + 1;  -- rounded, possibly a bit too small
                                        -- +1 to avoid endless loop for 0
   PERFORM dblink_connect('myserver');  -- your foreign server as instructed above

   FOR i IN 0..200 LOOP                 -- 200 >> 100 to make sure we exceed _max
      PERFORM dblink_exec(
       $$UPDATE public.orders
         SET    status = 'foo'
         WHERE  order_id >= $$ || _cur || $$
         AND    order_id <  $$ || _cur + _step || $$
         AND    status IS DISTINCT FROM 'foo'$$);  -- avoid empty update

      _cur := _cur + _step;

      EXIT WHEN _cur > _max;            -- stop when done (never loop till 200)
   END LOOP;

   PERFORM dblink_disconnect();
END
$func$  LANGUAGE plpgsql;

Call:

SELECT f_update_in_steps();

You can parameterize any part according to your needs: the table name, column name, value, ... just be sure to sanitize identifiers to avoid SQL injection:

您可以根据需要参数化任何部分:表名,列名,值,...只需确保清理标识符以避免SQL注入:

About avoiding empty UPDATE:

关于避免空UPDATE:

#2


You should delegate this column to another table like this:

您应该将此列委托给另一个表,如下所示:

create table order_status (
  order_id int not null references orders(order_id) primary key,
  status int not null
);

Then your operation of setting status=NULL will be instant:

那么你设置status = NULL的操作将是即时的:

truncate order_status;

#3


Postgres uses MVCC (multi-version concurrency control), thus avoiding any locking if you are the only writer; any number of concurrent readers can work on the table, and there won't be any locking.

Postgres使用MVCC(多版本并发控制),因此如果您是唯一的编写者,则避免任何锁定;任何数量的并发读者都可以在桌面上工作,并且不会有任何锁定。

So if it really takes 5h, it must be for a different reason (e.g. that you do have concurrent writes, contrary to your claim that you don't).

因此,如果它确实需要5小时,则必须是出于不同的原因(例如,您确实有并发写入,与您声称的不相同)。

#4


I would use CTAS:

我会用CTAS:

begin;
create table T as select col1, col2, ..., <new value>, colN from orders;
drop table orders;
alter table T rename to orders;
commit;

#5


First of all - are you sure that you need to update all rows?

首先 - 你确定需要更新所有行吗?

Perhaps some of the rows already have status NULL?

也许某些行已经具有NULL状态?

If so, then:

如果是这样,那么:

UPDATE orders SET status = null WHERE status is not null;

As for partitioning the change - that's not possible in pure sql. All updates are in single transaction.

至于分区变化 - 这在纯sql中是不可能的。所有更新都在单个事务中。

One possible way to do it in "pure sql" would be to install dblink, connect to the same database using dblink, and then issue a lot of updates over dblink, but it seems like overkill for such a simple task.

在“纯sql”中执行此操作的一种可能方法是安装dblink,使用dblink连接到同一数据库,然后通过dblink发出大量更新,但对于这样一个简单的任务来说似乎有些过分。

Usually just adding proper where solves the problem. If it doesn't - just partition it manually. Writing a script is too much - you can usually make it in a simple one-liner:

通常只需添加适当的地方即可解决问题。如果没有 - 只需手动分区。编写脚本太多了 - 你通常可以用简单的单行编写:

perl -e '
    for (my $i = 0; $i <= 3500000; $i += 1000) {
        printf "UPDATE orders SET status = null WHERE status is not null
                and order_id between %u and %u;\n",
        $i, $i+999
    }
'

I wrapped lines here for readability, generally it's a single line. Output of above command can be fed to psql directly:

为了便于阅读,我在这里包装了一行,通常它是一行。上述命令的输出可以直接输入psql:

perl -e '...' | psql -U ... -d ...

Or first to file and then to psql (in case you'd need the file later on):

或者首先发送文件然后发送到psql(如果以后需要该文件):

perl -e '...' > updates.partitioned.sql
psql -U ... -d ... -f updates.partitioned.sql

#6


I am by no means a DBA, but a database design where you'd frequently have to update 35 million rows might have… issues.

我绝不是一名DBA,但是你经常需要更新3500万行的数据库设计可能会有......问题。

A simple WHERE status IS NOT NULL might speed up things quite a bit (provided you have an index on status) – not knowing the actual use case, I'm assuming if this is run frequently, a great part of the 35 million rows might already have a null status.

一个简单的WHERE状态IS NOT NULL可能会加速一些事情(假设您有一个状态索引) - 不知道实际用例,我假设如果这是经常运行的,那么3500万行的很大一部分可能会已经有一个空状态。

However, you can make loops within the query via the LOOP statement. I'll just cook up a small example:

但是,您可以通过LOOP语句在查询中创建循环。我只是做一个小例子:

CREATE OR REPLACE FUNCTION nullstatus(count INTEGER) RETURNS integer AS $$
DECLARE
    i INTEGER := 0;
BEGIN
    FOR i IN 0..(count/1000 + 1) LOOP
        UPDATE orders SET status = null WHERE (order_id > (i*1000) and order_id <((i+1)*1000));
        RAISE NOTICE 'Count: % and i: %', count,i;
    END LOOP;
    RETURN 1;
END;
$$ LANGUAGE plpgsql;

It can then be run by doing something akin to:

然后可以通过类似于以下的操作来运行它:

SELECT nullstatus(35000000);

You might want to select the row count, but beware that the exact row count can take a lot of time. The PostgreSQL wiki has an article about slow counting and how to avoid it.

您可能希望选择行计数,但要注意确切的行计数可能需要很长时间。 PostgreSQL wiki有一篇关于慢计数以及如何避免它的文章。

Also, the RAISE NOTICE part is just there to keep track on how far along the script is. If you're not monitoring the notices, or do not care, it would be better to leave it out.

另外,RAISE NOTICE部分就是跟踪脚本的位置。如果您没有监控通知,或者不在乎,最好不要发布通知。

#7


Are you sure this is because of locking? I don't think so and there's many other possible reasons. To find out you can always try to do just the locking. Try this: BEGIN; SELECT NOW(); SELECT * FROM order FOR UPDATE; SELECT NOW(); ROLLBACK;

你确定这是因为锁定吗?我不这么认为,还有很多其他可能的原因。要找出答案,你总是可以尝试做锁定。试试这个:开始; SELECT NOW(); SELECT * FROM order FOR UPDATE; SELECT NOW(); ROLLBACK;

To understand what's really happening you should run an EXPLAIN first (EXPLAIN UPDATE orders SET status...) and/or EXPLAIN ANALYZE. Maybe you'll find out that you don't have enough memory to do the UPDATE efficiently. If so, SET work_mem TO 'xxxMB'; might be a simple solution.

要了解实际发生的情况,首先应运行EXPLAIN(EXPLAIN UPDATE命令SET状态...)和/或EXPLAIN ANALYZE。也许你会发现你没有足够的内存来有效地进行UPDATE。如果是这样,SET work_mem TO'xxxx';可能是一个简单的解决方案

Also, tail the PostgreSQL log to see if some performance related problems occurs.

另外,请关闭PostgreSQL日志以查看是否出现一些与性能相关的问题。

#8


Some options that haven't been mentioned:

一些未提及的选项:

Use the new table trick. Probably what you'd have to do in your case is write some triggers to handle it so that changes to the original table also go propagated to your table copy, something like that... (percona is an example of something that does it the trigger way). Another option might be the "create a new column then replace the old one with it" trick, to avoid locks (unclear if helps with speed).

使用新表技巧。在你的情况下你可能需要做的就是编写一些触发器来处理它,以便对原始表的更改也会传播到你的表副本,就像那样......(percona就是这样的一个例子)触发方式)。另一种选择可能是“创建一个新列然后用它替换旧列”技巧,以避免锁定(不清楚是否有助于速度)。

Possibly calculate the max ID, then generate "all the queries you need" and pass them in as a single query like update X set Y = NULL where ID < 10000 and ID >= 0; update X set Y = NULL where ID < 20000 and ID > 10000; ... then it might not do as much locking, and still be all SQL, though you do have extra logic up front to do it :(

可能会计算最大ID,然后生成“您需要的所有查询”并将其作为单个查询传递,如更新X集Y = NULL,其中ID <10000且ID> = 0;更新X设置Y = NULL,其中ID <20000且ID> 10000; ...然后它可能没有那么多锁定,仍然是所有SQL,虽然你事先有额外的逻辑来做到这一点:(

#1


Column / Row

... I don't need the transactional integrity to be maintained across the entire operation, because I know that the column I'm changing is not going to be written to or read during the update.

...我不需要在整个操作中维护事务完整性,因为我知道我正在更改的列不会在更新期间写入或读取。

Any UPDATE in PostgreSQL's MVCC model writes a new version of the whole row. If concurrent transactions change any column of the same row, time-consuming concurrency issues arise. Details in the manual. Knowing the same column won't be touched by concurrent transactions avoids some possible complications, but not others.

PostgreSQL的MVCC模型中的任何UPDATE都会写入整个行的新版本。如果并发事务更改同一行的任何列,则会出现耗时的并发问题。手册中的详细信息。并发事务不会触及相同的列,避免了一些可能的并发症,而不是其他的。

Index

To avoid being diverted to an offtopic discussion, let's assume that all the values of status for the 35 million columns are currently set to the same (non-null) value, thus rendering an index useless.

为避免被转移到offtopic讨论,让我们假设3500万列的所有状态值当前都设置为相同(非空)值,从而使索引无用。

When updating the whole table (or major parts of it) Postgres never uses an index. A sequential scan is faster when all or most rows have to be read. On the contrary: Index maintenance means additional cost for the UPDATE.

更新整个表(或其主要部分)时,Postgres从不使用索引。当必须读取所有或大多数行时,顺序扫描更快。相反:索引维护意味着UPDATE的额外成本。

Performance

For example, let's say I have a table called "orders" with 35 million rows, and I want to do this:

例如,假设我有一个名为“orders”的表,有3500万行,我想这样做:

UPDATE orders SET status = null;

I understand you are aiming for a more general solution (see below). But to address the actual question asked: This can be dealt with in a matter milliseconds, regardless of table size:

我知道你的目标是更通用的解决方案(见下文)。但要解决实际问题:无论表大小如何,都可以在几毫秒内处理:

ALTER TABLE orders DROP column status
                 , ADD  column status text;

Per documentation:

When a column is added with ADD COLUMN, all existing rows in the table are initialized with the column's default value (NULL if no DEFAULT clause is specified). If there is no DEFAULT clause, this is merely a metadata change...

使用ADD COLUMN添加列时,表中的所有现有行都使用列的缺省值进行初始化(如果未指定DEFAULT子句,则为NULL)。如果没有DEFAULT子句,这只是一个元数据更改......

And:

The DROP COLUMN form does not physically remove the column, but simply makes it invisible to SQL operations. Subsequent insert and update operations in the table will store a null value for the column. Thus, dropping a column is quick but it will not immediately reduce the on-disk size of your table, as the space occupied by the dropped column is not reclaimed. The space will be reclaimed over time as existing rows are updated. (These statements do not apply when dropping the system oid column; that is done with an immediate rewrite.)

DROP COLUMN表单不会物理删除列,但只是使其对SQL操作不可见。表中的后续插入和更新操作将为列存储空值。因此,删除列很快,但不会立即减少表的磁盘大小,因为已删除列所占用的空间不会被回收。随着现有行的更新,该空间将随着时间的推移而被回收。 (这些语句在删除系统oid列时不适用;这是通过立即重写完成的。)

Make sure you don't have objects depending on the column (foreign key constraints, indices, views, ...). You would need to drop / recreate those. Barring that, tiny operations on the system catalog table pg_attribute do the job. Requires an exclusive lock on the table which may be a problem for heavy concurrent load. Since it only takes a few milliseconds, you should still be fine.

确保根据列不具有对象(外键约束,索引,视图......)。您需要删除/重新创建它们。除此之外,系统目录表pg_attribute上的微小操作可以完成这项工作。需要对表进行独占锁定,这可能是重度并发负载的问题。因为它只需要几毫秒,你应该仍然没问题。

If you have a column default you want to keep, add it back in a separate command. Doing it in the same command would apply it to all rows immediately, voiding the effect. You could then update the existing columns in batches. Follow the documentation link and read the Notes in the manual.

如果您希望保留列默认值,请将其添加回单独的命令中。在同一个命令中执行此操作会立即将其应用于所有行,从而消除效果。然后,您可以批量更新现有列。请按照文档链接阅读手册中的注释。

General solution

dblink has been mentioned in another answer. It allows access to "remote" Postgres databases in implicit separate connections. The "remote" database can be the current one, thereby achieving "autonomous transactions": what the function writes in the "remote" db is committed and can't be rolled back.

另一个答案中提到了dblink。它允许在隐式单独连接中访问“远程”Postgres数据库。 “远程”数据库可以是当前数据库,从而实现“自治事务”:函数在“远程”数据库中写入的内容已提交且无法回滚。

This allows to run a single function that updates a big table in smaller parts and each part is committed separately. Avoids building up transaction overhead for very big numbers of rows and, more importantly, releases locks after each part. This allows concurrent operations to proceed without much delay and makes deadlocks less likely.

这允许运行单个函数,以更小的部分更新大表,并且每个部分都单独提交。避免为非常大的行构建事务开销,更重要的是,在每个部分之后释放锁。这允许并发操作在没有太多延迟的情况下继续进行并且使死锁的可能性降低。

If you don't have concurrent access, this is hardly useful - except to avoid ROLLBACK after an exception. Also consider SAVEPOINT for that case.

如果您没有并发访问权限,这几乎没有用 - 除了在异常后避免ROLLBACK。在这种情况下也要考虑SAVEPOINT。

Disclaimer

First of all, lots of small transactions are actually more expensive. This only makes sense for big tables. The sweet spot depends on many factors.

首先,许多小交易实际上更昂贵。这对大表来说才有意义。甜蜜点取决于许多因素。

If you are not sure what you are doing: a single transaction is the safe method. For this to work properly, concurrent operations on the table have to play along. For instance: concurrent writes can move a row to a partition that's supposedly already processed. Or concurrent reads can see inconsistent intermediary states. You have been warned.

如果您不确定自己在做什么:单个交易是安全的方法。为了使其正常工作,桌面上的并发操作必须发挥作用。例如:并发写入可以将行移动到应该已经处理的分区。或并发读取可以看到不一致的中间状态。你被警告了。

Step-by-step instructions

The additional module dblink needs to be installed first:

需要首先安装附加模块dblink:

Setting up the connection with dblink very much depends on the setup of your DB cluster and security policies in place. It can be tricky. Related later answer with more how to connect with dblink:

与dblink建立连接非常依赖于数据库集群的设置和安全策略。这可能很棘手。相关的后续回答以及如何与dblink连接:

Create a FOREIGN SERVER and a USER MAPPING as instructed there to simplify and streamline the connection (unless you have one already).
Assuming a serial PRIMARY KEY with or without some gaps.

按照那里的指示创建一个FOREIGN SERVER和一个USER MAPPING来简化和简化连接(除非你已经有了)。假设一个序列PRIMARY KEY有或没有一些间隙。

CREATE OR REPLACE FUNCTION f_update_in_steps()
  RETURNS void AS
$func$
DECLARE
   _step int;   -- size of step
   _cur  int;   -- current ID (starting with minimum)
   _max  int;   -- maximum ID
BEGIN
   SELECT INTO _cur, _max  min(order_id), max(order_id) FROM orders;
                                        -- 100 slices (steps) hard coded
   _step := ((_max - _cur) / 100) + 1;  -- rounded, possibly a bit too small
                                        -- +1 to avoid endless loop for 0
   PERFORM dblink_connect('myserver');  -- your foreign server as instructed above

   FOR i IN 0..200 LOOP                 -- 200 >> 100 to make sure we exceed _max
      PERFORM dblink_exec(
       $$UPDATE public.orders
         SET    status = 'foo'
         WHERE  order_id >= $$ || _cur || $$
         AND    order_id <  $$ || _cur + _step || $$
         AND    status IS DISTINCT FROM 'foo'$$);  -- avoid empty update

      _cur := _cur + _step;

      EXIT WHEN _cur > _max;            -- stop when done (never loop till 200)
   END LOOP;

   PERFORM dblink_disconnect();
END
$func$  LANGUAGE plpgsql;

Call:

SELECT f_update_in_steps();

You can parameterize any part according to your needs: the table name, column name, value, ... just be sure to sanitize identifiers to avoid SQL injection:

您可以根据需要参数化任何部分:表名,列名,值,...只需确保清理标识符以避免SQL注入:

About avoiding empty UPDATE:

关于避免空UPDATE:

#2


You should delegate this column to another table like this:

您应该将此列委托给另一个表,如下所示:

create table order_status (
  order_id int not null references orders(order_id) primary key,
  status int not null
);

Then your operation of setting status=NULL will be instant:

那么你设置status = NULL的操作将是即时的:

truncate order_status;

#3


Postgres uses MVCC (multi-version concurrency control), thus avoiding any locking if you are the only writer; any number of concurrent readers can work on the table, and there won't be any locking.

Postgres使用MVCC(多版本并发控制),因此如果您是唯一的编写者,则避免任何锁定;任何数量的并发读者都可以在桌面上工作,并且不会有任何锁定。

So if it really takes 5h, it must be for a different reason (e.g. that you do have concurrent writes, contrary to your claim that you don't).

因此,如果它确实需要5小时,则必须是出于不同的原因(例如,您确实有并发写入,与您声称的不相同)。

#4


I would use CTAS:

我会用CTAS:

begin;
create table T as select col1, col2, ..., <new value>, colN from orders;
drop table orders;
alter table T rename to orders;
commit;

#5


First of all - are you sure that you need to update all rows?

首先 - 你确定需要更新所有行吗?

Perhaps some of the rows already have status NULL?

也许某些行已经具有NULL状态?

If so, then:

如果是这样,那么:

UPDATE orders SET status = null WHERE status is not null;

As for partitioning the change - that's not possible in pure sql. All updates are in single transaction.

至于分区变化 - 这在纯sql中是不可能的。所有更新都在单个事务中。

One possible way to do it in "pure sql" would be to install dblink, connect to the same database using dblink, and then issue a lot of updates over dblink, but it seems like overkill for such a simple task.

在“纯sql”中执行此操作的一种可能方法是安装dblink,使用dblink连接到同一数据库,然后通过dblink发出大量更新,但对于这样一个简单的任务来说似乎有些过分。

Usually just adding proper where solves the problem. If it doesn't - just partition it manually. Writing a script is too much - you can usually make it in a simple one-liner:

通常只需添加适当的地方即可解决问题。如果没有 - 只需手动分区。编写脚本太多了 - 你通常可以用简单的单行编写:

perl -e '
    for (my $i = 0; $i <= 3500000; $i += 1000) {
        printf "UPDATE orders SET status = null WHERE status is not null
                and order_id between %u and %u;\n",
        $i, $i+999
    }
'

I wrapped lines here for readability, generally it's a single line. Output of above command can be fed to psql directly:

为了便于阅读,我在这里包装了一行,通常它是一行。上述命令的输出可以直接输入psql:

perl -e '...' | psql -U ... -d ...

Or first to file and then to psql (in case you'd need the file later on):

或者首先发送文件然后发送到psql(如果以后需要该文件):

perl -e '...' > updates.partitioned.sql
psql -U ... -d ... -f updates.partitioned.sql

#6


I am by no means a DBA, but a database design where you'd frequently have to update 35 million rows might have… issues.

我绝不是一名DBA,但是你经常需要更新3500万行的数据库设计可能会有......问题。

A simple WHERE status IS NOT NULL might speed up things quite a bit (provided you have an index on status) – not knowing the actual use case, I'm assuming if this is run frequently, a great part of the 35 million rows might already have a null status.

一个简单的WHERE状态IS NOT NULL可能会加速一些事情(假设您有一个状态索引) - 不知道实际用例,我假设如果这是经常运行的,那么3500万行的很大一部分可能会已经有一个空状态。

However, you can make loops within the query via the LOOP statement. I'll just cook up a small example:

但是,您可以通过LOOP语句在查询中创建循环。我只是做一个小例子:

CREATE OR REPLACE FUNCTION nullstatus(count INTEGER) RETURNS integer AS $$
DECLARE
    i INTEGER := 0;
BEGIN
    FOR i IN 0..(count/1000 + 1) LOOP
        UPDATE orders SET status = null WHERE (order_id > (i*1000) and order_id <((i+1)*1000));
        RAISE NOTICE 'Count: % and i: %', count,i;
    END LOOP;
    RETURN 1;
END;
$$ LANGUAGE plpgsql;

It can then be run by doing something akin to:

然后可以通过类似于以下的操作来运行它:

SELECT nullstatus(35000000);

You might want to select the row count, but beware that the exact row count can take a lot of time. The PostgreSQL wiki has an article about slow counting and how to avoid it.

您可能希望选择行计数,但要注意确切的行计数可能需要很长时间。 PostgreSQL wiki有一篇关于慢计数以及如何避免它的文章。

Also, the RAISE NOTICE part is just there to keep track on how far along the script is. If you're not monitoring the notices, or do not care, it would be better to leave it out.

另外,RAISE NOTICE部分就是跟踪脚本的位置。如果您没有监控通知,或者不在乎,最好不要发布通知。

#7


Are you sure this is because of locking? I don't think so and there's many other possible reasons. To find out you can always try to do just the locking. Try this: BEGIN; SELECT NOW(); SELECT * FROM order FOR UPDATE; SELECT NOW(); ROLLBACK;

你确定这是因为锁定吗?我不这么认为,还有很多其他可能的原因。要找出答案,你总是可以尝试做锁定。试试这个:开始; SELECT NOW(); SELECT * FROM order FOR UPDATE; SELECT NOW(); ROLLBACK;

To understand what's really happening you should run an EXPLAIN first (EXPLAIN UPDATE orders SET status...) and/or EXPLAIN ANALYZE. Maybe you'll find out that you don't have enough memory to do the UPDATE efficiently. If so, SET work_mem TO 'xxxMB'; might be a simple solution.

要了解实际发生的情况,首先应运行EXPLAIN(EXPLAIN UPDATE命令SET状态...)和/或EXPLAIN ANALYZE。也许你会发现你没有足够的内存来有效地进行UPDATE。如果是这样,SET work_mem TO'xxxx';可能是一个简单的解决方案

Also, tail the PostgreSQL log to see if some performance related problems occurs.

另外,请关闭PostgreSQL日志以查看是否出现一些与性能相关的问题。

#8


Some options that haven't been mentioned:

一些未提及的选项:

Use the new table trick. Probably what you'd have to do in your case is write some triggers to handle it so that changes to the original table also go propagated to your table copy, something like that... (percona is an example of something that does it the trigger way). Another option might be the "create a new column then replace the old one with it" trick, to avoid locks (unclear if helps with speed).

使用新表技巧。在你的情况下你可能需要做的就是编写一些触发器来处理它,以便对原始表的更改也会传播到你的表副本,就像那样......(percona就是这样的一个例子)触发方式)。另一种选择可能是“创建一个新列然后用它替换旧列”技巧,以避免锁定(不清楚是否有助于速度)。

Possibly calculate the max ID, then generate "all the queries you need" and pass them in as a single query like update X set Y = NULL where ID < 10000 and ID >= 0; update X set Y = NULL where ID < 20000 and ID > 10000; ... then it might not do as much locking, and still be all SQL, though you do have extra logic up front to do it :(

可能会计算最大ID,然后生成“您需要的所有查询”并将其作为单个查询传递,如更新X集Y = NULL,其中ID <10000且ID> = 0;更新X设置Y = NULL,其中ID <20000且ID> 10000; ...然后它可能没有那么多锁定,仍然是所有SQL,虽然你事先有额外的逻辑来做到这一点:(