列复制和更新与列创建和插入

I have a table with 32 Million rows and 31 columns in PostgreSQL 9.2.10. I am altering the table by adding columns with updated values.

我在PostgreSQL 9.2.10中有一个3200万行和31列的表。我通过添加具有更新值的列来更改表。

For example, if the initial table is:

例如,如果初始表是:

id     initial_color
--     -------------
1      blue
2      red
3      yellow

I am modifying the table so that the result is:

我正在修改表,结果是:

id     initial_color     modified_color
--     -------------     --------------
1      blue              blue_green
2      red               red_orange
3      yellow            yellow_brown

I have code that will read the initial_color column and update the value.

我有代码将读取initial_color列并更新值。

Given that my table has 32 million rows and that I have to apply this procedure on five of the 31 columns, what is the most efficient way to do this? My present choices are:

鉴于我的表有3200万行,并且我必须在31列中的5列上应用此过程,那么最有效的方法是什么?我目前的选择是:

Copy the column and update the rows in the new column

复制列并更新新列中的行

Create an empty column and insert new values

创建一个空列并插入新值

I could do either option with one column at a time or with all five at once. The columns types are either character varying or character.

我可以一次选择一列,也可以同时使用全部五列。列类型可以是字符变量或字符。

3 个解决方案

#1

The columns types are either character varying or character.

列类型可以是字符变量或字符。

Don't use character, that's a misunderstanding. varchar is ok, but I would suggest just text for arbitrary character data.

不要使用角色,这是一种误解。 varchar没问题,但我建议只提供任意字符数据的文本。

Any downsides of using data type "text" for storing strings?

使用数据类型“text”存储字符串的任何缺点?

Given that my table has 32 million rows and that I have to apply this procedure on five of the 31 columns, what is the most efficient way to do this?

鉴于我的表有3200万行,并且我必须在31列中的5列上应用此过程,那么最有效的方法是什么?

If you don't have objects (views, foreign keys, functions) depending on the existing table, the most efficient way is create a new table. Something like this ( details depend on the details of your installation):

如果根据现有表没有对象(视图,外键,函数),则最有效的方法是创建新表。这样的事情(细节取决于您的安装细节):

BEGIN;
LOCK TABLE tbl_org IN SHARE MODE;  -- to prevent concurrent writes

CREATE TABLE tbl_new (LIKE tbl_org INCLUDING STORAGE INCLUDING COMMENTS);

ALTER tbl_new ADD COLUMN modified_color text
            , ADD COLUMN modified_something text;
            -- , etc
INSERT INTO tbl_new (<all columns in order here>)
SELECT <all columns in order here>
    ,  myfunction(initial_color) AS modified_color  -- etc
FROM   tbl_org;
-- ORDER  BY tbl_id;  -- optionally order rows while being at it.

-- Add constraints and indexes like in the original table here

DROP tbl_org;
ALTER tbl_new RENAME TO tbl_org;
COMMIT;

If you have depending objects, you need to do more.

如果您有依赖对象,则需要执行更多操作。

Either was, be sure to add all five at once. If you update each in a separate query you write another row version each time due to the MVCC model of Postgres.

要么是,一定要一次添加所有五个。如果您在单独的查询中更新每个,由于Postgres的MVCC模型,您每次都会编写另一行版本。

Related cases with more details, links and explanation:

相关案例包含更多细节,链接和说明:

Updating database rows without locking the table in PostgreSQL 9.2

更新数据库行而不在PostgreSQL 9.2中锁定表

Best way to populate a new column in a large table?

在大表中填充新列的最佳方法是什么?

Optimizing bulk update performance in PostgreSQL

优化PostgreSQL中的批量更新性能

While creating a new table you might also order columns in an optimized fashion:

在创建新表时,您还可以以优化的方式对列进行排序:

Calculating and saving space in PostgreSQL

在PostgreSQL中计算和节省空间

#2

Maybe I'm misreading the question, but as far as I know, you have 2 possibilities for creating a table with the extra columns:

也许我误解了这个问题,但据我所知,你有两种可能性来创建一个包含额外列的表:

CREATE TABLE
This would create a new table and filling could be done using

CREATE TABLE这将创建一个新表,并可以使用填充
- CREATE TABLE .. AS SELECT.. for filling with creation or
- using a separate INSERT...SELECT... later on Both variants are not what you seem to want to do, as you stated solution without listing all the fields.
  Also this would require all data (plus the new fields) to be copied.
ALTER TABLE...ADD ...
This creates the new columns. As I'm not aware of any possibility to reference existing column values, you will need an additional UPDATE ..SET... for filling in values.

ALTER TABLE ... ADD ...这将创建新列。由于我不知道有可能引用现有的列值,因此需要额外的UPDATE ..SET ...来填充值。

So, I' not seeing any way to realize a procedure that follows your choice 1.

所以,我没有看到任何方法来实现一个遵循你选择的程序1。

Nevertheless, copying the (column) data just to overwrite them in a second step would be suboptimal in any case. Altering a table adding new columns is doing minimal I/O. From this, even if there would be a possibility to execute your choice 1, following choice 2 promises better performance by factors.

然而,复制(列)数据只是为了在第二步中覆盖它们在任何情况下都是次优的。更改添加新列的表是最小的I / O.由此,即使有可能执行您的选择1,以下选择2也可以通过因子提供更好的性能。

Thus, do 2 statements one ALTER TABLE adding all your new columns in on go and then an UPDATE providing the new values for these columns will achieve what you want.

因此,做2个语句,一个ALTER TABLE在中添加所有新列,然后为这些列提供新值的UPDATE将实现您想要的。

#3

create new column (modified colour), it will have a value of NULL or blank on all records,

创建新列(修改后的颜色),它将在所有记录上具有NULL或空白值,

run an update statement, assuming your table name is 'Table'.

运行更新语句,假设您的表名是“Table”。

update table
set modified_color = 'blue_green'
where initial_color = 'blue'

if I am correct this can also work like this

如果我是正确的,这也可以这样工作

update table set modified_color = 'blue_green' where initial_color = 'blue';
update table set modified_color = 'red_orange' where initial_color = 'red';
update table set modified_color = 'yellow_brown' where initial_color = 'yellow';

once you have done this you can do another update (assuming you have another column that I will call modified_color1)

一旦你完成了这个,你可以做另一个更新(假设你有另一个我称之为modified_color1的列)

update table set 'modified_color1'= 'modified_color'

#1