如何提高PostgreSQL中的插入性能

时间:2021-05-01 14:24:17

I am testing Postgres insertion performance. I have a table with one column with number as its data type. There is an index on it as well. I filled the database up using this query:

我正在测试Postgres插入性能。我有一个表,其中一个列的数据类型为number。上面也有索引。我使用这个查询填充数据库:

insert into aNumber (id) values (564),(43536),(34560) ...

I inserted 4 million rows very quickly 10,000 at a time with the query above. After the database reached 6 million rows performance drastically declined to 1 Million rows every 15 min. Is there any trick to increase insertion performance? I need optimal insertion performance on this project.

在上面的查询中,我一次快速插入了400万行10,000行。当数据库达到600万行后,性能急剧下降到每15分钟就有100万行。有什么方法可以提高插入性能吗?在这个项目中,我需要最佳的插入性能。

Using Windows 7 Pro on a machine with 5 GB RAM.

在装有5gb内存的机器上使用Windows 7 Pro。

5 个解决方案

#1


373  

See populate a database in the PostgreSQL manual, depesz's excellent-as-usual article on the topic, and this SO question.

请参阅在PostgreSQL手册、depesz关于这个主题的优秀文章中填充数据库,以及这个SO问题。

(Note that this answer is about bulk-loading data into an existing DB or to create a new one. If you're interested DB restore performance with pg_restore or psql execution of pg_dump output, much of this doesn't apply since pg_dump and pg_restore already do things like creating triggers and indexes after it finishes a schema+data restore).

(注意,这个答案是关于将数据批量加载到现有的DB或创建一个新的DB。如果您对使用pg_restore或pg_dump输出执行psql恢复性能感兴趣,那么这在很大程度上是不适用的,因为pg_dump和pg_restore在完成模式+数据恢复之后就已经在创建触发器和索引了)。

There's lots to be done. The ideal solution would be to import into an UNLOGGED table without indexes, then change it to logged and add the indexes. Unfortunately in PostgreSQL 9.4 there's no support for changing tables from UNLOGGED to logged. 9.5 adds ALTER TABLE ... SET LOGGED to permit you to do this.

有很多事情要做。理想的解决方案是导入一个没有索引的未登录表,然后将其更改为已登录表并添加索引。不幸的是,在PostgreSQL 9.4中,不支持将表从未日志记录更改为日志记录。9.5增加更改表…设置日志以允许您这样做。

If you can take your database offline for the bulk import, use pg_bulkload.

如果可以将数据库脱机进行批量导入,请使用pg_bulkload。

Otherwise:

否则:

  • Disable any triggers on the table

    禁用表上的任何触发器

  • Drop indexes before starting the import, re-create them afterwards. (It takes much less time to build an index in one pass than it does to add the same data to it progressively, and the resulting index is much more compact).

    在开始导入之前删除索引,然后重新创建它们。(一次构建索引所花费的时间要比逐步添加相同的数据所花费的时间要少得多,结果的索引要紧凑得多)。

  • If doing the import within a single transaction, it's safe to drop foreign key constraints, do the import, and re-create the constraints before committing. Do not do this if the import is split across multiple transactions as you might introduce invalid data.

    如果在单个事务中执行导入操作,可以安全地删除外键约束、执行导入操作,并在提交之前重新创建约束。如果导入跨多个事务进行拆分,则不要这样做,因为可能会引入无效数据。

  • If possible, use COPY instead of INSERTs

    如果可能的话,使用复制而不是插入

  • If you can't use COPY consider using multi-valued INSERTs if practical. You seem to be doing this already. Don't try to list too many values in a single VALUES though; those values have to fit in memory a couple of times over, so keep it to a few hundred per statement.

    如果不能使用复制,可以考虑使用多值插入。你似乎已经在这么做了。不要试图在一个值中列出太多的值;这些值必须在内存中存储几次,所以每个语句的值要保持在几百。

  • Batch your inserts into explicit transactions, doing hundreds of thousands or millions of inserts per transaction. There's no practical limit AFAIK, but batching will let you recover from an error by marking the start of each batch in your input data. Again, you seem to be doing this already.

    将插入批处理到显式事务中,每个事务执行数十万或数百万次插入。没有实际的限制,但是批处理将允许您通过在输入数据中标记每个批的开始来从错误中恢复。再说一遍,你似乎已经在这么做了。

  • Use synchronous_commit=off and a huge commit_delay to reduce fsync() costs. This won't help much if you've batched your work into big transactions, though.

    使用synchronious_commit =off和巨大的commit_delay来减少fsync()的开销。但是,如果你已经将你的工作打包成大的事务,这并没有多大帮助。

  • INSERT or COPY in parallel from several connections. How many depends on your hardware's disk subsystem; as a rule of thumb, you want one connection per physical hard drive if using direct attached storage.

    从多个连接并行地插入或复制。多少取决于硬件的磁盘子系统;根据经验,如果使用直接附加存储,您希望每个物理硬盘上有一个连接。

  • Set a high checkpoint_segments value and enable log_checkpoints. Look at the PostgreSQL logs and make sure it's not complaining about checkpoints occurring too frequently.

    设置一个高checkpoint_segment值并启用log_检查点。查看PostgreSQL日志,确保它没有抱怨检查点频繁发生。

  • If and only if you don't mind losing your entire PostgreSQL cluster (your database and any others on the same cluster) to catastrophic corruption if the system crashes during the import, you can stop Pg, set fsync=off, start Pg, do your import, then (vitally) stop Pg and set fsync=on again. See WAL configuration. Do not do this if there is already any data you care about in any database on your PostgreSQL install. If you set fsync=off you can also set full_page_writes=off; again, just remember to turn it back on after your import to prevent database corruption and data loss. See non-durable settings in the Pg manual.

    如果并且仅当您不介意在导入过程中系统崩溃时使整个PostgreSQL集群(您的数据库和其他任何集群)发生灾难性损坏时,您可以停止Pg,设置fsync=off,启动Pg,执行导入,然后(至关重要的)停止Pg并再次设置fsync=on。看到WAL配置。如果在您的PostgreSQL安装上的任何数据库中已经有任何您关心的数据,请不要这样做。如果设置fsync=off,还可以设置full_page_write =off;同样,请记住在导入之后重新打开它,以防止数据库损坏和数据丢失。请参阅Pg手册中的非持久设置。

You should also look at tuning your system:

你也应该考虑调整你的系统:

  • Use good quality SSDs for storage as much as possible. Good SSDs with reliable, power-protected write-back caches make commit rates incredibly faster. They're less beneficial when you follow the advice above - which reduces disk flushes / number of fsync()s - but can still be a big help. Do not use cheap SSDs without proper power-failure protection unless you don't care about keeping your data.

    尽可能多地使用高质量的ssd进行存储。优秀的ssd具有可靠的、电源保护的回写缓存,使提交速度更快。如果您遵循上面的建议(减少磁盘刷新/ fsync()s的数量),那么它们就没有多大的好处,但仍然有很大的帮助。不要在没有适当的电源故障保护的情况下使用廉价的ssd,除非你不关心你的数据。

  • If you're using RAID 5 or RAID 6 for direct attached storage, stop now. Back your data up, restructure your RAID array to RAID 10, and try again. RAID 5/6 are hopeless for bulk write performance - though a good RAID controller with a big cache can help.

    如果您正在使用RAID 5或RAID 6进行直接附加存储,请立即停止。备份数据,将RAID阵列重组为RAID 10,然后再次尝试。RAID 5/6对于大容量写性能来说是毫无希望的——尽管一个好的具有大缓存的RAID控制器可以有所帮助。

  • If you have the option of using a hardware RAID controller with a big battery-backed write-back cache this can really improve write performance for workloads with lots of commits. It doesn't help as much if you're using async commit with a commit_delay or if you're doing fewer big transactions during bulk loading.

    如果您可以选择使用一个硬件RAID控制器和一个大的电池支持的回写缓存,这可以真正提高大量提交的工作负载的写性能。如果您使用的是带有commit_delay的异步提交,或者在批量加载期间处理的大型事务较少,那么这对您的帮助就不大。

  • If possible, store WAL (pg_xlog) on a separate disk / disk array. There's little point in using a separate filesystem on the same disk. People often choose to use a RAID1 pair for WAL. Again, this has more effect on systems with high commit rates, and it has little effect if you're using an unlogged table as the data load target.

    如果可能的话,将WAL (pg_xlog)存储在一个单独的磁盘/磁盘阵列上。在同一个磁盘上使用单独的文件系统没有什么意义。人们经常选择用一双RAID1为WAL。同样,这对高提交率的系统有更大的影响,如果您使用一个未记录的表作为数据加载目标,则几乎没有影响。

You may also be interested in Optimise PostgreSQL for fast testing.

您可能还对优化PostgreSQL以进行快速测试感兴趣。

#2


10  

Use COPY table TO ... WITH BINARY which is according to the documentation is "somewhat faster than the text and CSV formats." Only do this if you have millions of rows to insert, and if you are comfortable with binary data.

使用复制表到…根据文档,二进制文件“比文本和CSV格式快一些”。只有当您有数百万行要插入,并且您对二进制数据感到满意时,才可以这样做。

Here is an example recipe in Python, using psycopg2 with binary input.

下面是Python中的一个示例配方,使用带二进制输入的psycopg2。

#3


7  

In addition to excellent Craig Ringer's post and depesz's blog post, if you would like to speed up your inserts through ODBC (psqlodbc) interface by using prepared-statement inserts inside a transaction, there are a few extra things you need to do to make it work fast:

除了Craig Ringer出色的文章和depesz的博客文章之外,如果您想通过ODBC (psqlodbc)接口通过使用事务内的准备语句插入来加快插入速度,您还需要做一些额外的事情来使其快速工作:

  1. Set the level-of-rollback-on-errors to "Transaction" by specifying Protocol=-1 in the connection string. By default psqlodbc uses "Statement" level, which creates a SAVEPOINT for each statement rather than an entire transaction, making inserts slower.
  2. 通过在连接字符串中指定协议=-1,将回滚错误级别设置为“事务”。默认情况下,psqlodbc使用“语句”级别,它为每个语句创建一个保存点,而不是整个事务,这会使插入速度变慢。
  3. Use server-side prepared statements by specifying UseServerSidePrepare=1 in the connection string. Without this option the client sends the entire insert statement along with each row being inserted.
  4. 通过在连接字符串中指定UseServerSidePrepare=1,使用服务器端准备语句。如果没有这个选项,客户端会发送整个insert语句以及插入的每一行。
  5. Disable auto-commit on each statement using SQLSetConnectAttr(conn, SQL_ATTR_AUTOCOMMIT, reinterpret_cast<SQLPOINTER>(SQL_AUTOCOMMIT_OFF), 0);
  6. 使用SQLSetConnectAttr禁用每个语句的自动提交(conn, SQL_ATTR_AUTOCOMMIT, reinterpret_cast (SQL_AUTOCOMMIT_OFF), 0);
  7. Once all rows have been inserted, commit the transaction using SQLEndTran(SQL_HANDLE_DBC, conn, SQL_COMMIT);. There is no need to explicitly open a transaction.
  8. 插入所有行之后,使用SQLEndTran(SQL_HANDLE_DBC、conn、SQL_COMMIT)提交事务;没有必要显式地打开事务。

Unfortunately, psqlodbc "implements" SQLBulkOperations by issuing a series of unprepared insert statements, so that to achieve the fastest insert one needs to code up the above steps manually.

不幸的是,psqlodbc通过发出一系列未经准备的插入语句来“实现”sqlbulk操作,因此要实现最快的插入,需要手工编写上述步骤。

#4


0  

For optimal Insertion performance disable the index if that's an option for you. Other than that, better hardware (disk, memory) is also helpful

为了获得最佳插入性能,如果这是您的选项,请禁用索引。除此之外,更好的硬件(磁盘、内存)也很有帮助。

#5


0  

I encountered this insertion performance problem as well. My solution is spawn some go routines to finish the insertion work. In the meantime, SetMaxOpenConns should be given a proper number otherwise too many open connection error would be alerted.

我也遇到了这个插入性能问题。我的解决方案是生成一些go例程来完成插入工作。同时,应该给SetMaxOpenConns一个适当的数字,否则会引起太多的打开连接错误。

db, _ := sql.open() 
db.SetMaxOpenConns(SOME CONFIG INTEGER NUMBER) 
var wg sync.WaitGroup
for _, query := range queries {
    wg.Add(1)
    go func(msg string) {
        defer wg.Done()
        _, err := db.Exec(msg)
        if err != nil {
            fmt.Println(err)
        }
    }(query)
}
wg.Wait()

The loading speed is much faster for my project. This code snippet just gave an idea how it works. Readers should be able to modify it easily.

对我的项目来说,加载速度要快得多。这段代码片段给出了它的工作原理。读者应该能够很容易地修改它。

#1


373  

See populate a database in the PostgreSQL manual, depesz's excellent-as-usual article on the topic, and this SO question.

请参阅在PostgreSQL手册、depesz关于这个主题的优秀文章中填充数据库,以及这个SO问题。

(Note that this answer is about bulk-loading data into an existing DB or to create a new one. If you're interested DB restore performance with pg_restore or psql execution of pg_dump output, much of this doesn't apply since pg_dump and pg_restore already do things like creating triggers and indexes after it finishes a schema+data restore).

(注意,这个答案是关于将数据批量加载到现有的DB或创建一个新的DB。如果您对使用pg_restore或pg_dump输出执行psql恢复性能感兴趣,那么这在很大程度上是不适用的,因为pg_dump和pg_restore在完成模式+数据恢复之后就已经在创建触发器和索引了)。

There's lots to be done. The ideal solution would be to import into an UNLOGGED table without indexes, then change it to logged and add the indexes. Unfortunately in PostgreSQL 9.4 there's no support for changing tables from UNLOGGED to logged. 9.5 adds ALTER TABLE ... SET LOGGED to permit you to do this.

有很多事情要做。理想的解决方案是导入一个没有索引的未登录表,然后将其更改为已登录表并添加索引。不幸的是,在PostgreSQL 9.4中,不支持将表从未日志记录更改为日志记录。9.5增加更改表…设置日志以允许您这样做。

If you can take your database offline for the bulk import, use pg_bulkload.

如果可以将数据库脱机进行批量导入,请使用pg_bulkload。

Otherwise:

否则:

  • Disable any triggers on the table

    禁用表上的任何触发器

  • Drop indexes before starting the import, re-create them afterwards. (It takes much less time to build an index in one pass than it does to add the same data to it progressively, and the resulting index is much more compact).

    在开始导入之前删除索引,然后重新创建它们。(一次构建索引所花费的时间要比逐步添加相同的数据所花费的时间要少得多,结果的索引要紧凑得多)。

  • If doing the import within a single transaction, it's safe to drop foreign key constraints, do the import, and re-create the constraints before committing. Do not do this if the import is split across multiple transactions as you might introduce invalid data.

    如果在单个事务中执行导入操作,可以安全地删除外键约束、执行导入操作,并在提交之前重新创建约束。如果导入跨多个事务进行拆分,则不要这样做,因为可能会引入无效数据。

  • If possible, use COPY instead of INSERTs

    如果可能的话,使用复制而不是插入

  • If you can't use COPY consider using multi-valued INSERTs if practical. You seem to be doing this already. Don't try to list too many values in a single VALUES though; those values have to fit in memory a couple of times over, so keep it to a few hundred per statement.

    如果不能使用复制,可以考虑使用多值插入。你似乎已经在这么做了。不要试图在一个值中列出太多的值;这些值必须在内存中存储几次,所以每个语句的值要保持在几百。

  • Batch your inserts into explicit transactions, doing hundreds of thousands or millions of inserts per transaction. There's no practical limit AFAIK, but batching will let you recover from an error by marking the start of each batch in your input data. Again, you seem to be doing this already.

    将插入批处理到显式事务中,每个事务执行数十万或数百万次插入。没有实际的限制,但是批处理将允许您通过在输入数据中标记每个批的开始来从错误中恢复。再说一遍,你似乎已经在这么做了。

  • Use synchronous_commit=off and a huge commit_delay to reduce fsync() costs. This won't help much if you've batched your work into big transactions, though.

    使用synchronious_commit =off和巨大的commit_delay来减少fsync()的开销。但是,如果你已经将你的工作打包成大的事务,这并没有多大帮助。

  • INSERT or COPY in parallel from several connections. How many depends on your hardware's disk subsystem; as a rule of thumb, you want one connection per physical hard drive if using direct attached storage.

    从多个连接并行地插入或复制。多少取决于硬件的磁盘子系统;根据经验,如果使用直接附加存储,您希望每个物理硬盘上有一个连接。

  • Set a high checkpoint_segments value and enable log_checkpoints. Look at the PostgreSQL logs and make sure it's not complaining about checkpoints occurring too frequently.

    设置一个高checkpoint_segment值并启用log_检查点。查看PostgreSQL日志,确保它没有抱怨检查点频繁发生。

  • If and only if you don't mind losing your entire PostgreSQL cluster (your database and any others on the same cluster) to catastrophic corruption if the system crashes during the import, you can stop Pg, set fsync=off, start Pg, do your import, then (vitally) stop Pg and set fsync=on again. See WAL configuration. Do not do this if there is already any data you care about in any database on your PostgreSQL install. If you set fsync=off you can also set full_page_writes=off; again, just remember to turn it back on after your import to prevent database corruption and data loss. See non-durable settings in the Pg manual.

    如果并且仅当您不介意在导入过程中系统崩溃时使整个PostgreSQL集群(您的数据库和其他任何集群)发生灾难性损坏时,您可以停止Pg,设置fsync=off,启动Pg,执行导入,然后(至关重要的)停止Pg并再次设置fsync=on。看到WAL配置。如果在您的PostgreSQL安装上的任何数据库中已经有任何您关心的数据,请不要这样做。如果设置fsync=off,还可以设置full_page_write =off;同样,请记住在导入之后重新打开它,以防止数据库损坏和数据丢失。请参阅Pg手册中的非持久设置。

You should also look at tuning your system:

你也应该考虑调整你的系统:

  • Use good quality SSDs for storage as much as possible. Good SSDs with reliable, power-protected write-back caches make commit rates incredibly faster. They're less beneficial when you follow the advice above - which reduces disk flushes / number of fsync()s - but can still be a big help. Do not use cheap SSDs without proper power-failure protection unless you don't care about keeping your data.

    尽可能多地使用高质量的ssd进行存储。优秀的ssd具有可靠的、电源保护的回写缓存,使提交速度更快。如果您遵循上面的建议(减少磁盘刷新/ fsync()s的数量),那么它们就没有多大的好处,但仍然有很大的帮助。不要在没有适当的电源故障保护的情况下使用廉价的ssd,除非你不关心你的数据。

  • If you're using RAID 5 or RAID 6 for direct attached storage, stop now. Back your data up, restructure your RAID array to RAID 10, and try again. RAID 5/6 are hopeless for bulk write performance - though a good RAID controller with a big cache can help.

    如果您正在使用RAID 5或RAID 6进行直接附加存储,请立即停止。备份数据,将RAID阵列重组为RAID 10,然后再次尝试。RAID 5/6对于大容量写性能来说是毫无希望的——尽管一个好的具有大缓存的RAID控制器可以有所帮助。

  • If you have the option of using a hardware RAID controller with a big battery-backed write-back cache this can really improve write performance for workloads with lots of commits. It doesn't help as much if you're using async commit with a commit_delay or if you're doing fewer big transactions during bulk loading.

    如果您可以选择使用一个硬件RAID控制器和一个大的电池支持的回写缓存,这可以真正提高大量提交的工作负载的写性能。如果您使用的是带有commit_delay的异步提交,或者在批量加载期间处理的大型事务较少,那么这对您的帮助就不大。

  • If possible, store WAL (pg_xlog) on a separate disk / disk array. There's little point in using a separate filesystem on the same disk. People often choose to use a RAID1 pair for WAL. Again, this has more effect on systems with high commit rates, and it has little effect if you're using an unlogged table as the data load target.

    如果可能的话,将WAL (pg_xlog)存储在一个单独的磁盘/磁盘阵列上。在同一个磁盘上使用单独的文件系统没有什么意义。人们经常选择用一双RAID1为WAL。同样,这对高提交率的系统有更大的影响,如果您使用一个未记录的表作为数据加载目标,则几乎没有影响。

You may also be interested in Optimise PostgreSQL for fast testing.

您可能还对优化PostgreSQL以进行快速测试感兴趣。

#2


10  

Use COPY table TO ... WITH BINARY which is according to the documentation is "somewhat faster than the text and CSV formats." Only do this if you have millions of rows to insert, and if you are comfortable with binary data.

使用复制表到…根据文档,二进制文件“比文本和CSV格式快一些”。只有当您有数百万行要插入,并且您对二进制数据感到满意时,才可以这样做。

Here is an example recipe in Python, using psycopg2 with binary input.

下面是Python中的一个示例配方,使用带二进制输入的psycopg2。

#3


7  

In addition to excellent Craig Ringer's post and depesz's blog post, if you would like to speed up your inserts through ODBC (psqlodbc) interface by using prepared-statement inserts inside a transaction, there are a few extra things you need to do to make it work fast:

除了Craig Ringer出色的文章和depesz的博客文章之外,如果您想通过ODBC (psqlodbc)接口通过使用事务内的准备语句插入来加快插入速度,您还需要做一些额外的事情来使其快速工作:

  1. Set the level-of-rollback-on-errors to "Transaction" by specifying Protocol=-1 in the connection string. By default psqlodbc uses "Statement" level, which creates a SAVEPOINT for each statement rather than an entire transaction, making inserts slower.
  2. 通过在连接字符串中指定协议=-1,将回滚错误级别设置为“事务”。默认情况下,psqlodbc使用“语句”级别,它为每个语句创建一个保存点,而不是整个事务,这会使插入速度变慢。
  3. Use server-side prepared statements by specifying UseServerSidePrepare=1 in the connection string. Without this option the client sends the entire insert statement along with each row being inserted.
  4. 通过在连接字符串中指定UseServerSidePrepare=1,使用服务器端准备语句。如果没有这个选项,客户端会发送整个insert语句以及插入的每一行。
  5. Disable auto-commit on each statement using SQLSetConnectAttr(conn, SQL_ATTR_AUTOCOMMIT, reinterpret_cast<SQLPOINTER>(SQL_AUTOCOMMIT_OFF), 0);
  6. 使用SQLSetConnectAttr禁用每个语句的自动提交(conn, SQL_ATTR_AUTOCOMMIT, reinterpret_cast (SQL_AUTOCOMMIT_OFF), 0);
  7. Once all rows have been inserted, commit the transaction using SQLEndTran(SQL_HANDLE_DBC, conn, SQL_COMMIT);. There is no need to explicitly open a transaction.
  8. 插入所有行之后,使用SQLEndTran(SQL_HANDLE_DBC、conn、SQL_COMMIT)提交事务;没有必要显式地打开事务。

Unfortunately, psqlodbc "implements" SQLBulkOperations by issuing a series of unprepared insert statements, so that to achieve the fastest insert one needs to code up the above steps manually.

不幸的是,psqlodbc通过发出一系列未经准备的插入语句来“实现”sqlbulk操作,因此要实现最快的插入,需要手工编写上述步骤。

#4


0  

For optimal Insertion performance disable the index if that's an option for you. Other than that, better hardware (disk, memory) is also helpful

为了获得最佳插入性能,如果这是您的选项,请禁用索引。除此之外,更好的硬件(磁盘、内存)也很有帮助。

#5


0  

I encountered this insertion performance problem as well. My solution is spawn some go routines to finish the insertion work. In the meantime, SetMaxOpenConns should be given a proper number otherwise too many open connection error would be alerted.

我也遇到了这个插入性能问题。我的解决方案是生成一些go例程来完成插入工作。同时,应该给SetMaxOpenConns一个适当的数字,否则会引起太多的打开连接错误。

db, _ := sql.open() 
db.SetMaxOpenConns(SOME CONFIG INTEGER NUMBER) 
var wg sync.WaitGroup
for _, query := range queries {
    wg.Add(1)
    go func(msg string) {
        defer wg.Done()
        _, err := db.Exec(msg)
        if err != nil {
            fmt.Println(err)
        }
    }(query)
}
wg.Wait()

The loading speed is much faster for my project. This code snippet just gave an idea how it works. Readers should be able to modify it easily.

对我的项目来说,加载速度要快得多。这段代码片段给出了它的工作原理。读者应该能够很容易地修改它。