具有大量查询的node-postgres

I just started playing around with node.js with postgres, using node-postgres. One of the things I tried to do is to write a short js to populate my database, using a file with about 200,000 entries.

我刚开始使用postgres使用node-postgres来使用node.js。我试图做的一件事就是写一个简短的js来填充我的数据库，使用一个包含大约200,000个条目的文件。

I noticed that after sometime (less than 10 seconds), I start to get "Error: Connection terminated". I am not sure whether this is problem with how I use node-postgres, or if it's because I was spamming postgres.

我注意到一段时间后（少于10秒），我开始得到“错误：连接已终止”。我不确定这是否是我使用node-postgres的问题，或者是因为我是垃圾邮件postgres。

Anyway, here is a simple code that shows this behaviour:

无论如何，这是一个显示此行为的简单代码：

var pg = require('pg');
var connectionString = "postgres://xxxx:xxxx@localhost/xxxx";

pg.connect(connectionString, function(err,client,done){
  if(err) {
    return console.error('could not connect to postgres', err);
  }

  client.query("DROP TABLE IF EXISTS testDB");
  client.query("CREATE TABLE IF NOT EXISTS testDB (id int, first int, second int)");
  done();

  for (i = 0; i < 1000000; i++){
    client.query("INSERT INTO testDB VALUES (" + i.toString() + "," + (1000000-i).toString() + "," + (-i).toString() + ")",   function(err,result){
      if (err) {
         return console.error('Error inserting query', err);
      }
      done();
    });
  }
});

It fails after about 18,000-20,000 queries. Is this the wrong way to use client.query? I tried changing the default client number, but it didn't seem to help.

它在大约18,000-20,000次查询后失败。这是使用client.query的错误方法吗？我尝试更改默认客户端号码，但似乎没有帮助。

client.connect() doesn't seem to help either, but that was because I had too many clients, so I definitely think client pooling is the way to go.

client.connect（）似乎也没有帮助，但那是因为我有太多的客户端，所以我绝对认为客户端池是可行的方法。

Thanks for any help!

谢谢你的帮助！

2 个解决方案

#1

UPDATE

UPDATE

This answer has been since superseded with this article: Data Imports, which represents the most up-to-date approach.

这篇答案自此被本文取代：数据导入，代表了最新的方法。

In order to replicate your scenario I used pg-promise library, and I can confirm that trying it head-on will never work, no matter which library you use, it is the approach that matters.

为了复制你的场景，我使用了pg-promise库，我可以确认正在尝试它将永远不会工作，无论你使用哪个库，这都是重要的方法。

Below is a modified approach where we partition inserts into chunks and then execute each chunk within a transaction, which is load balancing (aka throttling):

下面是一个修改过的方法，我们将插入分区为块，然后在事务中执行每个块，这是负载平衡（也就是限制）：

function insertRecords(N) {
    return db.tx(function (ctx) {
        var queries = [];
        for (var i = 1; i <= N; i++) {
            queries.push(ctx.none('insert into test(name) values($1)', 'name-' + i));
        }
        return promise.all(queries);
    });
}
function insertAll(idx) {
    if (!idx) {
        idx = 0;
    }
    return insertRecords(100000)
        .then(function () {
            if (idx >= 9) {
                return promise.resolve('SUCCESS');
            } else {
                return insertAll(++idx);
            }
        }, function (reason) {
            return promise.reject(reason);
        });
}
insertAll()
    .then(function (data) {
        console.log(data);
    }, function (reason) {
        console.log(reason);
    })
    .done(function () {
        pgp.end();
    });

This produced 1000,000 records in about 4 minutes, dramatically slowing after the first 3 transactions. I was using Node JS 0.10.38 (64-bit), which consumed about 340MB of memory. This way we inserted 100,000 records, 10 times in a row.

这在大约4分钟内产生了1000,000条记录，在前3次交易后大幅放缓。我使用的是Node JS 0.10.38（64位），它消耗了大约340MB的内存。这样我们连续10次插入100,000条记录。

If we do the same, only this time insert 10,000 records within 100 transactions, the same 1,000,000 records are added in just 1m25s, no slowing down, with Node JS consuming around 100MB of memory, which tells us that partitioning data like this is a very good idea.

如果我们这样做，只有这次在100个事务中插入10,000条记录，相同的1,000,000条记录仅在1m25s内添加，没有减速，Node JS消耗大约100MB的内存，这告诉我们像这样的分区数据非常好主意。

It doesn't matter which library you use, the approach should be the same:

使用哪个库无关紧要，方法应该相同：

Partition/throttle your inserts into multiple transactions;
将插入分区/限制为多个事务;
Keep the list of inserts in a single transaction at around 10,000 records;
将单个事务中的插入列表保留在大约10,000条记录中;
Execute all your transactions in a synchronous chain.
在同步链中执行所有事务。
Release connection back to the pool after each transaction's COMMIT.
在每个事务的COMMIT之后释放连接回池。

If you break any of those rules, you're guaranteed trouble. For example, if you break rule 3, your Node JS process is likely to run out of memory real quick and throw an error. Rule 4 in my example was provided by the library.

如果你违反了这些规则，你就会遇到麻烦。例如，如果您违反规则3，您的Node JS进程可能会快速耗尽内存并抛出错误。我的例子中的规则4由图书馆提供。

And if you follow this pattern, you don't need to trouble yourself with the connection pool settings.

如果您遵循此模式，则无需使用连接池设置来解决问题。

UPDATE 1

更新1

Later versions of pg-promise support such scenarios perfectly, as shown below:

更高版本的pg-promise完美支持这些场景，如下所示：

function factory(index) {
    if (index < 1000000) {
        return this.query('insert into test(name) values($1)', 'name-' + index);
    }
}

db.tx(function () {
    return this.batch([
        this.none('drop table if exists test'),
        this.none('create table test(id serial, name text)'),
        this.sequence(factory), // key method
        this.one('select count(*) from test')
    ]);
})
    .then(function (data) {
        console.log("COUNT:", data[3].count);
    })
    .catch(function (error) {
        console.log("ERROR:", error);
    });

and if you do not want to include anything extra, like table creation, then it looks even simpler:

如果您不想包含任何额外的内容，例如表创建，那么它看起来更简单：

function factory(index) {
    if (index < 1000000) {
        return this.query('insert into test(name) values($1)', 'name-' + index);
    }
}

db.tx(function () {
    return this.sequence(factory);
})
    .then(function (data) {
        // success;
    })
    .catch(function (error) {
        // error;
    });

See Synchronous Transactions for details.

有关详细信息，请参阅同步事务

Using Bluebird as the promise library, for example, it takes 1m43s on my production machine to insert 1,000,000 records (without long stack traces enabled).

例如，使用Bluebird作为promise库，我的生产机器上需要1m43s来插入1,000,000条记录（没有启用长堆栈跟踪）。

You would just have your factory method return requests according to the index, till you have none left, simple as that.

您只需要根据索引返回工厂方法返回请求，直到您没有剩下，简单为止。

And the best part, this isn't just fast, but also creates little load on your NodeJS process. Memory test process stays under 60MB during the entire test, consuming only 7-8% of the CPU time.

最好的部分是，这不仅速度快，而且对NodeJS流程的负担也很小。在整个测试期间，内存测试过程保持在60MB以下，仅消耗7-8％的CPU时间。

UPDATE 2

更新2

Starting with version 1.7.2, pg-promise supports super-massive transactions with ease. See chapter Synchronous Transactions.

从版本1.7.2开始，pg-promise轻松支持超大规模事务。请参见同步事务一章。

For example, I could insert 10,000,000 records in a single transaction in just 15 minutes on my home PC, with Windows 8.1 64-bit.

例如，我可以在家用电脑上在15分钟内在一次交易中插入10,000,000条记录，Windows 8.1为64位。

For the test I set my PC to production mode, and used Bluebird as the promise library. During the test, memory consumption didn't go over 75MB for the entire NodeJS 0.12.5 process (64-bit), while my i7-4770 CPU showed consistent 15% load.

为了测试我将我的PC设置为生产模式，并使用Bluebird作为promise库。在测试期间，整个NodeJS 0.12.5进程（64位）的内存消耗不超过75MB，而我的i7-4770 CPU显示一致的15％负载。

Inserting 100m records the same way would require just more patience, but not more computer resources.

以相同的方式插入100米记录需要更多的耐心，但不需要更多的计算机资源。

In the meantime, the previous test for 1m inserts dropped from 1m43s to 1m31s.

与此同时，先前对1m刀片的测试从1m43s下降到1m31s。

UPDATE 3

更新3

The following considerations can make a huge difference: Performance Boost.

以下注意事项可以产生巨大的差异：性能提升。

UPDATE 4

更新4

Related question, with a better implementation example: Massive inserts with pg-promise.

相关问题，有一个更好的实现示例：带有pg-promise的大规模插入。

UPDATE 5

更新5

A better and newer example can be found here: nodeJS inserting Data into PostgreSQL error

可以在这里找到更好更新的示例：nodeJS将数据插入PostgreSQL错误

#2

I'm guessing that you are reaching max pool size. Since client.query is asynchronous, prolly all the available connections are used before they are returned.

我猜你达到了最大池大小。由于client.query是异步的，因此在返回之前会使用所有可用的连接。

Default Pool size is 10. Check here: https://github.com/brianc/node-postgres/blob/master/lib/defaults.js#L27

默认池大小为10.请在此处查看：https：//github.com/brianc/node-postgres/blob/master/lib/defaults.js#L27

You can increase default pool size by setting pg.defaults.poolSize:

您可以通过设置pg.defaults.poolSize来增加默认池大小：

pg.defaults.poolSize = 20;

Update: Execute another query after freeing a connection.

更新：释放连接后执行另一个查询。

var pg = require('pg');
var connectionString = "postgres://xxxx:xxxx@localhost/xxxx";
var MAX_POOL_SIZE = 25;

pg.defaults.poolSize = MAX_POOL_SIZE;
pg.connect(connectionString, function(err,client,done){
  if(err) {
    return console.error('could not connect to postgres', err);
  }

  var release = function() {
    done();
    i++;
    if(i < 1000000)
      insertQ();
  };

  var insertQ = function() {
    client.query("INSERT INTO testDB VALUES (" + i.toString() + "," + (1000000-i).toString() + "," + (-i).toString() + ")",        function(err,result){
      if (err) {
         return console.error('Error inserting query', err);
      }
      release();
    });
  };

  client.query("DROP TABLE IF EXISTS testDB");
  client.query("CREATE TABLE IF NOT EXISTS testDB (id int, first int,    second int)");
  done();

  for (i = 0; i < MAX_POOL_SIZE; i++){
    insertQ();
  }
});

The basic idea is since you are enqueuing a large number of queries with relatively small connection pool size, you are reaching max pool size. Here we make new query only after an existing connection is freed.

基本思想是，由于您使用相对较小的连接池大小排队大量查询，因此您达到了最大池大小。这里我们只在释放现有连接后才进行新查询。

#1