I have a huge collection of documents in my DB and I'm wondering how can I run through all the documents and update them, each document with a different value.
我的数据库中有大量的文档,我想知道如何运行所有的文档并更新它们,每个文档都有不同的值。
5 个解决方案
#1
82
The answer depends on the driver you're using. All MongoDB drivers I know have cursor.forEach()
implemented one way or another.
答案取决于你使用的驱动程序。我认识的所有MongoDB驱动都有cursor.forEach()实现了一种或另一种方式。
Here are some examples:
下面是一些例子:
node-mongodb-native
collection.find(query).forEach(function(doc) {
// handle
}, function(err) {
// done or error
});
mongojs
db.collection.find(query).forEach(function(err, doc) {
// handle
});
monk
collection.find(query, { stream: true })
.each(function(doc){
// handle doc
})
.error(function(err){
// handle error
})
.success(function(){
// final callback
});
mongoose
collection.find(query).stream()
.on('data', function(doc){
// handle doc
})
.on('error', function(err){
// handle error
})
.on('end', function(){
// final callback
});
Updating documents inside of .forEach
callback
The only problem with updating documents inside of .forEach
callback is that you have no idea when all documents are updated.
在. foreach回调中更新文档的唯一问题是,您不知道何时更新所有文档。
To solve this problem you should use some asynchronous control flow solution. Here are some options:
要解决这个问题,您应该使用一些异步控制流解决方案。这里有一些选项:
Here is an example of using async
, using its queue
feature:
这里有一个使用异步的例子,使用它的队列特性:
var q = async.queue(function (doc, callback) {
// code for your update
collection.update({
_id: doc._id
}, {
$set: {hi: 'there'}
}, {
w: 1
}, callback);
}, Infinity);
var cursor = collection.find(query);
cursor.each(function(err, doc) {
if (err) throw err;
if (doc) q.push(doc); // dispatching doc to async.queue
});
q.drain = function() {
if (cursor.isClosed()) {
console.log('all items have been processed');
db.close();
}
}
#2
4
Leonid's answer is great, but I want to reinforce the importance of using async/promises and to give a different solution with a promises example.
Leonid的回答很好,但是我想强调使用异步/承诺的重要性,并给出一个不同的解决方案和一个承诺示例。
The simplest solution to this problem is to loop forEach document and call an update. Usually, you don't need close the db connection after each request, but if you do need to close the connection, be careful. You must just close it if you are sure that all updates have finished executing.
这个问题最简单的解决方案是循环每个文档并调用更新。通常,您不需要在每个请求之后关闭db连接,但是如果需要关闭连接,请小心。如果您确信所有更新都已完成,您必须关闭它。
A common mistake here is to call db.close()
after all updates are dispatched without knowing if they have completed. If you do that, you'll get errors.
这里的一个常见错误是调用db.close(),在不知道它们是否已经完成的情况下发送所有更新。如果你那样做,你会得到错误。
Wrong implementation:
collection.find(query).each(function(err, doc) {
if (err) throw err;
if (doc) {
collection.update(query, update, function(err, updated) {
// handle
});
}
else {
db.close(); // if there is any pending update, it will throw an error there
}
});
However, as db.close()
is also an async operation (its signature have a callback option) you may be lucky and this code can finish without errors. It may work only when you need to update just a few docs in a small collection (so, don't try).
但是,由于db.close()也是一个异步操作(它的签名有一个回调选项),您可能很幸运,并且该代码可以在没有错误的情况下完成。只有当您只需要在一个小集合中更新几个文档时(所以,不要尝试),它才能工作。
Correct solution:
As a solution with async was already proposed by Leonid, below follows a solution using Q promises.
Leonid已经提出了异步的解决方案,下面是使用Q承诺的解决方案。
var Q = require('q');
var client = require('mongodb').MongoClient;
var url = 'mongodb://localhost:27017/test';
client.connect(url, function(err, db) {
if (err) throw err;
var promises = [];
var query = {}; // select all docs
var collection = db.collection('demo');
var cursor = collection.find(query);
// read all docs
cursor.each(function(err, doc) {
if (err) throw err;
if (doc) {
// create a promise to update the doc
var query = doc;
var update = { $set: {hi: 'there'} };
var promise =
Q.npost(collection, 'update', [query, update])
.then(function(updated){
console.log('Updated: ' + updated);
});
promises.push(promise);
} else {
// close the connection after executing all promises
Q.all(promises)
.then(function() {
if (cursor.isClosed()) {
console.log('all items have been processed');
db.close();
}
})
.fail(console.error);
}
});
});
#3
4
And here is an example of using a Mongoose cursor async with promises:
这里有一个使用Mongoose游标异步的例子:
new Promise(function (resolve, reject) {
collection.find(query).cursor()
.on('data', function(doc) {
// ...
})
.on('error', reject)
.on('end', resolve);
})
.then(function () {
// ...
});
Reference:
参考:
- Mongoose cursors
- 猫鼬游标
- Streams and promises
- 流和承诺
#4
4
The node-mongodb-native
now supports a endCallback
parameter to cursor.forEach
as for one to handle the event AFTER the whole iteration, refer to the official document for details http://mongodb.github.io/node-mongodb-native/2.2/api/Cursor.html#forEach.
node-mongodb-native现在支持指针的endCallback参数。对于每个在整个迭代之后处理事件的对象,请参阅官方文档以获得详细的http://mongodb.github.io/node-mongodb-native/2.2/api/Cursor.html#forEach。
Also note that .each is deprecated in the nodejs native driver now.
还请注意,现在nodejs本地驱动程序中都不赞成使用.each。
#5
3
var MongoClient = require('mongodb').MongoClient,
assert = require('assert');
MongoClient.connect('mongodb://localhost:27017/crunchbase', function(err, db) {
assert.equal(err, null);
console.log("Successfully connected to MongoDB.");
var query = {
"category_code": "biotech"
};
db.collection('companies').find(query).toArray(function(err, docs) {
assert.equal(err, null);
assert.notEqual(docs.length, 0);
docs.forEach(function(doc) {
console.log(doc.name + " is a " + doc.category_code + " company.");
});
db.close();
});
});
Notice that the call .toArray
is making the application to fetch the entire dataset.
注意,调用. toarray使应用程序获取整个数据集。
var MongoClient = require('mongodb').MongoClient,
assert = require('assert');
MongoClient.connect('mongodb://localhost:27017/crunchbase', function(err, db) {
assert.equal(err, null);
console.log("Successfully connected to MongoDB.");
var query = {"category_code": "biotech"};
var cursor = db.collection('companies').find(query);
function(doc) {
cursor.forEach(
console.log( doc.name + " is a " + doc.category_code + " company." );
},
function(err) {
assert.equal(err, null);
return db.close();
}
);
});
Notice that the cursor returned by the find()
is assigned to var cursor
. With this approach, instead of fetching all data in memort and consuming data at once, we're streaming the data to our application. find()
can create a cursor immediately because it doesn't actually make a request to the database until we try to use some of the documents it will provide. The point of cursor
is to describe our query. The 2nd parameter to cursor.forEach
shows what to do when the driver gets exhausted or an error occurs.
注意,find()返回的游标被分配给var游标。使用这种方法,我们将数据流到应用程序中,而不是一次获取内存中的所有数据并消费数据。find()可以立即创建游标,因为在我们尝试使用它将提供的一些文档之前,它实际上并不向数据库发出请求。光标的作用是描述我们的查询。光标的第二个参数。forEach显示了当驱动程序耗尽或发生错误时该做什么。
In the initial version of the above code, it was toArray()
which forced the database call. It meant we needed ALL the documents and wanted them to be in an array
.
在上述代码的初始版本中,强制执行数据库调用的是toArray()。这意味着我们需要所有的文档,并希望它们在数组中。
Also, MongoDB
returns data in batch format. The image below shows, requests from cursors (from application) to MongoDB
另外,MongoDB以批处理格式返回数据。下图显示了游标(来自应用程序)对MongoDB的请求
forEach
is better than toArray
because we can process documents as they come in until we reach the end. Contrast it with toArray
- where we wait for ALL the documents to be retrieved and the entire array is built. This means we're not getting any advantage from the fact that the driver and the database system are working together to batch results to your application. Batching is meant to provide efficiency in terms of memory overhead and the execution time. Take advantage of it, if you can in your application.
forEach比toArray要好,因为我们可以在文档输入时进行处理,直到最后。将它与toArray进行对比——在toArray中,我们等待所有的文档被检索并构建整个数组。这意味着我们不能从驱动程序和数据库系统一起为应用程序批处理结果这一事实中得到任何好处。批处理的目的是在内存开销和执行时间方面提供效率。如果可以的话,在你的应用中好好利用它。
#1
82
The answer depends on the driver you're using. All MongoDB drivers I know have cursor.forEach()
implemented one way or another.
答案取决于你使用的驱动程序。我认识的所有MongoDB驱动都有cursor.forEach()实现了一种或另一种方式。
Here are some examples:
下面是一些例子:
node-mongodb-native
collection.find(query).forEach(function(doc) {
// handle
}, function(err) {
// done or error
});
mongojs
db.collection.find(query).forEach(function(err, doc) {
// handle
});
monk
collection.find(query, { stream: true })
.each(function(doc){
// handle doc
})
.error(function(err){
// handle error
})
.success(function(){
// final callback
});
mongoose
collection.find(query).stream()
.on('data', function(doc){
// handle doc
})
.on('error', function(err){
// handle error
})
.on('end', function(){
// final callback
});
Updating documents inside of .forEach
callback
The only problem with updating documents inside of .forEach
callback is that you have no idea when all documents are updated.
在. foreach回调中更新文档的唯一问题是,您不知道何时更新所有文档。
To solve this problem you should use some asynchronous control flow solution. Here are some options:
要解决这个问题,您应该使用一些异步控制流解决方案。这里有一些选项:
Here is an example of using async
, using its queue
feature:
这里有一个使用异步的例子,使用它的队列特性:
var q = async.queue(function (doc, callback) {
// code for your update
collection.update({
_id: doc._id
}, {
$set: {hi: 'there'}
}, {
w: 1
}, callback);
}, Infinity);
var cursor = collection.find(query);
cursor.each(function(err, doc) {
if (err) throw err;
if (doc) q.push(doc); // dispatching doc to async.queue
});
q.drain = function() {
if (cursor.isClosed()) {
console.log('all items have been processed');
db.close();
}
}
#2
4
Leonid's answer is great, but I want to reinforce the importance of using async/promises and to give a different solution with a promises example.
Leonid的回答很好,但是我想强调使用异步/承诺的重要性,并给出一个不同的解决方案和一个承诺示例。
The simplest solution to this problem is to loop forEach document and call an update. Usually, you don't need close the db connection after each request, but if you do need to close the connection, be careful. You must just close it if you are sure that all updates have finished executing.
这个问题最简单的解决方案是循环每个文档并调用更新。通常,您不需要在每个请求之后关闭db连接,但是如果需要关闭连接,请小心。如果您确信所有更新都已完成,您必须关闭它。
A common mistake here is to call db.close()
after all updates are dispatched without knowing if they have completed. If you do that, you'll get errors.
这里的一个常见错误是调用db.close(),在不知道它们是否已经完成的情况下发送所有更新。如果你那样做,你会得到错误。
Wrong implementation:
collection.find(query).each(function(err, doc) {
if (err) throw err;
if (doc) {
collection.update(query, update, function(err, updated) {
// handle
});
}
else {
db.close(); // if there is any pending update, it will throw an error there
}
});
However, as db.close()
is also an async operation (its signature have a callback option) you may be lucky and this code can finish without errors. It may work only when you need to update just a few docs in a small collection (so, don't try).
但是,由于db.close()也是一个异步操作(它的签名有一个回调选项),您可能很幸运,并且该代码可以在没有错误的情况下完成。只有当您只需要在一个小集合中更新几个文档时(所以,不要尝试),它才能工作。
Correct solution:
As a solution with async was already proposed by Leonid, below follows a solution using Q promises.
Leonid已经提出了异步的解决方案,下面是使用Q承诺的解决方案。
var Q = require('q');
var client = require('mongodb').MongoClient;
var url = 'mongodb://localhost:27017/test';
client.connect(url, function(err, db) {
if (err) throw err;
var promises = [];
var query = {}; // select all docs
var collection = db.collection('demo');
var cursor = collection.find(query);
// read all docs
cursor.each(function(err, doc) {
if (err) throw err;
if (doc) {
// create a promise to update the doc
var query = doc;
var update = { $set: {hi: 'there'} };
var promise =
Q.npost(collection, 'update', [query, update])
.then(function(updated){
console.log('Updated: ' + updated);
});
promises.push(promise);
} else {
// close the connection after executing all promises
Q.all(promises)
.then(function() {
if (cursor.isClosed()) {
console.log('all items have been processed');
db.close();
}
})
.fail(console.error);
}
});
});
#3
4
And here is an example of using a Mongoose cursor async with promises:
这里有一个使用Mongoose游标异步的例子:
new Promise(function (resolve, reject) {
collection.find(query).cursor()
.on('data', function(doc) {
// ...
})
.on('error', reject)
.on('end', resolve);
})
.then(function () {
// ...
});
Reference:
参考:
- Mongoose cursors
- 猫鼬游标
- Streams and promises
- 流和承诺
#4
4
The node-mongodb-native
now supports a endCallback
parameter to cursor.forEach
as for one to handle the event AFTER the whole iteration, refer to the official document for details http://mongodb.github.io/node-mongodb-native/2.2/api/Cursor.html#forEach.
node-mongodb-native现在支持指针的endCallback参数。对于每个在整个迭代之后处理事件的对象,请参阅官方文档以获得详细的http://mongodb.github.io/node-mongodb-native/2.2/api/Cursor.html#forEach。
Also note that .each is deprecated in the nodejs native driver now.
还请注意,现在nodejs本地驱动程序中都不赞成使用.each。
#5
3
var MongoClient = require('mongodb').MongoClient,
assert = require('assert');
MongoClient.connect('mongodb://localhost:27017/crunchbase', function(err, db) {
assert.equal(err, null);
console.log("Successfully connected to MongoDB.");
var query = {
"category_code": "biotech"
};
db.collection('companies').find(query).toArray(function(err, docs) {
assert.equal(err, null);
assert.notEqual(docs.length, 0);
docs.forEach(function(doc) {
console.log(doc.name + " is a " + doc.category_code + " company.");
});
db.close();
});
});
Notice that the call .toArray
is making the application to fetch the entire dataset.
注意,调用. toarray使应用程序获取整个数据集。
var MongoClient = require('mongodb').MongoClient,
assert = require('assert');
MongoClient.connect('mongodb://localhost:27017/crunchbase', function(err, db) {
assert.equal(err, null);
console.log("Successfully connected to MongoDB.");
var query = {"category_code": "biotech"};
var cursor = db.collection('companies').find(query);
function(doc) {
cursor.forEach(
console.log( doc.name + " is a " + doc.category_code + " company." );
},
function(err) {
assert.equal(err, null);
return db.close();
}
);
});
Notice that the cursor returned by the find()
is assigned to var cursor
. With this approach, instead of fetching all data in memort and consuming data at once, we're streaming the data to our application. find()
can create a cursor immediately because it doesn't actually make a request to the database until we try to use some of the documents it will provide. The point of cursor
is to describe our query. The 2nd parameter to cursor.forEach
shows what to do when the driver gets exhausted or an error occurs.
注意,find()返回的游标被分配给var游标。使用这种方法,我们将数据流到应用程序中,而不是一次获取内存中的所有数据并消费数据。find()可以立即创建游标,因为在我们尝试使用它将提供的一些文档之前,它实际上并不向数据库发出请求。光标的作用是描述我们的查询。光标的第二个参数。forEach显示了当驱动程序耗尽或发生错误时该做什么。
In the initial version of the above code, it was toArray()
which forced the database call. It meant we needed ALL the documents and wanted them to be in an array
.
在上述代码的初始版本中,强制执行数据库调用的是toArray()。这意味着我们需要所有的文档,并希望它们在数组中。
Also, MongoDB
returns data in batch format. The image below shows, requests from cursors (from application) to MongoDB
另外,MongoDB以批处理格式返回数据。下图显示了游标(来自应用程序)对MongoDB的请求
forEach
is better than toArray
because we can process documents as they come in until we reach the end. Contrast it with toArray
- where we wait for ALL the documents to be retrieved and the entire array is built. This means we're not getting any advantage from the fact that the driver and the database system are working together to batch results to your application. Batching is meant to provide efficiency in terms of memory overhead and the execution time. Take advantage of it, if you can in your application.
forEach比toArray要好,因为我们可以在文档输入时进行处理,直到最后。将它与toArray进行对比——在toArray中,我们等待所有的文档被检索并构建整个数组。这意味着我们不能从驱动程序和数据库系统一起为应用程序批处理结果这一事实中得到任何好处。批处理的目的是在内存开销和执行时间方面提供效率。如果可以的话,在你的应用中好好利用它。