I have a function that looks like this:
我有一个函数是这样的
def insert_multiple_cakes(cake_list)
ensure_indexes
insert_list = cake_list.map { |cake| mongofy_values(cake.to_hash) }
inserted = db[CAKE_COLLECTION].insert(insert_list, w: 0)
return inserted.length
end
The goal of the function is to insert all cakes from cake_list
into the Mongo database. Any cake that already exists in the database should be ignored. The function should return the number of cakes inserted, so if cake_list
contains 5 cakes and 2 of those cakes already exist in the database, the function should return 3.
该函数的目标是将cake_list中的所有蛋糕插入到Mongo数据库中。应该忽略数据库中已经存在的任何蛋糕。函数应该返回插入的蛋糕数量,因此如果cake_list包含5个蛋糕,并且其中2个已经存在于数据库中,那么函数应该返回3。
My problem is that after an hour of experimenting, I have concluded the following:
我的问题是经过一个小时的实验,我得出以下结论:
-
If the write concern (the
:w
option) is 0, then the insert call silently ignores all duplicate inserts, and the return value contains all the input documents, even those that weren't inserted. It doesn't matter what I set:continue_on_error
or:collect_on_error
, the return value always contains all the documents, and the list of collected errors is always empty.如果write关注点(:w选项)为0,那么insert调用将静默地忽略所有重复的insert,返回值将包含所有输入文档,甚至包括那些没有插入的文档。不管我设置的是什么:continue_on_error或:collect_on_error,返回值总是包含所有的文档,收集的错误列表总是空的。
-
If the write concern is 1, then the insert call fails with an
Mongo::OperationFailure
if there are any duplicates among the input documents. It doesn't matter what I set:continue_on_error
or:collect_on_error
to, the insert always fails when there are duplicates.如果写关注点是1,那么如果输入文档中有重复项,那么Mongo::OperationFailure插入调用失败。无论我设置什么:continue_on_error或:collect_on_error,插入总是在有重复时失败。
So it seems to me that the only way to achieve this is to iterate over the input list, perform a search for EVERY document and filter away those that already exist. My application is going to deal with (at least) thousands of inserts at a time, so I like this plan about as much as I'd like to jump off a bridge.
因此,在我看来,实现这一目标的唯一方法是遍历输入列表,对每个文档执行搜索,并过滤掉那些已经存在的文档。我的应用程序每次都要处理成千上万个插入,所以我喜欢这个计划,就像我想从桥上跳下去一样。
Have I misunderstood something, or is the Ruby client perhaps bugged?
我是否误解了什么,或者Ruby客户机可能被窃听了?
To demonstrate, this function does exactly what I want and works:
为了说明这一点,这个函数做了我想做的,并且工作:
def insert_multiple_cakes(cake_list)
ensure_indexes
collection = db[CAKE_COLLECTION]
# Filters away any cakes that already exists in the database.
filtered_list = cake_list.reject { |cake|
collection.count(query: {"name" => cake.name}) == 1
}
insert_list = filtered_list.map { |cake| mongofy_values(cake.to_hash) }
inserted = collection.insert(insert_list)
return inserted.length
end
The problem is that it performs about a gazillion searches where it should only really have to do one insert.
问题是,它执行了大量的搜索,而实际上它只需要执行一次插入。
Documentation for Mongo::Collection#insert
文档Mongo::#插入集合
2 个解决方案
#1
5
You can do something like this (source):
你可以这样做(来源):
coll = MongoClient.new().db('test').collection('cakes')
bulk = coll.initialize_unordered_bulk_op
bulk.insert({'_id' => "strawberry"})
bulk.insert({'_id' => "strawberry"}) # duplicate key
bulk.insert({'_id' => "chocolate"})
bulk.insert({'_id' => "chocolate"}) # duplicate key
begin
bulk.execute({:w => 1}) # this is the default but don't change it to 0 or you won't get the errors
rescue => ex
p ex
p ex.result
end
ex.result
contains ninserted
and the reason each one failed.
result包含ninsert和每个失败的原因。
{"ok"=>1,
"n"=>2,
"code"=>65,
"errmsg"=>"batch item errors occurred",
"nInserted"=>2,
"writeErrors"=>
[{"index"=>1,
"code"=>11000,
"errmsg"=>
"insertDocument :: caused by :: 11000 E11000 duplicate key error index: test.cakes.$_id_ dup key: { : \"strawberry\" }"},
{"index"=>3,
"code"=>11000,
"errmsg"=>
"insertDocument :: caused by :: 11000 E11000 duplicate key error index: test.cakes.$_id_ dup key: { : \"chocolate\" }"}]}
#2
1
Bulk operations was the way to go. I'm accepting ranman's answer, but I thought I should share my final code:
大规模的行动是必须的。我接受兰曼的回答,但我认为我应该分享我的最终代码:
def insert_documents(collection_name, documents)
collection = db[collection_name]
bulk = collection.initialize_unordered_bulk_op
inserts = 0
documents.each { |doc|
bulk.insert doc
inserts += 1
}
begin
bulk.execute
rescue Mongo::BulkWriteError => e
inserts = e.result["nInserted"]
end
return inserts
end
def insert_cakes(cakes)
ensure_cake_indexes
doc_list = cakes.map { |cake|
mongofy_values(cake.to_hash)
}
return insert_documents(CAKE_COLLECTION, doc_list)
end
#1
5
You can do something like this (source):
你可以这样做(来源):
coll = MongoClient.new().db('test').collection('cakes')
bulk = coll.initialize_unordered_bulk_op
bulk.insert({'_id' => "strawberry"})
bulk.insert({'_id' => "strawberry"}) # duplicate key
bulk.insert({'_id' => "chocolate"})
bulk.insert({'_id' => "chocolate"}) # duplicate key
begin
bulk.execute({:w => 1}) # this is the default but don't change it to 0 or you won't get the errors
rescue => ex
p ex
p ex.result
end
ex.result
contains ninserted
and the reason each one failed.
result包含ninsert和每个失败的原因。
{"ok"=>1,
"n"=>2,
"code"=>65,
"errmsg"=>"batch item errors occurred",
"nInserted"=>2,
"writeErrors"=>
[{"index"=>1,
"code"=>11000,
"errmsg"=>
"insertDocument :: caused by :: 11000 E11000 duplicate key error index: test.cakes.$_id_ dup key: { : \"strawberry\" }"},
{"index"=>3,
"code"=>11000,
"errmsg"=>
"insertDocument :: caused by :: 11000 E11000 duplicate key error index: test.cakes.$_id_ dup key: { : \"chocolate\" }"}]}
#2
1
Bulk operations was the way to go. I'm accepting ranman's answer, but I thought I should share my final code:
大规模的行动是必须的。我接受兰曼的回答,但我认为我应该分享我的最终代码:
def insert_documents(collection_name, documents)
collection = db[collection_name]
bulk = collection.initialize_unordered_bulk_op
inserts = 0
documents.each { |doc|
bulk.insert doc
inserts += 1
}
begin
bulk.execute
rescue Mongo::BulkWriteError => e
inserts = e.result["nInserted"]
end
return inserts
end
def insert_cakes(cakes)
ensure_cake_indexes
doc_list = cakes.map { |cake|
mongofy_values(cake.to_hash)
}
return insert_documents(CAKE_COLLECTION, doc_list)
end