Mongo DB - 使用map reduce或聚合

I have series of documents in MongoDB collection that looks like this:

我在MongoDB集合中有一系列文档,如下所示:

{ 'time' : '2016-03-28 12:12:00', 'value' : 90 },
{ 'time' : '2016-03-28 12:13:00', 'value' : 82 },
{ 'time' : '2016-03-28 12:14:00', 'value' : 75 },
{ 'time' : '2016-03-28 12:15:00', 'value' : 72 },
{ 'time' : '2016-03-28 12:16:00', 'value' : 81 },
{ 'time' : '2016-03-28 12:17:00', 'value' : 90 },
etc....

The tasks is - with trash hold value of 80 find all times where value is entering below 80 and exiting above 80

任务是 - 垃圾保持值为80,查找值进入80以下并退出80以上的所有时间

{ 'time' : '2016-03-28 12:14:00', 'result' : 'enter' },
{ 'time' : '2016-03-28 12:16:00', 'result' : 'exit' },

Is it way to have map reduce or aggregation query that would produce such result ? I was trying to loop thru sorted results, but it is very processing and memory expensive - I need to do series of such checks.

是否可以使用map reduce或聚合查询来产生这样的结果?我试图通过排序结果循环,但它是非常昂贵的处理和内存 - 我需要做一系列这样的检查。

PS. I am using Django and mongoengine to execute call.

PS。我正在使用Django和mongoengine来执行调用。

2 个解决方案

#1

I'm not sure this is possible with the MongoDB aggregation framework alone since, as mentioned by @BlakesSeven, there is no link/connection between the subsequent documents. And you need this connection to check if the new value went below or above the desired threshold comparing to what the value was right before it, in a previous document.

我不确定单独使用MongoDB聚合框架是否可行,因为正如@BlakesSeven所述,后续文档之间没有链接/连接。并且您需要此连接以检查新值是否低于或高于所需阈值,而不是前一个文档中的值。

Here is a naive pure-python (since it is tagged with Django and MongoEngine) solution that loops over the sorted results maintaining the threshold-track variable and catching when it goes lower or higher 80 (col is your collection reference):

这是一个天真的纯python(因为它用Django和MongoEngine标记)解决方案循环遍历排序结果,保持threshold-track变量并在它变得越来越高时捕获80(col是你的集合引用):

THRESHOLD = 80
cursor = col.find().sort("time")

first_value = next(cursor)
more_than = first_value["value"] >= THRESHOLD

for document in cursor:
    if document["value"] < THRESHOLD:
        if more_than:
            print({"time": document["time"], "result": "enter"})
        more_than = False
    else:
        if not more_than:
            print({"time": document["time"], "result": "exit"})
        more_than = True

For the provided sample data, it prints:

对于提供的样本数据,它打印:

{'time': '2016-03-28 12:14:00', 'result': 'enter'}
{'time': '2016-03-28 12:16:00', 'result': 'exit'}

As a side note and an alternative solution..if you have control over the how these records are inserted, when you insert a document into this collection, you may check what is the latest value, compare it to the threshold and set the result as a separate field. Then, querying the entering and exiting the threshold points would become as easy as:

作为旁注和替代解决方案..如果您可以控制如何插入这些记录,当您将文档插入此集合时,您可以检查最新值是什么,将其与阈值进行比较并将结果设置为一个单独的领域。然后,查询进入和退出阈值点将变得如此简单:

col.find({"result" : {$exists : true}})

You can name this approach as "marking the threshold values beforehand". This probably makes sense only from querying/searching performance perspective and if you are going to do this often.

您可以将此方法命名为“事先标记阈值”。这可能只有从查询/搜索性能角度来看才有意义,如果你打算经常这样做的话。

#2

You can achieve transformation of documents easily with help of aggregation framework and cursor iteration.

借助聚合框架和游标迭代,您可以轻松实现文档转换。

Example:

db.collection.aggregate([
  {$project:
    {
      value:1,
      "threshold":{$let:
        {
          vars: {threshold: 80 }, 
          in:   "$$threshold"
        }}
     }
  },
  {$match:{value:{$ne: "$threshold"}}},
  {$group:
     {
       _id:"$null", 
       low:{
         $max:{
             $cond:[{$lt:["$value","$threshold"]},"$value",-1]
          }
       },

       high:{
         $min:{
             // 10000000000 is a superficial value. 
             // need something greater than values in documents
             $cond:[{$gt:["$value","$threshold"]},"$value",10000000000] 
          }
       },

       threshold:{$first:"$threshold"}
     }
   }  
])

Aggregation framework will return a document with two values.

聚合框架将返回具有两个值的文档。

{ 
    "_id" : null, 
    "low" : NumberInt(75), 
    "high" : NumberInt(81), 
    "threshold" : NumberInt(80)
}

We can easily find documents matching return criteria. e.g. in NodeJS we can easily do this. assuming variable result holds result from aggregation query.

我们可以轻松找到符合退货条件的文件。例如在NodeJS中,我们可以轻松地做到这一点。假设变量结果保存聚合查询的结果。

result.forEach(function(r){

   var documents = [];

   db.collection.find({$or:[{"value": r.low},{"value": r.high}]}).forEach(function(doc){

        var _doc = {};
        _doc.time = doc.time;
        _doc.result = doc.value < r.threshold ? "enter" : "exit";
        documents.push(_doc);
   });
   printjson(documents);
});

As you mention, if your input documents are (sample)

如您所述,如果您的输入文件是(样本)

{ 'time' : '2016-03-28 12:12:00', 'value' : 90 },
{ 'time' : '2016-03-28 12:13:00', 'value' : 82 },
{ 'time' : '2016-03-28 12:14:00', 'value' : 75 },
{ 'time' : '2016-03-28 12:15:00', 'value' : 72 },
{ 'time' : '2016-03-28 12:16:00', 'value' : 81 },
{ 'time' : '2016-03-28 12:17:00', 'value' : 90 },
etc....

Query above in solution will emit:

上面的解决方案查询将发出:

{
    "time" : "2016-03-28 12:14:00", 
    "result" : "enter"
}, 
{
    "time" : "2016-03-28 12:16:00", 
    "result" : "exit"
}

#1