
时间:2022-01-02 04:02:44
    meta_map = {}
    results = db.meta.find({'corpus_id':id, 'method':method}) #this Mongo query only takes 3ms
    print results.explain()
    #result is mongo queryset of 2000 documents

    count = 0
    for r in results:
        count += 1
        print count
        word = r.get('word')
        data = r.get('data',{})
        if not meta_map.has_key(word):
            meta_map[word] = data
    return meta_map

This is super, super slow for some reason.


There are a total of 2000 results. Below is an example of a result document (from Mongo). All other results are similar in length.


{ "word" : "articl", "data" : { "help" : 0.42454812322341984, "show" : 0.24099054286865948, "lack" : 0.2368313038407821, "steve" : 0.20491936823259457, "gb" : 0.18757527934987422, "feedback" : 0.2855335862138559, "categori" : 0.28210549642632016, "itun" : 0.23615623082085788, "articl" : 0.21378509220044106, "black" : 0.22720575131038662, "hidden" : 0.26172127252557625, "holiday" : 0.27662433827306804, "applic" : 0.1802411089325281, "digit" : 0.20491936823259457, "sourc" : 0.21909218369809863, "march" : 0.2632736571995878, "ceo" : 0.2153108869289692, "donat" : 1, "volum" : 0.2572042432755638, "octob" : 0.2802470156773559, "toolbox" : 0.2153108869289692, "discuss" : 0.26973295489368615, "list" : 0.3698592948408095, "upload" : 0.1802411089325281, "random" : 1, "default" : 0.33044754314072383, "februari" : 0.2899936154686609, "januari" : 0.25228424754983525, "septemb" : 0.1802411089325281, "page" : 0.24675067183234803, "view" : 0.20019523259334138, "pleas" : 0.2839965947961194, "mdi" : 0.2731217555354, "unsourc" : 0.2709524603813144, "direct" : 0.18757527934987422, "dead" : 0.22720575131038662, "smartphon" : 0.2839965947961194, "jump" : 0.3004203939398161, "see" : 0.33044754314072383, "design" : 0.2839965947961194, "download" : 0.19574598998663462, "home" : 0.3004203939398161, "event" : 0.651573574681647, "wikipedia" : 0.21909218369809863, "content" : 0.2471475889083912, "version" : 0.42454812322341984, "gener" : 0.3004203939398161, "refer" : 0.2188507485718582, "navig" : 0.27662433827306804, "june" : 0.2153108869289692, "screen" : 0.27662433827306804, "free" : 0.22720575131038662, "job" : 0.19574598998663462, "key" : 0.3004203939398161, "addit" : 0.22484486630589545, "search" : 0.2878804276884952, "current" : 0.5071530767683105, "worldwid" : 0.20491936823259457, "iphon" : 0.2230524329516571, "action" : 0.24099054286865948, "chang" : 0.18757527934987422, "summari" : 0.33044754314072383, "origin" : 0.2572042432755638, "softwar" : 0.651573574681647, "point" : 0.27662433827306804, "extern" : 0.22190187748860113, "mobil" : 0.2514880028687207, "cloud" : 0.18757527934987422, "use" : 0.2731217555354, "log" : 0.27662433827306804, "commun" : 0.33044754314072383, "interact" : 0.5071530767683105, "devic" : 0.3004203939398161, "long" : 0.2839965947961194, "avail" : 0.19574598998663462, "appl" : 0.24099054286865948, "disambigu" : 0.3195885490528538, "statement" : 0.2737499468972353, "namespac" : 0.3004203939398161, "season" : 0.3004203939398161, "juli" : 0.27243508666247285, "relat" : 0.19574598998663462, "phone" : 0.26973295489368615, "link" : 0.2178125232318433, "line" : 0.42454812322341984, "pilot" : 0.27243508666247285, "account" : 0.2572042432755638, "main" : 0.34870313981256423, "provid" : 0.2153108869289692, "histori" : 0.2714135089366041, "vagu" : 0.24875213214603717, "featur" : 0.24099054286865948, "creat" : 0.26645207330844684, "ipod" : 0.2230524329516571, "player" : 0.20491936823259457, "io" : 0.2447908314834019, "need" : 0.2580912994161046, "develop" : 0.27662433827306804, "began" : 0.24099054286865948, "client" : 0.19574598998663462, "also" : 0.42454812322341984, "cleanup" : 0.24875213214603717, "split" : 0.26973295489368615, "tool" : 0.2878804276884952, "product" : 0.42454812322341984, "may" : 0.2676701118192027, "assist" : 0.1802411089325281, "variant" : 0.2514880028687207, "portal" : 0.3004203939398161, "user" : 0.20491936823259457, "consid" : 0.27662433827306804, "date" : 0.2731217555354, "recent" : 0.24099054286865948, "read" : 0.2572042432755638, "reliabl" : 0.2388872270166464, "sale" : 0.22720575131038662, "ambigu" : 0.23482106920048526, "person" : 0.260801274024785, "contact" : 0.24099054286865948, "encyclopedia" : 0.2153108869289692, "time" : 0.2368313038407821, "model" : 0.24099054286865948, "audio" : 0.19574598998663462 }}

The whole process takings about 15 seconds...what the hell? How can I speed it up? :)


Edit: I realize that when I print the count in console, it goes from 0 to 101 very fast, and then freezes for 10 seconds, and then continues from 102 to 2000


could this be a MongoDB problem?


Edit 2: I printed the Mongo EXPLAIN() of the query below:

编辑2:我打印了下面查询的Mongo EXPLAIN():

{u'allPlans': [{u'cursor': u'BtreeCursor corpus_id_1_method_1_word_1',
                u'indexBounds': {u'corpus_id': [[u'iphone', u'iphone']],
                                 u'method': [[u'advanced', u'advanced']],
                                 u'word': [[{u'$minElement': 1},
                                            {u'$maxElement': 1}]]}}],
 u'cursor': u'BtreeCursor corpus_id_1_method_1_word_1',
 u'indexBounds': {u'corpus_id': [[u'iphone', u'iphone']],
                  u'method': [[u'advanced', u'advanced']],
                  u'word': [[{u'$minElement': 1}, {u'$maxElement': 1}]]},
 u'indexOnly': False,
 u'isMultiKey': False,
 u'millis': 3,
 u'n': 2443,
 u'nChunkSkips': 0,
 u'nYields': 0,
 u'nscanned': 2443,
 u'nscannedObjects': 2443,
 u'oldPlan': {u'cursor': u'BtreeCursor corpus_id_1_method_1_word_1',
              u'indexBounds': {u'corpus_id': [[u'iphone', u'iphone']],
                               u'method': [[u'advanced', u'advanced']],
                               u'word': [[{u'$minElement': 1},
                                          {u'$maxElement': 1}]]}}}

These are the stats for the mongo collection:


> db.meta.stats();
    "ns" : "inception.meta",
    "count" : 2450,
    "size" : 3001068,
    "avgObjSize" : 1224.9257142857143,
    "storageSize" : 18520320,
    "numExtents" : 6,
    "nindexes" : 2,
    "lastExtentSize" : 13893632,
    "paddingFactor" : 1.009999999999931,
    "flags" : 1,
    "totalIndexSize" : 368640,
    "indexSizes" : {
        "_id_" : 114688,
        "corpus_id_1_method_1_word_1" : 253952
    "ok" : 1

> db.meta.getIndexes();
        "name" : "_id_",
        "ns" : "inception.meta",
        "key" : {
            "_id" : 1
        "v" : 0
        "ns" : "inception.meta",
        "name" : "corpus_id_1_method_1_word_1",
        "key" : {
            "corpus_id" : 1,
            "method" : 1,
            "word" : 1
        "v" : 0

3 个解决方案



Your query is returning almost all the documents in your collection (which may or may not be correct in this case; good database advice is always to transmit as few documents/rows as possible from the server to your application), and your collection is about 3 megabytes in size. It's possible that the delay you are seeing is simply due to the network transmission time.




Instead of


if not meta_map.has_key(word):

you should use


if word not in meta_map:

There is no point in doing data = r.get('data',{}) if you are not going to use it.

如果您不打算使用data = r.get('data',{}),那么执行data = r.get是没有意义的。

It's not obvious why you are doing word = r.get('word') ... if 'word' always exists in r, you should just use word = r['word']; otherwise you should test whether word is None after the get.

你为什么要做word = r。get('word')…如果“word”总是存在于r中,你应该使用word = r['word'];否则,您应该在get之后测试word是否为None。

Likewise the data get.


Try this:


for r in results:
    word = r['word']
    if word not in meta_map:
         meta_map[word] = r['data']

In any case the time you quoted is enormous ... there must be something else going on there. I would be very interested to see your code for doing the timing and counting the number of entries in results.




If your problem really is the dictionary, maybe using setdefault() instead of first looking the key up and then setting it can help.




If your problem really is the dictionary, maybe using setdefault() instead of first looking the key up and then setting it can help.
