Context

Python 3.6.3 :: Anaconda custom (64-bit)
mrjob==0.6.2 with no custom configuration
Running locally

Python 3.6.3 :: Anaconda自定义(64位)mrjob == 0.6.2没有自定义配置在本地运行

I am implementing the basic word count example for a local map reduce job. My mapper maps a 1 to each word in each line of a book from a .txt file, using a simple regex. The reducer counts the number of ocurrences of each word, i.e. number of 1's grouped to each word.

我正在为本地地图减少作业实现基本的单词计数示例。我的映射器使用简单的正则表达式从.txt文件中将1映射到书籍每行中的每个单词。 reducer计算每个单词的出现次数,即每个单词的1个数。

from mrjob.job import MRJob
import re

WORD_REGEXP = re.compile(r"[\w']+")

class WordCounter(MRJob):
  def mapper(self, _, line):
    words = WORD_REGEXP.findall(line)
    for word in words:
      yield word.lower(), 1

  def reducer(self, word, times_seen):
    yield word, sum(times_seen)

if __name__ == '__main__':
  WordCounter.run()

Problem

The output file is correct but the key value pairs are not globally sorted. It seems like the result is only sorted alphabetically in chunks of data.

输出文件是正确的,但键值对不是全局排序的。似乎结果只按字母顺序排列在数据块中。

"customers'"    1
"customizing"   1
"cut"   2
"cycle" 1
"cycles"    1
"d" 10
"dad"   1
"dada"  1
"daily" 3
"damage"    1
"deductible"    6
...
"exchange"  10
"excited"   4
"excitement"    1
"exciting"  4
"executive" 2
"executives"    2
"theft" 1
"their" 122
"them"  166
"theme" 2
"themselves"    16
"then"  59
"there" 144
"they've"   2
...
"anecdotes" 1
"angel" 1
"angie's"   1
"angry" 1
"announce"  2
"announced" 1
"announcement"  3
"announcements" 3
"announcing"    2
...
"patents"   3
"path"  19
"paths" 1
"patterns"  1
"pay"   45
"exercise"  1
"exercises" 1
"exist" 6
"expansion" 1
"expect"    11
"expectation"   3
"expectations"  5
"expected"  4
....
"customer"  41
"customers" 122
"yours" 15
"yourself"  78
"youth" 1
"zealand"   1
"zero"  7
"zoho"  1
"zone"  2

Question

Is there some initial configuration to be done in order to obtain globally sorted output from an MRJob?

是否有一些初始配置要从MRJob获得全局排序的输出?

1 个解决方案

#1

You are missing the combiner step, in this guide its the first example of a single-step job: https://mrjob.readthedocs.io/en/latest/guides/writing-mrjobs.html

您缺少组合器步骤,在本指南中它是单步作业的第一个示例:https://mrjob.readthedocs.io/en/latest/guides/writing-mrjobs.html

I'll copy the code for completeness of this answer:

我将复制代码以获得此答案的完整性:

from mrjob.job import MRJob
import re

WORD_RE = re.compile(r"[\w']+")


class MRWordFreqCount(MRJob):

    def mapper(self, _, line):
        for word in WORD_RE.findall(line):
            yield word.lower(), 1

    def combiner(self, word, counts):
        yield word, sum(counts)

    def reducer(self, word, counts):
        yield word, sum(counts)


if __name__ == '__main__':
    MRWordFreqCount.run()

#1