Context
Python 3.6.3 :: Anaconda custom (64-bit)
mrjob==0.6.2 with no custom configuration
Running locally
Python 3.6.3 :: Anaconda自定义(64位)mrjob == 0.6.2没有自定义配置在本地运行
I am implementing the basic word count example for a local map reduce job. My mapper maps a 1 to each word in each line of a book from a .txt
file, using a simple regex. The reducer counts the number of ocurrences of each word, i.e. number of 1's grouped to each word.
我正在为本地地图减少作业实现基本的单词计数示例。我的映射器使用简单的正则表达式从.txt文件中将1映射到书籍每行中的每个单词。 reducer计算每个单词的出现次数,即每个单词的1个数。
from mrjob.job import MRJob
import re
WORD_REGEXP = re.compile(r"[\w']+")
class WordCounter(MRJob):
def mapper(self, _, line):
words = WORD_REGEXP.findall(line)
for word in words:
yield word.lower(), 1
def reducer(self, word, times_seen):
yield word, sum(times_seen)
if __name__ == '__main__':
WordCounter.run()
Problem
The output file is correct but the key value pairs are not globally sorted. It seems like the result is only sorted alphabetically in chunks of data.
输出文件是正确的,但键值对不是全局排序的。似乎结果只按字母顺序排列在数据块中。
"customers'" 1
"customizing" 1
"cut" 2
"cycle" 1
"cycles" 1
"d" 10
"dad" 1
"dada" 1
"daily" 3
"damage" 1
"deductible" 6
...
"exchange" 10
"excited" 4
"excitement" 1
"exciting" 4
"executive" 2
"executives" 2
"theft" 1
"their" 122
"them" 166
"theme" 2
"themselves" 16
"then" 59
"there" 144
"they've" 2
...
"anecdotes" 1
"angel" 1
"angie's" 1
"angry" 1
"announce" 2
"announced" 1
"announcement" 3
"announcements" 3
"announcing" 2
...
"patents" 3
"path" 19
"paths" 1
"patterns" 1
"pay" 45
"exercise" 1
"exercises" 1
"exist" 6
"expansion" 1
"expect" 11
"expectation" 3
"expectations" 5
"expected" 4
....
"customer" 41
"customers" 122
"yours" 15
"yourself" 78
"youth" 1
"zealand" 1
"zero" 7
"zoho" 1
"zone" 2
Question
Is there some initial configuration to be done in order to obtain globally sorted output from an MRJob?
是否有一些初始配置要从MRJob获得全局排序的输出?
1 个解决方案
#1
0
You are missing the combiner step, in this guide its the first example of a single-step job: https://mrjob.readthedocs.io/en/latest/guides/writing-mrjobs.html
您缺少组合器步骤,在本指南中它是单步作业的第一个示例:https://mrjob.readthedocs.io/en/latest/guides/writing-mrjobs.html
I'll copy the code for completeness of this answer:
我将复制代码以获得此答案的完整性:
from mrjob.job import MRJob
import re
WORD_RE = re.compile(r"[\w']+")
class MRWordFreqCount(MRJob):
def mapper(self, _, line):
for word in WORD_RE.findall(line):
yield word.lower(), 1
def combiner(self, word, counts):
yield word, sum(counts)
def reducer(self, word, counts):
yield word, sum(counts)
if __name__ == '__main__':
MRWordFreqCount.run()
#1
0
You are missing the combiner step, in this guide its the first example of a single-step job: https://mrjob.readthedocs.io/en/latest/guides/writing-mrjobs.html
您缺少组合器步骤,在本指南中它是单步作业的第一个示例:https://mrjob.readthedocs.io/en/latest/guides/writing-mrjobs.html
I'll copy the code for completeness of this answer:
我将复制代码以获得此答案的完整性:
from mrjob.job import MRJob
import re
WORD_RE = re.compile(r"[\w']+")
class MRWordFreqCount(MRJob):
def mapper(self, _, line):
for word in WORD_RE.findall(line):
yield word.lower(), 1
def combiner(self, word, counts):
yield word, sum(counts)
def reducer(self, word, counts):
yield word, sum(counts)
if __name__ == '__main__':
MRWordFreqCount.run()