Hadoop Mapreduce中wordcount 过程解析

时间:2021-05-23 04:08:58

将文件split

文件1:                                                                   分割结果:

hello  world                                                   <0, "hello world">

this is wordcount                                           <12,"this is wordcount">

文件2:

hello china                                                     <0,"hello china">

hello IT                                                           <12,"hello IT">

测试文件较小,所以一般测试文件就是一个split

MapReduce 框架完成了以上分割

Then,将分割好的<key ,value > 交给用户自定义的map 方法进行处理,生成新的<key,value>:

<0, "hello world">                        map()                <hello,1> <world,1>

<12,"this is wordcount">             map()                 <this,1> <is,1> <wordcount,1>

<0,"hello china">                         map()                 <hello,1> <china,1>

<12,"hello IT">                            map()                  <hello,1><IT,1>

map() reduce() 中间有个shuffle :

<hello,1> <world,1>                         shuffle ()             <hello,1>

<this,1> <is,1> <wordcount,1>        shuffle ()              <is,1>

<wordcount,1>

<world,1>

<hello,1> <china,1>                         shuffle ()              <china,1>

<hello,1> <IT,1>                               shuffle ()               <hello,1>

<hello,1>

<IT,1>

分组,将相同的key 合并在一起:

<hello,1>                        <hello,list(1)>

<is,1>                             <is,list(1)>

<wordcount,1>               <wordcount,list(1)>

<world,1>                      <world,list(1)>

<china,1>                        <china,list(1)>

<hello,1>

<hello,1>                          <hello,list(2)>

<IT,1>                             <IT,1>

<china,list(1)>

<hello,list(1,2)>

<is,list(1)>

<wordcount,list(1)>

<world,list(1)>

<IT,list(1)>

得到最新的<key,value> 之后,再交给用户的reduce()方法,得到最新的<key,value >,并组为wordcount 的结果输出:

<china,1>

<hello,3>

<is,1>

<wordcount,1>

<world,1>

<IT,1>