MapReduce序列化及分区的java代码示例

时间:2023-03-08 16:01:15

概述

  序列化(Serialization)是指把结构化对象转化为字节流。

  反序列化(Deserialization)是序列化的逆过程。把字节流转为结构化对象。

  当要在进程间传递对象或持久化对象的时候,就需要序列化对象成字节流,反之当要将接收到或从磁盘读取的字节流转换为对象,就要进行反序列化。

  Java 的序列化(Serializable)是一个重量级序列化框架,一个对象被序列化后,会附带很多额外的信息(各种校验信息,header,继承体系…),不便于在网络中高效传输;所以,hadoop 自己开发了一套序列化机制( Writable),精简,高效。不用像 java 对象类一样传输多层的父子关系,需要哪个属性就传输哪个属性值,大大的减少网络传输的开销。

  Writable是Hadoop的序列化格式,hadoop定义了这样一个Writable接口。一个类要支持可序列化只需实现这个接口即可。

public interface Writable {
void write(DataOutput out) throws IOException;
void readFields(DataInput in) throws IOException;
}

  如需要将自定义的 bean 放在 key 中传输,则还需要实现 comparable 接口,因为 mapreduce 框中的 shuffle 过程一定会对 key 进行排序,此时,自定义的bean 实现的接口应该是:WritableComparable

代码示例

  1 . 需求

    统计每一个用户(手机号)所耗费的总上行流量、下行流量,总流量结果的基础之上再加一个需求:将统计结果按照总流量倒序排序。

  准备数据

1363157985066     13726230503    00-FD-07-A4-72-B8:CMCC    120.196.100.82    i02.c.aliimg.com        24    27    2481    24681    200
1363157985066 13726230503 00-FD-07-A4-72-B8:CMCC 120.196.100.82 i02.c.aliimg.com 24 27 2481 24681 200
1363157985066 13726230503 00-FD-07-A4-72-B8:CMCC 120.196.100.82 i02.c.aliimg.com 24 27 2481 24681 200
1363157985066 13726230503 00-FD-07-A4-72-B8:CMCC 120.196.100.82 i02.c.aliimg.com 24 27 2481 24681 200
1363157985066 13726230503 00-FD-07-A4-72-B8:CMCC 120.196.100.82 i02.c.aliimg.com 24 27 2481 24681 200
1363157985066 13726230503 00-FD-07-A4-72-B8:CMCC 120.196.100.82 i02.c.aliimg.com 24 27 2481 24681 200
1363157995052 13826544101 5C-0E-8B-C7-F1-E0:CMCC 120.197.40.4 4 0 264 0 200
1363157991076 13926435656 20-10-7A-28-CC-0A:CMCC 120.196.100.99 2 4 132 1512 200
1363154400022 13926251106 5C-0E-8B-8B-B1-50:CMCC 120.197.40.4 4 0 240 0 200
1363157993044 18211575961 94-71-AC-CD-E6-18:CMCC-EASY 120.196.100.99 iface.qiyi.com 视频网站 15 12 1527 2106 200
1363157995074 84138413 5C-0E-8B-8C-E8-20:7DaysInn 120.197.40.4 122.72.52.12 20 16 4116 1432 200
1363157993055 13560439658 C4-17-FE-BA-DE-D9:CMCC 120.196.100.99 18 15 1116 954 200
1363157995033 15920133257 5C-0E-8B-C7-BA-20:CMCC 120.197.40.4 sug.so.360.cn 信息安全 20 20 3156 2936 200
1363157983019 13719199419 68-A1-B7-03-07-B1:CMCC-EASY 120.196.100.82 4 0 240 0 200
1363157984041 13660577991 5C-0E-8B-92-5C-20:CMCC-EASY 120.197.40.4 s19.cnzz.com 站点统计 24 9 6960 690 200
1363157973098 15013685858 5C-0E-8B-C7-F7-90:CMCC 120.197.40.4 rank.ie.sogou.com 搜索引擎 28 27 3659 3538 200
1363157986029 15989002119 E8-99-C4-4E-93-E0:CMCC-EASY 120.196.100.99 www.umeng.com 站点统计 3 3 1938 180 200
1363157992093 13560439658 C4-17-FE-BA-DE-D9:CMCC 120.196.100.99 15 9 918 4938 200
1363157992093 13560439658 C4-17-FE-BA-DE-D9:CMCC 120.196.100.99 15 9 918 4938 200
1363157992093 13560439658 C4-17-FE-BA-DE-D9:CMCC 120.196.100.99 15 9 918 4938 200
1363157992093 13560439658 C4-17-FE-BA-DE-D9:CMCC 120.196.100.99 15 9 918 4938 200
1363157992093 13560439658 C4-17-FE-BA-DE-D9:CMCC 120.196.100.99 15 9 918 4938 200
1363157992093 13560439658 C4-17-FE-BA-DE-D9:CMCC 120.196.100.99 15 9 918 4938 200
1363157992093 13560439658 C4-17-FE-BA-DE-D9:CMCC 120.196.100.99 15 9 918 4938 200
1363157992093 13560439658 C4-17-FE-BA-DE-D9:CMCC 120.196.100.99 15 9 918 4938 200
1363157986041 13480253104 5C-0E-8B-C7-FC-80:CMCC-EASY 120.197.40.4 3 3 180 180 200
1363157986041 13480253104 5C-0E-8B-C7-FC-80:CMCC-EASY 120.197.40.4 3 3 180 180 200
1363157986041 13480253104 5C-0E-8B-C7-FC-80:CMCC-EASY 120.197.40.4 3 3 180 180 200
1363157986041 13480253104 5C-0E-8B-C7-FC-80:CMCC-EASY 120.197.40.4 3 3 180 180 200
1363157986041 13480253104 5C-0E-8B-C7-FC-80:CMCC-EASY 120.197.40.4 3 3 180 180 200
1363157986041 13480253104 5C-0E-8B-C7-FC-80:CMCC-EASY 120.197.40.4 3 3 180 180 200
1363157986041 13480253104 5C-0E-8B-C7-FC-80:CMCC-EASY 120.197.40.4 3 3 180 180 200
1363157986041 13480253104 5C-0E-8B-C7-FC-80:CMCC-EASY 120.197.40.4 3 3 180 180 200
1363157986041 13480253104 5C-0E-8B-C7-FC-80:CMCC-EASY 120.197.40.4 3 3 180 180 200
1363157986041 13480253104 5C-0E-8B-C7-FC-80:CMCC-EASY 120.197.40.4 3 3 180 180 200
1363157984040 13602846565 5C-0E-8B-8B-B6-00:CMCC 120.197.40.4 2052.flash2-http.qq.com 综合门户 15 12 1938 2910 200
1363157995093 13922314466 00-FD-07-A2-EC-BA:CMCC 120.196.100.82 img.qfc.cn 12 12 3008 3720 200
1363157982040 13502468823 5C-0A-5B-6A-0B-D4:CMCC-EASY 120.196.100.99 y0.ifengimg.com 综合门户 57 102 7335 110349 200
1363157986072 18320173382 84-25-DB-4F-10-1A:CMCC-EASY 120.196.100.99 input.shouji.sogou.com 搜索引擎 21 18 9531 2412 200
1363157990043 13925057413 00-1F-64-E1-E6-9A:CMCC 120.196.100.55 t3.baidu.com 搜索引擎 69 63 11058 48243 200
1363157988072 13760778710 00-FD-07-A4-7B-08:CMCC 120.196.100.82 2 2 120 120 200
1363157985066 13726238888 00-FD-07-A4-72-B8:CMCC 120.196.100.82 i02.c.aliimg.com 24 27 2481 24681 200
1363157993055 13560436666 C4-17-FE-BA-DE-D9:CMCC 120.196.100.99 18 15 1116 954 200

  2 . 分析

    实现自定义的bean 来封装流量信息,并将bean 作为 map 输出的 key 来传输

    MR 程序在处理数据的过程中会对数据排序(map 输出的 kv 对传输到 reduce之前,会排序),排序的依据是 map 输出的 key。所以,我们如果要实现自己需要的排序规则,则可以考虑将排序因素放到 key 中,让 key 实现接口:WritableComparable,然后重写 key 的 compareTo 方法。

  3 . 未排序的实现

    自定义JavaBean

public class FlowBean implements WritableComparable<FlowBean>{
private long upFlow;
private long downFlow;
private long sumFlow; public FlowBean() {
}
public FlowBean(long upFlow, long downFlow, long sumFlow) {
this.upFlow = upFlow;
this.downFlow = downFlow;
this.sumFlow = sumFlow;
}
public FlowBean(long upFlow, long downFlow) {
this.upFlow = upFlow;
this.downFlow = downFlow;
this.sumFlow = upFlow + downFlow;
}
public void set(long upFlow, long downFlow) {
this.upFlow = upFlow;
this.downFlow = downFlow;
this.sumFlow = upFlow + downFlow;
} public long getUpFlow() {
return upFlow;
}
public void setUpFlow(long upFlow) {
this.upFlow = upFlow;
}
public long getDownFlow() {
return downFlow;
}
public void setDownFlow(long downFlow) {
this.downFlow = downFlow;
}
public long getSumFlow() {
return sumFlow;
}
public void setSumFlow(long sumFlow) {
this.sumFlow = sumFlow;
} @Override
public String toString() {
return upFlow + "\t" + downFlow + "\t" + sumFlow;
} /**
* 序列化方法
* @param out
* @throws IOException
*/
@Override
public void write(DataOutput out) throws IOException {
out.writeLong(upFlow);
out.writeLong(downFlow);
out.writeLong(sumFlow);
} /**
* 反序列化方法
* 先序列化的先反序列化
* @param in
* @throws IOException
*/
@Override
public void readFields(DataInput in) throws IOException {
this.upFlow = in.readLong();
this.downFlow = in.readLong();
this.sumFlow = in.readLong();
} /**
* 指定对象排序的方法
* 如果指定的数与参数相等返回 0。
* 如果指定的数小于参数返回 -1。
* 如果指定的数大于参数返回 1。
*/
@Override
public int compareTo(FlowBean o) {
return this.getSumFlow() > o.getSumFlow() ? -1 : 1 ;//按照指定的总流量的倒序排序
// return this.getSumFlow() > o.getSumFlow() ? 1 : -1 ;//按照指定的总流量的正序排序
}
}

    Mapper方法

public class FlowSumMapper extends Mapper<LongWritable,Text,Text,FlowBean> {
Text k = new Text();
FlowBean v = new FlowBean(); @Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
String[] fields = line.split("\t"); String phoNum = fields[1];//提前目标文件中的手机号
long upFlow = Long.parseLong(fields[fields.length-3]);//提取目标文件中的上行流量
long downFlow = Long.parseLong(fields[fields.length-2]);//提取目标文件中的下行流量 k.set(phoNum);
v.set(upFlow,downFlow);
context.write(k,v);
}
}

    Reducer方法

public class FlowSumReducer extends Reducer<Text,FlowBean,Text,FlowBean> {
FlowBean v = new FlowBean(); @Override
protected void reduce(Text key, Iterable<FlowBean> values, Context context) throws IOException, InterruptedException {
long sumUpFlow = 0;
long sumDownFlowd = 0;
for (FlowBean value : values) {
sumUpFlow += value.getUpFlow();//获取每条记录的上行流量并计算总和
sumDownFlowd += value.getDownFlow();//获取每条记录的下行流量并计算总和
}
v.set(sumUpFlow ,sumDownFlowd);
context.write(key,v);
}
}

    主方法

public class FlowSumRunner {
public static void main(String[] args) throws Exception{
Configuration conf = new Configuration();
//指定mr程序使用本地模式模拟一套环境执行mr程序,一般用于本地代码测试
conf.set("mapreduce.framework.name","local");
//通过job方法获得mr程序运行的实例
Job job = Job.getInstance(conf); //指定本次mr程序的运行主类
job.setJarByClass(FlowSumRunner.class);
//指定本次mr程序使用的mapper reduce
job.setMapperClass(FlowSumMapper.class);
job.setReducerClass(FlowSumReducer.class);
//指定本次mr程序map输出的数据类型
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(FlowBean.class);
//指定本次mr程序reduce输出的数据类型,也就是说最终的输出类型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(FlowBean.class);
//指定本次mr程序待处理数据目录 输出结果存放目录
FileInputFormat.addInputPath(job,new Path("D:\\flowsum\\input"));
FileOutputFormat.setOutputPath(job,new Path("D:\\flowsum\\output")); //提交本次mr程序
boolean b = job.waitForCompletion(true);
System.exit(b ? 0 : 1);//程序执行成功,退出状态码为0,退出程序,否则为1
}
}

    

   3 . 排序的实现

      使用上面的输出作为该需求的输入

    Mapper方法

public class FlowSumSortMapper extends Mapper<LongWritable,Text,FlowBean,Text> {
FlowBean k = new FlowBean();
Text v = new Text(); @Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
String[] fileds = line.split("\t"); String phoNum = fileds[0];
long sumUpFlow = Long.parseLong(fileds[1]);
long sumDownFlow = Long.parseLong(fileds[2]); v.set(phoNum);
k.set(sumUpFlow,sumDownFlow);
context.write(k,v);
}
}

    Reducer方法

public class FlowSumSortReducer extends Reducer<FlowBean,Text,Text,FlowBean> {
@Override
protected void reduce(FlowBean key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
Text phoNum = values.iterator().next();//iterator中只有一个值
context.write(phoNum,key);
}
}

    主方法

 //得出上题结果的基础之上再加一个需求:将统计结果按照总流量倒序排序
public class FlowSumSortDriver {
public static void main(String[] args) throws Exception{
Configuration conf = new Configuration();
//指定mr程序使用本地模式模拟一套环境执行mr程序,一般用于本地代码测试
conf.set("mapreduce.framework.name","local"); //通过job方法获得mr程序运行的实例
Job job = Job.getInstance(conf); //指定本次mr程序的运行主类
job.setJarByClass(FlowSumSortDriver.class);
//指定本次mr程序使用的mapper reduce
job.setMapperClass(FlowSumSortMapper.class);
job.setReducerClass(FlowSumSortReducer.class);
//指定本次mr程序map输出的数据类型
job.setMapOutputKeyClass(FlowBean.class);
job.setMapOutputValueClass(Text.class);
//指定本次mr程序reduce输出的数据类型,也就是说最终的输出类型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(FlowBean.class);
//指定本次mr程序待处理数据目录 输出结果存放目录
FileInputFormat.addInputPath(job,new Path("D:\\flowsum\\output"));
FileOutputFormat.setOutputPath(job,new Path("D:\\flowsum\\outsortput")); //提交本次mr程序
boolean b = job.waitForCompletion(true);
System.exit(b ? 0 : 1);//程序执行成功,退出状态码为0,退出程序,否则为1
}
}

Mapreduce的分区—Partitioner

MapReduce序列化及分区的java代码示例

1 .  需求

    将流量汇总统计结果按照手机归属地不同省份输出到不同文件中。

2 .  分析

    Mapreduce 中会将 map 输出的 kv 对,按照相同 key 分组,然后分发给不同的 reducetask。

    默认的分发规则为:根据 key 的 hashcode%reducetask 数来分发

    所以:如果要按照我们自己的需求进行分组,则需要改写数据分发(分组)组件 Partitioner,自定义一个 CustomPartitioner 继承抽象类:Partitioner,然后在job 对象中,设置自定义 partitioner: job.setPartitionerClass(CustomPartitioner.class)

3 .  实现

    自定义partitioner类

public class ProvincePartitioner extends Partitioner<Text, FlowBean> {
public static HashMap<String, Integer> provinceMap = new HashMap<String, Integer>();
static{
provinceMap.put("134", 0);
provinceMap.put("135", 1);
provinceMap.put("136", 2);
provinceMap.put("137", 3);
provinceMap.put("138", 4);
} @Override
public int getPartition(Text key, FlowBean value, int numPartitions) {
Integer code = provinceMap.get(key.toString().substring(0, 3));
if (code != null) {
return code;
}
return 5;
}
}

    Mapper、Reducer及主方法

 public class FlowSumProvince {
public static class FlowSumProvinceMapper extends Mapper<LongWritable, Text, Text, FlowBean>{
Text k = new Text();
FlowBean v = new FlowBean(); @Override
protected void map(LongWritable key, Text value,Context context)
throws IOException, InterruptedException {
//拿取一行文本转为String
String line = value.toString();
//按照分隔符\t进行分割
String[] fileds = line.split("\t");
//获取用户手机号
String phoneNum = fileds[1]; long upFlow = Long.parseLong(fileds[fileds.length-3]);
long downFlow = Long.parseLong(fileds[fileds.length-2]); k.set(phoneNum);
v.set(upFlow, downFlow);
context.write(k,v);
}
} public static class FlowSumProvinceReducer extends Reducer<Text, FlowBean, Text, FlowBean>{
FlowBean v = new FlowBean();
@Override
protected void reduce(Text key, Iterable<FlowBean> flowBeans,Context context) throws IOException, InterruptedException {
long upFlowCount = 0;
long downFlowCount = 0; for (FlowBean flowBean : flowBeans) {
upFlowCount += flowBean.getUpFlow();
downFlowCount += flowBean.getDownFlow();
}
v.set(upFlowCount, downFlowCount);
context.write(key, v);
} public static void main(String[] args) throws Exception{
Configuration conf = new Configuration();
Job job = Job.getInstance(conf); //指定我这个 job 所在的 jar包位置
job.setJarByClass(FlowSumProvince.class);
//指定我们使用的Mapper是那个类 reducer是哪个类
job.setMapperClass(FlowSumProvinceMapper.class);
job.setReducerClass(FlowSumProvinceReducer.class);
// 设置我们的业务逻辑 Mapper 类的输出 key 和 value 的数据类型
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(FlowBean.class);
// 设置我们的业务逻辑 Reducer 类的输出 key 和 value 的数据类型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(FlowBean.class); //这里设置运行reduceTask的个数
job.setNumReduceTasks(6); //这里指定使用我们自定义的分区组件
job.setPartitionerClass(ProvincePartitioner.class); FileInputFormat.setInputPaths(job, new Path("D:\\flowsum\\input"));
// 指定处理完成之后的结果所保存的位置
FileOutputFormat.setOutputPath(job, new Path("D:\\flowsum\\outputProvince"));
boolean res = job.waitForCompletion(true);
System.exit(res ? 0 : 1);
}
}
}