一、 Hadoop Streaming 和 Python
与前面介绍的Hadoop提供的基于Java 的 MapReduce 编程框架相比,Hadoop Streaming 是另一种形式的MapReduce编程框架。这种编程框架允许Map任务和Reduce任务通过标准输入输出来读数据、写数据,每次一行。任何程序只要能通过标准输入输出来读写数据,就可以使用Hadoop Streaming,即你可以用Python、Ruby这样的动态脚本语言来写程序。相比于Java,这种方式的优势是你可以快速试验你的想法,劣势是运行性能和类型检查。因此在前期的分析建模阶段我们可以用Streaming+Python,提高开发效率,在后期的生产系统中可以用Java,保证运行性能。 以下是WordCount的Streaming + Python实现1. mapper.py
#!/usr/bin/python
# -*- coding: utf-8 -*-
"""a python script for hadoop streaming map """
import sys
def map(input):
for line in input:
line = line.strip()
words = line.split()
for word in words:
print '%s\t%s' % (word, 1)
def main():
map(sys.stdin)
if __name__ == "__main__":
main()
2. reducer.py
#!/usr/bin/python3.exec_streaming.sh
# -*- coding: utf-8 -*-
"""a python script for hadoop streaming map """
import sys
def reduce(input):
current_word = None
current_count = 0
word = None
for line in input:
line = line.strip()
word, count = line.split('\t', 1)
try:
count = int(count)
except ValueError:
continue
if current_word == word:
current_count += count
else:
if current_word:
print '%s\t%s' %(current_word, current_count)
current_count = count
current_word = word
if current_word == word:
print '%s\t%s' % (current_word, current_count)
def main():
reduce(sys.stdin)
if __name__ == "__main__":
main()
#!/bin/sh此脚本包括3条命令。
hadoop dfs -rmr streaming_out
hadoop jar /home/hadoop/cloud/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar -file mapper.py -mapper mapper.py -file reducer.py -reducer reducer.py -input test2.txt -output streaming_out
hadoop dfs -cat streaming_out/part-00000
第1条,删除streaming_out目录,以免执行时出错。
第2条,其中 /home/hadoop/cloud/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar 需要根据你所安装的目录调整; -file mapper.py 告诉hadoop分发mapper.py文件,-mapper mapper.py 告诉hadoop用mapper.py作为map程序;-file reducer.py 告诉hadoop分发reducer.py文件,-reducer reducer.py 告诉hadoop用reducer.py作为reduce程序;-input test2.txt 指定输入文件(test2.txt在前面已导入到HDFS),-output streaming_out 指定streaming_out 为输出目录。
第3条,查看执行结果。
二、编程实例之气象数据分析
这是hadoop in action书上第4章的实例。我先用python实现map和reduce,然后再用java实现。
1. cite75_99.txt 的格式
逗号隔开,第一行为标题栏,一共两列,第一列为引用的专利号,第二列为被引用的专利号。第一行数据3858241,956203表示3858241专利引用了956203专利。整个文件有1600多万行。
"CITING","CITED"
3858241,956203
3858241,1324234
3858241,3398406
3858241,3557384
3858241,3634889
3858242,1515701
3858242,3319261
3858242,3668705
3858242,3707004
3858243,2949611
3858243,3146465
3858243,3156927
3858243,3221341
3858243,3574238
3858243,3681785
3858243,3684611
3858244,14040
3858244,17445
3858244,2211676
3858244,2635670
3858244,2838924
3858244,2912700
2. 读取引用数据倒排(invert)
输出格式如下,第一列为被引用的专利,第二列为引用它的专利,
10000050313882.1 mapper.py
10000064714284
10000074766693
10000115033339
10000173908629
10000264043055
10000334190903,4975983
10000434091523
10000444055371,4082383
100004539112
10000454290571
10000465525001,5918892
10000495996916
10000514541310
10000544946631
10000654748968
10000674944640,5071294,5312208
10000704928425,5009029
10000734107819,5474494
10000764867716,5845593
10000835322091,5566726
10000844182197,4683770
10000864178246,4217220,4686189,4839046
10000895277853,5395228,5503546,5505607,5505610,5505611,5540869,5544405,5571464,5807591
10000944897975,4920718,5713167
10001025120183,5791855
参照模板写就行了,语句print '%s\t%s' %(words[1],words[0])将作了一个转换
#!/usr/bin/python2.2 reducer.py
# -*- coding: utf-8 -*-
import sys
def map(input):
for line in input:
line = line.strip()
words = line.split(',')
if len(words) == 2 :
print '%s\t%s' %(words[1],words[0])
def main():
map(sys.stdin)
if __name__ == "__main__":
main()
参考前面wordcount的reducer.py,根据要求做部分修改。其中current_key表示被引用的专利号,current_value表示引用它的专利号,多个用逗号隔开,key与value之间用\t隔开。
#!/usr/bin/python
# -*- coding: utf-8 -*-
import sys
def reduce(input):
current_key = None
current_value = None
key = None
for line in input:
line = line.strip()
key, value = line.split('\t')
if current_key == key:
current_value += (',' + value)
else:
if current_key:
print '%s\t%s' %(current_key, current_value)
current_value = value
current_key = key
if current_key == key:
print '%s\t%s' % (current_key, current_value)
def main():
reduce(sys.stdin)
if __name__ == "__main__":
main()
2.3 本地验证
虽然有1600万行,速度还能接受。 more output.txt太慢,看看前面的行就可以了。注意一点要用sort命令排序,否则结果不对。
$ wc -l cite75_99.txt2.4 编写脚本到hadoop 上运行
16522439 cite75_99.txt
$ cat cite75_99.txt | ./mapper.py | sort | ./reducer.py > output.txt
$ wc -l output.txt
3276728 output.txt
$ more output.txt
exec_streaming.sh 相对于wordcount 的脚步不一样,原因是我重新在一个ubuntu的虚拟机上配置了一个伪分布环境,这样开发时就不用启动3个虚拟机了。
#!/bin/sh
hadoop dfs -rmr /examples/patent/streaming_out
hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar -file mapper.py -mapper mapper.py -file reducer.py -reducer reducer.py -input /examples/pat
ent/cite75_99.txt -output /examples/patent/streaming_out
hadoop dfs -cat /examples/patent/streaming_out/part-00000
3 计数和直方图
前面得到了一个专利引用的列表,统计的一个基本功能是计数,简单修改reducer.py可以很快实现这个功能。程序reducer_count.py 如下
#!/usr/bin/python
# -*- coding: utf-8 -*-
import sys
def reduce(input):
current_key = None
current_count = 0
key = None
for line in input:
line = line.strip()
key, value = line.split('\t')
count = len(value.split(','))
if current_key == key:
current_count += count
else:
if current_key:
print '%s\t%s' %(current_key, current_count)
current_count = count
current_key = key
if current_key == key:
print '%s\t%s' % (current_key, current_count)
def main():
reduce(sys.stdin)
if __name__ == "__main__":
main()
计数输出的文件output_count.txt如下
100000 1
1000006 1
1000007 1
1000011 1
1000017 1
1000026 1
1000033 2
1000043 1
1000044 2
10000 1
1000045 1
1000046 2
1000049 1
1000051 1
1000054 1
1000065 1
1000067 3
1000070 2
1000073 2
1000076 2
1000083 2
1000084 2
1000086 4
1000089 10
以output_count.txt作为输入,在此基础上进一步进行直方图统计。mapper_histogram.py如下
#!/usr/bin/pythonreducer_histogram.py如下
# -*- coding: utf-8 -*-
import sys
def map(input):
for line in input:
line = line.strip()
words = line.split('\t')
if len(words) == 2 :
print '%s\t%s' %(words[1],1)
def main():
map(sys.stdin)
if __name__ == "__main__":
main()
#!/usr/bin/python执行命令 cat output_count.txt | ./mapper_histogram.py | sort -k1 -n |./reducer_histogram.py >output_histogram.txt
# -*- coding: utf-8 -*-
import sys
def reduce(input):
current_key = None
current_count = 0
key = None
for line in input:
line = line.strip()
words = line.split('\t')
key = words[0]
count = 0
if len(words) == 2:
count = 1;
if current_key == key:
current_count += count
else:
if current_key:
print '%s\t%s' %(current_key, current_count)
current_count = count
current_key = key
if current_key == key:
print '%s\t%s' % (current_key, current_count)
def main():
reduce(sys.stdin)
if __name__ == "__main__":
main()
得到直方图统计,输出如下, 第1列表示被引用的次数,第2列表示专利数,第一行数据表示只引用一次的专利数据有942232个。
1942232
2551843
3379462
4277848
5210438
6162891
7127743
8102050
982048
1066578
1153835
1244966
1337055
1431178
1526208
1622024
1718896
1816123
1913697
2011856
2110348
229028
4. 用Java实现
4.1 专利引用倒排
PatentInvert.java
import java.io.*;build.sh
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class PatentInvert
{
public static class PatentMapper extends Mapper<Object, Text, Text, Text>
{
private Text key2 = new Text();
private Text value2 = new Text();
public void map(Object key1, Text value1, Context context) throws IOException, InterruptedException
{
String[] words = value1.toString().split(",");
if(words != null && words.length == 2)
{
key2.set(words[1]);
value2.set(words[0]);
context.write(key2, value2);
}
}
}
public static class PatentReducer extends Reducer<Text, Text, Text, Text>
{
public void reduce(Text key2, Iterable<Text> values2, Context context) throws IOException, InterruptedException
{
Text key3 = key2;
String value3 = "";
for(Text val: values2)
{
if(value3.length() > 0)
value3 += ",";
value3 += val.toString();
}
context.write(key3, new Text(value3));
}
}
public static void main(String[] args) throws Exception
{
Configuration conf = new Configuration();
Job job = new Job(conf, "Patent invert");
job.setJarByClass(PatentInvert.class);
job.setMapperClass(PatentMapper.class);
job.setReducerClass(PatentReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true)?0:1);
}
}
在脚本中增加了一个参数,是脚本更通用。另外如果编译的文件不存在则提示。
#/bin/sh
HADOOP_LIB_DIR=/usr/local/hadoop/share/hadoop
FILE_NAME=PatentInvert
if [ $# -eq 1 ]; then
FILE_NAME=$1
fi
rm -f ./*.class
rm -f ./${FILE_NAME}.jar
if [ -f ./${FILE_NAME}.java ]; then
javac -classpath $HADOOP_LIB_DIR/common/hadoop-common-2.6.0.jar:$HADOOP_LIB_DIR/common/lib/commons-cli-1.2.jar:$HADOOP_LIB_DIR/common/lib/hadoop-annotations-2.6.0.jar:$HADOOP_LIB_DIR/mapreduce/hadoop-mapreduce-client-core-2.6.0.jar -d . ./${FILE_NAME}.java
#package
jar -cvf ${FILE_NAME}.jar ./*.class
else
echo "${FILE_NAME}.java is not exist !"
fi
编译打包 build.sh
执行 hadoop jar PatentInvert.jar PatentInvert /examples/patent/cite75_99.txt /examples/patent/out_invert
查看结果 hadoop dfs -cat /examples/patent/out_invert/part-r-00000
查看结果要有耐心,我只看了前面的就 中断了它。
4.2 专利引用倒排计数
PatentCount.java
import java.io.*;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class PatentCount
{
public static class PatentMapper extends Mapper<Object, Text, IntWritable, IntWritable>
{
private IntWritable key2 = new IntWritable(0);
private IntWritable value2 = new IntWritable(1);
public void map(Object key1, Text value1, Context context) throws IOException, InterruptedException
{
String[] words = value1.toString().split(",");
if(words != null && words.length == 2)
{
try
{
key2.set(Integer.parseInt(words[1].trim()));
context.write(key2, value2);
}catch(Exception e){
context.write(new IntWritable(0), value2);
}
}
}
}
public static class PatentReducer extends Reducer<IntWritable, IntWritable, IntWritable, IntWritable>
{
public void reduce(IntWritable key2, Iterable<IntWritable> values2, Context context) throws IOException, InterruptedException
{
IntWritable key3 = key2;
IntWritable value3 = new IntWritable(0);
int total = 0;
for(IntWritable val: values2)
{
total++;
}
value3.set(total);
context.write(key3, value3);
}
}
public static void main(String[] args) throws Exception
{
Configuration conf = new Configuration();
Job job = new Job(conf, "Patent count");
job.setJarByClass(PatentCount.class);
job.setMapperClass(PatentMapper.class);
job.setReducerClass(PatentReducer.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true)?0:1);
}
}
编译打包 build.sh PatentCount
执行 hadoop jar PatentCount.jar PatentCount /examples/patent/cite75_99.txt /examples/patent/out_count
查看结果 hadoop dfs -cat /examples/patent/out_count/part-r-00000
4.3 专利引用直方图
PatentHistogram.java
import java.io.*;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class PatentHistogram
{
public static class PatentMapper extends Mapper<Object, Text, IntWritable, IntWritable>
{
private IntWritable key2 = new IntWritable(0);
private IntWritable value2 = new IntWritable(1);
public void map(Object key1, Text value1, Context context) throws IOException, InterruptedException
{
String[] words = value1.toString().split("\t");
if(words != null && words.length == 2)
{
try
{
key2.set(Integer.parseInt(words[1].trim()));
context.write(key2, value2);
}catch(Exception e){
context.write(new IntWritable(0), value2);
}
}
}
}
public static class PatentReducer extends Reducer<IntWritable, IntWritable, IntWritable, IntWritable>
{
public void reduce(IntWritable key2, Iterable<IntWritable> values2, Context context) throws IOException, InterruptedException
{
IntWritable key3 = key2;
IntWritable value3 = new IntWritable(0);
int total = 0;
for(IntWritable val: values2)
{
total++;
}
value3.set(total);
context.write(key3, value3);
}
}
public static void main(String[] args) throws Exception
{
Configuration conf = new Configuration();
Job job = new Job(conf, "Patent histogram");
job.setJarByClass(PatentHistogram.class);
job.setMapperClass(PatentMapper.class);
job.setReducerClass(PatentReducer.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true)?0:1);
}
}
编译打包 build.sh PatentHistogram
复制文件 hadoop dfs -mv /examples/patent/out_count/part-r-00000 /examples/patent/cite75_99_count.txt
执行 hadoop jar PatentHistogram.jar PatentHistogram /examples/patent/cite75_99_count.txt /examples/patent/out_histogram
查看结果 hadoop dfs -cat /examples/patent/out_histogram/part-r-00000
三、体会
把Streaming + python的方式作为开发/调试的工具来使用,确实比较方便。 在做个例子时候,虽然原始的数据有1600万多条记录,python脚本直接处理的速度也还能接受,看来是数据量太小,hadoop的优势没有体现出来。 只要理解了mapper和reducer两个类,理解并记住了WordCount这个最基本的例子,照猫画虎,MapReduce编程入门还是不难的。Mapper 中需要override的 map方法和Reducer中需要override的 reduce方法。public class Mapper<K1, V1, K2, V2>
{
void map(K1 key, V1, value Mapper.Context context) throws IOException, InterruptedException
{...}
}
public class Reducer<K2, V2, K3, V3>{void reduce(K1 key, Iterable<V2> values, Reducer.Context context) throws IOException, InterruptedException{...}}