【Todo】找出共同好友 & Spark & Hadoop面试题

时间:2022-01-09 09:08:18

找了这篇文章看了一下面试题<Spark 和hadoop的一些面试题(准备)>

http://blog.csdn.net/qiezikuaichuan/article/details/51578743

其中有一道题目很不错,详见:

http://www.aboutyun.com/thread-18826-1-1.html

http://www.cnblogs.com/lucius/p/3483494.html

我觉得可以在Hadoop上面实际编程做一下。

我觉得第一篇文章里面下面这一段总结的很好:

简要描述你知道的数据挖掘算法和使用场景

(一)基于分类模型的案例

(1)垃圾邮件的判别    通常会采用朴素贝叶斯的方法进行判别

(2)医学上的肿瘤判断   通过分类模型识别

(二)基于预测模型的案例

(1)红酒品质的判断  分类回归树模型进行预测和判断红酒的品质

(2)搜索引擎的搜索量和股价波动

(三)基于关联分析的案例:沃尔玛的啤酒尿布

(四)基于聚类分析的案例:零售客户细分

(五)基于异常值分析的案例:支付中的交易欺诈侦测

(六)基于协同过滤的案例:电商猜你喜欢和推荐引擎

(七)基于社会网络分析的案例:电信中的种子客户

(八)基于文本分析的案例

(1)字符识别:扫描王APP

(2)文学著作与统计:红楼梦归属

上面的统计共同好友的题目。写了个程序试了一下。

在Intellij项目 HadoopProj里面。maven项目,依赖如下:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion> <groupId>com.hadoop.my</groupId>
<artifactId>hadoop-proj</artifactId>
<version>1.0-SNAPSHOT</version> <dependencies> <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.7.3</version>
</dependency> <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.7.3</version>
</dependency> <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-hdfs -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>2.7.3</version>
</dependency> </dependencies> <repositories>
<repository>
<id>aliyunmaven</id>
<url>http://maven.aliyun.com/nexus/content/groups/public/</url>
</repository>
</repositories> </project>

代码如下:

package com.hadoop.my;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import java.io.IOException; /**
* Created by baidu on 16/12/3.
*/
public class HadoopProj {
public static class CommonFriendsMapper extends Mapper<LongWritable, Text, Text, Text> { @Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
String[] split = line.split(":");
String person = split[0];
String[] friends = split[1].split(","); for (String f: friends) {
context.write(new Text(f), new Text(person));
}
} } public static class CommonFriendsReducer extends Reducer<Text, Text, Text, Text> {
// 输入<B->A><B->E><B->F>....
// 输出 B A,E,F,J
protected void reduce(Text friend, Iterable<Text> persons, Context context) throws IOException, InterruptedException {
StringBuffer sb = new StringBuffer(); for (Text person: persons) {
sb.append(person+",");
} context.write(friend, new Text(sb.toString()));
}
} public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
//读取classpath下的所有xxx-site.xml配置文件,并进行解析
Configuration conf = new Configuration(); Job friendJob = Job.getInstance(conf); //通过主类的类加载器机制获取到本job的所有代码所在的jar包
friendJob.setJarByClass(HadoopProj.class);
//指定本job使用的mapper类
friendJob.setMapperClass(CommonFriendsMapper.class);
//指定本job使用的reducer类
friendJob.setReducerClass(CommonFriendsReducer.class); //指定reducer输出的kv数据类型
friendJob.setOutputKeyClass(Text.class);
friendJob.setOutputValueClass(Text.class); //指定本job要处理的文件所在的路径
FileInputFormat.setInputPaths(friendJob, new Path(args[0]));
//指定本job输出的结果文件放在哪个路径
FileOutputFormat.setOutputPath(friendJob, new Path(args[1])); //将本job向hadoop集群提交执行
boolean res = friendJob.waitForCompletion(true); System.exit(res?0:1); } }

打成Jar包之后,传到Hadoop机器m42n05上面。

在上面还要新建输入文件,内容:

A:B,C,D,F,E,O
B:A,C,E,K
C:F,A,D,I
D:A,E,F,L
E:B,C,D,M,L
F:A,B,C,D,E,O,M
G:A,C,D,E,F
H:A,C,D,E,O
I:A,O
J:B,O
K:A,C,D
L:D,E,F
M:E,F,G
O:A,H,I,J

命令:

$ hadoop fs -mkdir /input/frienddata

$ hadoop fs -put text.txt /input/frienddata

$ hadoop fs -ls /input/frienddata
Found 1 items
-rw-r--r-- 3 work supergroup 142 2016-12-03 17:12 /input/frienddata/text.txt

把hadoop-proj.jar 拷贝到 m42n05的/home/work/data/installed/hadoop-2.7.3/myjars

运行命令

$ hadoop jar /home/work/data/installed/hadoop-2.7.3/myjars/hadoop-proj.jar com.hadoop.my.HadoopProj /input/frienddata /output/friddata

报错:

$ hadoop jar /home/work/data/installed/hadoop-2.7.3/myjars/hadoop-proj.jar com.hadoop.my.HadoopProj /input/frienddata /outputtput/frienddata
16/12/03 17:19:52 INFO client.RMProxy: Connecting to ResourceManager at master.Hadoop/10.117.146.12:8032 /fri
Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://master.Hadoop:8390/input/frienddata already exists

看起来是命令行后面参数的索引不对,注意代码里是这样写的。

//指定本job要处理的文件所在的路径
FileInputFormat.setInputPaths(friendJob, new Path(args[0]));
//指定本job输出的结果文件放在哪个路径
FileOutputFormat.setOutputPath(friendJob, new Path(args[1]));

而Java里面,和C++不同,参数的确是从0开始的。程序名本身不占位。

所以可能是不需要输入类名。重新输入命令:

$ hadoop jar /home/work/data/installed/hadoop-2.7.3/myjars/hadoop-proj.jar /input/frienddata /output/frienddata

获得输出:

16/12/03 17:24:33 INFO client.RMProxy: Connecting to ResourceManager at master.Hadoop/10.117.146.12:8032
16/12/03 17:24:33 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
16/12/03 17:24:34 INFO input.FileInputFormat: Total input paths to process : 1
16/12/03 17:24:34 INFO mapreduce.JobSubmitter: number of splits:1
16/12/03 17:24:34 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1478254572601_0002
16/12/03 17:24:34 INFO impl.YarnClientImpl: Submitted application application_1478254572601_0002
16/12/03 17:24:34 INFO mapreduce.Job: The url to track the job: http://master.Hadoop:8320/proxy/application_1478254572601_0002/
16/12/03 17:24:34 INFO mapreduce.Job: Running job: job_1478254572601_0002
16/12/03 17:24:40 INFO mapreduce.Job: Job job_1478254572601_0002 running in uber mode : false
16/12/03 17:24:40 INFO mapreduce.Job: map 0% reduce 0%
16/12/03 17:24:45 INFO mapreduce.Job: map 100% reduce 0%
16/12/03 17:24:49 INFO mapreduce.Job: map 100% reduce 100%
16/12/03 17:24:50 INFO mapreduce.Job: Job job_1478254572601_0002 completed successfully
16/12/03 17:24:50 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=348
FILE: Number of bytes written=238531
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=258
HDFS: Number of bytes written=156
HDFS: Number of read operations=6
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=2651
Total time spent by all reduces in occupied slots (ms)=2446
Total time spent by all map tasks (ms)=2651
Total time spent by all reduce tasks (ms)=2446
Total vcore-milliseconds taken by all map tasks=2651
Total vcore-milliseconds taken by all reduce tasks=2446
Total megabyte-milliseconds taken by all map tasks=2714624
Total megabyte-milliseconds taken by all reduce tasks=2504704
Map-Reduce Framework
Map input records=14
Map output records=57
Map output bytes=228
Map output materialized bytes=348
Input split bytes=116
Combine input records=0
Combine output records=0
Reduce input groups=14
Reduce shuffle bytes=348
Reduce input records=57
Reduce output records=14
Spilled Records=114
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=111
CPU time spent (ms)=1850
Physical memory (bytes) snapshot=455831552
Virtual memory (bytes) snapshot=4239388672
Total committed heap usage (bytes)=342360064
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=142
File Output Format Counters
Bytes Written=156

看一下输出文件的位置:

$ hadoop fs -ls /output/frienddata
Found 2 items
-rw-r--r-- 3 work supergroup 0 2016-12-03 17:24 /output/frienddata/_SUCCESS
-rw-r--r-- 3 work supergroup 156 2016-12-03 17:24 /output/frienddata/part-r-00000 $ hadoop fs -cat /output/frienddata/part-r-00000
A I,K,C,B,G,F,H,O,D,
B A,F,J,E,
C A,E,B,H,F,G,K,
D G,C,K,A,L,F,E,H,
E G,M,L,H,A,F,B,D,
F L,M,D,C,G,A,
G M,
H O,
I O,C,
J O,
K B,
L D,E,
M E,F,
O A,H,I,J,F,

当然,也可以把输出merge到本地文件:

$ hdfs dfs -getmerge hdfs://master.Hadoop:8390/output/frienddata /home/work/frienddatatmp

$ cat frienddatatmp
A I,K,C,B,G,F,H,O,D,
B A,F,J,E,
C A,E,B,H,F,G,K,
D G,C,K,A,L,F,E,H,
E G,M,L,H,A,F,B,D,
F L,M,D,C,G,A,
G M,
H O,
I O,C,
J O,
K B,
L D,E,
M E,F,
O A,H,I,J,F,

上面这道题目,做完了。