【Todo】找出共同好友 & Spark & Hadoop面试题

找了这篇文章看了一下面试题<Spark 和hadoop的一些面试题（准备）>

http://blog.csdn.net/qiezikuaichuan/article/details/51578743

其中有一道题目很不错，详见：

http://www.aboutyun.com/thread-18826-1-1.html

http://www.cnblogs.com/lucius/p/3483494.html

我觉得可以在Hadoop上面实际编程做一下。

我觉得第一篇文章里面下面这一段总结的很好：

简要描述你知道的数据挖掘算法和使用场景

（一）基于分类模型的案例

（1）垃圾邮件的判别通常会采用朴素贝叶斯的方法进行判别

（2）医学上的肿瘤判断通过分类模型识别

（二）基于预测模型的案例

（1）红酒品质的判断分类回归树模型进行预测和判断红酒的品质

（2）搜索引擎的搜索量和股价波动

（三）基于关联分析的案例：沃尔玛的啤酒尿布

（四）基于聚类分析的案例：零售客户细分

（五）基于异常值分析的案例：支付中的交易欺诈侦测

（六）基于协同过滤的案例：电商猜你喜欢和推荐引擎

（七）基于社会网络分析的案例：电信中的种子客户

（八）基于文本分析的案例

（1）字符识别：扫描王APP

（2）文学著作与统计：红楼梦归属

上面的统计共同好友的题目。写了个程序试了一下。

在Intellij项目 HadoopProj里面。maven项目，依赖如下：

<?xml version="1.0" encoding="UTF-8"?>

<project xmlns="http://maven.apache.org/POM/4.0.0"

         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">

    <modelVersion>4.0.0</modelVersion>

    <groupId>com.hadoop.my</groupId>

    <artifactId>hadoop-proj</artifactId>

    <version>1.0-SNAPSHOT</version>

    <dependencies>

        <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common -->

        <dependency>

            <groupId>org.apache.hadoop</groupId>

            <artifactId>hadoop-common</artifactId>

            <version>2.7.3</version>

        </dependency>

        <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client -->

        <dependency>

            <groupId>org.apache.hadoop</groupId>

            <artifactId>hadoop-client</artifactId>

            <version>2.7.3</version>

        </dependency>

        <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-hdfs -->

        <dependency>

            <groupId>org.apache.hadoop</groupId>

            <artifactId>hadoop-hdfs</artifactId>

            <version>2.7.3</version>

        </dependency>

    </dependencies>

    <repositories>

        <repository>

            <id>aliyunmaven</id>

            <url>http://maven.aliyun.com/nexus/content/groups/public/</url>

        </repository>

    </repositories>

</project>

代码如下：

package com.hadoop.my;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

/**

 * Created by baidu on 16/12/3.

 */

public class HadoopProj {

    public static class CommonFriendsMapper extends Mapper<LongWritable, Text, Text, Text> {

        @Override

        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

            String line = value.toString();

            String[] split = line.split(":");

            String person = split[0];

            String[] friends = split[1].split(",");

            for (String f: friends) {

                context.write(new Text(f), new Text(person));

            }

        }

    }

    public static class CommonFriendsReducer extends Reducer<Text, Text, Text, Text> {

        // 输入<B->A><B->E><B->F>....

        // 输出 B A,E,F,J

        protected void reduce(Text friend, Iterable<Text> persons, Context context) throws IOException, InterruptedException {

            StringBuffer sb = new StringBuffer();

            for (Text person: persons) {

                sb.append(person+",");

            }

            context.write(friend, new Text(sb.toString()));

        }

    }

    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {

        //读取classpath下的所有xxx-site.xml配置文件，并进行解析

        Configuration conf = new Configuration();

        Job friendJob = Job.getInstance(conf);

        //通过主类的类加载器机制获取到本job的所有代码所在的jar包

        friendJob.setJarByClass(HadoopProj.class);

        //指定本job使用的mapper类

        friendJob.setMapperClass(CommonFriendsMapper.class);

        //指定本job使用的reducer类

        friendJob.setReducerClass(CommonFriendsReducer.class);

        //指定reducer输出的kv数据类型

        friendJob.setOutputKeyClass(Text.class);

        friendJob.setOutputValueClass(Text.class);

        //指定本job要处理的文件所在的路径

        FileInputFormat.setInputPaths(friendJob, new Path(args[0]));

        //指定本job输出的结果文件放在哪个路径

        FileOutputFormat.setOutputPath(friendJob, new Path(args[1]));

        //将本job向hadoop集群提交执行

        boolean res = friendJob.waitForCompletion(true);

        System.exit(res?0:1);

    }

}

打成Jar包之后，传到Hadoop机器m42n05上面。

在上面还要新建输入文件，内容：

A:B,C,D,F,E,O

B:A,C,E,K

C:F,A,D,I

D:A,E,F,L

E:B,C,D,M,L

F:A,B,C,D,E,O,M

G:A,C,D,E,F

H:A,C,D,E,O

I:A,O

J:B,O

K:A,C,D

L:D,E,F

M:E,F,G

O:A,H,I,J

命令：

$ hadoop fs -mkdir /input/frienddata

$ hadoop fs -put text.txt /input/frienddata

$ hadoop fs -ls /input/frienddata

Found 1 items

-rw-r--r--   3 work supergroup        142 2016-12-03 17:12 /input/frienddata/text.txt

把hadoop-proj.jar 拷贝到 m42n05的/home/work/data/installed/hadoop-2.7.3/myjars

运行命令

$ hadoop jar /home/work/data/installed/hadoop-2.7.3/myjars/hadoop-proj.jar com.hadoop.my.HadoopProj /input/frienddata /output/friddata

报错：

$ hadoop jar /home/work/data/installed/hadoop-2.7.3/myjars/hadoop-proj.jar com.hadoop.my.HadoopProj /input/frienddata /outputtput/frienddata

16/12/03 17:19:52 INFO client.RMProxy: Connecting to ResourceManager at master.Hadoop/10.117.146.12:8032                                                                             /fri

Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://master.Hadoop:8390/input/frienddata already exists

看起来是命令行后面参数的索引不对，注意代码里是这样写的。

//指定本job要处理的文件所在的路径

FileInputFormat.setInputPaths(friendJob, new Path(args[0]));

//指定本job输出的结果文件放在哪个路径

FileOutputFormat.setOutputPath(friendJob, new Path(args[1]));

而Java里面，和C++不同，参数的确是从0开始的。程序名本身不占位。

所以可能是不需要输入类名。重新输入命令：

$ hadoop jar /home/work/data/installed/hadoop-2.7.3/myjars/hadoop-proj.jar /input/frienddata /output/frienddata

获得输出：

16/12/03 17:24:33 INFO client.RMProxy: Connecting to ResourceManager at master.Hadoop/10.117.146.12:8032

16/12/03 17:24:33 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.

16/12/03 17:24:34 INFO input.FileInputFormat: Total input paths to process : 1

16/12/03 17:24:34 INFO mapreduce.JobSubmitter: number of splits:1

16/12/03 17:24:34 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1478254572601_0002

16/12/03 17:24:34 INFO impl.YarnClientImpl: Submitted application application_1478254572601_0002

16/12/03 17:24:34 INFO mapreduce.Job: The url to track the job: http://master.Hadoop:8320/proxy/application_1478254572601_0002/

16/12/03 17:24:34 INFO mapreduce.Job: Running job: job_1478254572601_0002

16/12/03 17:24:40 INFO mapreduce.Job: Job job_1478254572601_0002 running in uber mode : false

16/12/03 17:24:40 INFO mapreduce.Job:  map 0% reduce 0%

16/12/03 17:24:45 INFO mapreduce.Job:  map 100% reduce 0%

16/12/03 17:24:49 INFO mapreduce.Job:  map 100% reduce 100%

16/12/03 17:24:50 INFO mapreduce.Job: Job job_1478254572601_0002 completed successfully

16/12/03 17:24:50 INFO mapreduce.Job: Counters: 49

        File System Counters

                FILE: Number of bytes read=348

                FILE: Number of bytes written=238531

                FILE: Number of read operations=0

                FILE: Number of large read operations=0

                FILE: Number of write operations=0

                HDFS: Number of bytes read=258

                HDFS: Number of bytes written=156

                HDFS: Number of read operations=6

                HDFS: Number of large read operations=0

                HDFS: Number of write operations=2

        Job Counters

                Launched map tasks=1

                Launched reduce tasks=1

                Data-local map tasks=1

                Total time spent by all maps in occupied slots (ms)=2651

                Total time spent by all reduces in occupied slots (ms)=2446

                Total time spent by all map tasks (ms)=2651

                Total time spent by all reduce tasks (ms)=2446

                Total vcore-milliseconds taken by all map tasks=2651

                Total vcore-milliseconds taken by all reduce tasks=2446

                Total megabyte-milliseconds taken by all map tasks=2714624

                Total megabyte-milliseconds taken by all reduce tasks=2504704

        Map-Reduce Framework

                Map input records=14

                Map output records=57

                Map output bytes=228

                Map output materialized bytes=348

                Input split bytes=116

                Combine input records=0

                Combine output records=0

                Reduce input groups=14

                Reduce shuffle bytes=348

                Reduce input records=57

                Reduce output records=14

                Spilled Records=114

                Shuffled Maps =1

                Failed Shuffles=0

                Merged Map outputs=1

                GC time elapsed (ms)=111

                CPU time spent (ms)=1850

                Physical memory (bytes) snapshot=455831552

                Virtual memory (bytes) snapshot=4239388672

                Total committed heap usage (bytes)=342360064

        Shuffle Errors

                BAD_ID=0

                CONNECTION=0

                IO_ERROR=0

                WRONG_LENGTH=0

                WRONG_MAP=0

                WRONG_REDUCE=0

        File Input Format Counters

                Bytes Read=142

        File Output Format Counters

                Bytes Written=156

看一下输出文件的位置：

$ hadoop fs -ls /output/frienddata

Found 2 items

-rw-r--r--   3 work supergroup          0 2016-12-03 17:24 /output/frienddata/_SUCCESS

-rw-r--r--   3 work supergroup        156 2016-12-03 17:24 /output/frienddata/part-r-00000

$ hadoop fs -cat /output/frienddata/part-r-00000

A       I,K,C,B,G,F,H,O,D,

B       A,F,J,E,

C       A,E,B,H,F,G,K,

D       G,C,K,A,L,F,E,H,

E       G,M,L,H,A,F,B,D,

F       L,M,D,C,G,A,

G       M,

H       O,

I       O,C,

J       O,

K       B,

L       D,E,

M       E,F,

O       A,H,I,J,F,

当然，也可以把输出merge到本地文件：

$ hdfs dfs -getmerge hdfs://master.Hadoop:8390/output/frienddata /home/work/frienddatatmp

$ cat frienddatatmp

A       I,K,C,B,G,F,H,O,D,

B       A,F,J,E,

C       A,E,B,H,F,G,K,

D       G,C,K,A,L,F,E,H,

E       G,M,L,H,A,F,B,D,

F       L,M,D,C,G,A,

G       M,

H       O,

I       O,C,

J       O,

K       B,

L       D,E,

M       E,F,

O       A,H,I,J,F,

上面这道题目，做完了。

秒客网

【Todo】找出共同好友 & Spark & Hadoop面试题

相关文章