大数据工程师面试题(六)

3.14
1、一个Hadoop环境，整合了HBase和Hive，是否有必要给HDFS和Hbase都分别配置压缩策略？请给出对压缩策略的建议。
hdfs在存储的时候不会将数据进行压缩，如果想进行压缩，我们可以在向hdfs上传数据的时候进行压缩。
1）、采用压缩流

//压缩文件
public static void compress(String codecClassName) throws Exception{
    Class<?> codecClass = (codecClassName);
    Configuration conf = new Configuration();
    FileSystem fs = (conf);
    CompressionCodec codec = (CompressionCodec)(codecClass, conf);
    //指定压缩文件路径
    FSDataOutputStream outputStream = (new Path("/user/hadoop/"));
    //指定要被压缩的文件路径
    FSDataInputStream in = (new Path("/user/hadoop/"));
    //创建压缩输出流
    CompressionOutputStream out = (outputStream); 
    (in, out, conf);
    (in);
    (out);
}

2）、采用序列化文件

public void testSeqWrite() throws Exception {
    Configuration conf = new Configuration();// 创建配置信息
    ("", "hdfs://master:9000");// hdfs默认路径
    ("", "hadoop,hadoop");// 用户和组信息
    String uriin = "hdfs://master:9000/ceshi2/";// 文件路径
    FileSystem fs = ((uriin), conf);// 创建filesystem
    Path path = new Path("hdfs://master:9000/ceshi3/");// 文件名
    IntWritable k = new IntWritable();// key，相当于int
    Text v = new Text();// value，相当于String
     w = (fs, conf, path,(), ());// 创建writer
    for (int i = 1; i < 100; i++) {// 循环添加
        (i);
        ("abcd");
        (k, v);
    }
    ();
    (w);// 关闭的时候flush
    ();
}

hbase为列存数据库，本身存在压缩机制，所以无需设计。

3、简述Hbase性能优化的思路

1）、在库表设计的时候，尽量考虑rowkey和columnfamily的特性
2）、进行hbase集群的调优：见hbase调优

4、简述Hbase filter的实现原理是什么？结合实际项目经验，写出几个使用filter的场景。

hbase的filter是通过scan设置的，所以是基于scan的查询结果进行过滤。
1）、在进行订单开发的时候，我们使用rowkeyfilter过滤出某个用户的所有订单
2）、在进行云笔记开发时，我们使用rowkey过滤器进行redis数据的恢复。

5、ROWKEY的后缀匹配怎么实现？例如ROWKEY是yyyyMMDD-UserID形式，如果要以UserID为条件查询数据，怎样实现。
使用rowkey过滤器实现
6、简述Hive中的虚拟列作用是什么，使用它的注意事项。
Hive提供了三个虚拟列：
INPUT__FILE__NAME
BLOCK__OFFSET__INSIDE__FILE
ROW__OFFSET__INSIDE__BLOCK
但ROW__OFFSET__INSIDE__BLOCK默认是不可用的，需要设置为true才可以。可以用来排查有问题的输入数据。
INPUT__FILE__NAME, mapper任务的输出文件名。
BLOCK__OFFSET__INSIDE__FILE, 当前全局文件的偏移量。对于块压缩文件，就是当前块的文件偏移量，即当前块的第一个字节在文件中的偏移量。
hive> SELECT INPUT__FILE__NAME, BLOCK__OFFSET__INSIDE__FILE, line
> FROM hive_text WHERE line LIKE '%hive%' LIMIT 2;
har://file/user/hive/warehouse/hive_text/folder=docs/
/user/hive/warehouse/hive_text/folder=docs/ 2243
har://file/user/hive/warehouse/hive_text/folder=docs/
/user/hive/warehouse/hive_text/folder=docs/ 3646
7、如果要存储海量的小文件（大小都是几百K~几M），请简述自己的设计方案。
1）、将小文件打成har文件存储
2）、将小文件序列化到hdfs中
8、有两个文本文件，文件中的数据按行存放，请编写MapReduce程序，找到两个文件中彼此不相同的行。
写个mapreduce链用依赖关系，一共三个mapreduce，第一个处理第一个文件，第二个处理第二个文件，第三个处理前两个的输出结果，
第一个mapreduce将文件去重，第二个mapreduce也将文件去重，第三个做wordcount，wordcount为1的结果就是不同的。

4. 共同朋友

mapred找共同朋友，数据格式如下
usr:friend,friend,friend...
---------------
A:B,C,D,E,F
B:A,C,D,E
C:A,B,E
D:A,B,E
E:A,B,C,D
F:A
第一个字母表示本人，其他是他的朋友，找出共同朋友的人，和共同朋友是谁。
思路：例如A，他的朋友是B\C\D\E\F\，那么BC的共同朋友就是A。所以将BC作为key，将A作为value，在map端输出即可！其他的朋友循环处理。

import ;
import ;
import ;
import ;
import ;
import ;
import ;
import ;
import ;
import ;
import ;
import ;
import ;
import ;
 
public class FindFriend {   
 
    public static class ChangeMapper extends Mapper<Object, Text, Text,Text>{                      
        @Override
        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
            StringTokenizer itr = new StringTokenizer(());
            Text owner = new Text();
            Set<String> set = new TreeSet<String>();
            (());
            while (()) {
                (());
            }             
            String[] friends = new String[()];
            friends = (friends); 
            for(int i=0;i<;i++){
                for(int j=i+1;j<;j++){
                    String outputkey = friends[i]+friends[j];        
                    (new Text(outputkey),owner);
                }                                     
            }
        }
    }            
 
    public static class FindReducer extends Reducer<Text,Text,Text,Text>{                          
        public void reduce(Text key, Iterable<Text> values, Context context) throws IOException,InterruptedException {
            String  commonfriends =""; 
            for (Text val : values) {
                if(commonfriends == ""){
                    commonfriends = ();
                }else{
                    commonfriends = commonfriends+":"+();
                }
            }
            (key, new Text(commonfriends)); 
        }                           
    }
 
    public static void main(String[] args) throws IOException,InterruptedException, ClassNotFoundException{             
        Configuration conf = new Configuration();
        String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
        if ( < 2) {
            ("args error");
            (2);
        }
        Job job = new Job(conf, "word count");
        ();
        ();
        ();
        ();
        ();
        ();
        for (int i = 0; i <  - 1; ++i) {
            (job, new Path(otherArgs[i]));
        }
        (job,new Path(otherArgs[ - 1]));
        ((true) ? 0 : 1);                
    }
}

结果：

1. AB E:C:D

2. AC E:B

3. AD B:E

4. AE C:B:D

5. BC A:E

6. BD A:E

7. BE C:D:A

8. BF A

9. CD E:A:B

10. CE A:B

11. CF A

12. DE B:A

13. DF A

14. EF A

5. 基站逗留时间

需求：

期望：

思路：

将数据导入hive表中，查询时，用电话号码和时间排序即可！

6. 脚本替换

脚本：随意命名为

#!/bin/bash

ls $1 | while read line

sed -i 's,\$HADOOP_HOME\$,\/home\/aa,g' $1$line

echo $1$line

done

脚本执行命令：替换/home/hadoop/test/下的所有文件
./ /home/hadoop/test/

7. 一键执行

脚本：

#!/bin/bash

ssh -q hadoop@slave1 "$1"

ssh -q hadoop@slave2 "$1"

执行命令

./ "ls -l"

8. 大数据面试汇总

1.讲解一下MapReduce 的一些基本流程
任务提交流程，任务运行流程
2.你们数据库怎么导入hive 的,有没有出现问题
使用sqoop导入，我们公司的数据库中设计了text字段，导致导入的时候出现了缓存不够的情况（见云笔记），开始解决起来感觉很棘手，后来查看了sqoop的文档，加上了limit属性，解决了。
3.公司技术选型可能利用storm 进行实时计算,讲解一下storm
从storm的应用，代码书写，运行机制讲
4.问你java 集合类的数据结构,比如hashmap
看java面试宝典
5.问你知不知道concurrent 包下的东西,例如concurrenthashmap
看java面试宝典
6.公司最近主要在自然语言学习去开发,有没有接触过
没有用过

秒客网