[Hive_12] Hive 的自定义函数

0. 说明

　　UDF 　　//user define function
　　　　　　//输入单行，输出单行，类似于 format_number(age,'000')

　　UDTF 　　//user define table-gen function
　　　　　　 //输入单行，输出多行，类似于 explode(array);

　　UDAF 　　//user define aggr function
　　　　　　 //输入多行，输出单行，类似于 sum(xxx)

　　Hive 通过 UDF 实现对 temptags 的解析

1. UDF

　　1.1 代码示例

　　Code

　　1.2 用户自定义函数的使用

　　1. 将 Hive 自定义函数打包并发送到 /soft/hive/lib 下
　　2. 重启 Hive
　　3. 注册函数

# 永久函数

　　create function myudf as 'com.share.udf.MyUDF';

# 临时函数

　　create temporary function myudf as 'com.share.udf.MyUDF';

　　1.3 Demo

　　Hive 通过 UDF 实现对 temptags 的解析

　　0. 准备数据

　　1. 建表

    create table temptags(id int,json string) row format delimited fields terminated by '\t';

　　2. 加载数据

    load data local inpath '/home/centos/files/temptags.txt' into table temptags;

　　3. 代码编写

　　Code

　　4. 打包

　　5. 添加 fastjson-1.2.47.jar & myhive-1.0-SNAPSHOT.jar 到 /soft/hive/lib 中

　　6. 重启 Hive

　　7. 注册临时函数

    create temporary function parsejson as 'com.share.udf.ParseJson';

　　8. 测试

select id ,parsejson(json) as tags from temptags;

# 将 id 和 tag 炸开

select id,  tag from temptags lateral view explode(parsejson(json)) xx as tag;

# 开始统计每个商家每个标签个数

select id, tag, count(*) as count
from (select id,  tag from temptags lateral view explode(parsejson(json)) xx as tag) a
group by id, tag;

# 进行商家内标签数的排序

select id, tag , count, row_number()over(partition by id order by count desc) as rank
from  (select id, tag, count(*) as count from (select id,  tag from temptags lateral view explode(parsejson(json)) xx as tag) a
group by id,tag) b ;

# 将标签和个数进行拼串，取得前 10 标签数

select id, concat(tag,'_',count)
from (select id, tag , count, row_number()over(partition by id order by count desc) as rank 
from  (select id, tag, count(*) as count from (select id,  tag from temptags lateral view explode(parsejson(json)) xx as tag) a
group by id,tag) b )c
where rank<=10;

#聚合拼串

    //concat_ws(',', List<>)

    //collect_set(name) 将所有字段变为数组,去重

    //collect_list(name) 将所有字段变为数组,不去重

select id, concat_ws(',',collect_set(concat(tag,'_',count))) as tags
from (select id, tag , count, row_number()over(partition by id order by count desc) as rank
from  (select id, tag, count(*) as count from (select id,  tag from temptags lateral view explode(parsejson(json)) xx as tag) a
group by id,tag) b )c  where rank<=10 group by id;

　　1.4 虚列：lateral view

　　123456 味道好_10,环境卫生_9

　　id　　 tags
　　1 　　[味道好，环境卫生]　　 =>　　 1 味道好
　　　　　　　　　　　　　　　　　　1 环境卫生

select name, workplace from employee lateral view explode(work_place) xx as workplace;

　　1.5 类找不到异常

　　缺少 jar 包导致的: 类找不到异常的解决方案

　　问题描述

　　Caused by: java.lang.ClassNotFoundException: com.share.udf.ParseJson

　　解决方案

　　1. 将 fastjson 和 myhive.jar 放在 /soft/hadoop/share/hadoop/common/lib 下

　　cp /soft/hive/lib/myhive-1.0-SNAPSHOT.jar /soft/hadoop/share/hadoop/common/lib/

　　cp /soft/hive/lib/fastjson-1.2..jar /soft/hadoop/share/hadoop/common/lib/

　　2. 同步到其他节点

　　xsync.sh /soft/hadoop/share/hadoop/common/lib/fastjson-1.2..jar

　　xsync.sh /soft/hadoop/share/hadoop/common/lib/myhive-1.0-SNAPSHOT.jar

　　3. 重启 Hadoop 和 Hive

　　stop-all.sh

　　hive

2. UDTF

　　2.0 说明

　　Hive 实现 Word Count 通过以下两种方式

　　array => explode

　　string => split => explode

　　现在直接通过 UDTF 实现 WordCount

　　string => myudtf

　　2.1 代码编写

　　Code

　　2.2 打包

　　将 myhive-1.0-SNAPSHOT.jar 添加到 /soft/hive/lib 中

　　2.3 重启 Hive

　　2.4 注册临时函数

　　create function myudtf as 'com.share.udtf.MyUDTF';

　　2.5 测试

　　 [Hive_12] Hive 的自定义函数

    select myudtf(line) from wc2;

　　2.6 流程分析

　　1. 通过 initialize的参数(方法参数)类型或参数个数

　　2. 返回输出表的表结构(字段名+字段类型)

　　3. 通过 process函数，取出参数值

　　4. 进行处理后通过 forward函数将其输出

秒客网

[Hive_12] Hive 的自定义函数

0. 说明

1. UDF

1.1 代码示例

1.2 用户自定义函数的使用

1.3 Demo

1.4 虚列：lateral view

1.5 类找不到异常

2. UDTF

2.0 说明

2.1 代码编写

2.2 打包

2.3 重启 Hive

2.4 注册临时函数

2.5 测试

2.6 流程分析

相关文章