hive学习07-常见的优化

时间：2021-05-11 10:13:20

基础每天学习：

1.行转列：

case ... when ...then....else ...end as xxx

2.

“fields terminated by”：字段与字段之间的分隔符。
“collection items terminated by”：一个字段中各个子元素 item 的分隔符。

3.数据仓库中常见的分区

数据仓库分区：时间（天）、数据来源（app、m、pc）

　　--数据库：用户的属性、年龄、性别、收藏、购买的记录　　
　　--每天有新增用户，修改信息dt=2018922 存在大量信息冗余
　　--overwrite 7 每天做overwrite dt=20180922,
　　--当天之前的所有全量数据，有7个分区，冗余7分

4.hive查看数据时查看表头：

set hive.cli.print.header = true;

5.分桶使用:cluster by(xxx) into 4 buckets;

如果需要分桶必须事先设置参数：
set hive.enforce.bucketing = true
或者用户可以自主设置mapred.reduce.tasks通过reduce的个数来适配bucket

buctet的作用:
1、数据采样,如果采样列：select * from student tablesample(bucket x out of y on user_id)
hive根据y的大小决定抽样的比例

6.hive 优化

1.作业依赖于input的目录产生map的个数，set dfs.block.size

--小文件太多的时候，合并小文件，减少map个数

---set mapred.map.tasks = 10

---map聚合 set hive.map.aggr=true

reduce 优化：
---hive.exec.reducers.bytes.per.reducer= ; 每个reduce任务处理的数据量优先级第三
---hive.exec.reducers.max= ;reduce的最大个数优先级最大
---设置reduce的个数 set mapred.reduce.tasks = 10 优先级第二

一个reduce：
--order by (使用distribute by+ sort by 或者 cluster by 代替)
--笛卡尔积 a join b (没有on，或者无效的on条件，直接变成笛卡尔连接，触发一个reduce；一定要避免笛卡尔积，一个reduce)

hive优化：
-where 中的分区条件，会提前生效，不必特意做子查询，直接做join和group by

-Map join时候，小表放在最前边
- /*+MAPJOIN(TABLElist)*/,必须是小表，小于1G或者50条记录

-union all/distinct

-先做union all 再做join或者group by 等操作可以有效减少MR过程

标签：hive 学习优化

相关文章

