hive 下篇

时间:2023-03-08 19:58:34

由于spark on hive 问题,导致无法插入数据,暂时使用spark进行hive操作

向分区表插入数据

hive> show partitions customers;
OK
partition
city=beijing
city=shenzhen
Time taken: 0.071 seconds, Fetched: 2 row(s)

查看HDFS:

drwxr-xr-x   - root supergroup          0 2018-04-13 11:50 /user/hive/warehouse/gamedw.db/cust
-rwxr-xr-x   1 root supergroup         96 2018-04-11 17:14 /user/hive/warehouse/gamedw.db/cust/customers.txt
-rwxr-xr-x   1 root supergroup         10 2018-04-13 11:50 /user/hive/warehouse/gamedw.db/cust/part-00000-ebb85cb3-73e4-4d7b-aa07-8f007c8887bb-c000

drwx-wx-wx   - root supergroup          0 2018-03-29 16:35 /user/hive/warehouse/gamedw.db/customers
drwx-wx-wx   - root supergroup          0 2018-03-29 16:35 /user/hive/warehouse/gamedw.db/customers/city=beijing
-rwx-wx-wx   1 root supergroup         80 2018-03-29 16:35 /user/hive/warehouse/gamedw.db/customers/city=beijing/customers.txt
drwx-wx-wx   - root supergroup          0 2018-03-28 17:03 /user/hive/warehouse/gamedw.db/customers/city=shenzhen
-rwx-wx-wx   1 root supergroup         80 2018-03-28 17:03 /user/hive/warehouse/gamedw.db/customers/city=shenzhen/customers.txt

插入数据:

scala> hivecon.sql("insert into gamedw.customers partition(city='guangzhen')select * from gamedw.cust ").show
18/04/13 14:14:44 WARN hive.log: Updating partition stats fast for: customers  
18/04/13 14:14:44 WARN hive.log: Updated size to 106
++
||
++
++

scala> hivecon.sql("show partitions gamedw.customers ").show
+--------------+
|     partition|
+--------------+
|  city=beijing|
|city=guangzhen|
| city=shenzhen|
+--------------+

然后再查看hdfs :

drwx-wx-wx   - root supergroup          0 2018-04-13 14:14 /user/hive/warehouse/gamedw.db/customers
drwx-wx-wx   - root supergroup          0 2018-03-29 16:35 /user/hive/warehouse/gamedw.db/customers/city=beijing
-rwx-wx-wx   1 root supergroup         80 2018-03-29 16:35 /user/hive/warehouse/gamedw.db/customers/city=beijing/customers.txt
drwx-wx-wx   - root supergroup          0 2018-04-13 14:14 /user/hive/warehouse/gamedw.db/customers/city=guangzhen
-rwx-wx-wx   1 root supergroup         96 2018-04-13 14:14 /user/hive/warehouse/gamedw.db/customers/city=guangzhen/part-00000-8513f59a-0fb6-45f8-bf87-1d27254d9b74-c000
-rwx-wx-wx   1 root supergroup         10 2018-04-13 14:14 /user/hive/warehouse/gamedw.db/customers/city=guangzhen/part-00001-8513f59a-0fb6-45f8-bf87-1d27254d9b74-c000
drwx-wx-wx   - root supergroup          0 2018-03-28 17:03 /user/hive/warehouse/gamedw.db/customers/city=shenzhen
-rwx-wx-wx   1 root supergroup         80 2018-03-28 17:03 /user/hive/warehouse/gamedw.db/customers/city=shenzhen/customers.txt

由于表cust数据是两个文件,因此插入数据后,表customers,自动生成了一个分区,在HDFS中多了两个文件:

[root@host ~]# hdfs dfs -cat /user/hive/warehouse/gamedw.db/customers/city=guangzhen/part-00001-8513f59a-0fb6-45f8-bf87-1d27254d9b74-c000
nihao 1 5
[root@host ~]# hdfs dfs -cat /user/hive/warehouse/gamedw.db/customers/city=guangzhen/part-00000-8513f59a-0fb6-45f8-bf87-1d27254d9b74-c000
tianyt_touch100 1 50
wangwu 1 85
zhangsan 1 20
liuqin 0 56
wangwu 0 47
liuyang 1 32
hello 0 100

同样插入如下数据:

scala> hivecon.sql("insert overwrite table  gamedw.customers partition(city='wuhan')select * from gamedw.cust where nianling<>5").show
18/04/13 14:31:13 WARN hive.log: Updating partition stats fast for: customers  
18/04/13 14:31:13 WARN hive.log: Updated size to 96
++
||
++
++

查看hdfs:

drwx-wx-wx   - root supergroup          0 2018-04-13 14:31 /user/hive/warehouse/gamedw.db/customers/city=wuhan
-rwx-wx-wx   1 root supergroup         96 2018-04-13 14:31 /user/hive/warehouse/gamedw.db/customers/city=wuhan/part-00000-d3a1ad44-e0d6-4a1a-9dd0-b42933085b46-c000
-rwx-wx-wx   1 root supergroup          0 2018-04-13 14:31 /user/hive/warehouse/gamedw.db/customers/city=wuhan/part-00001-d3a1ad44-e0d6-4a1a-9dd0-b42933085b46-c000

[root@host ~]# hdfs dfs -text /user/hive/warehouse/gamedw.db/customers/city=wuhan/part-00001-d3a1ad44-e0d6-4a1a-9dd0-b42933085b46-c000
[root@host ~]# hdfs dfs -text /user/hive/warehouse/gamedw.db/customers/city=wuhan/part-00000-d3a1ad44-e0d6-4a1a-9dd0-b42933085b46-c000
tianyt_touch100 1 50
wangwu 1 85
zhangsan 1 20
liuqin 0 56
wangwu 0 47
liuyang 1 32
hello 0 100

可以查出虽然一个文件的数据已经被过滤,但插入时依然会生成一个空的文件

一次添加两个分区:

scala> hivecon.sql("from gamedw.cust cst insert overwrite table  gamedw.customers partition(city='nanjing') select *   where cst.nianling<>5  insert overwrite table  gamedw.customers partition(city='luohe') select *   where cst.nianling=5 ").show

18/04/13 15:16:38 WARN hive.log: Updating partition stats fast for: customers 18/04/13 15:16:38 WARN hive.log: Updated size to 96

18/04/13 15:16:39 WARN hive.log: Updating partition stats fast for: customers 18/04/13 15:16:39 WARN hive.log: Updated size to 10

++ || ++ ++

hive> show partitions customers;
OK
partition
city=beijing
city=guangzhen
city=luohe
city=nanjing
city=shenzhen
city=wuhan
Time taken: 0.086 seconds, Fetched: 6 row(s)

hive> select * from customers where city='luohe' or city='nanjing';
OK
customers.custname      customers.sex   customers.age   customers.city
nihao   1       5       luohe
tianyt_touch100 1       50      nanjing
wangwu  1       85      nanjing
zhangsan        1       20      nanjing
liuqin  0       56      nanjing
wangwu  0       47      nanjing
liuyang 1       32      nanjing
hello   0       100     nanjing
Time taken: 1.037 seconds, Fetched: 8 row(s)

drwx-wx-wx   - root supergroup          0 2018-04-13 15:16 /user/hive/warehouse/gamedw.db/customers/city=luohe
-rwx-wx-wx   1 root supergroup          0 2018-04-13 15:16 /user/hive/warehouse/gamedw.db/customers/city=luohe/part-00000-0315be67-7d9a-451a-bdea-09f18f02b97c-c000
-rwx-wx-wx   1 root supergroup         10 2018-04-13 15:16 /user/hive/warehouse/gamedw.db/customers/city=luohe/part-00001-0315be67-7d9a-451a-bdea-09f18f02b97c-c000
drwx-wx-wx   - root supergroup          0 2018-04-13 15:16 /user/hive/warehouse/gamedw.db/customers/city=nanjing
-rwx-wx-wx   1 root supergroup         96 2018-04-13 15:16 /user/hive/warehouse/gamedw.db/customers/city=nanjing/part-00000-4e959b24-fff2-45c0-a3d7-f4136d7ea96c-c000
-rwx-wx-wx   1 root supergroup          0 2018-04-13 15:16 /user/hive/warehouse/gamedw.db/customers/city=nanjing/part-00001-4e959b24-fff2-45c0-a3d7-f4136d7ea96c-c000

成功一次添加两个分区

之前的select 插入均为静态插入;

动态插入分区:

scala> hivecon.sql("insert overwrite table gamedw.customers partition(city) select custname,sex,nianling,'kaifeng' as city from gamedw.cust").show
org.apache.spark.SparkException: Dynamic partition strict mode requires at least one static partition column. To turn this off set hive.exec.dynamic.partition.mode=nonstrict
  at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.run(InsertIntoHiveTable.scala:301)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
  at org.apache.spark.sql.Dataset.<init>(Dataset.scala:182)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:67)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:623)
  at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:691)
  ... 48 elided

hive> set hive.exec.dynamic.partition.mode=nonstrict;(在此设置无效,需要在spark设置)

scala> hivecon.sql("set hive.exec.dynamic.partition.mode=nonstrict").show

+--------------------+---------+ |

key|    value|

+--------------------+---------+ |

hive.exec.dynamic...|nonstrict|

+--------------------+---------+

scala> hivecon.sql("insert overwrite table gamedw.customers partition(city) select custname,sex,nianling,'kaifeng' as city from gamedw.cust").show

18/04/13 15:33:46 WARN hive.log: Updating partition stats fast for: customers

18/04/13 15:33:46 WARN hive.log: Updated size to 106

++ || ++ ++

hive> select * from customers where city='kaifeng';
OK
customers.custname      customers.sex   customers.age   customers.city
tianyt_touch100 1       50      kaifeng
wangwu  1       85      kaifeng
zhangsan        1       20      kaifeng
liuqin  0       56      kaifeng
wangwu  0       47      kaifeng
liuyang 1       32      kaifeng
hello   0       100     kaifeng
nihao   1       5       kaifeng
Time taken: 0.161 seconds, Fetched: 8 row(s)

bingo,动态插入成功

单个语句创建表并加载数据

scala> hivecon.sql("create table cust_copy as  select custname,sex,nianling from gamedw.cust").show

18/04/13 15:46:38 WARN metastore.HiveMetaStore: Location: hdfs://localhost:9000/user/hive/warehouse/cust_copy specified for non-external table:cust_copy

++ || ++ ++

查看HDFS:

drwx-wx-wx   - root supergroup          0 2018-04-13 15:58 /user/hive/warehouse/gamedw.db/cust_copy
-rwx-wx-wx   1 root supergroup         96 2018-04-13 15:58 /user/hive/warehouse/gamedw.db/cust_copy/part-00000-8e3d26e5-5b38-4308-b631-9815f6730acc-c000
-rwx-wx-wx   1 root supergroup         10 2018-04-13 15:58 /user/hive/warehouse/gamedw.db/cust_copy/part-00001-8e3d26e5-5b38-4308-b631-9815f6730acc-c000

drwxr-xr-x   - root supergroup          0 2018-04-13 11:50 /user/hive/warehouse/gamedw.db/cust
-rwxr-xr-x   1 root supergroup         96 2018-04-11 17:14 /user/hive/warehouse/gamedw.db/cust/customers.txt
-rwxr-xr-x   1 root supergroup         10 2018-04-13 11:50 /user/hive/warehouse/gamedw.db/cust/part-00000-ebb85cb3-73e4-4d7b-aa07-8f007c8887bb-c000

导出数据:

如果表的数据文件刚好是需要的格式,直接在hdfs上拷贝数据即可

否则

insert overwrite local directory '/root/custtmp' select * from  tablename;

查询

employees表导入数据:

CREATE TABLE `employees`(
  `name` string,
  `salary` float,
  `subdinates` array<string>,
  `deducation` map<string,float>,
  `address` struct<street:string,city:string,state:string,zip:int>)

[root@host tmpdata]# cat -A employees.txt     
tianyongtao^A100^Awang^BZHANG^BLIU^Aaaa^C10^Bbb^C5^BCC^C8^AHENAN^BLUOHE^BLINYING^BTAICHEN$
wangyangming^A200^Ama^Bzhong^Aaaa^C6^Bbb^C12^Ahunan^Bchangsha$

[root@host tmpdata]# hdfs dfs -put employees.txt /user/hive/warehouse/gamedw.db/employees

hive> select *  from employees;
OK
employees.name  employees.salary        employees.subdinates    employees.deducation    employees.address
tianyongtao     100.0   ["wang","ZHANG","LIU"]  {"aaa":10.0,"bb":5.0,"CC":8.0}  {"street":"HENAN","city":"LUOHE","state":"LINYING","zip":null}
wangyangming    200.0   ["ma","zhong"]  {"aaa":6.0,"bb":12.0}   {"street":"hunan","city":"changsha","state":null,"zip":null}
Time taken: 0.14 seconds, Fetched: 2 row(s)

hive> select name,salary,subdinates[0] subdinate1,deducation["bb"] bb,address.city from employees;
OK
name    salary  subdinate1      bb      city
tianyongtao     100.0   wang    5.0     LUOHE
wangyangming    200.0   ma      12.0    changsha
Time taken: 0.134 seconds, Fetched: 2 row(s)
数组索引是基于0的;map使用的是键值;引用结构体的一个元素使用的是的 结构体名.属性名 如上例中的address.city

hive> select name,salary,subdinates[2] subdinate1,deducation["cc"] cc,address.city from employees;
OK
name    salary  subdinate1      cc      city
tianyongtao     100.0   LIU     NULL    LUOHE
wangyangming    200.0   NULL    NULL    changsha
Time taken: 0.123 seconds, Fetched: 2 row(s)

数据根据索引找不到则返回null,map同理;

hive> select *  from employees where address.city='LUOHE';
OK
employees.name  employees.salary        employees.subdinates    employees.deducation    employees.address
tianyongtao     100.0   ["wang","ZHANG","LIU"]  {"aaa":10.0,"bb":5.0,"CC":8.0}  {"street":"HENAN","city":"LUOHE","state":"LINYING","zip":null}
Time taken: 0.126 seconds, Fetched: 1 row(s)
hive> select *  from employees where deducation["bb"]=12;
OK
employees.name  employees.salary        employees.subdinates    employees.deducation    employees.address
wangyangming    200.0   ["ma","zhong"]  {"aaa":6.0,"bb":12.0}   {"street":"hunan","city":"changsha","state":null,"zip":null}
Time taken: 0.125 seconds, Fetched: 1 row(s)

HQL语句中用到的双引号,在spark需要使用转义字符

scala> hivecon.sql("select upper(name) Uname,lower(name) lname,salary-deducation[\"aaa\"]-deducation[\"bb\"] sal from gamedw.employees").show
+------------+------------+-----+
|       Uname|       lname|  sal|
+------------+------------+-----+
| TIANYONGTAO| tianyongtao| 85.0|
|WANGYANGMING|wangyangming|182.0|
+------------+------------+-----+

hive> select upper(name) Uname,lower(name) lname,salary-deducation["aaa"]-deducation["bb"] sal from employees;
OK
uname   lname   sal
TIANYONGTAO     tianyongtao     85.0
WANGYANGMING    wangyangming    182.0
Time taken: 0.124 seconds, Fetched: 2 row(s)

函数

scala> hivecon.sql("  select round(5.66) round,floor(8.55) floor,ceil(8.55) ceil,rand() rand,exp(8) exp,ln(16) ln,log10(100) log10,log2(5) log2,power(10,2) power,sqrt(100) sqrt,bin(5) bin,hex(10) hex,hex(9) hex1,abs(-100) abs,pmod(1,5) pmod,negative(-10) negative").show
+-----+-----+----+------------------+------------------+-----------------+-----+-----------------+-----+----+---+---+----+---+----+--------+
|round|floor|ceil|              rand|               exp|               ln|log10|             log2|power|sqrt|bin|hex|hex1|abs|pmod|negative|
+-----+-----+----+------------------+------------------+-----------------+-----+-----------------+-----+----+---+---+----+---+----+--------+
|    6|    8|   9|0.4827320089704821|2980.9579870417283|2.772588722239781|  2.0|2.321928094887362|100.0|10.0|101|  A|   9|100|   1|      10|
+-----+-----+----+------------------+------------------+-----------------+-----+-----------------+-----+----+---+---+----+---+----+--------+

e,作为数学常数,是自然对数函数的底数。有时称它为欧拉数(Euler number),以瑞士数学家欧拉命名;它就像圆周率π和虚数单位i,e是数学中最重要的常数之一。

hive 下篇其数值约为(小数点后100位):“e ≈ 2.71828 18284 59045 23536 02874 71352 66249 77572 47093 69995 95749 66967 62772 40766 30353 54759 45713 82178 52516 64274”。

hive函数exp(double d)返回e的d幂次方;

hive> select exp(1) exp1,exp(2) exp2,exp(8) exp8;
OK
exp1    exp2    exp8
2.718281828459045       7.38905609893065        2980.9579870417283
Time taken: 0.545 seconds, Fetched: 1 row(s)

hive函数ln(double d)返回以自然数e为底d的对数

hive> select ln(7.38905609893065);
OK
_c0
2.0
Time taken: 0.136 seconds, Fetched: 1 row(s)

log10(double d)返回以10为底d的对数;log2(double d)返回以2为底d的对数,log(double base,double d)返回以base为底d的对数

hive> select log10(100),power(10,2),log2(8),power(2,3),log(2,16);
OK
_c0     _c1     _c2     _c3     _c4
2.0     100.0   3.0     8.0     4.0
Time taken: 0.164 seconds, Fetched: 1 row(s)

bin(double i) 计算二进制i的string类型值,i是bigint类型;

hive> select bin(7),bin(5);
OK
_c0     _c1
111     101
Time taken: 0.147 seconds, Fetched: 1 row(s)

hex(bigint i) 计算十六进制i的string类型值,i是bigint类型

说明:如果变量是int类型,那么返回a的十六进制表示;如果变量是string类型,则返回该字符串的十六进制表示

hive> select hex(15),hex(9),hex(10),hex(31);
OK
_c0     _c1     _c2     _c3
F       9       A       1F
Time taken: 0.141 seconds, Fetched: 1 row(s)

conv(BIGINT num, int from_base, int to_base)

将数值num从from_base进制转化到to_base进制

hive> select conv(17,10,16);
OK
_c0
11
Time taken: 0.156 seconds, Fetched: 1 row(s)

语法: pmod(int a, int b),pmod(double a, double b)

返回值: int double

说明:返回正的a除以b的余数

hive> select pmod(8,3);
OK
_c0
2
Time taken: 0.549 seconds, Fetched: 1 row(s)

negative 负的

positive 正数

hive> select positive(-10),positive(10),negative(-10),negative(10);
OK
_c0     _c1     _c2     _c3
-10     10      10      -10
Time taken: 0.158 seconds, Fetched: 1 row(s)

hive> select e(),pi();
OK
_c0     _c1
2.718281828459045       3.141592653589793
Time taken: 0.144 seconds, Fetched: 1 row(s)

聚合函数

scala> hivecon.sql("select city,count(1),round(avg(age))  from gamedw.customers group by city").show
+---------+--------+----------------------------------+
|     city|count(1)|round(avg(CAST(age AS BIGINT)), 0)|
+---------+--------+----------------------------------+
|    wuhan|       7|                              56.0|
|  qingdao|       8|                              49.0|
|  beijing|       6|                              48.0|
|    luohe|       1|                               5.0|
|guangzhen|       8|                              49.0|
|  nanjing|       7|                              56.0|
|  kaifeng|       8|                              49.0|
| shenzhen|       6|                              48.0|
+---------+--------+----------------------------------+

count(*) 统计所有行数,包括null值的行

count(exp) 统计exp表达式为非null的行数

scala> hivecon.sql("select sum(age),sum(distinct age)  from gamedw.customers").show
+--------+-----------------+                                                   
|sum(age)|sum(DISTINCT age)|
+--------+-----------------+
|    2550|              395|
+--------+-----------------+

scala> hivecon.sql("select sum(age),sum(distinct age),avg(age),avg(distinct age)  from gamedw.customers").show
+--------+-----------------+--------+-----------------+                        
|sum(age)|sum(DISTINCT age)|avg(age)|avg(DISTINCT age)|
+--------+-----------------+--------+-----------------+
|    2550|              395|    50.0|           49.375|
+--------+-----------------+--------+-----------------+

表生产函数:

explode 可以接受数组和map参数

hive> select * from employees;
OK
employees.name  employees.salary        employees.subdinates    employees.deducation    employees.address
tianyongtao     100.0   ["wang","ZHANG","LIU"]  {"aaa":10.0,"bb":5.0,"CC":8.0}  {"street":"HENAN","city":"LUOHE","state":"LINYING","zip":null}
wangyangming    200.0   ["ma","zhong"]  {"aaa":6.0,"bb":12.0}   {"street":"hunan","city":"changsha","state":null,"zip":null}
Time taken: 0.114 seconds, Fetched: 2 row(s)
hive> select  explode(subdinates) from employees;
OK
col
wang
ZHANG
LIU
ma
zhong
Time taken: 0.118 seconds, Fetched: 5 row(s)
hive> select  explode(subdinates),explode(deducation) from employees;
FAILED: SemanticException 1:28 Only a single expression in the SELECT clause is supported with UDTF's. Error encountered near token 'deducation'
hive> select explode(deducation) from employees;
OK
key     value
aaa     10.0
bb      5.0
CC      8.0
aaa     6.0
bb      12.0
Time taken: 0.115 seconds, Fetched: 5 row(s)

hive> select explode(address) from employees;
FAILED: UDFArgumentException explode() takes an array or a map as a parameter

json

hive> select get_json_object('{"store":{"fruit":[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"}', '$.store.fruit.type');
OK
_c0
["apple","pear"]
Time taken: 0.147 seconds, Fetched: 1 row(s)

parse_url

hive> select parse_url('http://data.zqgame.com:3689/Report/ad_main_bydate_shouyou.aspx?adcode=256004296705&gameid=3004&sdate=20180101&edate=20180123','QUERY');
OK
_c0
adcode=256004296705&gameid=3004&sdate=20180101&edate=20180123
Time taken: 0.143 seconds, Fetched: 1 row(s)

hive> select parse_url('http://data.zqgame.com:3689/Report/ad_main_bydate_shouyou.aspx?adcode=256004296705&gameid=3004&sdate=20180101&edate=20180123','QUERY','gameid');
OK
_c0
3004
Time taken: 0.139 seconds, Fetched: 1 row(s)
hive> select parse_url('http://data.zqgame.com:3689/Report/ad_main_bydate_shouyou.aspx?adcode=256004296705&gameid=3004&sdate=20180101&edate=20180123','HOST');
OK
_c0
data.zqgame.com
Time taken: 0.14 seconds, Fetched: 1 row(s)
hive> select parse_url('http://data.zqgame.com:3689/Report/ad_main_bydate_shouyou.aspx?adcode=256004296705&gameid=3004&sdate=20180101&edate=20180123','PATH');
OK
_c0
/Report/ad_main_bydate_shouyou.aspx
Time taken: 0.146 seconds, Fetched: 1 row(s)

其他内置函数

hive> select instr('aggggaaeg','aa'),length('aaaggg1111'),locate('aa','aggggaaeg'),lower('aaaA'),lcase('aaaA'),upper('aggefR');
OK
_c0     _c1     _c2     _c3     _c4     _c5
6       10      6       aaaa    aaaa    AGGEFR
Time taken: 0.14 seconds, Fetched: 1 row(s)

hive> select cast('10' as int),ascii('a'),ascii(1),concat('how','are','you'),concat_ws(' ','how','are','you'),find_in_set('a','dgeage'),find_in_set('a','ggg,a,b'),format_number(20.25466,2);
OK
_c0     _c1     _c2     _c3     _c4     _c5     _c6     _c7
10      97      49      howareyou       how are you     0       2       20.25

scala> hivecon.sql("select lpad('aaaa',10,'bb'),lpad('aaaa',5,'bb'),ltrim(' aafdf'),repeat('55',5),space(10)").show
+------------------+-----------------+-------------+-------------+----------+
|lpad(aaaa, 10, bb)|lpad(aaaa, 5, bb)|ltrim( aafdf)|repeat(55, 5)| space(10)|
+------------------+-----------------+-------------+-------------+----------+
|        bbbbbbaaaa|            baaaa|        aafdf|   5555555555|          |
+------------------+-----------------+-------------+-------------+----------+

hive> select * from employees;
OK
employees.name  employees.salary        employees.subdinates    employees.deducation    employees.address
tianyongtao     100.0   ["wang","ZHANG","LIU"]  {"aaa":10.0,"bb":5.0,"CC":8.0}  {"street":"HENAN","city":"LUOHE","state":"LINYING","zip":null}
wangyangming    200.0   ["ma","zhong"]  {"aaa":6.0,"bb":12.0}   {"street":"hunan","city":"changsha","state":null,"zip":null}
Time taken: 0.103 seconds, Fetched: 2 row(s)
hive> select name,size(subdinates),size(deducation) from employees;
OK
name    c1      c2
tianyongtao     3       3
wangyangming    2       2
Time taken: 0.106 seconds, Fetched: 2 row(s)

字符串装换为map:

hive> select str_to_map("gameid-100,name-'jzry'",',','-')
    > ;
OK
_c0
{"gameid":"100","name":"'jzry'"}
Time taken: 0.146 seconds, Fetched: 1 row(s)

hive> select from_unixtime( unix_timestamp(),'yyyyMMdd'), from_unixtime( unix_timestamp()),unix_timestamp(),unix_timestamp('2018-04-18','yyyy-MM-dd');
unix_timestamp(void) is deprecated. Use current_timestamp instead.
unix_timestamp(void) is deprecated. Use current_timestamp instead.
unix_timestamp(void) is deprecated. Use current_timestamp instead.
OK
_c0     _c1     _c2     _c3
20180418        2018-04-18 15:13:37     1524035617      1523980800
Time taken: 0.178 seconds, Fetched: 1 row(s)

hive> select current_timestamp(),year(current_timestamp()),month(current_timestamp()),day(current_timestamp()),hour(current_timestamp()),second(current_timestamp());
OK
_c0     _c1     _c2     _c3     _c4     _c5
2018-04-18 15:17:11.882 2018    4       18      15      11
Time taken: 0.133 seconds, Fetched: 1 row(s)

hive> select weekofyear(current_timestamp()),date_add(current_timestamp(),2),datediff('2018-01-01',current_timestamp());
OK
_c0     _c1     _c2
16      2018-04-20      -107
Time taken: 0.139 seconds, Fetched: 1 row(s)

查看前三行和mysql基本一样

hive> select * from cust limit 3;
OK
cust.custname   cust.sex        cust.nianling
tianyt_touch100 1       50
wangwu  1       85
zhangsan        1       20
Time taken: 0.109 seconds, Fetched: 3 row(s)

嵌套查询

hive> from (select  *  from cust) e select e.nianling where e.nianling>50;
OK
e.nianling
85
56
100
Time taken: 0.109 seconds, Fetched: 3 row(s)

case when 基本和mysql同

hive> select custname,case when nianling>80 then 'old' else 'young' end ifold  from cust;
OK
custname        ifold
tianyt_touch100 young
wangwu  old
zhangsan        young
liuqin  young
wangwu  young
liuyang young
hello   old
nihao   young
Time taken: 0.097 seconds, Fetched: 8 row(s)

like 和 Rlike

hive> select * from cust where custname like '%nihao%';
OK
cust.custname   cust.sex        cust.nianling
nihao   1       5
Time taken: 0.104 seconds, Fetched: 1 row(s)

hive> select * from cust where custname Rlike '.*(wang|nihao).*';
OK
cust.custname   cust.sex        cust.nianling
wangwu  1       85
wangwu  0       47
nihao   1       5
Time taken: 0.089 seconds, Fetched: 3 row(s)

Rlike后面的字符串表达式,字符串中的点号(.)表示可以和任何字符匹配,*重复左边的字符串(上例中为点号)0次到无数次;表达式(x|y)表示和x或者y匹配

group by  having  同mysql

inner join on 同mysql

a , b ,c三个表关联,通常情况下,hive对每对join连接生产一个mapreduce,hive按照从左到右执行,因此a,b生成一个mapreduce job,然后将mapreduce的结果与表连接生成一个mapreduce job.

join 的优化

当3个或者以上的表join连接时,如果每个连接都使用相同的连接键时,那么只会产生一个mapreduce job;

hive同时假定最后连接的表是最大的那张表,在对每行记录进行连接操作时,他会尝试将其他表缓存起来,然后扫描最后的表进行计算,因此连续查询的表的大小从左到右是依次增加的。

hive也提供了一个标记来提示查询优化器哪一张表大表 select /*+streamtable(b)*/ 其中b就是最大那张表

scala> hivecon.sql("select /*+streamtable(b)*/ a.sexname,b.custname  from gamedw.tbsex a inner join gamedw.cust b on a.id=b.sex").show

left out join

right out join

left semi join  左半开连接 返回左边表的记录,前提条件是后边表满足on条件

hive 不支右半开连接

join 没有 on  则生成 笛卡尔积。

map-side join  如果所有表中只有一张小表,那么可以在最大的表通过mapper的时候将小表放入内存。hive可以在map端执行连接过程,这是因为hive可以与内存中的小表逐一匹配,从而忽略掉连接时需要的reduce过程。想用这个优化,则可以通过添加一个标记来进行触发。 select /*+mapjoin(d)*/...... 启动d代表写入内存的小表

scala> hivecon.sql("select /*+mapjoin(a)*/ a.sexname,b.custname  from gamedw.tbsex a inner join gamedw.cust b on a.id=b.sex").show

order by 与 sort by

order by 同其他sql ;sort by 只会在每个reducer中对数据进行排序,也就是执行局部排序,这样可以保证每个reducer数据结果都是有序的,可以提高后面全局排序的效率。

含有sort by 的 distribute by

distribute by控制map的输出在reducer中是如何划分的。mpreduce job中出传输的数据都是按照键值对进行组织的,hive再将语句转换为mpreduce job时,其内部必须使用这个功能。

默认情况下,mpreduce计算框架会根据map输入的键计算响应的哈希值,然后按照得到的哈希值将键值对均匀发到多个reducer中,不幸的是,意味着当我们使用sort by 时,不同的reducer输出的内容会有明显的重叠。如果我们想将相同性别的数据一起处理,我们使用distribute by来保证将性别相同的记录发到同一个reducer中进行处理,然后使用sort by 按我们期望的对数据进行排序。
scala> hivecon.sql("select * from gamedw.cust distribute by sex  sort by sex").show

hive 要求distribute by 要在sort by前面

scala> hivecon.sql("select * from gamedw.cust cluster by sex").show

cluster by 等价于distribute by  +sort by

类型转换

可以将binary类型转换为字符串,也可以将字符串转为binary

分桶表 对于每一个表或者是分区,Hive可以进一步组织成桶,也就是说桶是更为细粒度的数据范围划分。Hive是针对某一列进行分桶。Hive采用对列值哈希,然后除以桶的个数求余的方式决定该条记录存放在哪个桶中。分桶的好处是可以获得更高的查询处理效率。使取样更高效。

使用CLUSTERED BY子句指定使用哪一个列来划分桶和要划分的桶的个数。

hive> create table tb_bucket(id int,name string)  clustered by(id)  sorted by(name) into 5 buckets ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' LINES TERMINATED BY '\n' ;
OK
Time taken: 0.22 seconds

scala> hivecon.sql("insert overwrite table gamedw.tb_bucket select nianling,custname from gamedw.cust").show
++
||
++
++

hive> select * from tb_bucket;
OK
tb_bucket.id    tb_bucket.name
50      tianyt_touch100
85      wangwu
20      zhangsan
56      liuqin
47      wangwu
32      liuyang
100     hello
1001    mahuateng
5       nihao

drwx-wx-wx   - root supergroup          0 2018-04-20 14:38 /user/hive/warehouse/gamedw.db/tb_bucket
-rwx-wx-wx   1 root supergroup         82 2018-04-20 14:38 /user/hive/warehouse/gamedw.db/tb_bucket/part-00000-b5147761-6fcf-4810-aa9b-2e0cbead8eee-c000
-rwx-wx-wx   1 root supergroup         15 2018-04-20 14:38 /user/hive/warehouse/gamedw.db/tb_bucket/part-00001-b5147761-6fcf-4810-aa9b-2e0cbead8eee-c000
-rwx-wx-wx   1 root supergroup          8 2018-04-20 14:38 /user/hive/warehouse/gamedw.db/tb_bucket/part-00002-b5147761-6fcf-4810-aa9b-2e0cbead8eee-c000

[root@host ~]# hdfs dfs -cat /user/hive/warehouse/gamedw.db/tb_bucket/part-00000-b5147761-6fcf-4810-aa9b-2e0cbead8eee-c000
50 tianyt_touch100
85 wangwu
20 zhangsan
56 liuqin
47 wangwu
32 liuyang
100 hello
[root@host ~]# hdfs dfs -cat /user/hive/warehouse/gamedw.db/tb_bucket/part-00001-b5147761-6fcf-4810-aa9b-2e0cbead8eee-c000
1001 mahuateng
[root@host ~]# hdfs dfs -cat /user/hive/warehouse/gamedw.db/tb_bucket/part-00002-b5147761-6fcf-4810-aa9b-2e0cbead8eee-c000
5 nihao

抽样查询

分桶抽样 我么可以使用rand()进行抽样,这个函数返回一个随机值。

hive> select * from tb_bucket tablesample(bucket 3  out of 5 on rand()) s;
OK
s.id    s.name
20      zhangsan
32      liuyang
100     hello
5       nihao
Time taken: 0.099 seconds, Fetched: 4 row(s)
hive> select * from tb_bucket tablesample(bucket 3  out of 5 on rand()) s;
OK
s.id    s.name
Time taken: 0.102 seconds
hive> select * from tb_bucket tablesample(bucket 3  out of 5 on rand()) s;
OK
s.id    s.name
50      tianyt_touch100
Time taken: 0.103 seconds, Fetched: 1 row(s)

我们也可以根据指定列而非rand()进行分桶,那么同一个语句执行n次,返回结果相同。

hive> select * from tb_bucket tablesample(bucket 3  out of 5 on id) s;
OK
s.id    s.name
47      wangwu
32      liuyang
Time taken: 0.111 seconds, Fetched: 2 row(s)
hive> select * from tb_bucket tablesample(bucket 3  out of 5 on id) s;
OK
s.id    s.name
47      wangwu
32      liuyang
Time taken: 0.101 seconds, Fetched: 2 row(s)
hive> select * from tb_bucket tablesample(bucket 3  out of 5 on id) s;
OK
s.id    s.name
47      wangwu
32      liuyang
Time taken: 0.129 seconds, Fetched: 2 row(s)

Hive中分桶表取样的语法是:

table_sample: TABLESAMPLE (BUCKET x OUT OF y [ON colname])

分桶语句中分母(y)表示的是数据将会被散列的个数,而分子(x)表示将会选择桶的个数。

我们也可以使用分桶语句对非分桶表进行分桶抽样

hive> select * from t_name tablesample(bucket 3  out of 5 on rand()) s;
OK
s.id    s.name  s.level
4       堂哥            2
6       堂妹            3
Time taken: 0.108 seconds, Fetched: 2 row(s)
hive> select * from t_name tablesample(bucket 3  out of 5 on rand()) s;
OK
s.id    s.name  s.level
1       爷爷            1
2       大爷            1
3       叔叔            1
Time taken: 0.102 seconds, Fetched: 3 row(s)

数据块抽样

这是另一种按百分比抽样的方式,基于行数;

hive> select * from t_name tablesample(10 percent) s;
OK
s.id    s.name  s.level
1       爷爷            1
2       大爷            1
Time taken: 0.096 seconds, Fetched: 2 row(s)

hive> select * from t_name tablesample(5 rows) s;
OK
s.id    s.name  s.level
1       爷爷            1
2       大爷            1
3       叔叔            1
4       堂哥            2
5       堂姐            2
Time taken: 0.11 seconds, Fetched: 5 row(s)

hive> select * from t_name tablesample(5m) s;
OK
s.id    s.name  s.level
1       爷爷            1
2       大爷            1
3       叔叔            1
4       堂哥            2
5       堂姐            2
6       堂妹            3
1       爷爷            1
2       大爷            1
3       叔叔            1
4       堂哥            2
5       堂姐            2
6       堂妹            3
Time taken: 0.098 seconds, Fetched: 12 row(s)
hive> select * from t_name tablesample(5k) s;
OK
s.id    s.name  s.level
1       爷爷            1
2       大爷            1
3       叔叔            1
4       堂哥            2
5       堂姐            2
6       堂妹            3
1       爷爷            1
2       大爷            1
3       叔叔            1
4       堂哥            2
5       堂姐            2
6       堂妹            3
Time taken: 0.2 seconds, Fetched: 12 row(s)
hive> select * from t_name tablesample(5b) s;
OK
s.id    s.name  s.level
1       爷爷            1
Time taken: 0.153 seconds, Fetched: 1 row(s)

union all

可以将2个以及2个以上的表进行合并,查询的列个数以及对应的数据类型必须一致

scala> hivecon.sql("select custname,sex from gamedw.cust1 union all select custname,nianling from gamedw.cust").show

HiveQL 视图

可以降低查询复杂度,可以限制基于条件的过滤数据

视图可以指定字段名 create view viewname(fieldname,......) as ..........,当然也可以省掉 (fieldname,......)

hive> select * from employees;
OK
employees.name  employees.salary        employees.subdinates    employees.deducation    employees.address
tianyongtao     100.0   ["wang","ZHANG","LIU"]  {"aaa":10.0,"bb":5.0,"CC":8.0}  {"street":"HENAN","city":"LUOHE","state":"LINYING","zip":null}
wangyangming    200.0   ["ma","zhong"]  {"aaa":6.0,"bb":12.0}   {"street":"hunan","city":"changsha","state":null,"zip":null}
Time taken: 0.114 seconds, Fetched: 2 row(s)

hive> create view view_emp(aa,bb,city,name) as  select deducation['aaa'],deducation['bb'],address.city,name from employees;
OK
_c0     _c1     city    name
Time taken: 0.106 seconds
hive> select * from view_emp;
OK
view_emp.aa     view_emp.bb     view_emp.city   view_emp.name
10.0    5.0     LUOHE   tianyongtao
6.0     12.0    changsha        wangyangming
Time taken: 0.142 seconds, Fetched: 2 row(s)

索引

hive索引功能有限,但仍然可以加速一些操作,一张表的索引数据存放在另一张表中;可以通过explain命令查看语句是否使用了索引;

当逻辑分区太多的时候,建立索引就成为分区的另一种选择;索引会增加存储空间,建立索引也会消耗计算资源;

hive> create index ix_t_name_id on table t_name(id)  as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'   with deferred rebuild in table t_name_index;
OK
Time taken: 0.37 seconds

----------------------

create index ix_t_name_id     索引名

on table t_name(id)   t_name为表名

as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'   创建索引需要的实现类

with deferred rebuild

in table t_name_index;  创建索引后的表名

加载索引数据(重建索引)

lter index  ix_t_name_id  on  t_name rebuild;

查看表的索引

hive> show index on t_name;
OK
idx_name        tab_name        col_names       idx_tab_name    idx_type        comment
ix_t_name_id            t_name                  id                      t_name_index            compact            
Time taken: 0.036 seconds, Fetched: 1 row(s)

删除索引

hive> drop index if exists ix_t_name_id on t_name;
OK
Time taken: 2.467 seconds
hive> show index on t_name;
OK
idx_name        tab_name        col_names       idx_tab_name    idx_type        comment
Time taken: 0.03 seconds

模式设计

按天划分的表就是一种模式,在hive中可以使用按天分区;

分区可以优化查询,但也可能导致一些不利,分区多,每个分区文件太多就会对超出namenode信息处理能力,因为namenode会将所有系统文件的元数据的保存在内存中。

因此一个理想的分区方案应该是不会产生过多的分区和文件夹目录,并且每个文件夹的文件应该足够大,应该是文件系统块大小的若干倍。

按时间分区的策略应该是以不同的时间粒度来确定合适大小的数据积累量,随着时间的推移,分区数量的增长是均匀的,而且每个分区下的文件的大小应该是文件系统中块的若干倍。

这样可以保持是分区足够大,从而优化一般查询的数据吞吐量。同时要考虑粒度在将来是否是适用的。

另一种方案就是适用两个级别的分区,并且适用不同的维度。如时间+地区

唯一键和标准化

关系数据库通常适用唯一键,索引和标准化来存储数据集。hive没有主键,应尽量避免对非标准化数据进行连接操作(join);复杂的数据类型array map struct有助于单行存储一对多的数据;

同一种数据多种处理

可以从一个数据源产生多种聚合,而无需每次聚合都要重新扫描一次

hive > from table1

> INSERT OVERWRITE TABLE2 select *  where action='xx1'

> INSERT OVERWRITE TABLE3 select *  where action='xx2';

对于每个表的分区

ETL过程中,如果中间表也使用分区,一次可以处理多个分区的数据,缺点就是要管理中间表并删除旧分区

分桶表数据存储

分区提供了一个隔离数据,优化查询的方式;分桶是将数据分解成更容易管理的若*分的技术

假定一个表按日期 用户ID分区,用户过多,会导致出现非常多的分区,可能超出文件系统的处理能力;如果我们按照日期分区,按用户ID分桶,则字段值则会根据用户指定的值哈希分发到桶中;同一个用户ID通常会存储到同一个桶内;为了能正确的填充表,首先我们应该设置一个属性来强制hive为目标表的分桶初始化过程设置一个正确的reducer个数。

建表:

hive> show create table tb_bucket;
OK
createtab_stmt
CREATE TABLE `tb_bucket`(
  `userid` int,
  `log` string)
PARTITIONED BY (
  `dateid` int)
CLUSTERED BY (
  userid)
INTO 5 BUCKETS
ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
  'field.delim'='\t',
  'line.delim'='\n',
  'serialization.format'='\t')
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'hdfs://localhost:9000/user/hive/warehouse/gamedw.db/tb_bucket'
TBLPROPERTIES (
  'transient_lastDdlTime'='1524555904')
Time taken: 0.02 seconds, Fetched: 22 row(s)

导入数据

hive> set hive.enforce.bucketing=true;

load data local inpath '/root/tmpdata/userlog.txt' into table tb_bucket partition(dateid=20180223);

drwx-wx-wx   - root supergroup          0 2018-04-24 15:53 /user/hive/warehouse/gamedw.db/tb_bucket
drwx-wx-wx   - root supergroup          0 2018-04-24 15:54 /user/hive/warehouse/gamedw.db/tb_bucket/dateid=20180223
-rwx-wx-wx   1 root supergroup     162285 2018-04-24 15:54 /user/hive/warehouse/gamedw.db/tb_bucket/dateid=20180223/userlog.txt

抽样检查

hive> select * from gamedw.tb_bucket tablesample(bucket 5  out of 5 on userid) where userid=1285962;
OK
tb_bucket.userid        tb_bucket.log   tb_bucket.dateid
Time taken: 0.135 seconds
hive> select * from gamedw.tb_bucket tablesample(bucket 4  out of 5 on userid) where userid=1285962;
OK
tb_bucket.userid        tb_bucket.log   tb_bucket.dateid
Time taken: 0.13 seconds
hive> select * from gamedw.tb_bucket tablesample(bucket 3  out of 5 on userid) where userid=1285962;
OK
tb_bucket.userid        tb_bucket.log   tb_bucket.dateid
1285962 219     20180223
1285962 0       20180223
1285962 29      20180223
1285962 16      20180223
1285962 13      20180223
1285962 0       20180223
1285962 8       20180223
1285962 11      20180223
1285962 8       20180223
1285962 5       20180223
1285962 6       20180223
1285962 7       20180223
1285962 1       20180223
1285962 6       20180223
1285962 14      20180223
1285962 26      20180223
1285962 32      20180223
1285962 3       20180223
1285962 1       20180223
1285962 4       20180223
1285962 165     20180223
Time taken: 0.123 seconds, Fetched: 21 row(s)
hive> select * from gamedw.tb_bucket tablesample(bucket 2  out of 5 on userid) where userid=1285962;
OK
tb_bucket.userid        tb_bucket.log   tb_bucket.dateid
Time taken: 0.107 seconds
hive> select * from gamedw.tb_bucket tablesample(bucket 1  out of 5 on userid) where userid=1285962;
OK
tb_bucket.userid        tb_bucket.log   tb_bucket.dateid
Time taken: 0.139 seconds

分桶的优点,桶是固定的,没有数据的波动,利于数据抽样

为表增加列

alter table tbname add columns(colname type);

几乎总是使用压缩

调优

使用explain

explain extended 可以数据更多信息

join 的优化,大表要么放在最后边,要么使用 /*streamtable(tbname)*/标注,等

本地模式,输入数据量非常小,hive通过本地单台计算机处理所有任务,执行时间明显缩短

set hive.exec.mode.local.auto=true;

并行执行

hive> set hive.exec.parallel=true;

严格模式

调整maper reducer的个数

索引

动态分区调整

存档分区

hive有一种存储格式是har(hadoop archive)hadoop归档文件,har像hdfs系统中tar文件一样,是一个单独的文件,其内部可以存放多个文件和文件夹,某个特定分区,老旧文件访问比较低,当数量庞大,那就会需要hdfs的namenode消耗非常大的代价来管理这些文件。通过将分区下的文件归档成一个大的,但同时可以被HIVE访问的文件,可以减轻namenode的压力,缺点是har查询效率不高,也不是压缩文件,不能节省空间。

首先将hadoop的archive接口拷贝到$HIVE_HOME/auxlib目录下

[root@host lib]# mkdir $HIVE_HOME/auxlib
[root@host lib]# cp hadoop-archives-2.7.4.jar $HIVE_HOME/auxlib
[root@host lib]# pwd
/root/hadoop/hadoop-2.7.4/share/hadoop/tools/lib

查看归档分区的文件

drwx-wx-wx   - root supergroup          0 2018-04-25 11:10 /user/hive/warehouse/gamedw.db/tb_bucket/dateid=20180223
-rwx-wx-wx   1 root supergroup        190 2018-04-25 11:04 /user/hive/warehouse/gamedw.db/tb_bucket/dateid=20180223/log.txt
-rwx-wx-wx   1 root supergroup     162285 2018-04-24 15:54 /user/hive/warehouse/gamedw.db/tb_bucket/dateid=20180223/userlog.txt

重启hive

hive> set hive.archive.enabled=true;
hive> alter table tb_bucket archive partition (dateid=20180223);
intermediate.archived is hdfs://localhost:9000/user/hive/warehouse/gamedw.db/tb_bucket/dateid=20180223_INTERMEDIATE_ARCHIVED
intermediate.original is hdfs://localhost:9000/user/hive/warehouse/gamedw.db/tb_bucket/dateid=20180223_INTERMEDIATE_ORIGINAL
Creating data.har for hdfs://localhost:9000/user/hive/warehouse/gamedw.db/tb_bucket/dateid=20180223
in hdfs://localhost:9000/user/hive/warehouse/gamedw.db/tb_bucket/dateid=20180223/.hive-staging_hive_2018-04-25_11-18-41_670_3770144429856467164-1/-ext-10000/partlevel
Please wait... (this may take a while)
Moving hdfs://localhost:9000/user/hive/warehouse/gamedw.db/tb_bucket/dateid=20180223/.hive-staging_hive_2018-04-25_11-18-41_670_3770144429856467164-1/-ext-10000/partlevel to hdfs://localhost:9000/user/hive/warehouse/gamedw.db/tb_bucket/dateid=20180223_INTERMEDIATE_ARCHIVED
Moving hdfs://localhost:9000/user/hive/warehouse/gamedw.db/tb_bucket/dateid=20180223 to hdfs://localhost:9000/user/hive/warehouse/gamedw.db/tb_bucket/dateid=20180223_INTERMEDIATE_ORIGINAL
Moving hdfs://localhost:9000/user/hive/warehouse/gamedw.db/tb_bucket/dateid=20180223_INTERMEDIATE_ARCHIVED to hdfs://localhost:9000/user/hive/warehouse/gamedw.db/tb_bucket/dateid=20180223
OK
Time taken: 3.562 seconds

查看归档文件

drwxr-xr-x   - root supergroup          0 2018-04-25 11:18 /user/hive/warehouse/gamedw.db/tb_bucket/dateid=20180223
drwxr-xr-x   - root supergroup          0 2018-04-25 11:18 /user/hive/warehouse/gamedw.db/tb_bucket/dateid=20180223/data.har
-rw-r--r--   1 root supergroup          0 2018-04-25 11:18 /user/hive/warehouse/gamedw.db/tb_bucket/dateid=20180223/data.har/_SUCCESS
-rw-r--r--   5 root supergroup        380 2018-04-25 11:18 /user/hive/warehouse/gamedw.db/tb_bucket/dateid=20180223/data.har/_index
-rw-r--r--   5 root supergroup         22 2018-04-25 11:18 /user/hive/warehouse/gamedw.db/tb_bucket/dateid=20180223/data.har/_masterindex
-rw-r--r--   1 root supergroup     162475 2018-04-25 11:18 /user/hive/warehouse/gamedw.db/tb_bucket/dateid=20180223/data.har/part-0

下面命令将har中的文件提取出来重新发在hdfs中

hive> alter table tb_bucket unarchive partition (dateid=20180223);

查看文件(多了一个空文件夹.hive-staging_hive_2018-04-25_11-18-41_670_3770144429856467164-1):

drwx-wx-wx   - root supergroup          0 2018-04-25 11:21 /user/hive/warehouse/gamedw.db/tb_bucket
drwxr-xr-x   - root supergroup          0 2018-04-25 11:21 /user/hive/warehouse/gamedw.db/tb_bucket/dateid=20180223
drwxr-xr-x   - root supergroup          0 2018-04-25 11:21 /user/hive/warehouse/gamedw.db/tb_bucket/dateid=20180223/.hive-staging_hive_2018-04-25_11-18-41_670_3770144429856467164-1
-rw-r--r--   1 root supergroup        190 2018-04-25 11:21 /user/hive/warehouse/gamedw.db/tb_bucket/dateid=20180223/log.txt
-rw-r--r--   1 root supergroup     162285 2018-04-25 11:21 /user/hive/warehouse/gamedw.db/tb_bucket/dateid=20180223/userlog.txt

函数

用户自定义函数UDF,是允许用户可以扩展HQL的强大功能。ETL中,一个过程往往包含多个步骤,hive语言具多重方式将上一个步骤的输入通过管道传递给下一个步骤,然后在一个查询中产生众多输出。

发现和描述函数

hive> show functions; 列出所有当前hive回话中所加载的所有函数名称,包括内置和用户加载进来的函数

desc function 函数名:展示出函数的简单介绍

hive> desc function trim;
OK
tab_name
trim(str) - Removes the leading and trailing space characters from str
Time taken: 0.01 seconds, Fetched: 1 row(s)

hive> desc function extended trim;
OK
tab_name
trim(str) - Removes the leading and trailing space characters from str
Example:
  > SELECT trim('   facebook  ') FROM src LIMIT 1;
  'facebook'
Time taken: 0.005 seconds, Fetched: 4 row(s)

scala> hivecon.sql("select max(userid) from gamedw.tb_bucket ").show
18/04/25 15:34:01 WARN lazy.LazyStruct: Extra bytes detected at the end of the row! Ignoring similar problems.
18/04/25 15:34:01 WARN lazy.LazyStruct: Extra bytes detected at the end of the row! Ignoring similar problems.
18/04/25 15:34:01 WARN lazy.LazyStruct: Extra bytes detected at the end of the row! Ignoring similar problems.
+-----------+
|max(userid)|
+-----------+
|    2265738|
+-----------+

scala> hivecon.sql("select userid,count(1) from gamedw.tb_bucket group by userid").show
18/04/25 15:35:22 WARN lazy.LazyStruct: Extra bytes detected at the end of the row! Ignoring similar problems.
18/04/25 15:35:22 WARN lazy.LazyStruct: Extra bytes detected at the end of the row! Ignoring similar problems.
18/04/25 15:35:22 WARN lazy.LazyStruct: Extra bytes detected at the end of the row! Ignoring similar problems.
+-------+--------+
| userid|count(1)|
+-------+--------+
|2230227|       5|
|2253575|       4|
|2265183|       1|
|2265258|       3|
|2265307|       1|
|2265351|       3|
|2140713|       2|
|2192466|       2|
|2264948|       1|
|2218390|       5|
|2232595|       5|
|2265050|       1|
|2230117|      20|
|1330951|       8|
|2265318|       2|
|2223714|       1|
|2265630|       1|
|2234815|      11|
|1914825|       7|
|2261768|       5|
+-------+--------+
only showing top 20 rows

表生成函数

hive> select explode(subdinates) from employees;
OK
col
wang
ZHANG
LIU
ma
zhong
Time taken: 0.112 seconds, Fetched: 5 row(s)

hive> select name, explode(subdinates) from employees; 无法在表中产生其他列
FAILED: SemanticException [Error 10081]: UDTF's are not supported outside the SELECT clause, nor nested in expressions

不过hive提供了一个lateral view 来实现,lateral view 需要指定视图名称以及生成的列名

hive> select name, sub from employees lateral view explode(subdinates) subview as sub;
OK
name    sub
tianyongtao     wang
tianyongtao     ZHANG
tianyongtao     LIU
wangyangming    ma
wangyangming    zhong
Time taken: 0.1 seconds, Fetched: 5 row(s)

streaming

streaming提供了另一种处理数据的方式,在streaming job中,hadoop streaming api会为外部进程开启一个I/O通道,然后数据会输送给这个进程,其会从标准输入中读取数据,然后通过标准输出来写结果数据,最后返回到streaming api job。

streaming执行效率会 比对应编写的udf或改写的inputformat对象的效率要低,管道中的序列化然后反序列化通常是低效的,而且以通常的方式很难调试整个程序。

hive提供了多种语法来使用steaming,包括map(),reduce(),transform()

恒等变换

最基本的steaming job就是恒等运算,/bin/cat 整个shell命令可以将传递给它的数据直接输出,所以满足恒等运算。

scala> hivecon.sql("select transform(custname,nianling) using '/bin/cat' as name,age from gamedw.cust where nianling>50").show
+---------+----+
|     name| age|
+---------+----+
|   wangwu|  85|
|   liuqin|  56|
|    hello| 100|
|mahuateng|1001|
+---------+----+

改变类型

转换为其他数据类型

scala> hivecon.sql("select transform(custname,nianling) using '/bin/cat' as (name STRING,age  DOUBLE) from gamedw.cust where nianling>50").show
+---------+------+
|     name|   age|
+---------+------+
|   wangwu|  85.0|
|   liuqin|  56.0|
|    hello| 100.0|
|mahuateng|1001.0|
+---------+------+

投影变换

使用cut命令提取或者映射出特定的字段

scala> hivecon.sql("select transform(custname,sex,nianling) using '/bin/cut -f1' as name,sex,age  from gamedw.cust where nianling>50").show
+---------+----+----+
|     name| sex| age|
+---------+----+----+
|   wangwu|null|null|
|   liuqin|null|null|
|    hello|null|null|
|mahuateng|null|null|
+---------+----+----+

操作转换
scala> hivecon.sql("select transform(custname,sex,nianling) using '/bin/cat' as name,sex,age  from gamedw.cust where nianling>50").show
+---------+---+----+
|     name|sex| age|
+---------+---+----+
|   wangwu|  1|  85|
|   liuqin|  0|  56|
|    hello|  0| 100|
|mahuateng|  1|1001|
+---------+---+----+

scala> hivecon.sql("select transform(custname,sex,nianling) using '/bin/sed s/0/5/' as name,sex,age  from gamedw.cust where nianling>50").show
+---------+---+----+
|     name|sex| age|
+---------+---+----+
|   wangwu|  1|  85|
|   liuqin|  5|  56|
|    hello|  5| 100|
|mahuateng|  1|1501|
+---------+---+----+

创建表的默认格式是textfile,便于与其他工具共享数据,便于查看和编辑。相对于二进制文件,文本占的空间要大

创建存储格式为sequencefile的表 ,sequencefile是二进制的键值对文件

hive> create table cust00(custname string,sex int,nianling int)stored as sequencefile;
OK
Time taken: 0.139 seconds

csv 逗号分隔值 tsv 制表符分隔值

xml udf

xml是非结构话数据,这使得hive成为xml处理的一个强大的数据库平台 。众多原因之一就是大型xml文档解析和处理所需要的复杂性和资源消耗使得hadoop非常适合成为一个xml数据库平台,因为hadoop可以并行的处理xml文档,hive也成为了处理xml相关数据的完美工具。HiveQL天生就支持访问xml的嵌套元素和值,进一步,可以允许对嵌套的字段,值,属性进行连接操作。xpath xml路径语言,使用xpath作为xml的查询语言的话,hive可以从xml提取数据并进入hive系统进行其他处理;xpath将xml建模成一个节点树。