Hive开发中使用变量的两种方法

在使用hive开发数据分析代码时，经常会遇到需要改变运行参数的情况，比如select语句中对日期字段值的设定，可能不同时间想要看不同日期的数据，这就需要能动态改变日期的值。如果开发量较大、参数多的话，使用变量来替代原来的字面值非常有必要，本文总结了几种可以向hive的SQL中传入参数的方法，以满足类似的需要。

准备测试表和测试数据

第一步先准备测试表和测试数据用于后续测试：

hive> create database test;

OK

Time taken: 2.606 seconds

然后执行建表和导入数据的sql文件：

[czt@www.crazyant.net testHivePara]$ hive -f student.sql

Hive history file=/tmp/crazyant.net/hive_job_log_czt_201309131615_1720869864.txt

OK

Time taken: 2.131 seconds

OK

Time taken: 0.878 seconds

Copying data from file:/home/users/czt/testdata_student

Copying file: file:/home/users/czt/testdata_student

Loading data to table test.student

OK

Time taken: 1.76 seconds

其中student.sql内容如下：

use test; 

---学生信息表

create table IF NOT EXISTS student(

    sno        bigint    comment '学号' ,

    sname    string    comment '姓名' ,

    sage    bigint    comment '年龄' ,

    pdate    string    comment '入学日期'

)

COMMENT '学生信息表'

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '\t'

LINES TERMINATED BY '\n'

STORED AS TEXTFILE;

LOAD DATA LOCAL INPATH

    '/home/users/czt/testdata_student'

INTO TABLE student;

testdata_student测试数据文件内容如下：

1    name1    21    20130901

2    name2    22    20130901

3    name3    23    20130901

4    name4    24    20130901

5    name5    25    20130902

6    name6    26    20130902

7    name7    27    20130902

8    name8    28    20130902

9    name9    29    20130903

10    name10    30    20130903

11    name11    31    20130903

12    name12    32    20130904

13    name13    33    20130904

方法1：shell中设置变量，hive -e中直接使用

测试的shell文件名：

#!/bin/bash

tablename="student"

limitcount=""

hive -S -e "use test; select * from ${tablename} limit ${limitcount};"

运行结果：

[czt@www.crazyant.net testHivePara]$ sh -x shellhive.sh

+ tablename=student

+ limitcount=

+ hive -S -e 'use test; select * from student limit 8;'

       name1

       name2

       name3

       name4

       name5

       name6

       name7

       name8

由于hive自身是类SQL语言，缺乏shell的灵活性和对过程的控制能力，所以采用shell+hive的开发模式非常常见，在shell中直接定义变量，在hive -e语句中就可以直接引用；

注意：使用-hiveconf定义，在hive -e中是不能使用的

修改一下刚才的shell文件，采用-hiveconf的方法定义日期参数：

#!/bin/bash

tablename="student"

limitcount=""

hive -S \

    -hiveconf enter_school_date="" \

    -hiveconf min_age="" \

    -e \

    "    use test; \

        select * from ${tablename} \

        where \

            pdate='${hiveconf:enter_school_date}' \

            and \

            sage>'${hiveconf:min_age}' \

        limit ${limitcount};"

运行会失败，因为该脚本在shell环境中运行的，于是shell试图去解析${hiveconf:enter_school_date}和${hiveconf:min_age}变量，但是这两个SHELL变量并没有定义，所以会以空字符串放在这个位置。

运行时该SQL语句会被解析成下面这个样子：

+ hive -S -hiveconf enter_school_date= -hiveconf min_age= -e 'use test; explain select * from student where pdate='\'''\'' and sage>'\'''\'' limit ;'

方法2：使用-hiveconf定义，在SQL文件中使用

因为换行什么的很不方便，hive -e只适合写少量的SQL代码，所以一般都会写很多hql文件，然后使用hive –f的方法来调用，这时候可以通过-hiveconf定义一些变量，然后在SQL中直接使用。

先编写调用的SHELL文件：

#!/bin/bash

hive -hiveconf enter_school_date="" -hiveconf min_ag="" -f testvar.sql

被调用的testvar.sql文件内容：

use test; 

select * from student

where

    pdate='${hiveconf:enter_school_date}'

    and

    sage > '${hiveconf:min_ag}'

limit ;

执行过程：

[czt@www.crazyant.net testHivePara]$ sh -x shellhive.sh

+ hive -hiveconf enter_school_date= -hiveconf min_ag= -f testvar.sql

Hive history file=/tmp/czt/hive_job_log_czt_201309131651_2035045625.txt

OK

Time taken: 2.143 seconds

Total MapReduce jobs =

Launching Job  out of

Number of reduce tasks is set to  since there's no reduce operator

Kill Command = hadoop job -kill job_20130911213659_42303

-- ::, Stage- map = %,  reduce = %

-- ::, Stage- map = %,  reduce = %

-- ::, Stage- map = %,  reduce = %

-- ::, Stage- map = %,  reduce = %

Ended Job = job_20130911213659_42303

OK

       name7

       name8

Time taken: 54.268 seconds

总结

本文主要阐述了两种在hive中使用变量的方法，第一种是在shell中定义变量然后在hive -e的SQL语句中直接用${var_name}的方法调用；第二种是使用hive –hiveconf key=value –f run.sql模式使用-hiveconf来设置变量，然后在SQL文件中使用${hiveconf:varname}的方法调用。用这两种方法可以满足开发的时候向hive传递参数的需求，会很好的提升开发效率和代码质量。

秒客网

Hive开发中使用变量的两种方法

准备测试表和测试数据

方法1：shell中设置变量，hive -e中直接使用

注意：使用-hiveconf定义，在hive -e中是不能使用的

方法2：使用-hiveconf定义，在SQL文件中使用

总结

相关文章