select
首先可以通过desc tablename
查看表列信息。
hive> desc employees;
OK
col_name data_type comment
name string Employee name
salary float Employee salay
subordinates array<string> Names of subordiantes
deductions map<string,float> <deductions name, percentages>
address struct<street:string,city:string,state:string,zip:int> Home address
employees表中信息为如下:
employees.name employees.salary employees.subordinates employees.deductions employees.address
John Doe 10000.0 ["Mary Smith","Todd Jones"] {"Federal Taxes":0.2,"State Taxes":0.05,"Insurance":0.1} {"street":"1 Michigan Ave.","city":"Chicago","state":"IL","zip":60600}
Mary Smith 8000.0 ["Bill King"] {"Federal Taxes":0.2,"State Taxes":0.05,"Insurance":0.1} {"street":"100 Ontario St.","city":"Chicago","state":"IL","zip":60601}
Todd Jones 7000.0 ["Bill King"] {"Federal Taxes":0.2,"State Taxes":0.03,"Insurance":0.1} {"street":"200 Chicago Ave.","city":"Oak Park","state":"IL","zip":60700}
查询显示表名
hive> set hive.cli.println.header=true;
查询所有列
hive> select * from employees;
查询指定列并指定列别名
hive> select name, salary as employees_salary from employees;
查询集合数据类型列
查询Array元素
subordinates为一个string类型的数组,数组下标从0开始。如果要看每条数据subordinates列的第一个元素,则用subordinates[0]的形式表示。
hive> select name, subordinates[0] as first_subordinates from employees;
OK
name first_subordinates
John Doe Mary Smith
Mary Smith Bill King
Todd Jones Bill King
Time taken: 0.592 seconds, Fetched: 3 row(s)
查询Map元素
deductions列数据类型为map
select name, deductions["Insurance"] as Insutance_deduction from employees;
OK
name insutance_deduction
John Doe 0.1
Mary Smith 0.1
Todd Jones 0.1
Time taken: 0.2 seconds, Fetched: 3 row(s)
查询Struct元素
hive (mydb)> select name, address.city as city from employees;
OK
name city
John Doe Chicago
Mary Smith Chicago
Todd Jones Oak Park
Time taken: 0.215 seconds, Fetched: 3 row(s)
查询计算列
hive (mydb)> select upper(name) as name, salary, deductions["Federal Taxes"] as Federal_Taxes,
> round(salary * (1 - deductions["Federal Taxes"])) as real_wages from employees;
OK
name salary federal_taxes real_wages
JOHN DOE 10000.0 0.2 8000.0
MARY SMITH 8000.0 0.2 6400.0
TODD JONES 7000.0 0.2 5600.0
Time taken: 0.213 seconds, Fetched: 3 row(s)
算数运算符
1.两种数据类型进行计算时,范围较小的数据类型会自动转换成范围较大的数据类型。
2.Hive遵循Java的数据类型的规则。当数据溢出时,不会自动转换成更大的数据类型。
运算函数
数学函数
聚合函数
表生成函数
可以将一列扩展成多列或者多行。
Limit函数
限定查询的行数。
hive (mydb)> select name, salary from employees limit 2;
OK
name salary
John Doe 10000.0
Mary Smith 8000.0
嵌套select 语句
hive (mydb)> from (
> select upper(name) as name, salary, deductions["Federal Taxes"] as Federal_Taxes,
> round(salary * (1 - deductions["Federal Taxes"])) as real_wages from employees
> ) e
> select e.name, e.real_wages;
OK
e.name e.real_wages
JOHN DOE 8000.0
MARY SMITH 6400.0
TODD JONES 5600.0
Time taken: 0.206 seconds, Fetched: 3 row(s)
case语句
hive (mydb)> select name, salary,
> case
> when salary < 5000.0 then 'low'
> when salary >= 5000.0 and salary < 7000.0 then 'middle'
> when salary >= 7000.0 and salary < 10000.0 then 'high'
> else 'very high'
> end as bracket from employees;
OK
name salary bracket
John Doe 10000.0 very high
Mary Smith 8000.0 high
Todd Jones 7000.0 high
where语句
where语句用于过滤条件。
hive> select upper(name) as name, salary, deductions["Federal Taxes"] as Federal_Taxes,
> round(salary * (1 - deductions["Federal Taxes"])) as real_wages from employees
> where round(salary * (1 - deductions["Federal Taxes"])) as real_wages > 6000;
OK
name salary federal_taxes real_wages
JOHN DOE 10000.0 0.2 8000.0
MARY SMITH 8000.0 0.2 6400.0
Time taken: 0.271 seconds, Fetched: 2 row(s)
注意,这里不能使用where real_wages > 6000
,因为where不能使用列别名。而且round(salary * (1 - deductions["Federal Taxes"]))
也不能写成round(salary * (1 - read_wages))
。如果想要在where语句中使用简洁的的写法,可以改写成嵌套select的形式。
hive> from (
> select upper(name) as name, salary, deductions["Federal Taxes"] as Federal_Taxes,
> round(salary * (1 - deductions["Federal Taxes"])) as real_wages from employees
> ) e
> select e.*
> where e.real_wages > 6000;
OK
e.name e.salary e.federal_taxes e.real_wages
JOHN DOE 10000.0 0.2 8000.0
MARY SMITH 8000.0 0.2 6400.0
Time taken: 3.015 seconds, Fetched: 2 row(s)
也可以写成如下:
hive> select e.* from
> (select upper(name) as name, salary, deductions["Federal Taxes"] as Federal_Taxes,
> round(salary * (1 - deductions["Federal Taxes"])) as real_wages from employees) e
> where e.real_wages > 6000;
浮点数比较问题
如下所示,要查询Federal Taxes大于0.2的数据,但是等于0.2的数据也被输出了。这是因为标志deductions为map
hive> select name, deductions["Federal Taxes"]
> from employees where deductions['Federal Taxes'] > 0.2;
OK
name _c1
John Doe 0.2
Mary Smith 0.2
Todd Jones 0.2
Time taken: 0.231 seconds, Fetched: 3 row(s)
要解决这个办法,可以使得表中数据类型和比较的数据类型保持一致。
第一种办法是,将map
hive> select name, deductions["Federal Taxes"]
> from employees where deductions['Federal Taxes'] > cast(0.2 as float);
OK
name _c1
模糊匹配
like
like严格来说是匹配通配符。比如%匹配任意多的字符串。
hive> select address from employees where address.street like '%Ave.';
OK
address
{"street":"1 Michigan Ave.","city":"Chicago","state":"IL","zip":60600}
{"street":"200 Chicago Ave.","city":"Oak Park","state":"IL","zip":60700}
Time taken: 0.184 seconds, Fetched: 2 row(s)
rlike
rlike匹配正则表达式。如下匹配的是'.*(Chicago|Ontario).*'
,.
表示任意字符串,*
表示将左边的字符串重复任意次,(Chicago|Ontario)
表示Chicago或Ontario。
hive> select address from employees
> where address.street rlike '.*(Chicago|Ontario).*';
OK
address
{"street":"100 Ontario St.","city":"Chicago","state":"IL","zip":60601}
{"street":"200 Chicago Ave.","city":"Oak Park","state":"IL","zip":60700}
Time taken: 0.181 seconds, Fetched: 2 row(s)
group by
按一个列或者多个列对数据进行分组,然后对每组的数据进行聚合操作。如下,按address.city进行分组,然后通过avg(salary)求每组的salary的均值。group by之后的可以有一个或多个列。下面的例子只使用了struct类型的address列的一个city字段。
hive> select address.city, avg(salary) from employees
> group by address.city;
address.city _c1
Chicago 9000.0
Oak Park 7000.0
如果要对聚合操作后的结果进行过滤,可以使用having。
hive> select address.city, avg(salary) from employees
> group by address.city
> having avg(salary) > 8000;
如果给avg(salary)起了别名,having是可以引用这个别名的。
hive> select address.city, avg(salary) as avg from employees
> group by address.city
> having avg > 8000;