Hive入门(四):HQL查询语句(一)

时间:2021-10-23 13:47:51

select

首先可以通过desc tablename查看表列信息。

hive> desc employees;
OK
col_name    data_type   comment
name                    string                  Employee name       
salary                  float                   Employee salay      
subordinates            array<string>           Names of subordiantes
deductions              map<string,float>       <deductions name, percentages>
address                 struct<street:string,city:string,state:string,zip:int>  Home address  

employees表中信息为如下:

employees.name  employees.salary    employees.subordinates  employees.deductions    employees.address
John Doe    10000.0 ["Mary Smith","Todd Jones"] {"Federal Taxes":0.2,"State Taxes":0.05,"Insurance":0.1}     {"street":"1 Michigan Ave.","city":"Chicago","state":"IL","zip":60600}
Mary Smith  8000.0  ["Bill King"]   {"Federal Taxes":0.2,"State Taxes":0.05,"Insurance":0.1}    {"street":"100 Ontario St.","city":"Chicago","state":"IL","zip":60601}
Todd Jones  7000.0  ["Bill King"]   {"Federal Taxes":0.2,"State Taxes":0.03,"Insurance":0.1}    {"street":"200 Chicago Ave.","city":"Oak Park","state":"IL","zip":60700}

查询显示表名

hive> set hive.cli.println.header=true;

查询所有列

hive> select * from employees;

查询指定列并指定列别名

hive> select name, salary as employees_salary from employees; 

查询集合数据类型列

查询Array元素

subordinates为一个string类型的数组,数组下标从0开始。如果要看每条数据subordinates列的第一个元素,则用subordinates[0]的形式表示。

hive> select name, subordinates[0] as first_subordinates from employees;
OK
name    first_subordinates
John Doe    Mary Smith
Mary Smith  Bill King
Todd Jones  Bill King
Time taken: 0.592 seconds, Fetched: 3 row(s)

查询Map元素

deductions列数据类型为map

select name, deductions["Insurance"] as Insutance_deduction from employees;
OK
name    insutance_deduction
John Doe    0.1
Mary Smith  0.1
Todd Jones  0.1
Time taken: 0.2 seconds, Fetched: 3 row(s)

查询Struct元素

hive (mydb)> select name, address.city as city from employees;
OK
name    city
John Doe    Chicago
Mary Smith  Chicago
Todd Jones  Oak Park
Time taken: 0.215 seconds, Fetched: 3 row(s)

查询计算列

hive (mydb)> select upper(name) as name, salary, deductions["Federal Taxes"] as Federal_Taxes,
           > round(salary * (1 - deductions["Federal Taxes"])) as real_wages from employees;
OK
name    salary  federal_taxes   real_wages
JOHN DOE    10000.0 0.2 8000.0
MARY SMITH  8000.0  0.2 6400.0
TODD JONES  7000.0  0.2 5600.0
Time taken: 0.213 seconds, Fetched: 3 row(s)

算数运算符

1.两种数据类型进行计算时,范围较小的数据类型会自动转换成范围较大的数据类型。
2.Hive遵循Java的数据类型的规则。当数据溢出时,不会自动转换成更大的数据类型。

运算函数

数学函数

聚合函数

表生成函数

可以将一列扩展成多列或者多行。

Limit函数

限定查询的行数。

hive (mydb)> select name, salary from employees limit 2;
OK
name    salary
John Doe    10000.0
Mary Smith  8000.0

嵌套select 语句

hive (mydb)> from (
           > select upper(name) as name, salary, deductions["Federal Taxes"] as Federal_Taxes,
           > round(salary * (1 - deductions["Federal Taxes"])) as real_wages from employees
           > ) e
           > select e.name, e.real_wages;
OK
e.name  e.real_wages
JOHN DOE    8000.0
MARY SMITH  6400.0
TODD JONES  5600.0
Time taken: 0.206 seconds, Fetched: 3 row(s)

case语句

hive (mydb)> select name, salary, 
           > case
           >   when salary < 5000.0 then 'low'
           >   when salary >= 5000.0 and salary < 7000.0 then 'middle'
           >   when salary >= 7000.0 and salary < 10000.0 then 'high'
           > else 'very high'
           > end as bracket from employees;
OK
name    salary  bracket
John Doe    10000.0 very high
Mary Smith  8000.0  high
Todd Jones  7000.0  high

where语句

where语句用于过滤条件。

hive> select upper(name) as name, salary, deductions["Federal Taxes"] as Federal_Taxes,
    > round(salary * (1 - deductions["Federal Taxes"])) as real_wages from employees
    > where round(salary * (1 - deductions["Federal Taxes"])) as real_wages > 6000;
OK
name    salary  federal_taxes   real_wages
JOHN DOE    10000.0 0.2 8000.0
MARY SMITH  8000.0  0.2 6400.0
Time taken: 0.271 seconds, Fetched: 2 row(s)

注意,这里不能使用where real_wages > 6000,因为where不能使用列别名。而且round(salary * (1 - deductions["Federal Taxes"]))也不能写成round(salary * (1 - read_wages))。如果想要在where语句中使用简洁的的写法,可以改写成嵌套select的形式。

hive> from (
    > select upper(name) as name, salary, deductions["Federal Taxes"] as Federal_Taxes,
    > round(salary * (1 - deductions["Federal Taxes"])) as real_wages from employees
    > ) e
    > select e.*
    > where e.real_wages > 6000;
OK
e.name  e.salary    e.federal_taxes e.real_wages
JOHN DOE    10000.0 0.2 8000.0
MARY SMITH  8000.0  0.2 6400.0

Time taken: 3.015 seconds, Fetched: 2 row(s)

也可以写成如下:

hive> select e.* from
    > (select upper(name) as name, salary, deductions["Federal Taxes"] as Federal_Taxes,
    > round(salary * (1 - deductions["Federal Taxes"])) as real_wages from employees) e
    > where e.real_wages > 6000;

浮点数比较问题

如下所示,要查询Federal Taxes大于0.2的数据,但是等于0.2的数据也被输出了。这是因为标志deductions为map

hive> select name, deductions["Federal Taxes"]
    > from employees where deductions['Federal Taxes'] > 0.2;
OK
name    _c1
John Doe    0.2
Mary Smith  0.2
Todd Jones  0.2
Time taken: 0.231 seconds, Fetched: 3 row(s)

要解决这个办法,可以使得表中数据类型和比较的数据类型保持一致。
第一种办法是,将map

hive> select name, deductions["Federal Taxes"]
    > from employees where deductions['Federal Taxes'] > cast(0.2 as float);
OK
name    _c1

模糊匹配

like

like严格来说是匹配通配符。比如%匹配任意多的字符串。

hive> select address from employees where address.street like '%Ave.';
OK
address
{"street":"1 Michigan Ave.","city":"Chicago","state":"IL","zip":60600}
{"street":"200 Chicago Ave.","city":"Oak Park","state":"IL","zip":60700}
Time taken: 0.184 seconds, Fetched: 2 row(s)

rlike

rlike匹配正则表达式。如下匹配的是'.*(Chicago|Ontario).*'.表示任意字符串,*表示将左边的字符串重复任意次,(Chicago|Ontario)表示Chicago或Ontario。

hive> select address from employees
    > where address.street rlike '.*(Chicago|Ontario).*';
OK
address
{"street":"100 Ontario St.","city":"Chicago","state":"IL","zip":60601}
{"street":"200 Chicago Ave.","city":"Oak Park","state":"IL","zip":60700}
Time taken: 0.181 seconds, Fetched: 2 row(s)

group by

按一个列或者多个列对数据进行分组,然后对每组的数据进行聚合操作。如下,按address.city进行分组,然后通过avg(salary)求每组的salary的均值。group by之后的可以有一个或多个列。下面的例子只使用了struct类型的address列的一个city字段。

hive> select address.city, avg(salary) from employees
    > group by address.city;
address.city    _c1
Chicago 9000.0
Oak Park    7000.0

如果要对聚合操作后的结果进行过滤,可以使用having。

hive> select address.city, avg(salary) from employees
    > group by address.city
    > having avg(salary) > 8000;

如果给avg(salary)起了别名,having是可以引用这个别名的。

hive> select address.city, avg(salary) as avg from employees
    > group by address.city
    > having avg > 8000;