Hive函数:SUM,AVG,MIN,MAX

时间:2022-12-17 20:09:57

转自:http://lxw1234.com/archives/2015/04/176.htm,Hive分析窗口函数(一) SUM,AVG,MIN,MAX

之前看到大数据田地有关于max()over(partition by)的用法,今天恰好工作中用到了它,但是使用中遇到了一个问题:在max(rsrp)over(partition by buildingid,height) as max_rsrp返回的结果不是分组中的最大值。最中找到了问题的原因:max_rsrp数据类型为string而不是double类型,导致的一个bug问题。

再处理的过程中也再次把大数据田地的中关于sum,avg,max,min的函数用法做了demo,因此有了该参考后的文章。

数据准备:

echo ''>data_file.txt
vim data_file.txt 
cookie1,2015-04-10,1
cookie1,2015-04-11,5
cookie1,2015-04-12,7
cookie1,2015-04-13,3
cookie1,2015-04-14,2
cookie1,2015-04-15,4
cookie1,2015-04-16,4
cookie2,2015-04-10,6
cookie2,2015-04-11,5
cookie2,2015-04-12,7
cookie2,2015-04-13,4
cookie2,2015-04-14,3
cookie2,2015-04-15,5
cookie2,2015-04-16,5
hadoop fs -rm -r /user/jrf/test_data
hadoop fs -mkdir /user/jrf/test_data
hadoop fs -copyFromLocal data_file.txt /user/jrf/test_data/
drop table if exists test_data;
create EXTERNAL TABLE test_data ( cookieid string, createtime string, --day pv INT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' stored as textfile location '/user/jrf/test_data/';
select * from test_data; +---------------------+-----------------------+---------------+--+ | test_data.cookieid | test_data.createtime | test_data.pv | +---------------------+-----------------------+---------------+--+ | cookie1 | 2015-04-10 | 1 | | cookie1 | 2015-04-11 | 5 | | cookie1 | 2015-04-12 | 7 | | cookie1 | 2015-04-13 | 3 | | cookie1 | 2015-04-14 | 2 | | cookie1 | 2015-04-15 | 4 | | cookie1 | 2015-04-16 | 4 | | cookie2 | 2015-04-10 | 6 | | cookie2 | 2015-04-11 | 5 | | cookie2 | 2015-04-12 | 7 | | cookie2 | 2015-04-13 | 4 | | cookie2 | 2015-04-14 | 3 | | cookie2 | 2015-04-15 | 5 | | cookie2 | 2015-04-16 | 5 | +---------------------+-----------------------+---------------+--+

SUM — 注意,结果和ORDER BY相关,默认为升序

SELECT cookieid,createtime,pv,
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1, -- 默认为从起点到当前行
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --从起点到当前行,结果同pv1 
SUM(pv) OVER(PARTITION BY cookieid) AS pv3,--分组内所有行
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4,--当前行+往前3行
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5,--当前行+往前3行+往后1行
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6 --当前行+往后所有行  
FROM test_data order by cookieid,createtime;
+-----------+-------------+-----+------+------+------+------+------+------+--+
| cookieid  | createtime  | pv  | pv1  | pv2  | pv3  | pv4  | pv5  | pv6  |
+-----------+-------------+-----+------+------+------+------+------+------+--+
| cookie1   | 2015-04-10  | 1   | 1    | 1    | 26   | 1    | 6    | 26   |
| cookie1   | 2015-04-11  | 5   | 6    | 6    | 26   | 6    | 13   | 25   |
| cookie1   | 2015-04-12  | 7   | 13   | 13   | 26   | 13   | 16   | 20   |
| cookie1   | 2015-04-13  | 3   | 16   | 16   | 26   | 16   | 18   | 13   |
| cookie1   | 2015-04-14  | 2   | 18   | 18   | 26   | 17   | 21   | 10   |
| cookie1   | 2015-04-15  | 4   | 22   | 22   | 26   | 16   | 20   | 8    |
| cookie1   | 2015-04-16  | 4   | 26   | 26   | 26   | 13   | 13   | 4    |
| cookie2   | 2015-04-10  | 6   | 6    | 6    | 35   | 6    | 11   | 35   |
| cookie2   | 2015-04-11  | 5   | 11   | 11   | 35   | 11   | 18   | 29   |
| cookie2   | 2015-04-12  | 7   | 18   | 18   | 35   | 18   | 22   | 24   |
| cookie2   | 2015-04-13  | 4   | 22   | 22   | 35   | 22   | 25   | 17   |
| cookie2   | 2015-04-14  | 3   | 25   | 25   | 35   | 19   | 24   | 13   |
| cookie2   | 2015-04-15  | 5   | 30   | 30   | 35   | 19   | 24   | 10   |
| cookie2   | 2015-04-16  | 5   | 35   | 35   | 35   | 17   | 17   | 5    |
+-----------+-------------+-----+------+------+------+------+------+------+--+
pv1: 分组内从起点到当前行的pv累积,如,11号的pv1=10号的pv+11号的pv, 12号=10号+11号+12号
pv2: 同pv1
pv3: 分组内(cookie1)所有的pv累加
pv4: 分组内当前行+往前3行,如,11号=10号+11号, 12号=10号+11号+12号, 13号=10号+11号+12号+13号, 14号=11号+12号+13号+14号
pv5: 分组内当前行+往前3行+往后1行,如,14号=11号+12号+13号+14号+15号=5+7+3+2+4=21
pv6: 分组内当前行+往后所有行,如,13号=13号+14号+15号+16号=3+2+4+4=13,14号=14号+15号+16号=2+4+4=10

如果不指定ROWS BETWEEN,默认为从起点到当前行;
如果不指定ORDER BY,则将分组内所有值累加;
关键是理解ROWS BETWEEN含义,也叫做WINDOW子句:
PRECEDING:往前
FOLLOWING:往后
CURRENT ROW:当前行
UNBOUNDED:起点,UNBOUNDED PRECEDING 表示从前面的起点, UNBOUNDED FOLLOWING:表示到后面的终点

–其他AVG,MIN,MAX,和SUM用法一样。

--AVG
SELECT cookieid,createtime,pv,
AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1, -- 默认为从起点到当前行
AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --从起点到当前行,结果同pv1 
AVG(pv) OVER(PARTITION BY cookieid) AS pv3,--分组内所有行
AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4,--当前行+往前3行
AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5,--当前行+往前3行+往后1行
AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6 --当前行+往后所有行  
FROM test_data  order by cookieid,createtime;
+-----------+-------------+-----+---------------------+---------------------+---------------------+--------------------+--------------------+---------------------+--+
| cookieid  | createtime  | pv  |         pv1         |         pv2         |         pv3         |        pv4         |        pv5         |         pv6         |
+-----------+-------------+-----+---------------------+---------------------+---------------------+--------------------+--------------------+---------------------+--+
| cookie1   | 2015-04-10  | 1   | 1.0                 | 1.0                 | 3.7142857142857144  | 1.0                | 3.0                | 3.7142857142857144  |
| cookie1   | 2015-04-11  | 5   | 3.0                 | 3.0                 | 3.7142857142857144  | 3.0                | 4.333333333333333  | 4.166666666666667   |
| cookie1   | 2015-04-12  | 7   | 4.333333333333333   | 4.333333333333333   | 3.7142857142857144  | 4.333333333333333  | 4.0                | 4.0                 |
| cookie1   | 2015-04-13  | 3   | 4.0                 | 4.0                 | 3.7142857142857144  | 4.0                | 3.6                | 3.25                |
| cookie1   | 2015-04-14  | 2   | 3.6                 | 3.6                 | 3.7142857142857144  | 4.25               | 4.2                | 3.3333333333333335  |
| cookie1   | 2015-04-15  | 4   | 3.6666666666666665  | 3.6666666666666665  | 3.7142857142857144  | 4.0                | 4.0                | 4.0                 |
| cookie1   | 2015-04-16  | 4   | 3.7142857142857144  | 3.7142857142857144  | 3.7142857142857144  | 3.25               | 3.25               | 4.0                 |
| cookie2   | 2015-04-10  | 6   | 6.0                 | 6.0                 | 5.0                 | 6.0                | 5.5                | 5.0                 |
| cookie2   | 2015-04-11  | 5   | 5.5                 | 5.5                 | 5.0                 | 5.5                | 6.0                | 4.833333333333333   |
| cookie2   | 2015-04-12  | 7   | 6.0                 | 6.0                 | 5.0                 | 6.0                | 5.5                | 4.8                 |
| cookie2   | 2015-04-13  | 4   | 5.5                 | 5.5                 | 5.0                 | 5.5                | 5.0                | 4.25                |
| cookie2   | 2015-04-14  | 3   | 5.0                 | 5.0                 | 5.0                 | 4.75               | 4.8                | 4.333333333333333   |
| cookie2   | 2015-04-15  | 5   | 5.0                 | 5.0                 | 5.0                 | 4.75               | 4.8                | 5.0                 |
| cookie2   | 2015-04-16  | 5   | 5.0                 | 5.0                 | 5.0                 | 4.25               | 4.25               | 5.0                 |
+-----------+-------------+-----+---------------------+---------------------+---------------------+--------------------+--------------------+---------------------+--+
--MIN
SELECT cookieid,createtime,pv,
MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1, -- 默认为从起点到当前行
MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2,--从起点到当前行,结果同pv1 
MIN(pv) OVER(PARTITION BY cookieid) AS pv3,--分组内所有行
MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4,--当前行+往前3行
MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5,--当前行+往前3行+往后1行
MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6 --当前行+往后所有行  
FROM test_data order by cookieid,createtime;
+-----------+-------------+-----+------+------+------+------+------+------+--+
| cookieid  | createtime  | pv  | pv1  | pv2  | pv3  | pv4  | pv5  | pv6  |
+-----------+-------------+-----+------+------+------+------+------+------+--+
| cookie1   | 2015-04-10  | 1   | 1    | 1    | 1    | 1    | 1    | 1    |
| cookie1   | 2015-04-11  | 5   | 1    | 1    | 1    | 1    | 1    | 2    |
| cookie1   | 2015-04-12  | 7   | 1    | 1    | 1    | 1    | 1    | 2    |
| cookie1   | 2015-04-13  | 3   | 1    | 1    | 1    | 1    | 1    | 2    |
| cookie1   | 2015-04-14  | 2   | 1    | 1    | 1    | 2    | 2    | 2    |
| cookie1   | 2015-04-15  | 4   | 1    | 1    | 1    | 2    | 2    | 4    |
| cookie1   | 2015-04-16  | 4   | 1    | 1    | 1    | 2    | 2    | 4    |
| cookie2   | 2015-04-10  | 6   | 6    | 6    | 3    | 6    | 5    | 3    |
| cookie2   | 2015-04-11  | 5   | 5    | 5    | 3    | 5    | 5    | 3    |
| cookie2   | 2015-04-12  | 7   | 5    | 5    | 3    | 5    | 4    | 3    |
| cookie2   | 2015-04-13  | 4   | 4    | 4    | 3    | 4    | 3    | 3    |
| cookie2   | 2015-04-14  | 3   | 3    | 3    | 3    | 3    | 3    | 3    |
| cookie2   | 2015-04-15  | 5   | 3    | 3    | 3    | 3    | 3    | 5    |
| cookie2   | 2015-04-16  | 5   | 3    | 3    | 3    | 3    | 3    | 5    |
+-----------+-------------+-----+------+------+------+------+------+------+--+
--MAX
SELECT cookieid,createtime,pv,
MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1, -- 默认为从起点到当前行
MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --从起点到当前行,结果同pv1 
MAX(pv) OVER(PARTITION BY cookieid) AS pv3, --分组内所有行
MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4, --当前行+往前3行
MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5, --当前行+往前3行+往后1行
MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6 --当前行+往后所有行  
FROM test_data order by cookieid,createtime;
+-----------+-------------+-----+------+------+------+------+------+------+--+
| cookieid  | createtime  | pv  | pv1  | pv2  | pv3  | pv4  | pv5  | pv6  |
+-----------+-------------+-----+------+------+------+------+------+------+--+
| cookie1   | 2015-04-10  | 1   | 1    | 1    | 7    | 1    | 5    | 7    |
| cookie1   | 2015-04-11  | 5   | 5    | 5    | 7    | 5    | 7    | 7    |
| cookie1   | 2015-04-12  | 7   | 7    | 7    | 7    | 7    | 7    | 7    |
| cookie1   | 2015-04-13  | 3   | 7    | 7    | 7    | 7    | 7    | 4    |
| cookie1   | 2015-04-14  | 2   | 7    | 7    | 7    | 7    | 7    | 4    |
| cookie1   | 2015-04-15  | 4   | 7    | 7    | 7    | 7    | 7    | 4    |
| cookie1   | 2015-04-16  | 4   | 7    | 7    | 7    | 4    | 4    | 4    |
| cookie2   | 2015-04-10  | 6   | 6    | 6    | 7    | 6    | 6    | 7    |
| cookie2   | 2015-04-11  | 5   | 6    | 6    | 7    | 6    | 7    | 7    |
| cookie2   | 2015-04-12  | 7   | 7    | 7    | 7    | 7    | 7    | 7    |
| cookie2   | 2015-04-13  | 4   | 7    | 7    | 7    | 7    | 7    | 5    |
| cookie2   | 2015-04-14  | 3   | 7    | 7    | 7    | 7    | 7    | 5    |
| cookie2   | 2015-04-15  | 5   | 7    | 7    | 7    | 7    | 7    | 5    |
| cookie2   | 2015-04-16  | 5   | 7    | 7    | 7    | 5    | 5    | 5    |
+-----------+-------------+-----+------+------+------+------+------+------+--+


SELECT cookieid,
createtime,
pv,
min(pv) OVER(PARTITION BY cookieid) AS min_pv,
max(pv) OVER(PARTITION BY cookieid) AS max_pv
FROM test_data;
+-----------+-------------+-----+---------+---------+--+
| cookieid  | createtime  | pv  | min_pv  | max_pv  |
+-----------+-------------+-----+---------+---------+--+
| cookie1   | 2015-04-10  | 1   | 1       | 7       |
| cookie1   | 2015-04-16  | 4   | 1       | 7       |
| cookie1   | 2015-04-15  | 4   | 1       | 7       |
| cookie1   | 2015-04-14  | 2   | 1       | 7       |
| cookie1   | 2015-04-13  | 3   | 1       | 7       |
| cookie1   | 2015-04-12  | 7   | 1       | 7       |
| cookie1   | 2015-04-11  | 5   | 1       | 7       |
| cookie2   | 2015-04-16  | 5   | 3       | 7       |
| cookie2   | 2015-04-15  | 5   | 3       | 7       |
| cookie2   | 2015-04-14  | 3   | 3       | 7       |
| cookie2   | 2015-04-13  | 4   | 3       | 7       |
| cookie2   | 2015-04-12  | 7   | 3       | 7       |
| cookie2   | 2015-04-11  | 5   | 3       | 7       |
| cookie2   | 2015-04-10  | 6   | 3       | 7       |
+-----------+-------------+-----+---------+---------+--+