分组topN。求出每个部门工资最高的员工信息
准备数据
建表语句
create table employee (empid int ,deptid int ,salary decimal(10,2));
插入数据
insert into employee values(1,10,5500.00);
insert into employee values(2,10,4500.00);
insert into employee values(3,20,1900.00);
insert into employee values(4,20,4800.00);
insert into employee values(5,40,6500.00);
insert into employee values(6,40,14500.00);
insert into employee values(7,40,44500.00);
insert into employee values(8,50,6500.00);
insert into employee values(9,50,7500.00);
数据查询
数据显示为:
+-------+--------+----------+
| empid | deptid | salary |
+-------+--------+----------+
| 3 | 20 | 1900.00 |
| 8 | 50 | 6500.00 |
| 9 | 50 | 7500.00 |
| 1 | 10 | 5500.00 |
| 6 | 40 | 14500.00 |
| 5 | 40 | 6500.00 |
| 4 | 20 | 4800.00 |
| 7 | 40 | 44500.00 |
| 2 | 10 | 4500.00 |
+-------+--------+----------+
需求
需求:根据部门分组,显示每个部门工资最高的员工
根据员工的工资等级来排序,使用排名函数row_number()
SELECT empid
, deptid
, salary
, row_number() OVER (PARTITION BY deptid ORDER BY salary DESC) AS rank
FROM employee;
结果显示
+-------+--------+----------+------+
| empid | deptid | salary | rank |
+-------+--------+----------+------+
| 4 | 20 | 4800.00 | 1 |
| 3 | 20 | 1900.00 | 2 |
| 1 | 10 | 5500.00 | 1 |
| 2 | 10 | 4500.00 | 2 |
| 7 | 40 | 44500.00 | 1 |
| 6 | 40 | 14500.00 | 2 |
| 5 | 40 | 6500.00 | 3 |
| 9 | 50 | 7500.00 | 1 |
| 8 | 50 | 6500.00 | 2 |
+-------+--------+----------+------+
怎么显示出分数排名第一的员工的信息了,很容易犯的错误是直接使用where或者having来进行筛选
使用where来直接筛选
SELECT empid
, deptid
, salary
, row_number() OVER (PARTITION BY deptid ORDER BY salary DESC) AS rank
FROM employee
WHERE rank = 1
结果报错:表中没有rank
字段
ERROR: AnalysisException: Could not resolve column/field reference: 'rank'
使用having来进行筛选
FAILED: SemanticException HAVING specified without GROUP BY
结果报错,having语句要接在group by子句的后面
很常见的一个解决方式是,构建出临时表,使用子查询来进行排序筛选或者分页
SELECT * from
(SELECT empid, deptid, salary, row_number() OVER (PARTITION BY deptid ORDER BY salary DESC) AS rank
FROM employee) t where rank = 1
结果显示
+-------+--------+----------+------+
| empid | deptid | salary | rank |
+-------+--------+----------+------+
| 4 | 20 | 4800.00 | 1 |
| 1 | 10 | 5500.00 | 1 |
| 7 | 40 | 44500.00 | 1 |
| 9 | 50 | 7500.00 | 1 |
+-------+--------+----------+------+
如果有多个数据都是最大,第一名会怎么取呢?
再插入三条数据
insert into employee values(10,20,4800.00);
insert into employee values(11,10,5500.00);
insert into employee values(12,50,7500.00);
可以看到结果是变化
+-------+--------+----------+------+
| empid | deptid | salary | rank |
+-------+--------+----------+------+
| 10 | 20 | 4800.00 | 1 |
| 11 | 10 | 5500.00 | 1 |
| 7 | 40 | 44500.00 | 1 |
| 9 | 50 | 7500.00 | 1 |
+-------+--------+----------+------+
可能也有
+-------+--------+----------+------+
| empid | deptid | salary | rank |
+-------+--------+----------+------+
| 10 | 20 | 4800.00 | 1 |
| 1 | 10 | 5500.00 | 1 |
| 7 | 40 | 44500.00 | 1 |
| 12 | 50 | 7500.00 | 1 |
+-------+--------+----------+------+
造成结果不一致,常用的方法是
- 业务属性确保 salary字段不能唯一,则需要针对排序结果再加上 一个唯一字段的排序 比如empid
SELECT * from(SELECT empid, deptid, salary, row_number() OVER (PARTITION BY deptid ORDER BY salary DESC,empid) AS rank
FROM employee) t where rank = 1
这样结果是唯一的
+-------+--------+----------+------+
| empid | deptid | salary | rank |
+-------+--------+----------+------+
| 1 | 10 | 5500.00 | 1 |
| 7 | 40 | 44500.00 | 1 |
| 9 | 50 | 7500.00 | 1 |
| 4 | 20 | 4800.00 | 1 |
+-------+--------+----------+------+
一定会是这几条数据
- 是否可以去掉不必要的order by,这个是由业务逻辑决定的
总结:
1)row_number函数: row_number() 按指定的列进行分组生成行序列, 从 1 开始, 如果两行记录的分组列相同, 行序列也是+1。有重复值,数据的顺序是不固定的,可以采用再加上一个字段来达到类似去重复的效果。
2)over 函数:是一个窗口函数.
over(order by salary) 按照salary排序进行累计,order by是个默认的开窗函数.
over(partition by deptid)按照部门分区
over(partition by deptid order by salary desc)按照部分分区,并按着薪水降序排序.