SQL进阶技巧:如何实现多指标累计去重?

时间:2024-10-03 07:19:39

目录

0 需求

1 数据准备

2 数据分析

方法1:count(distinct)  over()

方法2:size(collect_set())over()

方法3:通用解决方案

3 小结


0 需求

假设表A 为事件流水表,客户当天有一条记录则视为当天活跃。
表A:
time_id                       user_id
2018-01-01 10:00:00 001
2018-01-01 11:03:00 002
2018-01-01 13:18:00 001
2018-01-02 08:34:00 004
2018-01-02 10:08:00 002
2018-01-02 10:40:00 003
2018-01-02 14:21:00 002
2018-01-02 15:39:00 004
2018-01-03 08:34:00 005
2018-01-03 10:08:00 003
2018-01-03 10:40:00 001
2018-01-03 14:21:00 005
假设客户活跃非常,一天产生的事件记录平均达千条。

问题:累计去重
输出结果如下所示:
日期当日活跃人数,月累计活跃人数_截至当日

  1. date_id user_cnt_act user_cnt_act_month
  2. 2018-01-01 2 2
  3. 2018-01-02 3 4
  4. 2018-01-03 3 5

1 数据准备

  1. create table t3 as
  2. select '2018-01-01 10:00:00' as time_id,'001' as user_id
  3. UNION ALL
  4. select '2018-01-01 11:03:00' as time_id,'002' as user_id
  5. UNION ALL
  6. select '2018-01-01 13:18:00' as time_id,'001' as user_id
  7. UNION ALL
  8. select '2018-01-02 08:34:00' as time_id,'004' as user_id
  9. UNION ALL
  10. select '2018-01-02 10:08:00' as time_id,'002' as user_id
  11. UNION ALL
  12. select '2018-01-02 10:40:00' as time_id,'003' as user_id
  13. UNION ALL
  14. select '2018-01-02 14:21:00' as time_id,'002' as user_id
  15. UNION ALL
  16. select '2018-01-02 15:39:00' as time_id,'004' as user_id
  17. UNION ALL
  18. select '2018-01-03 08:34:00' as time_id,'005' as user_id
  19. UNION ALL
  20. select '2018-01-03 10:08:00' as time_id,'003' as user_id
  21. UNION ALL
  22. select '2018-01-03 10:40:00' as time_id,'001' as user_id
  23. UNION ALL
  24. select '2018-01-03 14:21:00' as time_id,'005' as user_id

2 数据分析

本题就是典型的开窗来解决:先按天,及用户group by进行去重,然后再开窗求出累计值。

方法1:count(distinct)  over()

第一步:按照天维度及用户维度去重

  1. select substr(time_id,1,10) as date_id
  2. ,user_id
  3. from t3
  4. group by substr(time_id,1,10),user_id
  1. +-------------+----------+
  2. | date_id | user_id |
  3. +-------------+----------+
  4. | 2018-01-01 | 001 |
  5. | 2018-01-02 | 003 |
  6. | 2018-01-03 | 003 |
  7. | 2018-01-01 | 002 |
  8. | 2018-01-02 | 004 |
  9. | 2018-01-03 | 005 |
  10. | 2018-01-03 | 001 |
  11. | 2018-01-02 | 002 |
  12. +-------------+----------+

第二步:基于上述表利用开窗求出对应的指标

  1. select date_id
  2. ,user_id
  3. ,count(user_id) over(partition by date_id) as user_cnt_act
  4. --注意求月的累计值时需要对用户去重
  5. ,count(distinct user_id) over(partition by substr(date_id,1,7) order by date_id) as user_cnt_act_month
  6. from
  7. (
  8. select substr(time_id,1,10) as date_id
  9. ,user_id
  10. from t3
  11. group by substr(time_id,1,10),user_id
  12. ) t
  1. +-------------+----------+---------------+---------------------+
  2. | date_id | user_id | user_cnt_act | user_cnt_act_month |
  3. +-------------+----------+---------------+---------------------+
  4. | 2018-01-01 | 001 | 2 | 2 |
  5. | 2018-01-01 | 002 | 2 | 2 |
  6. | 2018-01-02 | 002 | 3 | 4 |
  7. | 2018-01-02 | 003 | 3 | 4 |
  8. | 2018-01-02 | 004 | 3 | 4 |
  9. | 2018-01-03 | 001 | 3 | 5 |
  10. | 2018-01-03 | 003 | 3 | 5 |
  11. | 2018-01-03 | 005 | 3 | 5 |
  12. +-------------+----------+---------------+---------------------+

第三步:按天进行聚合去重。由于开窗会按照窗口内每条数据都会生成一个标签值,会根据窗口内数据进行膨胀,因此需要去重。最终SQL如下

  1. select date_id
  2. ,max(user_cnt_act)
  3. ,max(user_cnt_act_month)
  4. from(
  5. select date_id
  6. ,user_id
  7. ,count(user_id) over(partition by date_id) as user_cnt_act
  8. ,count(distinct user_id) over(partition by substr(date_id,1,7) order by date_id) as user_cnt_act_month
  9. from
  10. (
  11. select substr(time_id,1,10) as date_id
  12. ,user_id
  13. from t3
  14. group by substr(time_id,1,10),user_id
  15. ) t
  16. ) t
  17. group by date_id
  1. +-------------+------+------+
  2. | date_id | _c1 | _c2 |
  3. +-------------+------+------+
  4. | 2018-01-01 | 2 | 2 |
  5. | 2018-01-03 | 3 | 5 |
  6. | 2018-01-02 | 3 | 4 |
  7. +-------------+------+------+

或SQL如下:

  1. select date_id
  2. ,max(user_cnt_act)
  3. ,max(user_cnt_act_month)
  4. from(
  5. select substr(time_id,1,10) as date_id
  6. ,user_id
  7. ,count(distinct user_id) over(partition by substr(time_id,1,10)) as user_cnt_act
  8. ,count(distinct user_id) over(partition by substr(time_id,1,7) order by substr(time_id,1,10)) as user_cnt_act_month
  9. from t3
  10. ) t
  11. group by date_id

方法2:size(collect_set())over()

关于hive 低版本不支持count(distinct XX) over()的说明,官网说法如下:

  1. Distinct support in Hive 2.1.0 and later (see HIVE-9534)
  2. Distinct is supported for aggregation functions including SUM, COUNT and AVG, which aggregate over the distinct values within each partition. Current implementation has the limitation that no ORDER BY or window specification can be supported in the partitioning clause for performance reason. The supported syntax is as follows.
  3. COUNT(DISTINCT a) OVER (PARTITION BY c)
  4. ORDER BY and window specification is supported for distinct in Hive 2.2.0 (see HIVE-13453). An example is as follows.
  5. COUNT(DISTINCT a) OVER (PARTITION BY c ORDER BY d ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)

hive2.1之后支持partition by语句

COUNT(DISTINCT a) OVER (PARTITION BY c)

hive2.2之后支持order by 语句

COUNT(DISTINCT a) OVER (PARTITION BY ORDER BY ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)

如果目前hive版本不支持这样的语法,尝试用collect_set()函数去重,利用size()函数求其个数,具体SQL如下:

  1. select dt, s1, s2
  2. from (select dt,
  3. size(collect_set(user_id) over (partition by dt)) as s1,
  4. size(collect_set(user_id) over (partition by date_format(dt, 'yyyy-MM') order by dt)) as s2
  5. from (
  6. select to_date(time_id) as dt, user_id
  7. from t3
  8. group by to_date(time_id), user_id
  9. ) t1
  10. ) t2
  11. group by dt, s1, s2

方法3:通用解决方案

  1. with d as (select date(time_id) as dt
  2. , user_id
  3. , case
  4. when date(time_id) =
  5. min(date(time_id)) over (partition by user_id,date_format(date(time_id), 'YYYY-MM')) then 1
  6. else 0 end as flag --- 每月第一次出现标记
  7. from t3
  8. group by date(time_id), user_id)
  9. select dt
  10. , count(*) active_cnt
  11. , sum(sum(flag)) over (partition by date_format(dt, 'YYYY-MM') order by dt) as active_mtd
  12. from d
  13. group by dt;

3 小结

本题主要利用窗口函数来解决累计去重问题,注意窗口函数会在窗口内数据进行膨胀,最终结果需要进行聚合处理