sql如何实现用户的连续最大登陆数或者连续最大签到数呢?
思路拆解:连续登陆/签到是指登陆的天数隔为1
方法一:
观察下面列表,假设是某一用户的登陆数据,可以看到只需要统计连续最大的1就是该用户的最大登陆数
所以思路是
相邻的日期相减,拿到间隔数字,(连续最大的1的个数字+1)即为连续最大登陆数比如2020-01-01 到2020-01-03 中间有2个连续的1
日期 | 是否连续 | 间隔 |
2020/1/1 | —— | 0 |
2020/1/2 | 是 | 1 |
2020/1/3 | 是 | 1 |
2020/1/5 | 否 | 2 |
2020/1/6 | 是 | 1 |
2020/1/7 | 是 | 1 |
2020/1/8 | 是 | 1 |
2020/1/13 | 否 | 5 |
2020/1/15 | 否 | 2 |
2020/1/19 | 否 | 4 |
2020/1/20 | 是 | 1 |
2020/1/21 | 是 | 1 |
日期间隔我们可以把天转化为数组,然后用arrayDifference() 判断数组间的两个相邻间隔得值
(select uid,groupArray(d_num) as arr_days
from
(select
d,
uid,
toDayOfYear(d) as d_num --由于arrayDifference()函数不支持时间格式,我们把时间转化成1-365的数字
from dau where uid=1
and d>='2020-01-01'
order by d
)
group by uid
结果如下:
select uid, arrayDifference(arr_days) as gap
from
(select uid,groupArray(d_num) as arr_days from
(select
d,uid,toDayOfYear(d) as d_num from dau where uid=1
and d>='2020-01-01'
order by d
)
group by uid
)
结果如下:
接下来提取相邻的1,作为一组,判断出长度最大的一组即为我们的连续登陆数,考虑split相关的函数,但是除了1之外的其他数字不确定不能用来作为拆分的基础 我们放弃split相关的函数
我发现 alphaTokens()函数可以识别出a-z范围的字母,我们可以利用这个特性作为处理以上结果
我们把结果中的1全部替换成a
select uid,alphaTokens(replaceAll(replaceAll(toString(gap),'1','a'),',','') ) as arr
from
(select uid, arrayDifference(arr_days) as gap
from
(select uid,groupArray(d_num) as arr_days from
(select
d,uid,toDayOfYear(d) as d_num from dau where uid=1
and d>='2020-01-01'
order by d
)
group by uid
)
)
得到如下结果
接下来我们取最长的一组就是改uid的最大连续登陆数
select uid,length(arrayJoin(arr_a))+1 as `连续登陆天数`
from
(select uid,alphaTokens(replaceAll(replaceAll(toString(gap),'1','a'),',','') ) as arr_a
from
(select uid, arrayDifference(arr_days) as gap
from
(select uid,groupArray(d_num) as arr_days from
(select
d,uid,toDayOfYear(d) as d_num from dau where uid=1
and d>='2020-01-01'
order by d
)
group by uid
)
)
)
得到结果:
我们取max值
select uid,max(`连续登陆天数`) from
(select uid,length(arrayJoin(arr_a))+1 as `连续登陆天数`
from
(select uid,alphaTokens(replaceAll(replaceAll(toString(gap),'1','a'),',','') ) as arr_a
from
(select uid, arrayDifference(arr_days) as gap
from
(select uid,groupArray(d_num) as arr_days from
(select
d,uid,toDayOfYear(d) as d_num from dau where uid=1
and d>='2020-01-01'
order by d
)
group by uid
)
)
)
)
group by uid
得到用户uid=1的 最大连续登陆结果
方法二、
方法一的直接去拆解需要多次转化,有没有更简便得方法呢?
我们继续看,因为连续登陆得间隔是1天,如果我们把时间都减去这(n+1)天,会得到一个固定的天,比如2020-01-01 减去1,是 2019-12-31,2020-01-02减去2,也是2019-12-31,我们只需要判断这个相同天值得数量就可以判断连续最大登陆数了
日期 | 是否连续 | 间隔 |
2020/1/1 | ||
2020/1/2 | 是 | 1 |
2020/1/3 | 是 | 1 |
2020/1/5 | 否 | 2 |
2020/1/6 | 是 | 1 |
2020/1/7 | 是 | 1 |
2020/1/8 | 是 | 1 |
2020/1/13 | 否 | 5 |
2020/1/15 | 否 | 2 |
2020/1/19 | 否 | 4 |
2020/1/20 | 是 | 1 |
2020/1/21 | 是 | 1 |
select uid,
(rowNumberInAllBlocks()+1) id, --对取到得结果加上序号,默认从0开始
d,
d-id as dt
from
(select d,uid from dws.dw_dau where uid=1
and d>='2020-01-01'
order by d
)
得到结果
select uid,dt,count() num
from
(select uid,(rowNumberInAllBlocks()+1) id,d,d-id as dt from
(select d,uid from dau where uid=1
and d>='2020-01-01'
order by d
)
)
group by uid,dt
得到结果
取最大值即为最大连续登陆数
现实得产品需求中会有,按月去计算用户得连续登陆数,如何实现呢?只需要加上月就行了
select uid,month,max(num)
from
(select uid,month,dt,count() num
from
(select uid,month,(rowNumberInAllBlocks()+1) id,d,d-id as dt from
(select d,uid,toMonth(d) month from dau where uid=1
and d>='2020-01-01'
order by d
)
)
group by uid,dt,month
)
group by uid,month
order by month
得到结果