understand the meaning of DATEDIFF + LAG

时间:2021-08-16 23:08:27

I need to understand what the meaning in this calculation :

我需要了解这个计算中的含义:

DATEDIFF(days, lag(recday, 1) OVER (PARTITION BY  udid
                                           ORDER BY recday), recday) 

And how can i implement it without using lag and datediff, for Amazon Redshift, which doesn't have datediff .

我怎样才能实现它而不使用lag和datediff,对于没有dateiff的Amazon Redshift。

This is the full query :

这是完整的查询:

SELECT udid
         ,recday AS day
         ,count(*) AS session_count
         ,DATEDIFF(days, lag(recday, 1) OVER (PARTITION BY  udid
                                           ORDER BY recday), recday) 
          AS repeat_transaction1
       ,DATEDIFF(days, lag(recday, 2) OVER (PARTITION BY  udid
                                           ORDER BY recday), recday) 
          AS repeat_transaction2
         ,DATEDIFF(days, lag(recday, 3) OVER (PARTITION BY  udid
                                           ORDER BY recday), recday) 
          AS repeat_transaction3
       ,DATEDIFF(days, lag(recday, 4) OVER (PARTITION BY  udid
                                           ORDER BY recday), recday) 
          AS repeat_transaction4
          ,DATEDIFF(days, lag(recday, 5) OVER (PARTITION BY  udid
                                           ORDER BY recday), recday) 
          AS repeat_transaction5
        ,DATEDIFF(days, lag(recday, 6) OVER (PARTITION BY  udid
                                           ORDER BY recday), recday) 
          AS repeat_transaction6
   FROM   vvdays

this is how my data looks -

这就是我的数据的样子 -

10000001 2016-02-03 17:26:03.0 10000001 2016-02-08 21:36:07.0 10000001 2016-02-10 07:48:06.0 10000012 2016-02-06 22:06:42.0 10000012 2016-02-06 22:07:42.0 10000028 2016-02-04 13:18:48.0 10000028 2016-02-04 13:30:42.0 10000028 2016-02-04 13:30:55.0 10000028 2016-02-05 16:48:41.0 10000028 2016-02-05 16:58:34.0 10000028 2016-02-07 15:44:33.0 10000028 2016-02-07 16:29:00.0 10000039 2016-02-03 21:16:49.0 10000039 2016-02-03 21:17:50.0 10000039 2016-02-03 21:18:49.0 10000039 2016-02-03 21:19:49.0 10000039 2016-02-03 21:20:50.0 10000039 2016-02-03 21:21:50.0 10000039 2016-02-03 21:22:51.0 10000039 2016-02-03 21:23:53.0 10000039 2016-02-03 21:24:49.0 10000039 2016-02-03 21:25:50.0 10000039 2016-02-03 21:26:50.0 10000039 2016-02-03 21:27:49.0 10000039 2016-02-05 21:58:59.0 10000039 2016-02-05 21:59:58.0 10000039 2016-02-05 22:00:58.0 10000039 2016-02-05 22:01:58.0 10000039 2016-02-05 22:02:59.0 10000039 2016-02-05 22:03:58.0 10000039 2016-02-05 22:05:00.0 10000039 2016-02-05 22:05:58.0 10000039 2016-02-05 22:06:58.0

10000001 2016-02-03 17:26:03.0 10000001 2016-02-08 21:36:07.0 10000001 2016-02-10 07:48:06.0 10000012 2016-02-06 22:06:42.0 10000012 2016-02-06 22:07:42.0 10000028 2016-02-04 13:18:48.0 10000028 2016-02-04 13:30:42.0 10000028 2016-02-04 13:30:55.0 10000028 2016-02-05 16:48:41.0 10000028 2016-02-05 16:58:34.0 10000028 2016-02-07 15:44:33.0 10000028 2016-02-07 16:29:00.0 10000039 2016-02-03 21:16:49.0 10000039 2016-02-03 21 :17:50.0 10000039 2016-02-03 21:18:49.0 10000039 2016-02-03 21:19:49.0 10000039 2016-02-03 21:20:50.0 10000039 2016-02-03 21:21:50.0 10000039 2016 -02-03 21:22:51.0 10000039 2016-02-03 21:23:53.0 10000039 2016-02-03 21:24:49.0 10000039 2016-02-03 21:25:50.0 10000039 2016-02-03 21: 26:50.0 10000039 2016-02-03 21:27:49.0 10000039 2016-02-05 21:58:59.0 10000039 2016-02-05 21:59:58.0 10000039 2016-02-05 22:00:58.0 10000039 2016- 02-05 22:01:58.0 10000039 2016-02-05 22:02:59.0 10000039 2016-02-05 22:03:58.0 10000039 2016-02-05 22:05:00.0 10000039 2016-02-05 22:05 :58.0 10000039 2016-02-05 22:06:58.0

1 个解决方案

#1


0  

Without seeing your data, I'm guessing that your table 'vvdays' contains the two fields 'udid' and 'recday'. The LAG function is getting the second, third, fourth, fifth, sixth and seventh rows of data based upon the udid. The DATEDIFF is then comparing the first 'recday' to the other lines and returning the number of days between these two dates.

没有看到你的数据,我猜你的表'vvdays'包含两个字段'udid'和'recday'。 LAG函数基于udid获取第二,第三,第四,第五,第六和第七行数据。然后,DATEDIFF将第一个'recday'与其他行进行比较,并返回这两个日期之间的天数。

How you replicate this in redshift is another issue, you could look at using UNPIVOT to get the top 7 results into the same row and then running your DATEDIFF equivalent functions on the fields themselves.

你如何在redshift中复制它是另一个问题,你可以看看使用UNPIVOT将前7个结果放到同一行,然后在字段本身上运行DATEDIFF等效函数。

EDIT: Ok, I've designed a really hacky way of getting this working;

编辑:好的,我设计了一种真正的hacky方式来实现这个功能;

Create temp table for testing purposes;

创建临时表以进行测试;

CREATE TABLE #vvdays (udid int, recday datetime)

Inserting some data EDIT: now using OP's supplied data;

插入一些数据编辑:现在使用OP提供的数据;

VALUES 
('10000001', '2016-02-03 17:26:03.0') 
,('10000001', '2016-02-08 21:36:07.0') 
,('10000001', '2016-02-10 07:48:06.0') 
,('10000012', '2016-02-06 22:06:42.0') 
,('10000012', '2016-02-06 22:07:42.0') 
,('10000028', '2016-02-04 13:18:48.0') 
,('10000028', '2016-02-04 13:30:42.0') 
,('10000028', '2016-02-04 13:30:55.0') 
,('10000028', '2016-02-05 16:48:41.0') 
,('10000028', '2016-02-05 16:58:34.0') 
,('10000028', '2016-02-07 15:44:33.0') 
,('10000028', '2016-02-07 16:29:00.0') 
,('10000039', '2016-02-03 21:16:49.0') 
,('10000039', '2016-02-03 21:17:50.0') 
,('10000039', '2016-02-03 21:18:49.0') 
,('10000039', '2016-02-03 21:19:49.0') 
,('10000039', '2016-02-03 21:20:50.0') 
,('10000039', '2016-02-03 21:21:50.0') 
,('10000039', '2016-02-03 21:22:51.0') 
,('10000039', '2016-02-03 21:23:53.0') 
,('10000039', '2016-02-03 21:24:49.0') 
,('10000039', '2016-02-03 21:25:50.0') 
,('10000039', '2016-02-03 21:26:50.0') 
,('10000039', '2016-02-03 21:27:49.0') 
,('10000039', '2016-02-05 21:58:59.0') 
,('10000039', '2016-02-05 21:59:58.0') 
,('10000039', '2016-02-05 22:00:58.0') 
,('10000039', '2016-02-05 22:01:58.0') 
,('10000039', '2016-02-05 22:02:59.0') 
,('10000039', '2016-02-05 22:03:58.0') 
,('10000039', '2016-02-05 22:05:00.0') 
,('10000039', '2016-02-05 22:05:58.0') 
,('10000039', '2016-02-05 22:06:58.0')

Really horrible query to get this working. With the restrictions you've mentioned and my lack of Amazon specific knowledge I've done the first two values for you below. If you did it this way you're going to end up with a massive statement but it will work. I'd heavily recommend researching further to see what equivalent functions you have available to you;

真的很可怕的查询让这个工作。由于您提到的限制以及我缺乏亚马逊特定的知识,我已经为您完成了前两个值。如果你以这种方式做到这一点,你将最终得到一个大规模的声明,但它会起作用。我强烈建议进一步研究,看看你有哪些等效功能;

SELECT day1.udid
    ,MAX(day1.recday) day1
    ,MAX(day2.recday) day2
    ,DATEDIFF(day,MAX(day2.recday),MAX(day1.recday)) Day2Diff
    ,MAX(day3.recday) day3
    ,DATEDIFF(day,MAX(day3.recday),MAX(day1.recday)) Day3Diff
FROM #vvdays day1
LEFT JOIN (
    SELECT a.udid
        ,MAX(a.recday) recday
    FROM #vvdays a
    LEFT JOIN (
        SELECT udid
            ,MAX(recday) recday
        FROM #vvdays
        GROUP BY udid
        ) b ON a.udid = b.udid
    WHERE a.recday <> b.recday
    GROUP BY a.udid
    ) day2 ON day1.udid = day2.udid
LEFT JOIN (
    SELECT a.udid
        ,MAX(a.recday) recday
    FROM #vvdays a
    LEFT JOIN (
        SELECT udid
            ,MAX(recday) recday
        FROM #vvdays
        GROUP BY udid
        ) b ON a.udid = b.udid
    LEFT JOIN (
    SELECT a.udid
        ,MAX(a.recday) recday
    FROM #vvdays a
    LEFT JOIN (
        SELECT udid
            ,MAX(recday) recday
        FROM #vvdays
        GROUP BY udid
        ) b ON a.udid = b.udid
    WHERE a.recday <> b.recday
    GROUP BY a.udid
    ) day2 ON a.udid = day2.udid
    WHERE a.recday NOT IN (b.recday, day2.recday)
    GROUP BY a.udid
    ) day3 ON day1.udid = day3.udid
GROUP BY day1.udid

The reason I have used MAX in for 'day1' is to return the first date. I have used it in 'day2' at the top level purely to turn it into an aggregate field, you'll only get one result here, it's a false aggregate that just lets the GROUP BY work correctly.

我在'day1'中使用MAX的原因是返回第一个日期。我在*的'day2'中使用它纯粹是为了把它变成一个聚合字段,你只会在这里得到一个结果,这是一个错误的聚合,让GROUP BY正常工作。

#1


0  

Without seeing your data, I'm guessing that your table 'vvdays' contains the two fields 'udid' and 'recday'. The LAG function is getting the second, third, fourth, fifth, sixth and seventh rows of data based upon the udid. The DATEDIFF is then comparing the first 'recday' to the other lines and returning the number of days between these two dates.

没有看到你的数据,我猜你的表'vvdays'包含两个字段'udid'和'recday'。 LAG函数基于udid获取第二,第三,第四,第五,第六和第七行数据。然后,DATEDIFF将第一个'recday'与其他行进行比较,并返回这两个日期之间的天数。

How you replicate this in redshift is another issue, you could look at using UNPIVOT to get the top 7 results into the same row and then running your DATEDIFF equivalent functions on the fields themselves.

你如何在redshift中复制它是另一个问题,你可以看看使用UNPIVOT将前7个结果放到同一行,然后在字段本身上运行DATEDIFF等效函数。

EDIT: Ok, I've designed a really hacky way of getting this working;

编辑:好的,我设计了一种真正的hacky方式来实现这个功能;

Create temp table for testing purposes;

创建临时表以进行测试;

CREATE TABLE #vvdays (udid int, recday datetime)

Inserting some data EDIT: now using OP's supplied data;

插入一些数据编辑:现在使用OP提供的数据;

VALUES 
('10000001', '2016-02-03 17:26:03.0') 
,('10000001', '2016-02-08 21:36:07.0') 
,('10000001', '2016-02-10 07:48:06.0') 
,('10000012', '2016-02-06 22:06:42.0') 
,('10000012', '2016-02-06 22:07:42.0') 
,('10000028', '2016-02-04 13:18:48.0') 
,('10000028', '2016-02-04 13:30:42.0') 
,('10000028', '2016-02-04 13:30:55.0') 
,('10000028', '2016-02-05 16:48:41.0') 
,('10000028', '2016-02-05 16:58:34.0') 
,('10000028', '2016-02-07 15:44:33.0') 
,('10000028', '2016-02-07 16:29:00.0') 
,('10000039', '2016-02-03 21:16:49.0') 
,('10000039', '2016-02-03 21:17:50.0') 
,('10000039', '2016-02-03 21:18:49.0') 
,('10000039', '2016-02-03 21:19:49.0') 
,('10000039', '2016-02-03 21:20:50.0') 
,('10000039', '2016-02-03 21:21:50.0') 
,('10000039', '2016-02-03 21:22:51.0') 
,('10000039', '2016-02-03 21:23:53.0') 
,('10000039', '2016-02-03 21:24:49.0') 
,('10000039', '2016-02-03 21:25:50.0') 
,('10000039', '2016-02-03 21:26:50.0') 
,('10000039', '2016-02-03 21:27:49.0') 
,('10000039', '2016-02-05 21:58:59.0') 
,('10000039', '2016-02-05 21:59:58.0') 
,('10000039', '2016-02-05 22:00:58.0') 
,('10000039', '2016-02-05 22:01:58.0') 
,('10000039', '2016-02-05 22:02:59.0') 
,('10000039', '2016-02-05 22:03:58.0') 
,('10000039', '2016-02-05 22:05:00.0') 
,('10000039', '2016-02-05 22:05:58.0') 
,('10000039', '2016-02-05 22:06:58.0')

Really horrible query to get this working. With the restrictions you've mentioned and my lack of Amazon specific knowledge I've done the first two values for you below. If you did it this way you're going to end up with a massive statement but it will work. I'd heavily recommend researching further to see what equivalent functions you have available to you;

真的很可怕的查询让这个工作。由于您提到的限制以及我缺乏亚马逊特定的知识,我已经为您完成了前两个值。如果你以这种方式做到这一点,你将最终得到一个大规模的声明,但它会起作用。我强烈建议进一步研究,看看你有哪些等效功能;

SELECT day1.udid
    ,MAX(day1.recday) day1
    ,MAX(day2.recday) day2
    ,DATEDIFF(day,MAX(day2.recday),MAX(day1.recday)) Day2Diff
    ,MAX(day3.recday) day3
    ,DATEDIFF(day,MAX(day3.recday),MAX(day1.recday)) Day3Diff
FROM #vvdays day1
LEFT JOIN (
    SELECT a.udid
        ,MAX(a.recday) recday
    FROM #vvdays a
    LEFT JOIN (
        SELECT udid
            ,MAX(recday) recday
        FROM #vvdays
        GROUP BY udid
        ) b ON a.udid = b.udid
    WHERE a.recday <> b.recday
    GROUP BY a.udid
    ) day2 ON day1.udid = day2.udid
LEFT JOIN (
    SELECT a.udid
        ,MAX(a.recday) recday
    FROM #vvdays a
    LEFT JOIN (
        SELECT udid
            ,MAX(recday) recday
        FROM #vvdays
        GROUP BY udid
        ) b ON a.udid = b.udid
    LEFT JOIN (
    SELECT a.udid
        ,MAX(a.recday) recday
    FROM #vvdays a
    LEFT JOIN (
        SELECT udid
            ,MAX(recday) recday
        FROM #vvdays
        GROUP BY udid
        ) b ON a.udid = b.udid
    WHERE a.recday <> b.recday
    GROUP BY a.udid
    ) day2 ON a.udid = day2.udid
    WHERE a.recday NOT IN (b.recday, day2.recday)
    GROUP BY a.udid
    ) day3 ON day1.udid = day3.udid
GROUP BY day1.udid

The reason I have used MAX in for 'day1' is to return the first date. I have used it in 'day2' at the top level purely to turn it into an aggregate field, you'll only get one result here, it's a false aggregate that just lets the GROUP BY work correctly.

我在'day1'中使用MAX的原因是返回第一个日期。我在*的'day2'中使用它纯粹是为了把它变成一个聚合字段,你只会在这里得到一个结果,这是一个错误的聚合,让GROUP BY正常工作。