I'm trying to create some session based on data ( In Vertica but any other OLAP database SQL should work )
我正在尝试基于数据创建一些会话(在Vertica中,但任何其他OLAP数据库SQL应该工作)
I have a simple table, with columns named Vehicle-ID, Event, Event-Code, and "Session-ID" is the column which I want to populate
我有一个简单的表,列名为Vehicle-ID,Event,Event-Code,“Session-ID”是我要填充的列
I have tried partition, lead, lag and other analytics function but no luck. The logic for creating is session as below.
我尝试过分区,领导,滞后和其他分析功能,但没有运气。创建的逻辑是会话,如下所示。
Session starts when you first encounter a started-event ( 1 ) and ends when the last stopped(2) event is obtained. As you can see in session after a session is started if we get more start events we ignore and we look for last stop event. Example Session-Id-1
会话在您第一次遇到启动事件(1)时开始,在获得最后一个停止(2)事件时结束。正如您在会话开始后在会话中看到的那样,如果我们获得更多的启动事件,我们会忽略它们,并且我们会寻找最后一次停止示例Session-Id-1
For some reason after stop event, the next event is not start ( i.e running, etc ) it means it a bad session and we want to capture the bad session until we find a new-start. The example is in session-id-2
由于某些原因,在停止事件之后,下一个事件没有开始(即运行等),这意味着它是一个糟糕的会话,我们想要捕获坏会话,直到找到新的开始。该示例位于session-id-2中
I'm trying to create markers use lead, and lag which look at records before and after , and adds markers like first_start, final_end etc .. but it's getting clumsy
我正在尝试创建标记使用lead和lag来查看记录之前和之后的记录,并添加标记,如first_start,final_end等..但它变得笨拙
Updated SQL Query to create session
更新了SQL查询以创建会话
SELECT * , SUM(FLAG) OVER ( PARTITION BY Vehicle_ID ORDER BY Event_Time ROWS UNBOUNDED PRECEDING) AS SESSION_ID
FROM (
SELECT * ,
Case when Prev_Start_time < Prev_Stop_time and Event != 'Started' Then 1 else 0 end as bad_data ,
Case when
( Event = 'Started' and Prev_Start_time < Prev_Stop_time ) OR
--( Event = 'Stopped' and Prev_Event = 'Stopped' ) OR
( Event = 'Running' and Prev_Start_time < Prev_Stop_time) OR
( Prev_Event IS NULL)
THEN 1 END AS FLAG
--Case when ( Event = 'Stopped' and Next_Event = 'Stopped' ) OR ( Event != 'Started' and Prev_Start_time < Prev_Stop_time) OR ( Prev_Event IS NULL) THEN 1 END AS FLAG
FROM (
WITH
input(Vehicle_ID,Event_time,Event,Event_Code) AS (
SELECT 1,TIME '09:01:00','Started',1
UNION ALL SELECT 1,TIME '09:02:00','Started',1
UNION ALL SELECT 1,TIME '09:03:00','Running',3
UNION ALL SELECT 1,TIME '09:04:00','Started',1
UNION ALL SELECT 1,TIME '09:05:00','Running',3
UNION ALL SELECT 1,TIME '09:06:00','Running',3
UNION ALL SELECT 1,TIME '09:07:00','Running',3
UNION ALL SELECT 1,TIME '09:08:00','Stopped',2
UNION ALL SELECT 1,TIME '09:09:00','Stopped',2
UNION ALL SELECT 1,TIME '09:10:00','Running',3
UNION ALL SELECT 1,TIME '09:11:00','Running',3
UNION ALL SELECT 1,TIME '09:12:00','Running',3
UNION ALL SELECT 1,TIME '09:13:00','Started',1
UNION ALL SELECT 1,TIME '09:14:00','Started',1
UNION ALL SELECT 1,TIME '09:15:00','Running',3
UNION ALL SELECT 1,TIME '09:16:00','Started',1
UNION ALL SELECT 1,TIME '09:17:00','Running',3
UNION ALL SELECT 1,TIME '09:18:00','Running',3
UNION ALL SELECT 1,TIME '09:19:00','Running',3
UNION ALL SELECT 1,TIME '09:20:00','Stopped',2
UNION ALL SELECT 1,TIME '09:21:00','Started',1
UNION ALL SELECT 1,TIME '09:22:00','Started',1
UNION ALL SELECT 1,TIME '09:23:00','Running',3
UNION ALL SELECT 1,TIME '09:24:00','Started',1
UNION ALL SELECT 1,TIME '09:25:00','Running',3
UNION ALL SELECT 1,TIME '09:26:00','Running',3
UNION ALL SELECT 1,TIME '09:27:00','Running',3
UNION ALL SELECT 1,TIME '09:28:00','Stopped',2
)
SELECT *,
Max( Case Event when 'Started' then Event_time end ) OVER (PARTITION BY Vehicle_ID ORDER BY Event_time Rows between unbounded preceding and 1 preceding ) AS Prev_Start_time,
Max( Case Event when 'Stopped' then Event_time end ) OVER (PARTITION BY Vehicle_ID ORDER BY Event_time Rows between unbounded preceding and 1 preceding ) AS Prev_Stop_time,
LAG(Event) OVER (PARTITION BY Vehicle_ID ORDER BY Event_time ) AS Prev_Event,
LEAD(Event) OVER (PARTITION BY Vehicle_ID ORDER BY Event_time ) AS Next_Event
FROM input ) AS T1
) AS T2
New input as per the updated query
根据更新的查询的新输入
Vehicle_ID Event_time Event Event_Code Prev_Start_time Prev_Stop_time Prev_Event Next_Event bad_data FLAG SESSION_ID
1 9:01:00 Started 1 NULL NULL NULL Started 0 1 1
1 9:02:00 Started 1 9:01:00 NULL Started Running 0 NULL 1
1 9:03:00 Running 3 9:02:00 NULL Started Started 0 NULL 1
1 9:04:00 Started 1 9:02:00 NULL Running Running 0 NULL 1
1 9:05:00 Running 3 9:04:00 NULL Started Running 0 NULL 1
1 9:06:00 Running 3 9:04:00 NULL Running Running 0 NULL 1
1 9:07:00 Running 3 9:04:00 NULL Running Stopped 0 NULL 1
1 9:08:00 Stopped 2 9:04:00 NULL Running Stopped 0 NULL 1
1 9:09:00 Stopped 2 9:04:00 9:08:00 Stopped Running 1 NULL 1
1 9:10:00 Running 3 9:04:00 9:09:00 Stopped Running 1 1 2
1 9:11:00 Running 3 9:04:00 9:09:00 Running Running 1 1 3
1 9:12:00 Running 3 9:04:00 9:09:00 Running Started 1 1 4
1 9:13:00 Started 1 9:04:00 9:09:00 Running Started 0 1 5
1 9:14:00 Started 1 9:13:00 9:09:00 Started Running 0 NULL 5
1 9:15:00 Running 3 9:14:00 9:09:00 Started Started 0 NULL 5
1 9:16:00 Started 1 9:14:00 9:09:00 Running Running 0 NULL 5
1 9:17:00 Running 3 9:16:00 9:09:00 Started Running 0 NULL 5
1 9:18:00 Running 3 9:16:00 9:09:00 Running Running 0 NULL 5
1 9:19:00 Running 3 9:16:00 9:09:00 Running Stopped 0 NULL 5
1 9:20:00 Stopped 2 9:16:00 9:09:00 Running Started 0 NULL 5
1 9:21:00 Started 1 9:16:00 9:20:00 Stopped Started 0 1 6
1 9:22:00 Started 1 9:21:00 9:20:00 Started Running 0 NULL 6
1 9:23:00 Running 3 9:22:00 9:20:00 Started Started 0 NULL 6
1 9:24:00 Started 1 9:22:00 9:20:00 Running Running 0 NULL 6
1 9:25:00 Running 3 9:24:00 9:20:00 Started Running 0 NULL 6
1 9:26:00 Running 3 9:24:00 9:20:00 Running Running 0 NULL 6
1 9:27:00 Running 3 9:24:00 9:20:00 Running Stopped 0 NULL 6
1 9:28:00 Stopped 2 9:24:00 9:20:00 Running NULL 0 NULL 6
2 个解决方案
#1
1
Following seems to match yur description, I assume there's a column (named whatever) to order your data uniquely (probably a timestamp).
以下似乎与您的描述匹配,我假设有一个列(命名为any)来唯一地排序您的数据(可能是时间戳)。
This will result in two STAT-steps in Teradata:
这将在Teradata中产生两个STAT步骤:
SELECT dt.*
,Sum(flag) -- (cumulative sum or COUNT(*) to create the session id
Over (PARTITION BY Vehicle_ID
ORDER BY whatever
ROWS Unbounded Preceding) AS session_id_
FROM
(
SELECT mytable.*
-- previous start
,Max(CASE event_code WHEN 1 THEN whatever END)
Over (PARTITION BY Vehicle_ID
ORDER BY whatever
ROWS BETWEEN Unbounded Preceding AND 1 Preceding) AS prev_start
-- previous stop
,Max(CASE event_code WHEN 2 THEN whatever END)
Over (PARTITION BY Vehicle_ID
ORDER BY whatever
ROWS BETWEEN Unbounded Preceding AND 1 Preceding) AS prev_stop
-- previous event
,Lag(event_code)
Over (PARTITION BY Vehicle_ID
ORDER BY whatever) AS lag_event
-- no new session started after previous stop and current event is not start = bad data
,CASE WHEN prev_start < prev_stop AND event_code <> 1 THEN 1 ELSE 0 END AS bad_data
-- new session starts at
,CASE WHEN (event_code <> 2 AND (lag_event = 2) ) -- first row after a stop (ignore consecutive stops)
OR (event_code = 1 AND prev_start < prev_stop) -- first row after bad data
OR lag_event IS NULL -- first row
THEN 1
END AS flag
FROM mytable
) AS dt
For Oracle/SQL Server you need to add another nesting level to be able to use the aliases within the CASEs.
对于Oracle / SQL Server,您需要添加另一个嵌套级别才能使用CASE中的别名。
#2
3
In Vertica, I'd use the MATCH() clause. It would also leave out the unwanted rows - the ones 'running' that make no sense. try this:
在Vertica中,我使用了MATCH()子句。它也会省去不需要的行 - 那些“无法运行”的行。尝试这个:
WITH
-- your input as you gave it
input(tm,Vehicle_ID,Col1,Event,Event_Code,Session_ID) AS (
SELECT TIME '09:01:00',1,'A','Started',1,1
UNION ALL SELECT TIME '09:02:00',1,'B','Started',1,1
UNION ALL SELECT TIME '09:03:00',1,'C','Running',3,1
UNION ALL SELECT TIME '09:04:00',1,'A','Started',1,1
UNION ALL SELECT TIME '09:05:00',1,'B','Running',3,1
UNION ALL SELECT TIME '09:06:00',1,'C','Running',3,1
UNION ALL SELECT TIME '09:07:00',1,'A','Running',3,1
UNION ALL SELECT TIME '09:08:00',1,'A','Stopped',2,1
UNION ALL SELECT TIME '09:09:00',1,'B','Stopped',2,1
UNION ALL SELECT TIME '09:10:00',1,'C','Running',3,2
UNION ALL SELECT TIME '09:11:00',1,'A','Running',3,2
UNION ALL SELECT TIME '09:12:00',1,'B','Running',3,2
UNION ALL SELECT TIME '09:13:00',1,'A','Started',1,3
UNION ALL SELECT TIME '09:14:00',1,'B','Started',1,3
UNION ALL SELECT TIME '09:15:00',1,'C','Running',3,3
UNION ALL SELECT TIME '09:16:00',1,'A','Started',1,3
UNION ALL SELECT TIME '09:17:00',1,'B','Running',3,3
UNION ALL SELECT TIME '09:18:00',1,'C','Running',3,3
UNION ALL SELECT TIME '09:19:00',1,'A','Running',3,3
UNION ALL SELECT TIME '09:20:00',1,'A','Stopped',2,3
)
-- here is where the real select starts ..
SELECT
pattern_id()
, match_id()
, event_name()
, *
FROM input
MATCH(
PARTITION BY vehicle_id
ORDER BY tm
DEFINE
started_event AS (event='Started')
, running_event AS (event='Running')
, stopped_event AS (event='Stopped')
PATTERN p AS (started_event+ (running_event|started_event)* stopped_event+)
)
;
pattern_id|match_id|event_name |tm |Vehicle_ID|Col1|Event |Event_Code|Session_ID
1| 1|started_event|09:01:00| 1|A |Started| 1| 1
1| 2|started_event|09:02:00| 1|B |Started| 1| 1
1| 3|running_event|09:03:00| 1|C |Running| 3| 1
1| 4|started_event|09:04:00| 1|A |Started| 1| 1
1| 5|running_event|09:05:00| 1|B |Running| 3| 1
1| 6|running_event|09:06:00| 1|C |Running| 3| 1
1| 7|running_event|09:07:00| 1|A |Running| 3| 1
1| 8|stopped_event|09:08:00| 1|A |Stopped| 2| 1
1| 9|stopped_event|09:09:00| 1|B |Stopped| 2| 1
2| 1|started_event|09:13:00| 1|A |Started| 1| 3
2| 2|started_event|09:14:00| 1|B |Started| 1| 3
2| 3|running_event|09:15:00| 1|C |Running| 3| 3
2| 4|started_event|09:16:00| 1|A |Started| 1| 3
2| 5|running_event|09:17:00| 1|B |Running| 3| 3
2| 6|running_event|09:18:00| 1|C |Running| 3| 3
2| 7|running_event|09:19:00| 1|A |Running| 3| 3
#1
1
Following seems to match yur description, I assume there's a column (named whatever) to order your data uniquely (probably a timestamp).
以下似乎与您的描述匹配,我假设有一个列(命名为any)来唯一地排序您的数据(可能是时间戳)。
This will result in two STAT-steps in Teradata:
这将在Teradata中产生两个STAT步骤:
SELECT dt.*
,Sum(flag) -- (cumulative sum or COUNT(*) to create the session id
Over (PARTITION BY Vehicle_ID
ORDER BY whatever
ROWS Unbounded Preceding) AS session_id_
FROM
(
SELECT mytable.*
-- previous start
,Max(CASE event_code WHEN 1 THEN whatever END)
Over (PARTITION BY Vehicle_ID
ORDER BY whatever
ROWS BETWEEN Unbounded Preceding AND 1 Preceding) AS prev_start
-- previous stop
,Max(CASE event_code WHEN 2 THEN whatever END)
Over (PARTITION BY Vehicle_ID
ORDER BY whatever
ROWS BETWEEN Unbounded Preceding AND 1 Preceding) AS prev_stop
-- previous event
,Lag(event_code)
Over (PARTITION BY Vehicle_ID
ORDER BY whatever) AS lag_event
-- no new session started after previous stop and current event is not start = bad data
,CASE WHEN prev_start < prev_stop AND event_code <> 1 THEN 1 ELSE 0 END AS bad_data
-- new session starts at
,CASE WHEN (event_code <> 2 AND (lag_event = 2) ) -- first row after a stop (ignore consecutive stops)
OR (event_code = 1 AND prev_start < prev_stop) -- first row after bad data
OR lag_event IS NULL -- first row
THEN 1
END AS flag
FROM mytable
) AS dt
For Oracle/SQL Server you need to add another nesting level to be able to use the aliases within the CASEs.
对于Oracle / SQL Server,您需要添加另一个嵌套级别才能使用CASE中的别名。
#2
3
In Vertica, I'd use the MATCH() clause. It would also leave out the unwanted rows - the ones 'running' that make no sense. try this:
在Vertica中,我使用了MATCH()子句。它也会省去不需要的行 - 那些“无法运行”的行。尝试这个:
WITH
-- your input as you gave it
input(tm,Vehicle_ID,Col1,Event,Event_Code,Session_ID) AS (
SELECT TIME '09:01:00',1,'A','Started',1,1
UNION ALL SELECT TIME '09:02:00',1,'B','Started',1,1
UNION ALL SELECT TIME '09:03:00',1,'C','Running',3,1
UNION ALL SELECT TIME '09:04:00',1,'A','Started',1,1
UNION ALL SELECT TIME '09:05:00',1,'B','Running',3,1
UNION ALL SELECT TIME '09:06:00',1,'C','Running',3,1
UNION ALL SELECT TIME '09:07:00',1,'A','Running',3,1
UNION ALL SELECT TIME '09:08:00',1,'A','Stopped',2,1
UNION ALL SELECT TIME '09:09:00',1,'B','Stopped',2,1
UNION ALL SELECT TIME '09:10:00',1,'C','Running',3,2
UNION ALL SELECT TIME '09:11:00',1,'A','Running',3,2
UNION ALL SELECT TIME '09:12:00',1,'B','Running',3,2
UNION ALL SELECT TIME '09:13:00',1,'A','Started',1,3
UNION ALL SELECT TIME '09:14:00',1,'B','Started',1,3
UNION ALL SELECT TIME '09:15:00',1,'C','Running',3,3
UNION ALL SELECT TIME '09:16:00',1,'A','Started',1,3
UNION ALL SELECT TIME '09:17:00',1,'B','Running',3,3
UNION ALL SELECT TIME '09:18:00',1,'C','Running',3,3
UNION ALL SELECT TIME '09:19:00',1,'A','Running',3,3
UNION ALL SELECT TIME '09:20:00',1,'A','Stopped',2,3
)
-- here is where the real select starts ..
SELECT
pattern_id()
, match_id()
, event_name()
, *
FROM input
MATCH(
PARTITION BY vehicle_id
ORDER BY tm
DEFINE
started_event AS (event='Started')
, running_event AS (event='Running')
, stopped_event AS (event='Stopped')
PATTERN p AS (started_event+ (running_event|started_event)* stopped_event+)
)
;
pattern_id|match_id|event_name |tm |Vehicle_ID|Col1|Event |Event_Code|Session_ID
1| 1|started_event|09:01:00| 1|A |Started| 1| 1
1| 2|started_event|09:02:00| 1|B |Started| 1| 1
1| 3|running_event|09:03:00| 1|C |Running| 3| 1
1| 4|started_event|09:04:00| 1|A |Started| 1| 1
1| 5|running_event|09:05:00| 1|B |Running| 3| 1
1| 6|running_event|09:06:00| 1|C |Running| 3| 1
1| 7|running_event|09:07:00| 1|A |Running| 3| 1
1| 8|stopped_event|09:08:00| 1|A |Stopped| 2| 1
1| 9|stopped_event|09:09:00| 1|B |Stopped| 2| 1
2| 1|started_event|09:13:00| 1|A |Started| 1| 3
2| 2|started_event|09:14:00| 1|B |Started| 1| 3
2| 3|running_event|09:15:00| 1|C |Running| 3| 3
2| 4|started_event|09:16:00| 1|A |Started| 1| 3
2| 5|running_event|09:17:00| 1|B |Running| 3| 3
2| 6|running_event|09:18:00| 1|C |Running| 3| 3
2| 7|running_event|09:19:00| 1|A |Running| 3| 3