I'm trying to analyze a funnel using event data in Redshift and have difficulties finding an efficient query to extract that data.
我正在尝试使用Redshift中的事件数据来分析漏斗,并且很难找到有效的查询来提取该数据。
For example, in Redshift I have:
例如,在Redshift中我有:
timestamp action user id
--------- ------ -------
2015-05-05 12:00 homepage 1
2015-05-05 12:01 product page 1
2015-05-05 12:02 homepage 2
2015-05-05 12:03 checkout 1
I would like to extract the funnel statistics. For example:
我想提取漏斗统计信息。例如:
homepage_count product_page_count checkout_count
-------------- ------------------ --------------
100 50 25
Where homepage_count
represent the distinct number of users who visited the homepage, product_page_count
represents the distinct numbers of users who visited the homepage after visiting the homepage, and checkout_count
represents the number of users who checked out after visiting the homepage and the product page.
如果homepage_count表示访问主页的用户数量不同,则product_page_count表示访问主页后访问主页的用户数量不同,checkout_count表示访问主页和产品页面后签出的用户数。
What would be the best query to achieve that with Amazon Redshift? Is it possible to do with a single query?
使用Amazon Redshift实现这一目标的最佳查询是什么?是否可以使用单个查询?
3 个解决方案
#1
4
I think the best method might be to add flags to the data for the first visit of each type for each user and then use these for aggregation logic:
我认为最好的方法可能是为每个用户首次访问每个类型的数据添加标志,然后将它们用于聚合逻辑:
select sum(case when ts_homepage is not null then 1 else 0 end) as homepage_count,
sum(case when ts_productpage > ts_homepage then 1 else 0 end) as productpage_count,
sum(case when ts_checkout > ts.productpage and ts.productpage > ts.homepage then 1 else 0 end) as checkout_count
from (select userid,
min(case when action = 'homepage' then timestamp end) as ts_homepage,
min(case when action = 'product page' then timestamp end) as ts_productpage,
min(case when action = 'checkout' then timestamp end) as ts_checkout
from table t
group by userid
) t
#2
0
The above answer is very much correct . I have modified it for people using it for AWS Mobile Analytics and Redshift.
以上答案非常正确。我已经为使用它进行AWS Mobile Analytics和Redshift的人修改了它。
select sum(case when ts_homepage is not null then 1 else 0 end) as homepage_count,
sum(case when ts_productpage > ts_homepage then 1 else 0 end) as productpage_count,
sum(case when ts_checkout > ts_productpage and ts_productpage > ts_homepage then 1 else 0 end) as checkout_count
from (select client_id,
min(case when event_type = 'App Launch' then event_timestamp end) as ts_homepage,
min(case when event_type = 'SignUp Success' then event_timestamp end) as ts_productpage,
min(case when event_type = 'Start Quiz' then event_timestamp end) as ts_checkout
from awsma.v_event
group by client_id
) ts;
#3
0
Just in case more precise model required: when product page can be opened twice. First time before home page and second one after. This case usually should be considered as conversion as well.
以防万一需要更精确的模型:产品页面可以打开两次。第一次在主页之前和第二次之后。这种情况通常也应被视为转换。
Redshift SQL query:
Redshift SQL查询:
SELECT
COUNT(
DISTINCT CASE WHEN cur_homepage_time IS NOT NULL
THEN user_id END
) Step1,
COUNT(
DISTINCT CASE WHEN cur_homepage_time IS NOT NULL AND cur_productpage_time IS NOT NULL
THEN user_id END
) Step2,
COUNT(
DISTINCT CASE WHEN
cur_homepage_time IS NOT NULL AND cur_productpage_time IS NOT NULL AND cur_checkout_time IS NOT NULL
THEN user_id END
) Step3
FROM (
SELECT
user_id,
timestamp,
COALESCE(homepage_time,
LAG(homepage_time) IGNORE NULLS OVER(PARTITION BY user_id
ORDER BY time)
) cur_homepage_time,
COALESCE(productpage_time,
LAG(productpage_time) IGNORE NULLS OVER(PARTITION BY distinct_id
ORDER BY time)
) cur_productpage_time,
COALESCE(checkout_time,
LAG(checkout_time) IGNORE NULLS OVER(PARTITION BY distinct_id
ORDER BY time)
) cur_checkout_time
FROM
(
SELECT
timestamp,
user_id,
(CASE WHEN event = 'homepage'
THEN timestamp END) homepage_time,
(CASE WHEN event = 'product page'
THEN timestamp END) productpage_time,
(CASE WHEN event = 'checkout'
THEN timestamp END) checkout_time
FROM events
WHERE timestamp > '2016-05-01' AND timestamp < '2017-01-01'
ORDER BY user_id, timestamp
) event_times
ORDER BY user_id, timestamp
) event_windows
This query fills each row's cur_homepage_time
, cur_productpage_time
and cur_checkout_time
with recent timestamp of event occurrences. So in case for some specific time (read row) event occured then particular column is not NULL
.
此查询使用最近发生的事件时间戳填充每一行的cur_homepage_time,cur_productpage_time和cur_checkout_time。因此,如果某个特定时间(读取行)事件发生,则特定列不为NULL。
More info here.
更多信息在这里。
#1
4
I think the best method might be to add flags to the data for the first visit of each type for each user and then use these for aggregation logic:
我认为最好的方法可能是为每个用户首次访问每个类型的数据添加标志,然后将它们用于聚合逻辑:
select sum(case when ts_homepage is not null then 1 else 0 end) as homepage_count,
sum(case when ts_productpage > ts_homepage then 1 else 0 end) as productpage_count,
sum(case when ts_checkout > ts.productpage and ts.productpage > ts.homepage then 1 else 0 end) as checkout_count
from (select userid,
min(case when action = 'homepage' then timestamp end) as ts_homepage,
min(case when action = 'product page' then timestamp end) as ts_productpage,
min(case when action = 'checkout' then timestamp end) as ts_checkout
from table t
group by userid
) t
#2
0
The above answer is very much correct . I have modified it for people using it for AWS Mobile Analytics and Redshift.
以上答案非常正确。我已经为使用它进行AWS Mobile Analytics和Redshift的人修改了它。
select sum(case when ts_homepage is not null then 1 else 0 end) as homepage_count,
sum(case when ts_productpage > ts_homepage then 1 else 0 end) as productpage_count,
sum(case when ts_checkout > ts_productpage and ts_productpage > ts_homepage then 1 else 0 end) as checkout_count
from (select client_id,
min(case when event_type = 'App Launch' then event_timestamp end) as ts_homepage,
min(case when event_type = 'SignUp Success' then event_timestamp end) as ts_productpage,
min(case when event_type = 'Start Quiz' then event_timestamp end) as ts_checkout
from awsma.v_event
group by client_id
) ts;
#3
0
Just in case more precise model required: when product page can be opened twice. First time before home page and second one after. This case usually should be considered as conversion as well.
以防万一需要更精确的模型:产品页面可以打开两次。第一次在主页之前和第二次之后。这种情况通常也应被视为转换。
Redshift SQL query:
Redshift SQL查询:
SELECT
COUNT(
DISTINCT CASE WHEN cur_homepage_time IS NOT NULL
THEN user_id END
) Step1,
COUNT(
DISTINCT CASE WHEN cur_homepage_time IS NOT NULL AND cur_productpage_time IS NOT NULL
THEN user_id END
) Step2,
COUNT(
DISTINCT CASE WHEN
cur_homepage_time IS NOT NULL AND cur_productpage_time IS NOT NULL AND cur_checkout_time IS NOT NULL
THEN user_id END
) Step3
FROM (
SELECT
user_id,
timestamp,
COALESCE(homepage_time,
LAG(homepage_time) IGNORE NULLS OVER(PARTITION BY user_id
ORDER BY time)
) cur_homepage_time,
COALESCE(productpage_time,
LAG(productpage_time) IGNORE NULLS OVER(PARTITION BY distinct_id
ORDER BY time)
) cur_productpage_time,
COALESCE(checkout_time,
LAG(checkout_time) IGNORE NULLS OVER(PARTITION BY distinct_id
ORDER BY time)
) cur_checkout_time
FROM
(
SELECT
timestamp,
user_id,
(CASE WHEN event = 'homepage'
THEN timestamp END) homepage_time,
(CASE WHEN event = 'product page'
THEN timestamp END) productpage_time,
(CASE WHEN event = 'checkout'
THEN timestamp END) checkout_time
FROM events
WHERE timestamp > '2016-05-01' AND timestamp < '2017-01-01'
ORDER BY user_id, timestamp
) event_times
ORDER BY user_id, timestamp
) event_windows
This query fills each row's cur_homepage_time
, cur_productpage_time
and cur_checkout_time
with recent timestamp of event occurrences. So in case for some specific time (read row) event occured then particular column is not NULL
.
此查询使用最近发生的事件时间戳填充每一行的cur_homepage_time,cur_productpage_time和cur_checkout_time。因此,如果某个特定时间(读取行)事件发生,则特定列不为NULL。
More info here.
更多信息在这里。