I need some help writing/optimizing a query to retrieve the latest version of each row by type and performing some calculations depending on the type. I think would be best if I illustrate it with an example.
我需要一些帮助来编写/优化查询,根据类型检索每一行的最新版本,并根据类型执行一些计算。我认为最好用一个例子来说明。
Given the following dataset:
得到以下数据:
+-------+-------------------+---------------------+-------------+---------------------+--------+----------+
| id | event_type | event_timestamp | message_id | sent_at | status | rate |
+-------+-------------------+---------------------+-------------+---------------------+--------+----------+
| 1 | create | 2016-11-25 09:17:48 | 1 | 2016-11-25 09:17:48 | 0 | 0.500000 |
| 2 | status_update | 2016-11-25 09:24:38 | 1 | 2016-11-25 09:28:49 | 1 | 0.500000 |
| 3 | create | 2016-11-25 09:47:48 | 2 | 2016-11-25 09:47:48 | 0 | 0.500000 |
| 4 | status_update | 2016-11-25 09:54:38 | 2 | 2016-11-25 09:48:49 | 1 | 0.500000 |
| 5 | rate_update | 2016-11-25 09:55:07 | 2 | 2016-11-25 09:50:07 | 0 | 1.000000 |
| 6 | create | 2016-11-26 09:17:48 | 3 | 2016-11-26 09:17:48 | 0 | 0.500000 |
| 7 | create | 2016-11-27 09:17:48 | 4 | 2016-11-27 09:17:48 | 0 | 0.500000 |
| 8 | rate_update | 2016-11-27 09:55:07 | 4 | 2016-11-27 09:50:07 | 0 | 2.000000 |
| 9 | rate_update | 2016-11-27 09:55:07 | 2 | 2016-11-25 09:55:07 | 0 | 2.000000 |
+-------+-------------------+---------------------+-------------+---------------------+--------+----------+
The expected result should be:
预期的结果应是:
+------------+--------------------+--------------------+-----------------------+
| sent_at | sum(submitted_msg) | sum(delivered_msg) | sum(rate_total) |
+------------+--------------------+--------------------+-----------------------+
| 2016-11-25 | 2 | 2 | 2.500000 |
| 2016-11-26 | 1 | 0 | 0.500000 |
| 2016-11-27 | 1 | 0 | 2.000000 |
+------------+--------------------+--------------------+-----------------------+
At the end of the post is the query that is used to obtain this result. I'm willing to bet that there should be a way to optimize it, since it's using subqueries with joins, and from what I have read about BigQuery, joins should best be avoided. But first some background:
在文章的末尾是用于获取这个结果的查询。我愿意打赌,应该有一种方法来优化它,因为它使用的是连接的子查询,并且从我读到的关于BigQuery的内容中,应该尽量避免连接。但首先,一些背景:
In essence, the dataset represents an append-only table, to which multipe events are written. The size of the data is in the hundreds of millions and will grow to billions+. Since Updates in BigQuery are not practical, and the data is being streamed to BQ, I need a way to retrieve the most recent of each events, perform some calculations based on the certain conditions and return an accurate result. The query is generated dynamically, based on user input, so more fields/calculations can be included, but have been ommited for simplicity.
本质上,数据集表示一个只添加的表,其中写入了multipe事件。数据的规模是数亿,并且将增长到数十亿。由于BigQuery中的更新不实用,并且数据被流到BQ,所以我需要一种方法来检索每个事件的最新信息,基于特定的条件执行一些计算并返回一个准确的结果。查询是基于用户输入动态生成的,因此可以包含更多的字段/计算,但是为了简单起见,已经省略了。
- There is only one
create
event, butn
of any other kind - 只有一个创建事件,但是其他类型的n。
- For each group of events, only the latest should be taken into account when doing the calculations.
- status_update - updates the status
- status_update -更新状态。
- rate_update - updates the rate
- rate_update—更新速率
- create - self explanatory
- 创建——自我解释
- 对于每组事件,在进行计算时只考虑最新的事件。status_update -更新状态rate_update -更新速率创建-自解释
- Every event that is not
create
may not carry the rest of the information of the original/may not be accurate(except for message_id and the field that the event is operating on) (the dataset is simplified, but imagine there are many more columns, and more events will be added later)- E.g. a
rate_update
may or may not have the status field set, or be a value that is not the final, so no calculation can be made on the status field from arate_update
event and the same goes forstatus_update
- 例如,rate_update可能有也可能没有状态字段集,或者不是最终的值,因此rate_update事件不能对状态字段进行计算,status_update也是如此
- E.g. a
- 每个事件不创建不得携带其他信息的原始/可能并不准确(除了message_id和事件操作的字段)(数据集是简化,但想象有很多列,和更多的事件将被添加后)如rate_update可能有也可能没有状态字段集,或者是一个值,不是最终的,所以没有计算可以在状态字段从rate_update事件和status_update也是一样
- It can be assumed that the table is partitioned by date and every query will make use of the partions. Those conditions have been omitted in favor of simplicity for now.
- 可以假定表是按日期划分的,每个查询都将使用解析。这些条件现在被省略了,以简化。
So I guess I have a couple of questions:
我想我有几个问题:
- How can this query be optimized?
- 如何优化这个查询?
- Will it be a better idea to put the events, other than
create
in their own tables, where the only fields available will be the ones relevant for the events, and needed for the joins(message_id, event_timestamp)? Will this reduce the amount of data processed? - 除了在自己的表中创建事件之外,将事件放置在惟一可用的字段是与事件相关的字段,以及连接所需的字段(message_id、event_timestamp)中,这样做是否更好?这会减少处理的数据量吗?
- What would be the most optimal way to add more events in the future, which will have their own conditions and calculations?
- 什么是在未来添加更多事件的最优方法,而这些事件将有自己的条件和计算?
Actually any advice on how to query this dataset efficiently and friendly is more than welcome! Thank you! :)
实际上,任何关于如何高效友好地查询数据集的建议都是非常受欢迎的!谢谢你!:)
The monstrosity I've come up with is the following. The INNER JOINS
are used to retrieve the latest version of each row, as per this resource
我想到的怪物是这样的。内部连接用于检索每一行的最新版本,根据此资源。
select
sent_at as sent_at,
sum(submitted_msg) as submitted,
sum(delivered_msg) as delivered,
sum(sales_rate_total) as sales_rate_total
FROM (
#DELIVERED
SELECT
d.message_id,
FORMAT_TIMESTAMP('%Y-%m-%d 00:00:00', sent_at) AS sent_at,
0 as submitted_msg,
sum(if(status=1,1,0)) as delivered_msg,
0 as sales_rate_total
FROM `events` d
INNER JOIN
(
select message_id, max(event_timestamp) as ts
from `events`
where event_type = "status_update"
group by 1
) g on d.message_id = g.message_id and d.event_timestamp = g.ts
GROUP BY 1,2
UNION ALL
#SALES RATE
SELECT
s.message_id,
FORMAT_TIMESTAMP('%Y-%m-%d 00:00:00', sent_at) AS sent_at,
0 as submitted_msg,
0 as delivered_msg,
sum(sales_rate) as sales_rate_total
FROM `events` s
INNER JOIN
(
select message_id, max(event_timestamp) as ts
from `events`
where event_type in ("rate_update", "create")
group by 1
) f on s.message_id = f.message_id and s.event_timestamp = f.ts
GROUP BY 1,2
UNION ALL
#SUBMITTED & REST
SELECT
r.message_id,
FORMAT_TIMESTAMP('%Y-%m-%d 00:00:00', sent_at) AS sent_at,
sum(if(status=0,1,0)) as submitted_msg,
0 as delivered_msg,
0 as sales_rate_total
FROM `events` r
INNER JOIN
(
select message_id, max(event_timestamp) as ts
from `events`
where event_type = "create"
group by 1
) e on r.message_id = e.message_id and r.event_timestamp = e.ts
GROUP BY 1, 2
) k
group by 1
2 个解决方案
#1
4
How can this query be optimized?
如何优化这个查询?
Try below version
试试下面的版本
#standardSQL
WITH types AS (
SELECT
FORMAT_TIMESTAMP('%Y-%m-%d', sent_at) AS sent_at,
message_id,
FIRST_VALUE(status) OVER(PARTITION BY message_id ORDER BY (event_type = "create") DESC, event_timestamp DESC) AS submitted_status,
FIRST_VALUE(status) OVER(PARTITION BY message_id ORDER BY (event_type = "status_update") DESC, event_timestamp DESC) AS delivered_status,
FIRST_VALUE(rate) OVER(PARTITION BY message_id ORDER BY (event_type IN ("rate_update", "create")) DESC, event_timestamp DESC) AS sales_rate
FROM events
), latest AS (
SELECT
sent_at,
message_id,
ANY_VALUE(IF(submitted_status=0,1,0)) AS submitted,
ANY_VALUE(IF(delivered_status=1,1,0)) AS delivered,
ANY_VALUE(sales_rate) AS sales_rate
FROM types
GROUP BY 1, 2
)
SELECT
sent_at,
SUM(submitted) AS submitted,
SUM(delivered) AS delivered,
SUM(sales_rate) AS sales_rate_total
FROM latest
GROUP BY 1
It's compact enough to easily manage, no redundancy, no joins at all, etc.
If your table partitioned - you can easily use it by adjusting query just in one place
它非常紧凑,易于管理,没有冗余,没有连接,等等。如果你的表分区,你可以很容易地通过调整一个地方的查询来使用它。
You can use below dummy data if want to check above query on low volume first
您可以使用以下虚拟数据,如果想要先检查上面的低音量查询。
WITH events AS (
SELECT 1 AS id, 'create' AS event_type, TIMESTAMP '2016-11-25 09:17:48' AS event_timestamp, 1 AS message_id, TIMESTAMP '2016-11-25 09:17:48' AS sent_at, 0 AS status, 0.500000 AS rate UNION ALL
SELECT 2 AS id, 'status_update' AS event_type, TIMESTAMP '2016-11-25 09:24:38' AS event_timestamp, 1 AS message_id, TIMESTAMP '2016-11-25 09:28:49' AS sent_at, 1 AS status, 0.500000 AS rate UNION ALL
SELECT 3 AS id, 'create' AS event_type, TIMESTAMP '2016-11-25 09:47:48' AS event_timestamp, 2 AS message_id, TIMESTAMP '2016-11-25 09:47:48' AS sent_at, 0 AS status, 0.500000 AS rate UNION ALL
SELECT 4 AS id, 'status_update' AS event_type, TIMESTAMP '2016-11-25 09:54:38' AS event_timestamp, 2 AS message_id, TIMESTAMP '2016-11-25 09:48:49' AS sent_at, 1 AS status, 0.500000 AS rate UNION ALL
SELECT 5 AS id, 'rate_update' AS event_type, TIMESTAMP '2016-11-25 09:55:07' AS event_timestamp, 2 AS message_id, TIMESTAMP '2016-11-25 09:50:07' AS sent_at, 0 AS status, 1.000000 AS rate UNION ALL
SELECT 6 AS id, 'create' AS event_type, TIMESTAMP '2016-11-26 09:17:48' AS event_timestamp, 3 AS message_id, TIMESTAMP '2016-11-26 09:17:48' AS sent_at, 0 AS status, 0.500000 AS rate UNION ALL
SELECT 7 AS id, 'create' AS event_type, TIMESTAMP '2016-11-27 09:17:48' AS event_timestamp, 4 AS message_id, TIMESTAMP '2016-11-27 09:17:48' AS sent_at, 0 AS status, 0.500000 AS rate UNION ALL
SELECT 8 AS id, 'rate_update' AS event_type, TIMESTAMP '2016-11-27 09:55:07' AS event_timestamp, 4 AS message_id, TIMESTAMP '2016-11-27 09:50:07' AS sent_at, 0 AS status, 2.000000 AS rate UNION ALL
SELECT 9 AS id, 'rate_update' AS event_type, TIMESTAMP '2016-11-27 09:55:07' AS event_timestamp, 2 AS message_id, TIMESTAMP '2016-11-25 09:55:07' AS sent_at, 0 AS status, 2.000000 AS rate
)
#2
0
For every table that holds multiple events and where we need to pick the latest we have a view in place.
对于每个包含多个事件的表,以及我们需要选择最新的位置,我们都有一个视图。
View: user_profile_latest
观点:user_profile_latest
SELECT * from (
select rank() over (partition by user_id order by bq.created DESC, bq.insert_id desc) as _rank,
*
FROM [user_profile_event]
) where _rank=1
We maintain a record BQ with created and insert_id for deduplication purposes.
为了进行重复数据删除,我们使用已创建的和insert_id维护一个记录BQ。
#1
4
How can this query be optimized?
如何优化这个查询?
Try below version
试试下面的版本
#standardSQL
WITH types AS (
SELECT
FORMAT_TIMESTAMP('%Y-%m-%d', sent_at) AS sent_at,
message_id,
FIRST_VALUE(status) OVER(PARTITION BY message_id ORDER BY (event_type = "create") DESC, event_timestamp DESC) AS submitted_status,
FIRST_VALUE(status) OVER(PARTITION BY message_id ORDER BY (event_type = "status_update") DESC, event_timestamp DESC) AS delivered_status,
FIRST_VALUE(rate) OVER(PARTITION BY message_id ORDER BY (event_type IN ("rate_update", "create")) DESC, event_timestamp DESC) AS sales_rate
FROM events
), latest AS (
SELECT
sent_at,
message_id,
ANY_VALUE(IF(submitted_status=0,1,0)) AS submitted,
ANY_VALUE(IF(delivered_status=1,1,0)) AS delivered,
ANY_VALUE(sales_rate) AS sales_rate
FROM types
GROUP BY 1, 2
)
SELECT
sent_at,
SUM(submitted) AS submitted,
SUM(delivered) AS delivered,
SUM(sales_rate) AS sales_rate_total
FROM latest
GROUP BY 1
It's compact enough to easily manage, no redundancy, no joins at all, etc.
If your table partitioned - you can easily use it by adjusting query just in one place
它非常紧凑,易于管理,没有冗余,没有连接,等等。如果你的表分区,你可以很容易地通过调整一个地方的查询来使用它。
You can use below dummy data if want to check above query on low volume first
您可以使用以下虚拟数据,如果想要先检查上面的低音量查询。
WITH events AS (
SELECT 1 AS id, 'create' AS event_type, TIMESTAMP '2016-11-25 09:17:48' AS event_timestamp, 1 AS message_id, TIMESTAMP '2016-11-25 09:17:48' AS sent_at, 0 AS status, 0.500000 AS rate UNION ALL
SELECT 2 AS id, 'status_update' AS event_type, TIMESTAMP '2016-11-25 09:24:38' AS event_timestamp, 1 AS message_id, TIMESTAMP '2016-11-25 09:28:49' AS sent_at, 1 AS status, 0.500000 AS rate UNION ALL
SELECT 3 AS id, 'create' AS event_type, TIMESTAMP '2016-11-25 09:47:48' AS event_timestamp, 2 AS message_id, TIMESTAMP '2016-11-25 09:47:48' AS sent_at, 0 AS status, 0.500000 AS rate UNION ALL
SELECT 4 AS id, 'status_update' AS event_type, TIMESTAMP '2016-11-25 09:54:38' AS event_timestamp, 2 AS message_id, TIMESTAMP '2016-11-25 09:48:49' AS sent_at, 1 AS status, 0.500000 AS rate UNION ALL
SELECT 5 AS id, 'rate_update' AS event_type, TIMESTAMP '2016-11-25 09:55:07' AS event_timestamp, 2 AS message_id, TIMESTAMP '2016-11-25 09:50:07' AS sent_at, 0 AS status, 1.000000 AS rate UNION ALL
SELECT 6 AS id, 'create' AS event_type, TIMESTAMP '2016-11-26 09:17:48' AS event_timestamp, 3 AS message_id, TIMESTAMP '2016-11-26 09:17:48' AS sent_at, 0 AS status, 0.500000 AS rate UNION ALL
SELECT 7 AS id, 'create' AS event_type, TIMESTAMP '2016-11-27 09:17:48' AS event_timestamp, 4 AS message_id, TIMESTAMP '2016-11-27 09:17:48' AS sent_at, 0 AS status, 0.500000 AS rate UNION ALL
SELECT 8 AS id, 'rate_update' AS event_type, TIMESTAMP '2016-11-27 09:55:07' AS event_timestamp, 4 AS message_id, TIMESTAMP '2016-11-27 09:50:07' AS sent_at, 0 AS status, 2.000000 AS rate UNION ALL
SELECT 9 AS id, 'rate_update' AS event_type, TIMESTAMP '2016-11-27 09:55:07' AS event_timestamp, 2 AS message_id, TIMESTAMP '2016-11-25 09:55:07' AS sent_at, 0 AS status, 2.000000 AS rate
)
#2
0
For every table that holds multiple events and where we need to pick the latest we have a view in place.
对于每个包含多个事件的表,以及我们需要选择最新的位置,我们都有一个视图。
View: user_profile_latest
观点:user_profile_latest
SELECT * from (
select rank() over (partition by user_id order by bq.created DESC, bq.insert_id desc) as _rank,
*
FROM [user_profile_event]
) where _rank=1
We maintain a record BQ with created and insert_id for deduplication purposes.
为了进行重复数据删除,我们使用已创建的和insert_id维护一个记录BQ。