I have about 1 million events in a PostgreSQL database that are of this format:
我在PostgreSQL数据库中有大约100万个这种格式的事件:
id | stream_id | timestamp
----------+-----------------+-----------------
1 | 7 | ....
2 | 8 | ....
There are about 50,000 unique streams.
大约有50,000个独特的流。
I need to find all of the events where the time between any two of the events is over a certain time period. In other words, I need to find event pairs where there was no event in a certain period of time.
我需要找到所有事件,其中任何两个事件之间的时间超过一定时间段。换句话说,我需要找到在特定时间段内没有事件的事件对。
For example:
a b c d e f g h i j k
| | | | | | | | | | |
\____2 mins____/
In this scenario, I would want to find the pair (f, g) since those are the events immediately surrounding a gap.
在这种情况下,我想找到对(f,g),因为那些是紧邻间隙的事件。
I don't care if the query is (that) slow, i.e. on 1 million records it's fine if it takes an hour or so. However, the data set will keep growing, so hopefully if it's slow it scales sanely.
我不在乎查询是否缓慢,即在100万条记录上,如果它需要一个小时左右就可以了。但是,数据集将继续增长,所以如果它的速度很慢,那么它可以保持稳定。
I also have the data in MongoDB.
我也有MongoDB中的数据。
What's the best way to perform this query?
执行此查询的最佳方法是什么?
2 个解决方案
#1
4
You can do this with the lag()
window function over a partition by the stream_id which is ordered by the timestamp. The lag()
function gives you access to previous rows in the partition; without a lag value, it is the previous row. So if the partition on stream_id is ordered by time, then the previous row is the previous event for that stream_id.
您可以通过由时间戳排序的stream_id在分区上使用lag()窗口函数来执行此操作。 lag()函数使您可以访问分区中的前一行;没有滞后值,它是前一行。因此,如果stream_id上的分区按时间排序,则前一行是该stream_id的上一个事件。
SELECT stream_id, lag(id) OVER pair AS start_id, id AS end_id,
("timestamp" - lag("timestamp") OVER pair) AS diff
FROM my_table
WHERE diff > interval '2 minutes'
WINDOW pair AS (PARTITION BY stream_id ORDER BY "timestamp");
#2
4
In postgres it can be done very easily with a help of the lag() window function. Check the fiddle below as an example:
在postgres中,借助lag()窗口函数可以很容易地完成它。请查看下面的小提琴作为示例:
PostgreSQL 9.3 Schema Setup:
PostgreSQL 9.3架构设置:
CREATE TABLE Table1
("id" int, "stream_id" int, "timestamp" timestamp)
;
INSERT INTO Table1
("id", "stream_id", "timestamp")
VALUES
(1, 7, '2015-06-01 15:20:30'),
(2, 7, '2015-06-01 15:20:31'),
(3, 7, '2015-06-01 15:20:32'),
(4, 7, '2015-06-01 15:25:30'),
(5, 7, '2015-06-01 15:25:31')
;
Query 1:
with c as (select *,
lag("timestamp") over(partition by stream_id order by id) as pre_time,
lag(id) over(partition by stream_id order by id) as pre_id
from Table1
)
select * from c where "timestamp" - pre_time > interval '2 sec'
| id | stream_id | timestamp | pre_time | pre_id |
|----|-----------|------------------------|------------------------|--------|
| 4 | 7 | June, 01 2015 15:25:30 | June, 01 2015 15:20:32 | 3 |
#1
4
You can do this with the lag()
window function over a partition by the stream_id which is ordered by the timestamp. The lag()
function gives you access to previous rows in the partition; without a lag value, it is the previous row. So if the partition on stream_id is ordered by time, then the previous row is the previous event for that stream_id.
您可以通过由时间戳排序的stream_id在分区上使用lag()窗口函数来执行此操作。 lag()函数使您可以访问分区中的前一行;没有滞后值,它是前一行。因此,如果stream_id上的分区按时间排序,则前一行是该stream_id的上一个事件。
SELECT stream_id, lag(id) OVER pair AS start_id, id AS end_id,
("timestamp" - lag("timestamp") OVER pair) AS diff
FROM my_table
WHERE diff > interval '2 minutes'
WINDOW pair AS (PARTITION BY stream_id ORDER BY "timestamp");
#2
4
In postgres it can be done very easily with a help of the lag() window function. Check the fiddle below as an example:
在postgres中,借助lag()窗口函数可以很容易地完成它。请查看下面的小提琴作为示例:
PostgreSQL 9.3 Schema Setup:
PostgreSQL 9.3架构设置:
CREATE TABLE Table1
("id" int, "stream_id" int, "timestamp" timestamp)
;
INSERT INTO Table1
("id", "stream_id", "timestamp")
VALUES
(1, 7, '2015-06-01 15:20:30'),
(2, 7, '2015-06-01 15:20:31'),
(3, 7, '2015-06-01 15:20:32'),
(4, 7, '2015-06-01 15:25:30'),
(5, 7, '2015-06-01 15:25:31')
;
Query 1:
with c as (select *,
lag("timestamp") over(partition by stream_id order by id) as pre_time,
lag(id) over(partition by stream_id order by id) as pre_id
from Table1
)
select * from c where "timestamp" - pre_time > interval '2 sec'
| id | stream_id | timestamp | pre_time | pre_id |
|----|-----------|------------------------|------------------------|--------|
| 4 | 7 | June, 01 2015 15:25:30 | June, 01 2015 15:20:32 | 3 |