MySQL GROUP BY DateTime +/- 3 seconds

Suppose I have a table with 3 columns:

假设我有一个包含3列的表：

id (PK, int)
id（PK，int）
timestamp (datetime)
时间戳（日期时间）
title (text)
标题（文字）

I have the following records:

我有以下记录：

1, 2010-01-01 15:00:00, Some Title
2, 2010-01-01 15:00:02, Some Title
3, 2010-01-02 15:00:00, Some Title

I need to do a GROUP BY records that are within 3 seconds of each other. For this table, rows 1 and 2 would be grouped together.

我需要做一个彼此在3秒内的GROUP BY记录。对于此表，第1行和第2行将组合在一起。

There is a similar question here: Mysql DateTime group by 15 mins

这里有一个类似的问题：Mysql DateTime组15分钟

I also found this: http://www.artfulsoftware.com/infotree/queries.php#106

我也发现了这个：http：//www.artfulsoftware.com/infotree/queries.php#106

I don't know how to convert these methods into something that will work for seconds. The trouble with the method on the SO question is that it seems to me that it would only work for records falling within a bin of time that starts at a known point. For instance, if I were to get FLOOR() to work with seconds, at an interval of 5 seconds, a time of 15:00:04 would be grouped with 15:00:01, but not grouped with 15:00:06.

我不知道如何将这些方法转换为可以工作几秒钟的方法。在SO问题上方法的问题在于，在我看来它只适用于落在从已知点开始的时间段内的记录。例如，如果我让FLOOR（）以秒为单位工作，间隔为5秒，则15:00:04的时间将与15:00:01分组，但不会与15:00:06分组。

Does this make sense? Please let me know if further clarification is needed.

这有意义吗？如果需要进一步说明，请与我们联系。

EDIT: For the set of numbers, {1, 2, 3, 4, 5, 6, 7, 50, 51, 60}, it seems it might be best to group them {1, 2, 3, 4, 5, 6, 7}, {50, 51}, {60}, so that each grouping row depends on if the row is within 3 seconds of the previous. I know this changes things a bit, I'm sorry for being wishywashy on this.

编辑：对于一组数字，{1,2,3,4,5,6,7,50,51,60}，似乎最好将它们分组{1,2,3,4,5， 6,7}，{50,51}，{60}，以便每个分组行取决于行是否在前一个3秒内。我知道这会改变一些事情，对不起，我很抱歉。

I am trying to fuzzy-match logs from different servers. Server #1 may log an item, "Item #1", and Server #2 will log that same item, "Item #1", within a few seconds of server #1. I need to do some aggregate functions on both log lines. Unfortunately, I only have title to go on, due to the nature of the server software.

我试图模糊匹配来自不同服务器的日志。服务器＃1可以记录项目“项目＃1”，服务器＃2将在服务器＃1的几秒内记录相同的项目“项目＃1”。我需要在两个日志行上做一些聚合函数。不幸的是，由于服务器软件的性质，我只有标题可以继续。

5 个解决方案

#1

I'm using Tom H.'s excellent idea but doing it a little differently here:

我正在使用Tom H.的优秀想法，但在这里做的有点不同：

Instead of finding all the rows that are the beginnings of chains, we can find all times that are the beginnings of chains, then go back and ifnd the rows that match the times.

我们可以找到作为链的起点的所有时间，而不是查找作为链的开头的所有行，然后返回并且ifnd匹配时间的行。

Query #1 here should tell you which times are the beginnings of chains by finding which times do not have any times below them but within 3 seconds:

这里的查询＃1应该告诉你哪些时间是链的开头，通过查找哪些时间在它们之下没有任何时间但在3秒内：

SELECT DISTINCT Timestamp
FROM Table a
LEFT JOIN Table b
ON (b.Timestamp >= a.TimeStamp - INTERVAL 3 SECONDS
    AND b.Timestamp < a.Timestamp)
WHERE b.Timestamp IS NULL

And then for each row, we can find the largest chain-starting timestamp that is less than our timestamp with Query #2:

然后对于每一行，我们可以找到最小的链起始时间戳，该时间戳小于查询＃2的时间戳：

SELECT Table.id, MAX(StartOfChains.TimeStamp) AS ChainStartTime
FROM Table
JOIN ([query #1]) StartofChains
ON Table.Timestamp >= StartOfChains.TimeStamp
GROUP BY Table.id

Once we have that, we can GROUP BY it as you wanted.

一旦我们有了这个，我们可以根据需要GROUP BY它。

SELECT COUNT(*) --or whatever
FROM Table
JOIN ([query #2]) GroupingQuery
ON Table.id = GroupingQuery.id
GROUP BY GroupingQuery.ChainStartTime

I'm not entirely sure this is distinct enough from Tom H's answer to be posted separately, but it sounded like you were having trouble with implementation, and I was thinking about it, so I thought I'd post again. Good luck!

我不完全确定这与Tom H的答案分开是否足够明显，但听起来好像你在实施方面遇到了麻烦，我正在思考它，所以我想我会再次发帖。祝你好运！

#2

Now that I think that I understand your problem, based on your comment response to OMG Ponies, I think that I have a set-based solution. The idea is to first find the start of any chains based on the title. The start of a chain is going to be defined as any row where there is no match within three seconds prior to that row:

现在我认为我理解你的问题，基于你对OMG Ponies的评论回应，我认为我有一个基于集合的解决方案。我们的想法是首先根据标题找到任何链的起点。链的起点将被定义为在该行之前三秒内没有匹配的任何行：

SELECT
    MT1.my_id,
    MT1.title,
    MT1.my_time
FROM
    My_Table MT1
LEFT OUTER JOIN My_Table MT2 ON
    MT2.title = MT1.title AND
    (
        MT2.my_time < MT1.my_time OR
        (MT2.my_time = MT1.my_time AND MT2.my_id < MT1.my_id)
    ) AND
    MT2.my_time >= MT1.my_time - INTERVAL 3 SECONDS
WHERE
    MT2.my_id IS NULL

Now we can assume that any non-chain starters belong to the chain starter that appeared before them. Since MySQL doesn't support CTEs, you might want to throw the above results into a temporary table, as that would save you the multiple joins to the same subquery below.

现在我们可以假设任何非链起动器都属于它们之前出现的链起动器。由于MySQL不支持CTE，您可能希望将上述结果抛出到临时表中，因为这样可以节省下面相同子查询的多个连接。

SELECT
    SQ1.my_id,
    COUNT(*)  -- You didn't say what you were trying to calculate, just that you needed to group them
FROM
(
    SELECT
        MT1.my_id,
        MT1.title,
        MT1.my_time
    FROM
        My_Table MT1
    LEFT OUTER JOIN My_Table MT2 ON
        MT2.title = MT1.title AND
        (
            MT2.my_time < MT1.my_time OR
            (MT2.my_time = MT1.my_time AND MT2.my_id < MT1.my_id)
        ) AND
        MT2.my_time >= MT1.my_time - INTERVAL 3 SECONDS
    WHERE
        MT2.my_id IS NULL
) SQ1
INNER JOIN My_Table MT3 ON
    MT3.title = SQ1.title AND
    MT3.my_time >= SQ1.my_time
LEFT OUTER JOIN
(
    SELECT
        MT1.my_id,
        MT1.title,
        MT1.my_time
    FROM
        My_Table MT1
    LEFT OUTER JOIN My_Table MT2 ON
        MT2.title = MT1.title AND
        (
            MT2.my_time < MT1.my_time OR
            (MT2.my_time = MT1.my_time AND MT2.my_id < MT1.my_id)
        ) AND
        MT2.my_time >= MT1.my_time - INTERVAL 3 SECONDS
    WHERE
        MT2.my_id IS NULL
) SQ2 ON
    SQ2.title = SQ1.title AND
    SQ2.my_time > SQ1.my_time AND
    SQ2.my_time <= MT3.my_time
WHERE
    SQ2.my_id IS NULL

This would look much simpler if you could use CTEs or if you used a temporary table. Using the temporary table might also help performance.

如果您可以使用CTE或使用临时表，这看起来会简单得多。使用临时表也可能有助于提高性能。

Also, there will be issues with this if you can have timestamps that match exactly. If that's the case then you will need to tweak the query slightly to use a combination of the id and the timestamp to distinguish rows with matching timestamp values.

此外，如果您可以准确匹配时间戳，则会出现此问题。如果是这种情况，那么您需要稍微调整查询以使用id和时间戳的组合来区分具有匹配时间戳值的行。

EDIT: Changed the queries to handle exact matches by timestamp.

编辑：更改查询以按时间戳处理完全匹配。

#3

Warning: Long answer. This should work, and is fairly neat, except for one step in the middle where you have to be willing to run an INSERT statement over and over until it doesn't do anything since we can't do recursive CTE things in MySQL.

警告：答案很长。这应该工作，并且相当简洁，除了中间的一步，你必须愿意一遍又一遍地运行INSERT语句，直到它不做任何事情，因为我们不能在MySQL中做递归的CTE事情。

I'm going to use this data as the example instead of yours:

我将使用此数据作为示例而不是您的：

id    Timestamp
1     1:00:00
2     1:00:03
3     1:00:06
4     1:00:10

Here is the first query to write:

这是第一个要编写的查询：

SELECT a.id as aid, b.id as bid
FROM Table a
JOIN Table b 
ON (a.Timestamp is within 3 seconds of b.Timestamp)

It returns:

它返回：

aid     bid
1       1
1       2
2       1
2       2
2       3
3       2
3       3
4       4

Let's create a nice table to hold those things that won't allow duplicates:

让我们创建一个很好的表来保存那些不允许重复的东西：

CREATE TABLE
Adjacency
( aid INT(11)
, bid INT(11)
, PRIMARY KEY (aid, bid) --important for later
)

Now the challenge is to find something like the transitive closure of that relation.

现在的挑战是找到类似该关系的传递闭包之类的东西。

To do so, let's find the next level of links. by that I mean, since we have 1 2 and 2 3 in the Adjacency table, we should add 1 3:

为此，让我们找到下一级链接。我的意思是，因为我们在Adjacency表中有1 2和2 3，我们应该添加1 3：

INSERT IGNORE INTO Adjacency(aid,bid)
SELECT adj1.aid, adj2.bid
FROM Adjacency adj1
JOIN Adjacency adj2
ON (adj1.bid = adj2.aid)

This is the non-elegant part: You'll need to run the above INSERT statement over and over until it doesn't add any rows to the table. I don't know if there is a neat way to do that.

这是非优雅的部分：您需要反复运行上面的INSERT语句，直到它不向表中添加任何行。我不知道是否有一种巧妙的方法可以做到这一点。

Once this is over, you will have a transitively-closed relation like this:

一旦结束，你将拥有一个过渡性关闭的关系，如下所示：

aid     bid
1       1
1       2
1       3     --added
2       1
2       2
2       3
3       1     --added
3       2
3       3
4       4

And now for the punchline:

现在为了妙语：

SELECT aid, GROUP_CONCAT( bid ) AS Neighbors
FROM Adjacency
GROUP BY aid

returns:

收益：

aid     Neighbors
1       1,2,3
2       1,2,3
3       1,2,3
4       4

所以

SELECT DISTINCT Neighbors
FROM (
     SELECT aid, GROUP_CONCAT( bid ) AS Neighbors
     FROM Adjacency
     GROUP BY aid
     ) Groupings

returns

回报

Neighbors
1,2,3
4

Whew!

呼！

#4

I like @Chris Cunningham's answer, but here's another take on it.

我喜欢@Chris Cunningham的回答，但这是另一个看法。

First, my understanding of your problem statement (correct me if I'm wrong):

首先，我对你的问题陈述的理解（如果我错了，请纠正我）：

You want to look at your event log as a sequence, ordered by the time of the event, and partitition it into groups, defining the boundary as being an interval of more than 3 seconds between two adjacent rows in the sequence.

您希望将事件日志视为序列，按事件的时间排序，并将其分成组，将边界定义为序列中两个相邻行之间的间隔超过3秒。

I work mostly in SQL Server, so I'm using SQL Server syntax. It shouldn't be too difficult to translate into MySQL SQL.

我主要在SQL Server中工作，所以我使用的是SQL Server语法。转换成MySQL SQL应该不会太难。

So, first our event log table:

那么，首先我们的事件日志表：

--
-- our event log table
--
create table dbo.eventLog
(
  id       int          not null ,
  dtLogged datetime     not null ,
  title    varchar(200) not null ,

  primary key nonclustered ( id ) ,
  unique clustered ( dtLogged , id ) ,

)

Given the above understanding of the problem statement, the following query should give you the upper and lower bounds your groups. It's a simple, nested select statement with 2 group by to collapse things:

鉴于对问题陈述的上述理解，以下查询应该为您提供组的上限和下限。它是一个简单的嵌套select语句，有2个by by来折叠：

The innermost select defines the upper bound of each group. That upper boundary defines a group.
最里面的选择定义了每个组的上限。上边界定义了一个组。
The outer select defines the lower bound of each group.
外部选择定义每个组的下限。

Every row in the table should fall into one of the groups so defined, and any given group may well consist of a single date/time value.

表中的每一行都应该属于如此定义的一个组，任何给定的组都可能由一个日期/时间值组成。

[edited: the upper bound is the lowest date/time value where the interval is more than 3 seconds]

[编辑：上限是间隔超过3秒的最低日期/时间值]

select dtFrom = min( t.dtFrom ) ,
       dtThru =      t.dtThru
from ( select dtFrom = t1.dtLogged ,
              dtThru = min( t2.dtLogged )
       from      dbo.EventLog t1
       left join dbo.EventLog t2 on t2.dtLogged >= t1.dtLogged
                                and datediff(second,t1.dtLogged,t2.dtLogged) > 3
       group by t1.dtLogged
     ) t
group by t.dtThru

You could then pull rows from the event log and tag them with the group to which they belong thus:

然后，您可以从事件日志中提取行，并使用它们所属的组标记它们，从而：

select *
from ( select dtFrom = min( t.dtFrom ) ,
              dtThru =      t.dtThru
       from ( select dtFrom = t1.dtLogged ,
                     dtThru = min( t2.dtLogged )
              from      dbo.EventLog t1
              left join dbo.EventLog t2 on t2.dtLogged >= t1.dtLogged
                                       and datediff(second,t1.dtLogged,t2.dtLogged) > 3
              group by t1.dtLogged
            ) t
       group by t.dtThru
     ) period
join dbo.EventLog t on t.dtLogged >=           period.dtFrom
                   and t.dtLogged <= coalesce( period.dtThru , t.dtLogged )
order by period.dtFrom , period.dtThru , t.dtLogged

Each row is tagged with its group via the dtFrom and dtThru columns returned. You could get fancy and assign an integral row number to each group if you want.

每行都通过返回的dtFrom和dtThru列标记其组。如果需要，您可以获得想象并为每个组分配一个完整的行号。

#5

Simple query:

简单查询：

SELECT * FROM time_history GROUP BY ROUND(UNIX_TIMESTAMP(time_stamp)/3);

#1