Lets say we have a database table with two columns, entry_time and value. entry_time is timestamp while value can be any other datatype. The records are relatively consistent, entered in roughly x minute intervals. For many x's of time, however, an entry may not be made, thus producing a 'gap' in the data.
假设我们有一个包含两个列的数据库表,entry_time和value。entry_time是时间戳,而值可以是任何其他数据类型。记录是相对一致的,以大约x分钟的间隔输入。然而,在很多时候,一个条目可能不会被输入,从而在数据中产生一个“间隔”。
In terms of efficiency, what is the best way to go about finding these gaps of at least time Y (both new and old) with a query?
在效率方面,使用查询查找至少时间Y(新旧)之间的差距的最佳方法是什么?
2 个解决方案
#1
17
To start with, let us summarize the number of entries by hour in your table.
首先,让我们按小时总结表中的条目数。
SELECT CAST(DATE_FORMAT(entry_time,'%Y-%m-%d %k:00:00') AS DATETIME) hour,
COUNT(*) samplecount
FROM table
GROUP BY CAST(DATE_FORMAT(entry_time,'%Y-%m-%d %k:00:00') AS DATETIME)
Now, if you log something every six minutes (ten times an hour) all your samplecount values should be ten. This expression: CAST(DATE_FORMAT(entry_time,'%Y-%m-%d %k:00:00') AS DATETIME)
looks hairy but it simply truncates your timestamps to the hour in which they occur by zeroing out the minute and second.
现在,如果你每六分钟(每小时十次)记录一个东西,那么你所有的采样值应该是10。这个表达式:CAST(DATE_FORMAT,“entry_time”,“%Y-%m-%d %d %k:00:00”)作为DATETIME),但是它只是将时间戳截断到它们发生的时间,方法是将分秒调零。
This is reasonably efficient, and will get you started. It's very efficient if you can put an index on your entry_time column and restrict your query to, let's say, yesterday's samples as shown here.
这是相当有效的,将使您开始。如果您可以在entry_time列上放置一个索引,并将查询限制为,比方说,昨天的示例,请参见这里。
SELECT CAST(DATE_FORMAT(entry_time,'%Y-%m-%d %k:00:00') AS DATETIME) hour,
COUNT(*) samplecount
FROM table
WHERE entry_time >= CURRENT_DATE - INTERVAL 1 DAY
AND entry_time < CURRENT_DATE
GROUP BY CAST(DATE_FORMAT(entry_time,'%Y-%m-%d %k:00:00') AS DATETIME)
But it isn't much good at detecting whole hours that go by with missing samples. It's also a little sensitive to jitter in your sampling. That is, if your top-of-the-hour sample is sometimes a half-second early (10:59:30) and sometimes a half-second late (11:00:30) your hourly summary counts will be off. So, this hour summary thing (or day summary, or minute summary, etc) is not bulletproof.
但它不太擅长于检测整个时间的缺失样本。它对采样时的抖动也有点敏感。也就是说,如果你的最佳时间样本有时早半秒(10:59:30),有时晚半秒(11:00:30),那么你的小时总结计数就会减少。
You need a self-join query to get stuff perfectly right; it's a bit more of a hairball and not nearly as efficient.
你需要一个自连接查询才能把事情做得完美;这有点像发球,效率不高。
Let's start by creating ourselves a virtual table (subquery) like this with numbered samples. (This is a pain in MySQL; some other expensive DBMSs make it easier. No matter.)
让我们首先创建一个具有编号的示例的虚拟表(子查询)。(这是MySQL的问题;其他一些昂贵的DBMSs使其更容易实现。不管)。
SELECT @sample:=@sample+1 AS entry_num, c.entry_time, c.value
FROM (
SELECT entry_time, value
FROM table
ORDER BY entry_time
) C,
(SELECT @sample:=0) s
This little virtual table gives entry_num, entry_time, value.
这个小虚拟表给出entry_num、entry_time和值。
Next step, we join it to itself.
下一步,我们将它加入到自身中。
SELECT one.entry_num, one.entry_time, one.value,
TIMEDIFF(two.value, one.value) interval
FROM (
/* virtual table */
) ONE
JOIN (
/* same virtual table */
) TWO ON (TWO.entry_num - 1 = ONE.entry_num)
This lines up the tables next two each other offset by a single entry, governed by the ON clause of the JOIN.
这将在接下来的两个表之间排列,由一个条目抵消,该条目由JOIN的ON子句控制。
Finally we choose the values from this table with an interval
larger than your threshold, and there are the times of the samples right before the missing ones.
最后,我们从这个表中选择的值的间隔大于阈值,并且在缺失值之前有样本的时间。
The over all self join query is this. I told you it was a hairball.
全自连接查询是这样的。我跟你说过那是个发球。
SELECT one.entry_num, one.entry_time, one.value,
TIMEDIFF(two.value, one.value) interval
FROM (
SELECT @sample:=@sample+1 AS entry_num, c.entry_time, c.value
FROM (
SELECT entry_time, value
FROM table
ORDER BY entry_time
) C,
(SELECT @sample:=0) s
) ONE
JOIN (
SELECT @sample2:=@sample2+1 AS entry_num, c.entry_time, c.value
FROM (
SELECT entry_time, value
FROM table
ORDER BY entry_time
) C,
(SELECT @sample2:=0) s
) TWO ON (TWO.entry_num - 1 = ONE.entry_num)
If you have to do this in production on a large table you may want to do it for a subset of your data. For example, you could do it each day for the previous two days' samples. This would be decently efficient, and would also make sure you didn't overlook any missing samples right at midnight. To do this your little rownumbered virtual tables would look like this.
如果必须在大型表上执行此操作,您可能希望对数据的子集执行此操作。例如,您可以在前两天的示例中每天都这样做。这将非常有效,并且确保您不会在午夜忽略任何丢失的示例。为此,您的行编号虚拟表应该是这样的。
SELECT @sample:=@sample+1 AS entry_num, c.entry_time, c.value
FROM (
SELECT entry_time, value
FROM table
ORDER BY entry_time
WHERE entry_time >= CURRENT_DATE - INTERVAL 2 DAY
AND entry_time < CURRENT_DATE /*yesterday but not today*/
) C,
(SELECT @sample:=0) s
#2
1
A very efficient way to do this is with a stored procedure using cursors. I think this is simpler and more efficient than the other answers.
一个非常有效的方法是使用游标存储过程。我认为这比其他答案更简单、更有效。
This procedure creates a cursor and iterates it through the datetime records that you are checking. If there is ever a gap of more than what you specify, it will write the gap's begin and end to a table.
这个过程创建一个游标,并通过您正在检查的datetime记录进行迭代。如果差值大于您指定的值,它将把差值的开始和结束写到表中。
CREATE PROCEDURE findgaps()
BEGIN
DECLARE done INT DEFAULT FALSE;
DECLARE a,b DATETIME;
DECLARE cur CURSOR FOR SELECT dateTimeCol FROM targetTable
ORDER BY dateTimeCol ASC;
DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = TRUE;
OPEN cur;
FETCH cur INTO a;
read_loop: LOOP
SET b = a;
FETCH cur INTO a;
IF done THEN
LEAVE read_loop;
END IF;
IF DATEDIFF(a,b) > [range you specify] THEN
INSERT INTO tmp_table (gap_begin, gap_end)
VALUES (a,b);
END IF;
END LOOP;
CLOSE cur;
END;
In this case it is assumed that 'tmp_table' exists. You could easily define this as a TEMPORARY table in the procedure, but I left it out of this example.
在本例中,假定存在'tmp_table'。您可以在过程中很容易地将其定义为一个临时表,但是我把它排除在这个示例之外。
#1
17
To start with, let us summarize the number of entries by hour in your table.
首先,让我们按小时总结表中的条目数。
SELECT CAST(DATE_FORMAT(entry_time,'%Y-%m-%d %k:00:00') AS DATETIME) hour,
COUNT(*) samplecount
FROM table
GROUP BY CAST(DATE_FORMAT(entry_time,'%Y-%m-%d %k:00:00') AS DATETIME)
Now, if you log something every six minutes (ten times an hour) all your samplecount values should be ten. This expression: CAST(DATE_FORMAT(entry_time,'%Y-%m-%d %k:00:00') AS DATETIME)
looks hairy but it simply truncates your timestamps to the hour in which they occur by zeroing out the minute and second.
现在,如果你每六分钟(每小时十次)记录一个东西,那么你所有的采样值应该是10。这个表达式:CAST(DATE_FORMAT,“entry_time”,“%Y-%m-%d %d %k:00:00”)作为DATETIME),但是它只是将时间戳截断到它们发生的时间,方法是将分秒调零。
This is reasonably efficient, and will get you started. It's very efficient if you can put an index on your entry_time column and restrict your query to, let's say, yesterday's samples as shown here.
这是相当有效的,将使您开始。如果您可以在entry_time列上放置一个索引,并将查询限制为,比方说,昨天的示例,请参见这里。
SELECT CAST(DATE_FORMAT(entry_time,'%Y-%m-%d %k:00:00') AS DATETIME) hour,
COUNT(*) samplecount
FROM table
WHERE entry_time >= CURRENT_DATE - INTERVAL 1 DAY
AND entry_time < CURRENT_DATE
GROUP BY CAST(DATE_FORMAT(entry_time,'%Y-%m-%d %k:00:00') AS DATETIME)
But it isn't much good at detecting whole hours that go by with missing samples. It's also a little sensitive to jitter in your sampling. That is, if your top-of-the-hour sample is sometimes a half-second early (10:59:30) and sometimes a half-second late (11:00:30) your hourly summary counts will be off. So, this hour summary thing (or day summary, or minute summary, etc) is not bulletproof.
但它不太擅长于检测整个时间的缺失样本。它对采样时的抖动也有点敏感。也就是说,如果你的最佳时间样本有时早半秒(10:59:30),有时晚半秒(11:00:30),那么你的小时总结计数就会减少。
You need a self-join query to get stuff perfectly right; it's a bit more of a hairball and not nearly as efficient.
你需要一个自连接查询才能把事情做得完美;这有点像发球,效率不高。
Let's start by creating ourselves a virtual table (subquery) like this with numbered samples. (This is a pain in MySQL; some other expensive DBMSs make it easier. No matter.)
让我们首先创建一个具有编号的示例的虚拟表(子查询)。(这是MySQL的问题;其他一些昂贵的DBMSs使其更容易实现。不管)。
SELECT @sample:=@sample+1 AS entry_num, c.entry_time, c.value
FROM (
SELECT entry_time, value
FROM table
ORDER BY entry_time
) C,
(SELECT @sample:=0) s
This little virtual table gives entry_num, entry_time, value.
这个小虚拟表给出entry_num、entry_time和值。
Next step, we join it to itself.
下一步,我们将它加入到自身中。
SELECT one.entry_num, one.entry_time, one.value,
TIMEDIFF(two.value, one.value) interval
FROM (
/* virtual table */
) ONE
JOIN (
/* same virtual table */
) TWO ON (TWO.entry_num - 1 = ONE.entry_num)
This lines up the tables next two each other offset by a single entry, governed by the ON clause of the JOIN.
这将在接下来的两个表之间排列,由一个条目抵消,该条目由JOIN的ON子句控制。
Finally we choose the values from this table with an interval
larger than your threshold, and there are the times of the samples right before the missing ones.
最后,我们从这个表中选择的值的间隔大于阈值,并且在缺失值之前有样本的时间。
The over all self join query is this. I told you it was a hairball.
全自连接查询是这样的。我跟你说过那是个发球。
SELECT one.entry_num, one.entry_time, one.value,
TIMEDIFF(two.value, one.value) interval
FROM (
SELECT @sample:=@sample+1 AS entry_num, c.entry_time, c.value
FROM (
SELECT entry_time, value
FROM table
ORDER BY entry_time
) C,
(SELECT @sample:=0) s
) ONE
JOIN (
SELECT @sample2:=@sample2+1 AS entry_num, c.entry_time, c.value
FROM (
SELECT entry_time, value
FROM table
ORDER BY entry_time
) C,
(SELECT @sample2:=0) s
) TWO ON (TWO.entry_num - 1 = ONE.entry_num)
If you have to do this in production on a large table you may want to do it for a subset of your data. For example, you could do it each day for the previous two days' samples. This would be decently efficient, and would also make sure you didn't overlook any missing samples right at midnight. To do this your little rownumbered virtual tables would look like this.
如果必须在大型表上执行此操作,您可能希望对数据的子集执行此操作。例如,您可以在前两天的示例中每天都这样做。这将非常有效,并且确保您不会在午夜忽略任何丢失的示例。为此,您的行编号虚拟表应该是这样的。
SELECT @sample:=@sample+1 AS entry_num, c.entry_time, c.value
FROM (
SELECT entry_time, value
FROM table
ORDER BY entry_time
WHERE entry_time >= CURRENT_DATE - INTERVAL 2 DAY
AND entry_time < CURRENT_DATE /*yesterday but not today*/
) C,
(SELECT @sample:=0) s
#2
1
A very efficient way to do this is with a stored procedure using cursors. I think this is simpler and more efficient than the other answers.
一个非常有效的方法是使用游标存储过程。我认为这比其他答案更简单、更有效。
This procedure creates a cursor and iterates it through the datetime records that you are checking. If there is ever a gap of more than what you specify, it will write the gap's begin and end to a table.
这个过程创建一个游标,并通过您正在检查的datetime记录进行迭代。如果差值大于您指定的值,它将把差值的开始和结束写到表中。
CREATE PROCEDURE findgaps()
BEGIN
DECLARE done INT DEFAULT FALSE;
DECLARE a,b DATETIME;
DECLARE cur CURSOR FOR SELECT dateTimeCol FROM targetTable
ORDER BY dateTimeCol ASC;
DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = TRUE;
OPEN cur;
FETCH cur INTO a;
read_loop: LOOP
SET b = a;
FETCH cur INTO a;
IF done THEN
LEAVE read_loop;
END IF;
IF DATEDIFF(a,b) > [range you specify] THEN
INSERT INTO tmp_table (gap_begin, gap_end)
VALUES (a,b);
END IF;
END LOOP;
CLOSE cur;
END;
In this case it is assumed that 'tmp_table' exists. You could easily define this as a TEMPORARY table in the procedure, but I left it out of this example.
在本例中,假定存在'tmp_table'。您可以在过程中很容易地将其定义为一个临时表,但是我把它排除在这个示例之外。