SQL连接具有一系列值（int范围，日期范围，等等）

I have two tables, the first is a big table (millions of rows), with the most interesting column being an integer I'll just call "key." I believe this solution would be identical for date or datetime ranges as well though.

我有两个表,第一个是一个大表(数百万行),最有趣的列是一个整数,我只称之为“键”。我相信这个解决方案对于日期或日期时间范围也是相同的。

The second table is much smaller (thousands of rows) with a bunch of attributes that are interesting to me which are defined over a range of keys. It has the following structure:

第二个表要小得多(数千行),有一堆我感兴趣的属性,这些属性是在一系列键上定义的。它具有以下结构:

key_lower_bound : int key_upper_bound : int interesting_value1 : float interesting_value2 : int interesting_value3 : varchar(50) ...

key_lower_bound:int key_upper_bound:int interesting_value1:float interesting_value2:int interesting_value3:varchar(50)...

I want to look up all of the values in the first table and "join" them with the second table based on whether the key in the first table falls inside the interval [key_lower_bound, key_upper_bound).

我想查找第一个表中的所有值,并根据第一个表中的键是否落在区间[key_lower_bound,key_upper_bound]内,将它们与第二个表“连接”起来。

This is sort of like a sparse inner product or sparse dot product mathematically, but it's a little weird since there are these ranges involved in the second table. Still, if I were to write this up in code it would be an O(|first table| + |second table|) algorithm. I would keep a pointer into both (sorted) lists and walk through them each in order to determine if each key in the first table belonged in the range of the second table. The trick is that I am not iterating through the second list each time I examine a key in the first table because both lists are sorted.

这有点像稀疏内积或稀疏点积在数学上,但它有点奇怪,因为第二个表中涉及这些范围。不过,如果我在代码中写这个,那将是一个O(|第一个表| + |第二个表|)算法。我会指向两个(已排序)列表并逐个浏览它们,以确定第一个表中的每个键是否属于第二个表的范围。诀窍在于,每次检查第一个表中的键时,我都不会遍历第二个列表,因为两个列表都已排序。

When I construct the most obivous SQL query (involving checking that key is > key_lower_bound and < key_upper_bound) it takes WAY too long.

当我构造最客观的SQL查询(涉及检查该键是> key_lower_bound和 )时,它需要花费太长时间。

There is some kind of quadratic behavior going on with that naive query because I think the query engine is doing each compare against each row in the second table, when in reality, if the second table is sorted by key_lower_bounds this shouldn't be necessary. So I'm getting a O(|first table| x |second table|) kind of behavior instead of the desired O(|first table| + |second table|) behavior.

Is it possible to get a linear SQL query to do this?

是否可以获得线性SQL查询来执行此操作?

4 个解决方案

#1

Well I have played with the problem and have a couple of suggestions. But first let's populate helper table

好吧,我已经解决了这个问题并提出了一些建议。但首先让我们填充帮助表

CREATE TABLE dbo.Numbers(n INT NOT NULL PRIMARY KEY)
GO
DECLARE @i INT;
SET @i = 1;
INSERT INTO dbo.Numbers(n) SELECT 1;
WHILE @i<1024000 BEGIN
  INSERT INTO dbo.Numbers(n)
    SELECT n + @i FROM dbo.Numbers;
  SET @i = @i * 2;
END;
GO

and test data, one minute commercials every minute for one year, and one customer call per minute for the same year:

和测试数据,一分钟一分钟广告一年,同一年每分钟一个客户电话:

CREATE TABLE dbo.Commercials(
  StartedAt DATETIME NOT NULL 
    CONSTRAINT PK_Commercials PRIMARY KEY,
  EndedAt DATETIME NOT NULL,
  CommercialName VARCHAR(30) NOT NULL);
GO
INSERT INTO dbo.Commercials(StartedAt, EndedAt, CommercialName)
SELECT DATEADD(minute, n - 1, '20080101')
    ,DATEADD(minute, n, '20080101')
    ,'Show #'+CAST(n AS VARCHAR(6))
  FROM dbo.Numbers
  WHERE n<=24*365*60;
GO
CREATE TABLE dbo.Calls(CallID INT 
  CONSTRAINT PK_Calls NOT NULL PRIMARY KEY,
  AirTime DATETIME NOT NULL,
  SomeInfo CHAR(300));
GO
INSERT INTO dbo.Calls(CallID,
  AirTime,
  SomeInfo)
SELECT n 
    ,DATEADD(minute, n - 1, '20080101')
    ,'Call during Commercial #'+CAST(n AS VARCHAR(6))
  FROM dbo.Numbers
  WHERE n<=24*365*60;
GO
CREATE UNIQUE INDEX Calls_AirTime
  ON dbo.Calls(AirTime) INCLUDE(SomeInfo);
GO

The original attempt to select all the calls made during commercials for three hours in the middle of the year is terribly slow:

原来尝试在一年中的商业广告中选择所有拨打电话的时间非常缓慢:

SET STATISTICS IO ON;
SET STATISTICS TIME ON;
GO

SELECT COUNT(*) FROM(
SELECT s.StartedAt, s.EndedAt, c.AirTime
FROM dbo.Commercials s JOIN dbo.Calls c 
  ON c.AirTime >= s.StartedAt AND c.AirTime < s.EndedAt
WHERE c.AirTime BETWEEN '20080701' AND '20080701 03:00'
) AS t;

SQL Server parse and compile time: 
   CPU time = 15 ms, elapsed time = 30 ms.

(1 row(s) affected)
Table 'Calls'. Scan count 1, logical reads 11, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 2, logical reads 3338264, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Commercials'. Scan count 2, logical reads 7166, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
   CPU time = 71704 ms,  elapsed time = 36316 ms.

The reason is simple: we know that commercials do not overlap, so one call fits into at most one commercial, but the optimizer does not know it. We know that commercials are short, but the optimizer does not know it either. Both assumptions can be enforced as constraints, but the optimizer will not not it still.

原因很简单:我们知道商业广告不会重叠,因此一个广告最多只适用于一个广告,但优化商并不知道。我们知道广告很短,但优化者也不知道。这两个假设都可以作为约束强制执行,但优化器不会不会。

Assuming that commercials are no longer than 15 minutes, we can tell that to the optimizer, and the query is very fast:

假设广告不超过15分钟,我们可以告诉优化器,查询速度非常快:

SELECT COUNT(*) FROM(
SELECT s.StartedAt, s.EndedAt, c.AirTime
FROM dbo.Commercials s JOIN dbo.Calls c 
  ON c.AirTime >= s.StartedAt AND c.AirTime < s.EndedAt
WHERE c.AirTime BETWEEN '20080701' AND '20080701 03:00'
AND s.StartedAt BETWEEN '20080630 23:45' AND '20080701 03:00'
) AS t;

SQL Server parse and compile time: 
   CPU time = 15 ms, elapsed time = 15 ms.

(1 row(s) affected)
Table 'Worktable'. Scan count 1, logical reads 753, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Calls'. Scan count 1, logical reads 11, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Commercials'. Scan count 1, logical reads 4, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
   CPU time = 31 ms,  elapsed time = 24 ms.

Assuming that commercials do not overlap so so one call fits into at most one commercial, we can tell that to the optimizer, and the query is again very fast:

假设广告没有重叠,所以一个调用最多适合一个广告,我们可以告诉优化器,查询再次非常快:

SELECT COUNT(*) FROM(
SELECT s.StartedAt, s.EndedAt, c.AirTime
FROM dbo.Calls c CROSS APPLY(
  SELECT TOP 1 s.StartedAt, s.EndedAt FROM dbo.Commercials s 
  WHERE c.AirTime >= s.StartedAt AND c.AirTime < s.EndedAt
  ORDER BY s.StartedAt DESC) AS s
WHERE c.AirTime BETWEEN '20080701' AND '20080701 03:00'
) AS t;

SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 7 ms.

(1 row(s) affected)
Table 'Commercials'. Scan count 181, logical reads 1327, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Calls'. Scan count 1, logical reads 11, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
   CPU time = 31 ms,  elapsed time = 31 ms.

#2

For the first table I would put a clustered index on "key". For the second table I would put a clustered index on "key_lower_bound". Then I would try:

对于第一个表,我会在“key”上放置聚簇索引。对于第二个表,我会在“key_lower_bound”上放置一个聚簇索引。然后我会尝试:

select *
from FirstTable f
inner join SecondTable s 
    on f.key between s.key_lower_bound and s.key_upper_bound

I would then add a second non-clustered index on "key_upper_bound" to see if that improved the performance.

然后,我将在“key_upper_bound”上添加第二个非聚集索引,以查看是否提高了性能。

#3

In my experience there is no easy and robust solution. I have successfully used denormalization in many similar cases, copying key_lower_bound and key_upper_bound to the big table, and having a foreign key refer from the big table to the one with intervals. You also create a check constraint to make sure that (key > key_lower_bound and key < key_upper_bound), but this check only involves columns in one table, so it works OK. This is definitely denormalization, but the data never gets out of sync, because the FK constraint ensures that (key_lower_bound, key_upper_bound) in the big table matches the interval in the parent one. Because you don't need a join, your select performs very fast.

根据我的经验,没有简单而强大的解决方案。我已经成功地在许多类似情况下使用了非规范化,将key_lower_bound和key_upper_bound复制到大表,并且从大表引用外键到具有间隔的表。您还创建了一个检查约束以确保(key> key_lower_bound和key ),但此检查仅涉及一个表中的列,因此它可以正常工作。这绝对是非规范化,但数据永远不会失去同步,因为fk约束确保大表中的(key_lower_bound,key_upper_bound)与父表中的间隔匹配。因为您不需要连接,所以您的选择执行速度非常快。

Similar problem solved by denormalization:

非规范化解决了类似的问题:

http://sqlblog.com/blogs/alexander_kuznetsov/archive/2009/03/08/storing-intervals-of-time-with-no-overlaps.aspx

Let me know if you need full DDL, it is very easy to write up.

如果您需要完整的DDL,请告诉我,它很容易编写。

#4

To perform the linear algorithm that you describe would require 2 things that the database doesn't have:

要执行您描述的线性算法,需要数据库没有的两件事:

The ability to understand that each row in your small table contains a distinct (disjoint) mapping of many LargeTable.Key's to a single SmallTable.Range, and

能够理解小表中的每一行都包含许多LargeTable.Key与单个SmallTable.Range的不同(不相交)映射,以及

The tables would be required to be stored as linked lists or arrays (which they are not).

这些表需要存储为链表或数组(它们不是)。

I believe the closest you will get to the behavior you are describing is a merge join:

我相信你最接近你所描述的行为的是合并连接:

select t1.key from largeTable t1 inner merge join smallTable t2 on t1.key >= t2.key_lower_bound and t1.key < t2.key_upper_bound

从t1.key> = t2.key_lower_bound和t1.key 中的largetable>

You should understand that a table is stored as a B-tree or heap - so it is optimized to look for particular nodes - not for scanning. Scanning means you must keep up to log_B(N) pointers (e.g. in a stack) to remember your place in the tree without having to traverse back in. And this isn't even talking about disk access patterns.

您应该了解一个表存储为B树或堆 - 因此它被优化以查找特定节点 - 而不是用于扫描。扫描意味着您必须跟上log_B(N)指针(例如在堆栈中)以记住您在树中的位置,而不必回溯。这甚至不是在讨论磁盘访问模式。

As secondary performance idea, you should try defining a single value that represents the range and using that as the primary key of the smallTable, which can be referenced from the largeTable as a foreign key. This is more efficient than a compound key (which is essentially what your lower_bound and upper_bound columns represent). Perhaps a hashed value such as PK = lower_bound & upper_bound << certain number of bits

作为次要性能的想法,您应该尝试定义表示范围的单个值,并将其用作smallTable的主键,可以从largeTable作为外键引用。这比复合键更有效(这实际上是lower_bound和upper_bound列所代表的)。也许是散列值,例如PK = lower_bound&upper_bound < <特定位数< p>

Just another reference that should illustrate why it's difficult for SQL to put this algorithm together. If you can use Matlab to process your stuff - that's probably a better bet :)

只是另一个参考,应该说明为什么SQL很难将这个算法放在一起。如果你可以使用Matlab来处理你的东西 - 这可能是一个更好的选择:)

#1