如何选择行和附近的行

时间:2021-05-13 10:18:30

SQL Fiddle

SQL小提琴

Background

I have a table of values that some need attention:

我有一些需要注意的值表:

| ID      | AddedDate   |
|---------|-------------|
|       1 | 2010-04-01  |
|       2 | 2010-04-01  |
|       3 | 2010-04-02  |
|       4 | 2010-04-02  |
|       5 | NULL        | <----------- needs attention
|       6 | 2010-04-02  |
|       7 | 2010-04-03  |
|       8 | 2010-04-04  |
|       9 | 2010-04-04  |
| 2432659 | 2016-06-15  |
| 2432650 | 2016-06-16  |
| 2432651 | 2016-06-17  |
| 2432672 | 2016-06-18  |
| 2432673 | NULL        | <----------- needs attention
| 2432674 | 2016-06-20  |
| 2432685 | 2016-06-21  |

I want to select the rows where AddedDate is null, and i want to select rows around it. In this example question it would be sufficient to say rows where the ID is ±3. This means i want:

我想选择AddedDate为null的行,我想选择它周围的行。在这个示例问题中,说出ID为±3的行就足够了。这意味着我想:

| ID      | AddedDate   |
|---------|-------------|
|       2 | 2010-04-01  | ─╮
|       3 | 2010-04-02  |  │
|       4 | 2010-04-02  |  │
|       5 | NULL        |  ├──ID values ±3
|       6 | 2010-04-02  |  │
|       7 | 2010-04-03  |  │
|       8 | 2010-04-04  | ─╯

| 2432672 | 2016-06-18  | ─╮
| 2432673 | NULL        |  ├──ID values ±3
| 2432674 | 2016-06-20  | ─╯

Note: In reality it's a table of 9M rows, and 15k need attention.

注意:实际上它是一个9M行的表,15k需要注意。

Attempts

First i create a query that builds the ranges i'm interested in returning:

首先,我创建一个查询来构建我有兴趣返回的范围:

SELECT
  ID-3 AS [Low ID],
  ID+3 AS [High ID]
FROM Items
WHERE AddedDate IS NULL

Low ID   High ID
-------  -------
2        8 
2432670  2432676

So my initial attempt to use this does work:

所以我最初尝试使用它确实有效:

WITH dt AS (
   SELECT ID-3 AS Low, ID+3 AS High
   FROM Items
   WHERE AddedDate IS NULL
)
SELECT * FROM Items
WHERE EXISTS(
    SELECT 1 FROM dt
    WHERE Items.ID BETWEEN dt.Low AND dt.High)

But when i try it on real data:

但是当我在真实数据上尝试时:

  • 9 million total rows
  • 总行数900万
  • 15,000 interesting rows
  • 15,000个有趣的行
  • subtree cost of 63,318,400
  • 子树成本为63,318,400
  • it takes hours (before i give up and cancel it)
  • 这需要几个小时(在我放弃并取消之前)

如何选择行和附近的行

There's probably a more efficient way.

可能有一种更有效的方式。

Bonus Reading

4 个解决方案

#1


4  

This is your existing logic rewritten using an moving max:

这是您使用移动最大值重写的现有逻辑:

WITH dt AS (
   SELECT
      ID, AddedDate,
      -- check if there's a NULL within a range of +/- 3 rows
      -- and remember it's ID 
      max(case when AddedDate is null then id end)
      over (order by id 
            rows between 3 preceding and 3 following) as NullID
   FROM Items 
)
SELECT *
FROM dt
where id between NullID-3 and NullID+3

#2


3  

Here is one method that uses the windowing clause:

这是一个使用windowing子句的方法:

select i.*
from (select i.*,
             count(*) over (order by id rows between 3 preceding and 1 preceding) as cnt_prec,
             count(*) over (order by id rows between 1 following and 3 following) as cnt_foll,
             count(addeddate) over (order by id rows between 3 preceding and 1 preceding) as cnt_ad_prec,
             count(addeddate) over (order by id rows between 1 following and 3 following) as cnt_ad_foll
      from items
     ) i
where cnt_ad_prec <> cnt_prec or
      cnt_ad_foll <> cnt_foll or
      addeddate is null;
order by id;

This returns all rows that have NULL in the column or are within three rows of a NULL.

这将返回列中具有NULL或位于NULL的三行内的所有行。

The need for the comparison to the count is to avoid the edge issues on the smallest and largest ids.

与计数进行比较的需要是避免最小和最大ID上的边缘问题。

#3


3  

Another way:

其他方式:

SELECT i1.*
    FROM Items i1, Items i2
        WHERE i2.AddedDate IS NULL AND ABS(i1.ID - i2.ID) <= 3

I hope there is index on AddedDate column.

我希望AddedDate列上有索引。

#4


1  

Just to try a different approach than other answers... How about using a table variable to store the ids you want. Then you join. My hope is that the insert executes fast enough, and then the SELECT can take advantage of the clustered index in Items. Unfortunately I don't have here your amount of data to test its efficiency:

只是为了尝试与其他答案不同的方法...如何使用表变量来存储您想要的ID。然后你加入。我希望插入执行得足够快,然后SELECT可以利用Items中的聚簇索引。不幸的是,我没有你的数据来测试它的效率:

DECLARE @userData TABLE(
    idInRange int NOT NULL
)

INSERT INTO @userData (idInRange)
SELECT DISTINCT i.Id + r 
FROM Items i 
CROSS JOIN (
  SELECT -3 as r UNION ALL SELECT -2 as r UNION ALL SELECT -1 as r UNION ALL SELECT 0 as r UNION ALL
  SELECT 1 as r UNION ALL SELECT 2 as r UNION ALL SELECT 3 as r 
) yourRange
WHERE AddedDate IS NULL;

SELECT i.*
FROM @userData u
INNER JOIN Items i ON i.ID = u.idInRange

Edited to add a DISTINCT when filling the table variable, to avoid duplicated rows just in case there are two contiguous NULL dates and their id ranges overlap

编辑在填充表变量时添加DISTINCT,以避免重复行,以防万一有两个连续的NULL日期并且它们的id范围重叠

#1


4  

This is your existing logic rewritten using an moving max:

这是您使用移动最大值重写的现有逻辑:

WITH dt AS (
   SELECT
      ID, AddedDate,
      -- check if there's a NULL within a range of +/- 3 rows
      -- and remember it's ID 
      max(case when AddedDate is null then id end)
      over (order by id 
            rows between 3 preceding and 3 following) as NullID
   FROM Items 
)
SELECT *
FROM dt
where id between NullID-3 and NullID+3

#2


3  

Here is one method that uses the windowing clause:

这是一个使用windowing子句的方法:

select i.*
from (select i.*,
             count(*) over (order by id rows between 3 preceding and 1 preceding) as cnt_prec,
             count(*) over (order by id rows between 1 following and 3 following) as cnt_foll,
             count(addeddate) over (order by id rows between 3 preceding and 1 preceding) as cnt_ad_prec,
             count(addeddate) over (order by id rows between 1 following and 3 following) as cnt_ad_foll
      from items
     ) i
where cnt_ad_prec <> cnt_prec or
      cnt_ad_foll <> cnt_foll or
      addeddate is null;
order by id;

This returns all rows that have NULL in the column or are within three rows of a NULL.

这将返回列中具有NULL或位于NULL的三行内的所有行。

The need for the comparison to the count is to avoid the edge issues on the smallest and largest ids.

与计数进行比较的需要是避免最小和最大ID上的边缘问题。

#3


3  

Another way:

其他方式:

SELECT i1.*
    FROM Items i1, Items i2
        WHERE i2.AddedDate IS NULL AND ABS(i1.ID - i2.ID) <= 3

I hope there is index on AddedDate column.

我希望AddedDate列上有索引。

#4


1  

Just to try a different approach than other answers... How about using a table variable to store the ids you want. Then you join. My hope is that the insert executes fast enough, and then the SELECT can take advantage of the clustered index in Items. Unfortunately I don't have here your amount of data to test its efficiency:

只是为了尝试与其他答案不同的方法...如何使用表变量来存储您想要的ID。然后你加入。我希望插入执行得足够快,然后SELECT可以利用Items中的聚簇索引。不幸的是,我没有你的数据来测试它的效率:

DECLARE @userData TABLE(
    idInRange int NOT NULL
)

INSERT INTO @userData (idInRange)
SELECT DISTINCT i.Id + r 
FROM Items i 
CROSS JOIN (
  SELECT -3 as r UNION ALL SELECT -2 as r UNION ALL SELECT -1 as r UNION ALL SELECT 0 as r UNION ALL
  SELECT 1 as r UNION ALL SELECT 2 as r UNION ALL SELECT 3 as r 
) yourRange
WHERE AddedDate IS NULL;

SELECT i.*
FROM @userData u
INNER JOIN Items i ON i.ID = u.idInRange

Edited to add a DISTINCT when filling the table variable, to avoid duplicated rows just in case there are two contiguous NULL dates and their id ranges overlap

编辑在填充表变量时添加DISTINCT,以避免重复行,以防万一有两个连续的NULL日期并且它们的id范围重叠