SQL小提琴
Background
I have a table of values that some need attention:
我有一些需要注意的值表:
| ID | AddedDate |
|---------|-------------|
| 1 | 2010-04-01 |
| 2 | 2010-04-01 |
| 3 | 2010-04-02 |
| 4 | 2010-04-02 |
| 5 | NULL | <----------- needs attention
| 6 | 2010-04-02 |
| 7 | 2010-04-03 |
| 8 | 2010-04-04 |
| 9 | 2010-04-04 |
| 2432659 | 2016-06-15 |
| 2432650 | 2016-06-16 |
| 2432651 | 2016-06-17 |
| 2432672 | 2016-06-18 |
| 2432673 | NULL | <----------- needs attention
| 2432674 | 2016-06-20 |
| 2432685 | 2016-06-21 |
I want to select the rows where AddedDate
is null, and i want to select rows around it. In this example question it would be sufficient to say rows where the ID
is ±3. This means i want:
我想选择AddedDate为null的行,我想选择它周围的行。在这个示例问题中,说出ID为±3的行就足够了。这意味着我想:
| ID | AddedDate |
|---------|-------------|
| 2 | 2010-04-01 | ─╮
| 3 | 2010-04-02 | │
| 4 | 2010-04-02 | │
| 5 | NULL | ├──ID values ±3
| 6 | 2010-04-02 | │
| 7 | 2010-04-03 | │
| 8 | 2010-04-04 | ─╯
| 2432672 | 2016-06-18 | ─╮
| 2432673 | NULL | ├──ID values ±3
| 2432674 | 2016-06-20 | ─╯
Note: In reality it's a table of 9M rows, and 15k need attention.
注意:实际上它是一个9M行的表,15k需要注意。
Attempts
First i create a query that builds the ranges i'm interested in returning:
首先,我创建一个查询来构建我有兴趣返回的范围:
SELECT
ID-3 AS [Low ID],
ID+3 AS [High ID]
FROM Items
WHERE AddedDate IS NULL
Low ID High ID
------- -------
2 8
2432670 2432676
So my initial attempt to use this does work:
所以我最初尝试使用它确实有效:
WITH dt AS (
SELECT ID-3 AS Low, ID+3 AS High
FROM Items
WHERE AddedDate IS NULL
)
SELECT * FROM Items
WHERE EXISTS(
SELECT 1 FROM dt
WHERE Items.ID BETWEEN dt.Low AND dt.High)
But when i try it on real data:
但是当我在真实数据上尝试时:
- 9 million total rows
- 总行数900万
- 15,000 interesting rows
- 15,000个有趣的行
- subtree cost of 63,318,400
- 子树成本为63,318,400
- it takes hours (before i give up and cancel it)
- 这需要几个小时(在我放弃并取消之前)
There's probably a more efficient way.
可能有一种更有效的方式。
Bonus Reading
- Select a row and rows around it
- 选择它周围的行和行
- Select Rows with matching columns from SQL Server
- 从SQL Server中选择具有匹配列的行
- How can I search for rows "around" a given string value?
- 如何搜索给定字符串值“周围”的行?
- How to get N rows starting from row M from sorted table in T-SQL
- 如何从T-SQL中的排序表中的行M开始获取N行
4 个解决方案
#1
4
This is your existing logic rewritten using an moving max:
这是您使用移动最大值重写的现有逻辑:
WITH dt AS (
SELECT
ID, AddedDate,
-- check if there's a NULL within a range of +/- 3 rows
-- and remember it's ID
max(case when AddedDate is null then id end)
over (order by id
rows between 3 preceding and 3 following) as NullID
FROM Items
)
SELECT *
FROM dt
where id between NullID-3 and NullID+3
#2
3
Here is one method that uses the windowing clause:
这是一个使用windowing子句的方法:
select i.*
from (select i.*,
count(*) over (order by id rows between 3 preceding and 1 preceding) as cnt_prec,
count(*) over (order by id rows between 1 following and 3 following) as cnt_foll,
count(addeddate) over (order by id rows between 3 preceding and 1 preceding) as cnt_ad_prec,
count(addeddate) over (order by id rows between 1 following and 3 following) as cnt_ad_foll
from items
) i
where cnt_ad_prec <> cnt_prec or
cnt_ad_foll <> cnt_foll or
addeddate is null;
order by id;
This returns all rows that have NULL
in the column or are within three rows of a NULL
.
这将返回列中具有NULL或位于NULL的三行内的所有行。
The need for the comparison to the count is to avoid the edge issues on the smallest and largest ids.
与计数进行比较的需要是避免最小和最大ID上的边缘问题。
#3
3
Another way:
其他方式:
SELECT i1.*
FROM Items i1, Items i2
WHERE i2.AddedDate IS NULL AND ABS(i1.ID - i2.ID) <= 3
I hope there is index on AddedDate
column.
我希望AddedDate列上有索引。
#4
1
Just to try a different approach than other answers... How about using a table variable to store the ids you want. Then you join. My hope is that the insert executes fast enough, and then the SELECT can take advantage of the clustered index in Items. Unfortunately I don't have here your amount of data to test its efficiency:
只是为了尝试与其他答案不同的方法...如何使用表变量来存储您想要的ID。然后你加入。我希望插入执行得足够快,然后SELECT可以利用Items中的聚簇索引。不幸的是,我没有你的数据来测试它的效率:
DECLARE @userData TABLE(
idInRange int NOT NULL
)
INSERT INTO @userData (idInRange)
SELECT DISTINCT i.Id + r
FROM Items i
CROSS JOIN (
SELECT -3 as r UNION ALL SELECT -2 as r UNION ALL SELECT -1 as r UNION ALL SELECT 0 as r UNION ALL
SELECT 1 as r UNION ALL SELECT 2 as r UNION ALL SELECT 3 as r
) yourRange
WHERE AddedDate IS NULL;
SELECT i.*
FROM @userData u
INNER JOIN Items i ON i.ID = u.idInRange
Edited to add a DISTINCT when filling the table variable, to avoid duplicated rows just in case there are two contiguous NULL dates and their id ranges overlap
编辑在填充表变量时添加DISTINCT,以避免重复行,以防万一有两个连续的NULL日期并且它们的id范围重叠
#1
4
This is your existing logic rewritten using an moving max:
这是您使用移动最大值重写的现有逻辑:
WITH dt AS (
SELECT
ID, AddedDate,
-- check if there's a NULL within a range of +/- 3 rows
-- and remember it's ID
max(case when AddedDate is null then id end)
over (order by id
rows between 3 preceding and 3 following) as NullID
FROM Items
)
SELECT *
FROM dt
where id between NullID-3 and NullID+3
#2
3
Here is one method that uses the windowing clause:
这是一个使用windowing子句的方法:
select i.*
from (select i.*,
count(*) over (order by id rows between 3 preceding and 1 preceding) as cnt_prec,
count(*) over (order by id rows between 1 following and 3 following) as cnt_foll,
count(addeddate) over (order by id rows between 3 preceding and 1 preceding) as cnt_ad_prec,
count(addeddate) over (order by id rows between 1 following and 3 following) as cnt_ad_foll
from items
) i
where cnt_ad_prec <> cnt_prec or
cnt_ad_foll <> cnt_foll or
addeddate is null;
order by id;
This returns all rows that have NULL
in the column or are within three rows of a NULL
.
这将返回列中具有NULL或位于NULL的三行内的所有行。
The need for the comparison to the count is to avoid the edge issues on the smallest and largest ids.
与计数进行比较的需要是避免最小和最大ID上的边缘问题。
#3
3
Another way:
其他方式:
SELECT i1.*
FROM Items i1, Items i2
WHERE i2.AddedDate IS NULL AND ABS(i1.ID - i2.ID) <= 3
I hope there is index on AddedDate
column.
我希望AddedDate列上有索引。
#4
1
Just to try a different approach than other answers... How about using a table variable to store the ids you want. Then you join. My hope is that the insert executes fast enough, and then the SELECT can take advantage of the clustered index in Items. Unfortunately I don't have here your amount of data to test its efficiency:
只是为了尝试与其他答案不同的方法...如何使用表变量来存储您想要的ID。然后你加入。我希望插入执行得足够快,然后SELECT可以利用Items中的聚簇索引。不幸的是,我没有你的数据来测试它的效率:
DECLARE @userData TABLE(
idInRange int NOT NULL
)
INSERT INTO @userData (idInRange)
SELECT DISTINCT i.Id + r
FROM Items i
CROSS JOIN (
SELECT -3 as r UNION ALL SELECT -2 as r UNION ALL SELECT -1 as r UNION ALL SELECT 0 as r UNION ALL
SELECT 1 as r UNION ALL SELECT 2 as r UNION ALL SELECT 3 as r
) yourRange
WHERE AddedDate IS NULL;
SELECT i.*
FROM @userData u
INNER JOIN Items i ON i.ID = u.idInRange
Edited to add a DISTINCT when filling the table variable, to avoid duplicated rows just in case there are two contiguous NULL dates and their id ranges overlap
编辑在填充表变量时添加DISTINCT,以避免重复行,以防万一有两个连续的NULL日期并且它们的id范围重叠