为多个列查找第一个非空值。

时间:2021-11-27 10:19:39

I'm attempting to get the first non-null value in a set of many columns. I'm aware that I could accomplish this using a sub-query per column. In the name of performance, which really does count in this scenario, I'd like to do this in a single pass.

我试图在一组多列中得到第一个非空值。我知道我可以通过每个列的子查询来实现这一点。在性能的名义下,在这个场景中,我希望通过一个简单的方法来实现。

Take the following example data:

以以下示例数据为例:

col1     col2     col3     sortCol
====================================
NULL     4        8        1
1        NULL     0        2
5        7        NULL     3

My dream query would find the first non-null value in each of the data columns, sorted on the sortCol.

我的dream查询将在每个数据列中找到第一个非空值,并对sortCol进行排序。

For example, when selecting the magical aggregate of the first three columns, sorted by the sortCol descending.

例如,在选择前三列的魔法聚合时,按sortCol降序排序。

col1     col2     col3
========================
5        7         0

Or when sorting ascending:

或者当升序排序:

col1     col2     col3
========================
1        4         8

Does anyone know a solution?

有人知道解决方案吗?

4 个解决方案

#1


7  

Have you actually performance tested this solution before rejecting it?

在拒绝该解决方案之前,您是否对其进行了性能测试?

SELECT
    (SELECT TOP(1) col1 FROM Table1 WHERE col1 IS NOT NULL ORDER BY SortCol) AS col1,
    (SELECT TOP(1) col2 FROM Table1 WHERE col2 IS NOT NULL ORDER BY SortCol) AS col2,
    (SELECT TOP(1) col3 FROM Table1 WHERE col3 IS NOT NULL ORDER BY SortCol) AS col3

If this is slow it's probably because you don't have an appropriate index. What indexes do you have?

如果速度很慢,那可能是因为你没有合适的索引。你有什么索引?

#2


6  

The problem with implementing this as an aggregation (which you indeed could do if, for example, you implemented a "First-Non-Null" SQL CLR aggregate) is the wasted IO to read every row when you're typically only interested in the first few rows. The aggregation won't just stop after the first non-null even though its implementation would ignore further values. Aggregations are also unordered, so your result would depend on the ordering of the index selected by query engine.

将其实现为聚合的问题(例如,如果实现了“first - non - null”SQL CLR聚合,那么您确实可以这样做)是当您通常只对前几行感兴趣时,读取每一行的浪费IO。聚合不会在第一个非空值之后停止,即使它的实现会忽略进一步的值。聚合也是无序的,所以您的结果将取决于查询引擎选择的索引的顺序。

The subquery solution, by contrast, reads minimal rows for each query (since you only need the first matching row) and supports any ordering. It will also work on database platforms where it's not possible to define custom aggregates.

相比之下,子查询解决方案只读取每个查询的最小行(因为您只需要第一个匹配行),并支持任何排序。它还可以在数据库平台上工作,在这些平台上不可能定义自定义聚合。

Which one performs better will likely depend on the number of rows and columns in your table and how sparse your data is. Additional rows require reading more rows for the aggregate approach. Additional columns require additional subqueries. Sparser data requires checking more rows within each of the subqueries.

哪个性能更好,可能取决于表中的行数和列数,以及数据的稀疏性。额外的行需要为聚合方法读取更多的行。其他列需要额外的子查询。Sparser数据需要在每个子查询中检查更多的行。

Here are some results for various table sizes:

以下是各种表格大小的一些结果:

Rows  Cols  Aggregation IO  CPU  Subquery IO  CPU
3     3                 2   0             6   0
1728  3                 8   63            6   0
1728  8                 12  266           16  0

The IO measured here is the number of logical reads. Notice that the number of logical reads for the subquery approach doesn't change with the number of rows in the table. Also keep in mind that the logical reads performed by each additional subquery will likely be for the same pages of data (containing the first few rows). Aggregation, on the other hand, has to process the entire table and involves some CPU time to do so.

这里测量的IO是逻辑读取的数量。注意,子查询方法的逻辑读取数量不会随着表中的行数而改变。还要记住,每个附加子查询执行的逻辑读取很可能是针对相同的数据页(包含前几行)。另一方面,聚合必须处理整个表,并需要一些CPU时间来处理。

This is the code I used for testing... the clustered index on SortCol is required since (in this case) it will determine the order of the aggregation.

这是我用来测试的代码……SortCol上的聚集索引是必需的,因为(在本例中)它将确定聚集的顺序。

Defining the table and inserting test data:

定义表并插入测试数据:

CREATE TABLE Table1 (Col1 int null, Col2 int null, Col3 int null, SortCol int);
CREATE CLUSTERED INDEX IX_Table1 ON Table1 (SortCol);

WITH R (i) AS
(
 SELECT null

 UNION ALL

 SELECT 0

 UNION ALL

 SELECT i + 1
 FROM R
 WHERE i < 10
)
INSERT INTO Table1
SELECT a.i, b.i, c.i, ROW_NUMBER() OVER (ORDER BY NEWID())
FROM R a, R b, R c;

Querying the table:

查询的表:

SET STATISTICS IO ON;

--aggregation
SELECT TOP(0) * FROM Table1 --shortcut to convert columns back to their types
UNION ALL
SELECT
 dbo.FirstNonNull(Col1),
 dbo.FirstNonNull(Col2),
 dbo.FirstNonNull(Col3),
 null
FROM Table1;


--subquery
SELECT
    (SELECT TOP(1) Col1 FROM Table1 WHERE Col1 IS NOT NULL ORDER BY SortCol) AS Col1,
    (SELECT TOP(1) Col2 FROM Table1 WHERE Col2 IS NOT NULL ORDER BY SortCol) AS Col2,
    (SELECT TOP(1) Col3 FROM Table1 WHERE Col3 IS NOT NULL ORDER BY SortCol) AS Col3;

The CLR "first-non-null" aggregate to test:

用于测试的CLR“first-non-null”聚合体:

 [Serializable]
 [SqlUserDefinedAggregate(
  Format.UserDefined,
  IsNullIfEmpty = true,
  IsInvariantToNulls = true,
  IsInvariantToDuplicates = true,
  IsInvariantToOrder = false, 
#if(SQL90)
  MaxByteSize = 8000
#else
  MaxByteSize = -1
#endif
 )]
 public sealed class FirstNonNull : IBinarySerialize
 {
  private SqlBinary Value;

  public void Init()
  {
   Value = SqlBinary.Null;
  }

  public void Accumulate(SqlBinary next)
  {
   if (Value.IsNull && !next.IsNull)
   {
    Value = next;
   }
  }

  public void Merge(FirstNonNull other)
  {
   Accumulate(other.Value);
  }

  public SqlBinary Terminate()
  {
   return Value;
  }

  #region IBinarySerialize Members

  public void Read(BinaryReader r)
  {
   int Length = r.ReadInt32();

   if (Length < 0)
   {
    Value = SqlBinary.Null;
   }
   else
   {
    byte[] Buffer = new byte[Length];
    r.Read(Buffer, 0, Length);

    Value = new SqlBinary(Buffer);
   }
  }

  public void Write(BinaryWriter w)
  {
   if (Value.IsNull)
   {
    w.Write(-1);
   }
   else
   {
    w.Write(Value.Length);
    w.Write(Value.Value);
   }
  }

  #endregion
 }

#3


1  

Not exactly elegant, but it can do it in a single query. Though this will probably render any indexes rather useless, so as mentioned the multiple sub-query method is likely to be faster.

不是很优雅,但它可以在一个查询中完成。尽管这可能会使任何索引变得无用,因此,正如前面提到的,多个子查询方法可能会更快。


create table Foo (data1 tinyint, data2 tinyint, data3 tinyint, seq int not null)
go

insert into Foo (data1, data2, data3, seq)
values (NULL, 4, 8, 1), (1, NULL, 0, 2), (5, 7, NULL, 3)
go

with unpivoted as (
    select seq, value, col
    from (select seq, data1, data2, data3 from Foo) a
    unpivot (value FOR col IN (data1, data2, data3)) b
), firstSeq as (
    select min(seq) as seq, col
    from unpivoted
    group by col
), data as (
    select b.col, b.value
    from firstSeq a
    inner join unpivoted b on a.seq = b.seq and a.col = b.col
)
select * from data pivot (min(value) for col in (data1, data2, data3)) d
go

drop table Foo
go

#4


1  

Here's another way to do it. This will be of most use if your database disallows top(N) in subqueries (such as mine, Teradata).

这是另一种方法。如果您的数据库不允许在子查询(如我的、Teradata)中使用top(N),那么这将是最有用的。

For comparison, here's the solution the other folks mentioned, using top(1):

为了进行比较,这里是其他同事提到的解决方案,使用top(1):

select top(1) Col1 
from Table1 
where Col1 is not null 
order by SortCol asc

In an ideal world, that seems to me like the best way to do it - clean, intuitive, efficient (apparently).

在一个理想的世界里,这似乎是最好的方法——干净、直观、高效(显然)。

Alternatively you can do this:

你也可以这样做:

select max(Col1) -- max() guarantees a unique result
from Table1 
where SortCol in (
    select min(SortCol) 
    from Table1 
    where Col1 is not null
)

Both solutions retrieve the 'first' record along an ordered column. Top(1) does it definitely more elegantly and probably more efficiently. The second method does the same thing conceptually, just with more manual/explicit implementation from a code perspective.

两个解决方案都沿着有序列检索“first”记录。Top(1)确实更优雅、更有效率。第二种方法在概念上做同样的事情,只是从代码的角度使用更多的手动/显式实现。

The reason for the max() in the root select is that you can get multiple results if the value min(SortCol) shows up in more than one row in Table1. I'm not sure how Top(1) handles this scenario, by the way.

root select中max()的原因是,如果值min(SortCol)出现在表1中的不止一行中,则可以得到多个结果。顺便说一下,我不确定Top(1)是如何处理这个场景的。

#1


7  

Have you actually performance tested this solution before rejecting it?

在拒绝该解决方案之前,您是否对其进行了性能测试?

SELECT
    (SELECT TOP(1) col1 FROM Table1 WHERE col1 IS NOT NULL ORDER BY SortCol) AS col1,
    (SELECT TOP(1) col2 FROM Table1 WHERE col2 IS NOT NULL ORDER BY SortCol) AS col2,
    (SELECT TOP(1) col3 FROM Table1 WHERE col3 IS NOT NULL ORDER BY SortCol) AS col3

If this is slow it's probably because you don't have an appropriate index. What indexes do you have?

如果速度很慢,那可能是因为你没有合适的索引。你有什么索引?

#2


6  

The problem with implementing this as an aggregation (which you indeed could do if, for example, you implemented a "First-Non-Null" SQL CLR aggregate) is the wasted IO to read every row when you're typically only interested in the first few rows. The aggregation won't just stop after the first non-null even though its implementation would ignore further values. Aggregations are also unordered, so your result would depend on the ordering of the index selected by query engine.

将其实现为聚合的问题(例如,如果实现了“first - non - null”SQL CLR聚合,那么您确实可以这样做)是当您通常只对前几行感兴趣时,读取每一行的浪费IO。聚合不会在第一个非空值之后停止,即使它的实现会忽略进一步的值。聚合也是无序的,所以您的结果将取决于查询引擎选择的索引的顺序。

The subquery solution, by contrast, reads minimal rows for each query (since you only need the first matching row) and supports any ordering. It will also work on database platforms where it's not possible to define custom aggregates.

相比之下,子查询解决方案只读取每个查询的最小行(因为您只需要第一个匹配行),并支持任何排序。它还可以在数据库平台上工作,在这些平台上不可能定义自定义聚合。

Which one performs better will likely depend on the number of rows and columns in your table and how sparse your data is. Additional rows require reading more rows for the aggregate approach. Additional columns require additional subqueries. Sparser data requires checking more rows within each of the subqueries.

哪个性能更好,可能取决于表中的行数和列数,以及数据的稀疏性。额外的行需要为聚合方法读取更多的行。其他列需要额外的子查询。Sparser数据需要在每个子查询中检查更多的行。

Here are some results for various table sizes:

以下是各种表格大小的一些结果:

Rows  Cols  Aggregation IO  CPU  Subquery IO  CPU
3     3                 2   0             6   0
1728  3                 8   63            6   0
1728  8                 12  266           16  0

The IO measured here is the number of logical reads. Notice that the number of logical reads for the subquery approach doesn't change with the number of rows in the table. Also keep in mind that the logical reads performed by each additional subquery will likely be for the same pages of data (containing the first few rows). Aggregation, on the other hand, has to process the entire table and involves some CPU time to do so.

这里测量的IO是逻辑读取的数量。注意,子查询方法的逻辑读取数量不会随着表中的行数而改变。还要记住,每个附加子查询执行的逻辑读取很可能是针对相同的数据页(包含前几行)。另一方面,聚合必须处理整个表,并需要一些CPU时间来处理。

This is the code I used for testing... the clustered index on SortCol is required since (in this case) it will determine the order of the aggregation.

这是我用来测试的代码……SortCol上的聚集索引是必需的,因为(在本例中)它将确定聚集的顺序。

Defining the table and inserting test data:

定义表并插入测试数据:

CREATE TABLE Table1 (Col1 int null, Col2 int null, Col3 int null, SortCol int);
CREATE CLUSTERED INDEX IX_Table1 ON Table1 (SortCol);

WITH R (i) AS
(
 SELECT null

 UNION ALL

 SELECT 0

 UNION ALL

 SELECT i + 1
 FROM R
 WHERE i < 10
)
INSERT INTO Table1
SELECT a.i, b.i, c.i, ROW_NUMBER() OVER (ORDER BY NEWID())
FROM R a, R b, R c;

Querying the table:

查询的表:

SET STATISTICS IO ON;

--aggregation
SELECT TOP(0) * FROM Table1 --shortcut to convert columns back to their types
UNION ALL
SELECT
 dbo.FirstNonNull(Col1),
 dbo.FirstNonNull(Col2),
 dbo.FirstNonNull(Col3),
 null
FROM Table1;


--subquery
SELECT
    (SELECT TOP(1) Col1 FROM Table1 WHERE Col1 IS NOT NULL ORDER BY SortCol) AS Col1,
    (SELECT TOP(1) Col2 FROM Table1 WHERE Col2 IS NOT NULL ORDER BY SortCol) AS Col2,
    (SELECT TOP(1) Col3 FROM Table1 WHERE Col3 IS NOT NULL ORDER BY SortCol) AS Col3;

The CLR "first-non-null" aggregate to test:

用于测试的CLR“first-non-null”聚合体:

 [Serializable]
 [SqlUserDefinedAggregate(
  Format.UserDefined,
  IsNullIfEmpty = true,
  IsInvariantToNulls = true,
  IsInvariantToDuplicates = true,
  IsInvariantToOrder = false, 
#if(SQL90)
  MaxByteSize = 8000
#else
  MaxByteSize = -1
#endif
 )]
 public sealed class FirstNonNull : IBinarySerialize
 {
  private SqlBinary Value;

  public void Init()
  {
   Value = SqlBinary.Null;
  }

  public void Accumulate(SqlBinary next)
  {
   if (Value.IsNull && !next.IsNull)
   {
    Value = next;
   }
  }

  public void Merge(FirstNonNull other)
  {
   Accumulate(other.Value);
  }

  public SqlBinary Terminate()
  {
   return Value;
  }

  #region IBinarySerialize Members

  public void Read(BinaryReader r)
  {
   int Length = r.ReadInt32();

   if (Length < 0)
   {
    Value = SqlBinary.Null;
   }
   else
   {
    byte[] Buffer = new byte[Length];
    r.Read(Buffer, 0, Length);

    Value = new SqlBinary(Buffer);
   }
  }

  public void Write(BinaryWriter w)
  {
   if (Value.IsNull)
   {
    w.Write(-1);
   }
   else
   {
    w.Write(Value.Length);
    w.Write(Value.Value);
   }
  }

  #endregion
 }

#3


1  

Not exactly elegant, but it can do it in a single query. Though this will probably render any indexes rather useless, so as mentioned the multiple sub-query method is likely to be faster.

不是很优雅,但它可以在一个查询中完成。尽管这可能会使任何索引变得无用,因此,正如前面提到的,多个子查询方法可能会更快。


create table Foo (data1 tinyint, data2 tinyint, data3 tinyint, seq int not null)
go

insert into Foo (data1, data2, data3, seq)
values (NULL, 4, 8, 1), (1, NULL, 0, 2), (5, 7, NULL, 3)
go

with unpivoted as (
    select seq, value, col
    from (select seq, data1, data2, data3 from Foo) a
    unpivot (value FOR col IN (data1, data2, data3)) b
), firstSeq as (
    select min(seq) as seq, col
    from unpivoted
    group by col
), data as (
    select b.col, b.value
    from firstSeq a
    inner join unpivoted b on a.seq = b.seq and a.col = b.col
)
select * from data pivot (min(value) for col in (data1, data2, data3)) d
go

drop table Foo
go

#4


1  

Here's another way to do it. This will be of most use if your database disallows top(N) in subqueries (such as mine, Teradata).

这是另一种方法。如果您的数据库不允许在子查询(如我的、Teradata)中使用top(N),那么这将是最有用的。

For comparison, here's the solution the other folks mentioned, using top(1):

为了进行比较,这里是其他同事提到的解决方案,使用top(1):

select top(1) Col1 
from Table1 
where Col1 is not null 
order by SortCol asc

In an ideal world, that seems to me like the best way to do it - clean, intuitive, efficient (apparently).

在一个理想的世界里,这似乎是最好的方法——干净、直观、高效(显然)。

Alternatively you can do this:

你也可以这样做:

select max(Col1) -- max() guarantees a unique result
from Table1 
where SortCol in (
    select min(SortCol) 
    from Table1 
    where Col1 is not null
)

Both solutions retrieve the 'first' record along an ordered column. Top(1) does it definitely more elegantly and probably more efficiently. The second method does the same thing conceptually, just with more manual/explicit implementation from a code perspective.

两个解决方案都沿着有序列检索“first”记录。Top(1)确实更优雅、更有效率。第二种方法在概念上做同样的事情,只是从代码的角度使用更多的手动/显式实现。

The reason for the max() in the root select is that you can get multiple results if the value min(SortCol) shows up in more than one row in Table1. I'm not sure how Top(1) handles this scenario, by the way.

root select中max()的原因是,如果值min(SortCol)出现在表1中的不止一行中,则可以得到多个结果。顺便说一下,我不确定Top(1)是如何处理这个场景的。