I have a legacy product that I have to maintain. One of the table is somewhat similar to the following example:
我有一个遗留产品需要维护。其中一张表格与下面的例子有些相似:
DECLARE @t TABLE
(
id INT,
DATA NVARCHAR(30)
);
INSERT INTO @t
SELECT 1,
'name: Jim Ey'
UNION ALL
SELECT 2,
'age: 43'
UNION ALL
SELECT 3,
'----------------'
UNION ALL
SELECT 4,
'name: Johnson Dom'
UNION ALL
SELECT 5,
'age: 34'
UNION ALL
SELECT 6,
'----------------'
UNION ALL
SELECT 7,
'name: Jason Thwe'
UNION ALL
SELECT 8,
'age: 22'
SELECT *
FROM @t;
/*
You will get the following result
id DATA
----------- ------------------------------
1 name: Jim Ey
2 age: 43
3 ----------------
4 name: Johnson Dom
5 age: 34
6 ----------------
7 name: Jason Thwe
8 age: 22
*/
Now I want to get the information in the following form:
现在我想以以下形式获取信息:
name age
-------------- --------
Jim Ey 43
Johnson Dom 34
Jason Thwe 22
What's the easiest way to do this? Thanks.
最简单的方法是什么?谢谢。
4 个解决方案
#1
4
Out of (slightly morbid) curiosity I tried to come up with a means of transforming the exact input data you have provided.
出于(有点病态的)好奇心,我试图想出一种方法来转换您提供的确切输入数据。
Far better, of course, would be to properly structure the original data. With a legacy system, this may not be possible, but an ETL process could be created to bring this information into an intermediate location so that an ugly query like this would not need to be run in real time.
当然,更好的办法是正确地构造原始数据。对于遗留系统,这可能是不可能的,但是可以创建一个ETL进程来将这些信息带到一个中间位置,这样就不需要实时运行这种丑陋的查询了。
Example #1
This example assumes that all IDs are consistent and sequential (otherwise, an additional ROW_NUMBER()
column or a new identity column would need to be used to guarantee correct remainder operations on ID).
这个例子假设所有的ID都是一致的和连续的(否则,需要使用一个额外的ROW_NUMBER()列或一个新的标识列来保证ID上正确的剩余操作)。
SELECT
Name = REPLACE( Name, 'name: ', '' ),
Age = REPLACE( Age, 'age: ', '' )
FROM
(
SELECT
Name = T2.Data,
Age = T1.Data,
RowNumber = ROW_NUMBER() OVER( ORDER BY T1.Id ASC )
FROM @t T1
INNER JOIN @t T2 ON T1.id = T2.id +1 -- offset by one to combine two rows
WHERE T1.id % 3 != 0 -- skip delimiter records
) Q1
-- skip every other record (minus delimiters, which have already been stripped)
WHERE RowNumber % 2 != 0
Example #2: No Dependency on Sequential IDs
This is a more practical example because the actual ID values do not matter, only the row sequence.
这是一个更实际的示例,因为实际的ID值并不重要,只有行序列。
DECLARE @NumberedData TABLE( RowNumber INT, Data VARCHAR( 100 ) );
INSERT @NumberedData( RowNumber, Data )
SELECT
RowNumber = ROW_NUMBER() OVER( ORDER BY id ASC ),
Data
FROM @t;
SELECT
Name = REPLACE( N2.Data, 'name: ', '' ),
Age = REPLACE( N1.Data, 'age: ', '' )
FROM @NumberedData N1
INNER JOIN @NumberedData N2 ON N1.RowNumber = N2.RowNumber + 1
WHERE ( N1.RowNumber % 3 ) = 2;
DELETE @NumberedData;
Example #3: Cursor
Again, it would be best to avoid running a query like this in real time and use a scheduled, transactional ETL process. In my experience, semi-structured data like this is prone to anomalies.
同样,最好避免实时运行这样的查询,并使用预定的事务ETL进程。根据我的经验,像这样的半结构化数据容易出现异常。
While examples #1 and #2 (and the solutions provided by others) demonstrate clever ways of working with the data, a more practical way to transform this data would be a cursor. Why? it may actually perform better (no nested queries, recursion, pivoting, or row numbering) and even if it is slower it provides much better opportunities for error handling.
虽然示例#1和#2(以及其他提供的解决方案)演示了处理数据的巧妙方法,但是转换这些数据的更实际的方法是使用游标。为什么?它实际上可能执行得更好(没有嵌套查询、递归、旋转或行编号),即使它更慢,它也为错误处理提供了更好的机会。
-- this could be a table variable, temp table, or staging table
DECLARE @Results TABLE ( Name VARCHAR( 100 ), Age INT );
DECLARE @Index INT = 0, @Data VARCHAR( 100 ), @Name VARCHAR( 100 ), @Age INT;
DECLARE Person_Cursor CURSOR FOR SELECT Data FROM @t;
OPEN Person_Cursor;
FETCH NEXT FROM Person_Cursor INTO @Data;
WHILE( 1 = 1 )BEGIN -- busy loop so we can handle the iteration following completion
IF( @Index = 2 ) BEGIN
INSERT @Results( Name, Age ) VALUES( @Name, @Age );
SET @Index = 0;
END
ELSE BEGIN
-- optional: examine @Data for integrity
IF( @Index = 0 ) SET @Name = REPLACE( @Data, 'name: ', '' );
IF( @Index = 1 ) SET @Age = CAST( REPLACE( @Data, 'age: ', '' ) AS INT );
SET @Index = @Index + 1;
END
-- optional: examine @Index to see that there are no superfluous trailing
-- rows or rows omitted at the end.
IF( @@FETCH_STATUS != 0 ) BREAK;
FETCH NEXT FROM Person_Cursor INTO @Data;
END
CLOSE Person_Cursor;
DEALLOCATE Person_Cursor;
Performance
I created sample source data of 100K rows and the three aforementioned examples seem roughly equivalent for transforming data.
我创建了100K行的示例源数据,前面提到的三个示例在转换数据时大致相同。
I created a million rows of source data and a query similar to the following gives excellent performance for selecting a subset of rows (such as would be used in a grid on a web page or a report).
我创建了100万行源数据,类似于下面的查询在选择行的子集(比如在web页面或报表的网格中使用)方面提供了出色的性能。
-- INT IDENTITY( 1, 1 ) numbers the rows for us
DECLARE @NumberedData TABLE( RowNumber INT IDENTITY( 1, 1 ), Data VARCHAR( 100 ) );
-- subset selection; ordering/filtering can be done here but it will need to preserve
-- the original 3 rows-per-result structure and it will impact performance
INSERT @NumberedData( Data )
SELECT TOP 1000 Data FROM @t;
SELECT
N1.RowNumber,
Name = REPLACE( N2.Data, 'name: ', '' ),
Age = REPLACE( N1.Data, 'age: ', '' )
FROM @NumberedData N1
INNER JOIN @NumberedData N2 ON N1.RowNumber = N2.RowNumber + 1
WHERE ( N1.RowNumber % 3 ) = 2;
DELETE @NumberedData;
I'm seeing execution times of 4-10ms (i7-3960x) against a set of a million records.
我看到执行时间是4-10ms (i7-3960x),而记录是100万。
#2
1
Given that table you can do this:
考虑到这个表格,你可以这样做:
;WITH DATA
AS
(
SELECT
SUBSTRING(t.DATA,CHARINDEX(':',t.DATA)+2,LEN(t.DATA)) AS value,
SUBSTRING(t.DATA,0,CHARINDEX(':',t.DATA)) AS ValueType,
ID,
ROW_NUMBER() OVER(ORDER BY ID) AS RowNbr
FROM
@t AS t
WHERE
NOT t.DATA='----------------'
)
, RecursiveCTE
AS
(
SELECT
Data.RowNbr,
Data.value,
Data.ValueType,
NEWID() AS ID
FROM
Data
WHERE
Data.RowNbr=1
UNION ALL
SELECT
Data.RowNbr,
Data.value,
Data.ValueType,
CASE
WHEN Data.ValueType='age'
THEN RecursiveCTE.ID
ELSE NEWID()
END AS ID
FROM
Data
JOIN RecursiveCTE
ON RecursiveCTE.RowNbr+1=Data.RowNbr
)
SELECT
pvt.name,
pvt.age
FROM
(
SELECT
ID,
value,
ValueType
FROM
RecursiveCTE
) AS SourceTable
PIVOT
(
MAX(Value)
FOR ValueType IN ([name],[age])
) AS pvt
Output
输出
Name Age
------------------
Jim Ey 43
Jason Thwe 22
Johnson Dom 34
#3
1
One solution without self-joins, recursion and with a single pass over the rows from @t
:
一个没有自连接、递归和从@t传递行的解决方案:
SELECT *
FROM
(
SELECT
CASE
WHEN a.DATA LIKE 'name:%' THEN 'Name'
ELSE 'Age'
END AS Attribute,
CASE
WHEN a.DATA LIKE 'name:%' THEN SUBSTRING(a.DATA, 7, 4000) --or LTRIM(SUBSTRING(...,6,...))
ELSE SUBSTRING(a.DATA, 6, 4000) --or LTRIM(SUBSTRING(...,5,...))
END AS Value,
(ROW_NUMBER() OVER(ORDER BY id) + 1) / 2 AS PseudoDenseRank
FROM @t a
WHERE a.DATA LIKE 'name:%' OR a.DATA LIKE 'age:%'
) b
PIVOT( MAX(b.Value) FOR b.Attribute IN ([Name], [Age]) ) pvt
Results:
结果:
PseudoDenseRank Name Age
--------------- ----------- ---
1 Jim Ey 43
2 Johnson Dom 34
3 Jason Thwe 22
Note 1: derived table b
will group name:%
and age:%
rows using (ROW_NUMBER() OVER(ORDER BY id) + 1) / 2
. Results for derived table b
:
注1:派生表b将使用(ROW_NUMBER() /(按id排序)+ 1)/ 2对名称:%和年龄:%行进行分组。派生表b的结果:
Attribute Value ROW_NUMBER() OVER(ORDER BY id) PseudoDenseRank
--------- ----------- ------------------------------ ---------------
Name Jim Ey 1 1
Age 43 2 1
Name Johnson Dom 3 2
Age 34 4 2
Name Jason Thwe 5 3
Age 22 6 3
Note 2: if the values from id
column doesn't have gaps (ex. (id 1, name:Jim Ey), (id 3 age: 43) ) then you can use (a.id + 1) / 2 AS PseudoDenseRank
instead of (ROW_NUMBER() OVER(ORDER BY id) + 1) / 2 AS PseudoDenseRank
.
注意2:如果id列中的值没有间隙(例如(id 1, name:Jim Ey), (id 3 age: 43)),那么可以使用(a)。id + 1) / 2为伪锯齿,而不是(ROW_NUMBER() /(按id排序)+ 1)/ 2为伪锯齿。
Note 3: if you use (a.id + 1) / 2 AS PseudoDenseRank
solution (to group name and age rows) then the first id value should be an odd number. If the first id value is a even number then you should use this expression a.id / 2 AS PseudoDenseRank
.
注3:如果你使用(a)id + 1) / 2作为伪加密解决方案(针对组名和年龄行),那么第一个id值应该是一个奇数。如果第一个id值是一个偶数,那么您应该使用这个表达式a。id / 2为伪锯齿。
#4
0
Here's another option if you upgrade to SQL Server 2012, which implements the OVER clause for aggregate functions. This approach will allow you to choose only those tags that you know you want and find them regardless of how many rows there are between names.
如果您升级到SQL Server 2012,这里还有一个选项,它实现聚合函数的OVER子句。这种方法将允许您只选择您知道需要的标记并找到它们,而不管名称之间有多少行。
This will also work if the names and ages are not always in the same order within a group of rows representing a single person.
如果姓名和年龄在一组代表单个人的行中并不总是以相同的顺序排列,那么也可以这样做。
with Ready2Pivot(tag,val,part) as (
select
CASE WHEN DATA like '_%:%' THEN SUBSTRING(DATA,1,CHARINDEX(':',DATA)-1) END as tag,
CASE WHEN DATA like '_%:%' THEN SUBSTRING(DATA,CHARINDEX(':',DATA)+1,8000) END as val,
max(id * CASE WHEN DATA LIKE 'name:%' THEN 1 ELSE 0 END)
over (
order by id
)
from @t
where DATA like '_%:%'
)
select [name], [age]
from Ready2Pivot
pivot (
max(val)
for tag in ([name], [age])
) as p
If your legacy data has an entry with extra items (say "altName: Jimmy"), this query will ignore it. If your legacy data has no row (and no id number) for someone's age, it will give you NULL in that spot. It will associate all information with the closest preceding row with "name: ..." as the DATA, so it is important that every group of rows has a "name: ..." row.
如果您的遗留数据有一个条目(比如“altName: Jimmy”),那么这个查询将忽略它。如果您的遗留数据对某人的年龄没有行(而且没有id号),那么它将在该位置给您NULL。它将把所有的信息与前面最近的行“name:…”关联起来作为数据,所以每组行都有一个“name:…”行是很重要的。
#1
4
Out of (slightly morbid) curiosity I tried to come up with a means of transforming the exact input data you have provided.
出于(有点病态的)好奇心,我试图想出一种方法来转换您提供的确切输入数据。
Far better, of course, would be to properly structure the original data. With a legacy system, this may not be possible, but an ETL process could be created to bring this information into an intermediate location so that an ugly query like this would not need to be run in real time.
当然,更好的办法是正确地构造原始数据。对于遗留系统,这可能是不可能的,但是可以创建一个ETL进程来将这些信息带到一个中间位置,这样就不需要实时运行这种丑陋的查询了。
Example #1
This example assumes that all IDs are consistent and sequential (otherwise, an additional ROW_NUMBER()
column or a new identity column would need to be used to guarantee correct remainder operations on ID).
这个例子假设所有的ID都是一致的和连续的(否则,需要使用一个额外的ROW_NUMBER()列或一个新的标识列来保证ID上正确的剩余操作)。
SELECT
Name = REPLACE( Name, 'name: ', '' ),
Age = REPLACE( Age, 'age: ', '' )
FROM
(
SELECT
Name = T2.Data,
Age = T1.Data,
RowNumber = ROW_NUMBER() OVER( ORDER BY T1.Id ASC )
FROM @t T1
INNER JOIN @t T2 ON T1.id = T2.id +1 -- offset by one to combine two rows
WHERE T1.id % 3 != 0 -- skip delimiter records
) Q1
-- skip every other record (minus delimiters, which have already been stripped)
WHERE RowNumber % 2 != 0
Example #2: No Dependency on Sequential IDs
This is a more practical example because the actual ID values do not matter, only the row sequence.
这是一个更实际的示例,因为实际的ID值并不重要,只有行序列。
DECLARE @NumberedData TABLE( RowNumber INT, Data VARCHAR( 100 ) );
INSERT @NumberedData( RowNumber, Data )
SELECT
RowNumber = ROW_NUMBER() OVER( ORDER BY id ASC ),
Data
FROM @t;
SELECT
Name = REPLACE( N2.Data, 'name: ', '' ),
Age = REPLACE( N1.Data, 'age: ', '' )
FROM @NumberedData N1
INNER JOIN @NumberedData N2 ON N1.RowNumber = N2.RowNumber + 1
WHERE ( N1.RowNumber % 3 ) = 2;
DELETE @NumberedData;
Example #3: Cursor
Again, it would be best to avoid running a query like this in real time and use a scheduled, transactional ETL process. In my experience, semi-structured data like this is prone to anomalies.
同样,最好避免实时运行这样的查询,并使用预定的事务ETL进程。根据我的经验,像这样的半结构化数据容易出现异常。
While examples #1 and #2 (and the solutions provided by others) demonstrate clever ways of working with the data, a more practical way to transform this data would be a cursor. Why? it may actually perform better (no nested queries, recursion, pivoting, or row numbering) and even if it is slower it provides much better opportunities for error handling.
虽然示例#1和#2(以及其他提供的解决方案)演示了处理数据的巧妙方法,但是转换这些数据的更实际的方法是使用游标。为什么?它实际上可能执行得更好(没有嵌套查询、递归、旋转或行编号),即使它更慢,它也为错误处理提供了更好的机会。
-- this could be a table variable, temp table, or staging table
DECLARE @Results TABLE ( Name VARCHAR( 100 ), Age INT );
DECLARE @Index INT = 0, @Data VARCHAR( 100 ), @Name VARCHAR( 100 ), @Age INT;
DECLARE Person_Cursor CURSOR FOR SELECT Data FROM @t;
OPEN Person_Cursor;
FETCH NEXT FROM Person_Cursor INTO @Data;
WHILE( 1 = 1 )BEGIN -- busy loop so we can handle the iteration following completion
IF( @Index = 2 ) BEGIN
INSERT @Results( Name, Age ) VALUES( @Name, @Age );
SET @Index = 0;
END
ELSE BEGIN
-- optional: examine @Data for integrity
IF( @Index = 0 ) SET @Name = REPLACE( @Data, 'name: ', '' );
IF( @Index = 1 ) SET @Age = CAST( REPLACE( @Data, 'age: ', '' ) AS INT );
SET @Index = @Index + 1;
END
-- optional: examine @Index to see that there are no superfluous trailing
-- rows or rows omitted at the end.
IF( @@FETCH_STATUS != 0 ) BREAK;
FETCH NEXT FROM Person_Cursor INTO @Data;
END
CLOSE Person_Cursor;
DEALLOCATE Person_Cursor;
Performance
I created sample source data of 100K rows and the three aforementioned examples seem roughly equivalent for transforming data.
我创建了100K行的示例源数据,前面提到的三个示例在转换数据时大致相同。
I created a million rows of source data and a query similar to the following gives excellent performance for selecting a subset of rows (such as would be used in a grid on a web page or a report).
我创建了100万行源数据,类似于下面的查询在选择行的子集(比如在web页面或报表的网格中使用)方面提供了出色的性能。
-- INT IDENTITY( 1, 1 ) numbers the rows for us
DECLARE @NumberedData TABLE( RowNumber INT IDENTITY( 1, 1 ), Data VARCHAR( 100 ) );
-- subset selection; ordering/filtering can be done here but it will need to preserve
-- the original 3 rows-per-result structure and it will impact performance
INSERT @NumberedData( Data )
SELECT TOP 1000 Data FROM @t;
SELECT
N1.RowNumber,
Name = REPLACE( N2.Data, 'name: ', '' ),
Age = REPLACE( N1.Data, 'age: ', '' )
FROM @NumberedData N1
INNER JOIN @NumberedData N2 ON N1.RowNumber = N2.RowNumber + 1
WHERE ( N1.RowNumber % 3 ) = 2;
DELETE @NumberedData;
I'm seeing execution times of 4-10ms (i7-3960x) against a set of a million records.
我看到执行时间是4-10ms (i7-3960x),而记录是100万。
#2
1
Given that table you can do this:
考虑到这个表格,你可以这样做:
;WITH DATA
AS
(
SELECT
SUBSTRING(t.DATA,CHARINDEX(':',t.DATA)+2,LEN(t.DATA)) AS value,
SUBSTRING(t.DATA,0,CHARINDEX(':',t.DATA)) AS ValueType,
ID,
ROW_NUMBER() OVER(ORDER BY ID) AS RowNbr
FROM
@t AS t
WHERE
NOT t.DATA='----------------'
)
, RecursiveCTE
AS
(
SELECT
Data.RowNbr,
Data.value,
Data.ValueType,
NEWID() AS ID
FROM
Data
WHERE
Data.RowNbr=1
UNION ALL
SELECT
Data.RowNbr,
Data.value,
Data.ValueType,
CASE
WHEN Data.ValueType='age'
THEN RecursiveCTE.ID
ELSE NEWID()
END AS ID
FROM
Data
JOIN RecursiveCTE
ON RecursiveCTE.RowNbr+1=Data.RowNbr
)
SELECT
pvt.name,
pvt.age
FROM
(
SELECT
ID,
value,
ValueType
FROM
RecursiveCTE
) AS SourceTable
PIVOT
(
MAX(Value)
FOR ValueType IN ([name],[age])
) AS pvt
Output
输出
Name Age
------------------
Jim Ey 43
Jason Thwe 22
Johnson Dom 34
#3
1
One solution without self-joins, recursion and with a single pass over the rows from @t
:
一个没有自连接、递归和从@t传递行的解决方案:
SELECT *
FROM
(
SELECT
CASE
WHEN a.DATA LIKE 'name:%' THEN 'Name'
ELSE 'Age'
END AS Attribute,
CASE
WHEN a.DATA LIKE 'name:%' THEN SUBSTRING(a.DATA, 7, 4000) --or LTRIM(SUBSTRING(...,6,...))
ELSE SUBSTRING(a.DATA, 6, 4000) --or LTRIM(SUBSTRING(...,5,...))
END AS Value,
(ROW_NUMBER() OVER(ORDER BY id) + 1) / 2 AS PseudoDenseRank
FROM @t a
WHERE a.DATA LIKE 'name:%' OR a.DATA LIKE 'age:%'
) b
PIVOT( MAX(b.Value) FOR b.Attribute IN ([Name], [Age]) ) pvt
Results:
结果:
PseudoDenseRank Name Age
--------------- ----------- ---
1 Jim Ey 43
2 Johnson Dom 34
3 Jason Thwe 22
Note 1: derived table b
will group name:%
and age:%
rows using (ROW_NUMBER() OVER(ORDER BY id) + 1) / 2
. Results for derived table b
:
注1:派生表b将使用(ROW_NUMBER() /(按id排序)+ 1)/ 2对名称:%和年龄:%行进行分组。派生表b的结果:
Attribute Value ROW_NUMBER() OVER(ORDER BY id) PseudoDenseRank
--------- ----------- ------------------------------ ---------------
Name Jim Ey 1 1
Age 43 2 1
Name Johnson Dom 3 2
Age 34 4 2
Name Jason Thwe 5 3
Age 22 6 3
Note 2: if the values from id
column doesn't have gaps (ex. (id 1, name:Jim Ey), (id 3 age: 43) ) then you can use (a.id + 1) / 2 AS PseudoDenseRank
instead of (ROW_NUMBER() OVER(ORDER BY id) + 1) / 2 AS PseudoDenseRank
.
注意2:如果id列中的值没有间隙(例如(id 1, name:Jim Ey), (id 3 age: 43)),那么可以使用(a)。id + 1) / 2为伪锯齿,而不是(ROW_NUMBER() /(按id排序)+ 1)/ 2为伪锯齿。
Note 3: if you use (a.id + 1) / 2 AS PseudoDenseRank
solution (to group name and age rows) then the first id value should be an odd number. If the first id value is a even number then you should use this expression a.id / 2 AS PseudoDenseRank
.
注3:如果你使用(a)id + 1) / 2作为伪加密解决方案(针对组名和年龄行),那么第一个id值应该是一个奇数。如果第一个id值是一个偶数,那么您应该使用这个表达式a。id / 2为伪锯齿。
#4
0
Here's another option if you upgrade to SQL Server 2012, which implements the OVER clause for aggregate functions. This approach will allow you to choose only those tags that you know you want and find them regardless of how many rows there are between names.
如果您升级到SQL Server 2012,这里还有一个选项,它实现聚合函数的OVER子句。这种方法将允许您只选择您知道需要的标记并找到它们,而不管名称之间有多少行。
This will also work if the names and ages are not always in the same order within a group of rows representing a single person.
如果姓名和年龄在一组代表单个人的行中并不总是以相同的顺序排列,那么也可以这样做。
with Ready2Pivot(tag,val,part) as (
select
CASE WHEN DATA like '_%:%' THEN SUBSTRING(DATA,1,CHARINDEX(':',DATA)-1) END as tag,
CASE WHEN DATA like '_%:%' THEN SUBSTRING(DATA,CHARINDEX(':',DATA)+1,8000) END as val,
max(id * CASE WHEN DATA LIKE 'name:%' THEN 1 ELSE 0 END)
over (
order by id
)
from @t
where DATA like '_%:%'
)
select [name], [age]
from Ready2Pivot
pivot (
max(val)
for tag in ([name], [age])
) as p
If your legacy data has an entry with extra items (say "altName: Jimmy"), this query will ignore it. If your legacy data has no row (and no id number) for someone's age, it will give you NULL in that spot. It will associate all information with the closest preceding row with "name: ..." as the DATA, so it is important that every group of rows has a "name: ..." row.
如果您的遗留数据有一个条目(比如“altName: Jimmy”),那么这个查询将忽略它。如果您的遗留数据对某人的年龄没有行(而且没有id号),那么它将在该位置给您NULL。它将把所有的信息与前面最近的行“name:…”关联起来作为数据,所以每组行都有一个“name:…”行是很重要的。