使用RANK或ROW_NUMBER创建可分组ID以在SQL Server中使用难以捉摸的顺序替换来连接行值

时间:2022-02-21 21:21:49

Before you make a judgment, this question is different from most of the other questions on this subject. Yes, it's true that I want to concatenate the text in certain rows; however, in most of the other cases, there's an ID value that is the same for each of the rows that a person desires to concatenate. In my case, it appears that I would need to create an ID value, but the problem is elusive because I can't seem to get the ROW_NUMBER() or RANK() functions to partition the values in the way that I'm seeking.

在做出判断之前,这个问题与大多数关于这个问题的其他问题不同。是的,我想在某些行中连接文本;但是,在大多数其他情况下,对于人们希望连接的每一行,ID值都是相同的。在我的情况下,似乎我需要创建一个ID值,但问题是难以捉摸的,因为我似乎无法获得ROW_NUMBER()或RANK()函数来按照我正在寻找的方式对值进行分区。

In the data, as the ID increases sequentially, I want to setup a column value like a ROW_NUMBER() but I want its count to reset every time the SpeakerID changes.

在数据中,随着ID顺序增加,我想设置一个类似ROW_NUMBER()的列值,但我希望每次SpeakerID更改时它的计数都会重置。

I have data that looks like this:

我的数据看起来像这样:

<table><tbody><tr><th>ID</th><th>ConversationLine</th><th>SpeakerName</th><th>SpeakerID</th><th>TeacherLineIfSpeaking</th><th>StudentLineIfSpeaking</th><th>TeacherIDifSpeaking</th><th>StudentIDifSpeaking</th><th>CleanLineID</th><th>ConvID</th></tr><tr><td>1</td><td> Hi! Let's look over your problem again. Would you like me to type or talk?</td><td>Mr. Roberts </td><td>299875</td><td> Hi! Let's look over your problem again. Would you like me to type or talk?</td><td>NULL</td><td>299875</td><td>NULL</td><td>1</td><td>1</td></tr><tr><td>2</td><td> Hi Gabriela... which phone has the larger area for the screen?</td><td>Mr. Roberts </td><td>299875</td><td> Hi Gabriela... which phone has the larger area for the screen?</td><td>NULL</td><td>299875</td><td>NULL</td><td>2</td><td>1</td></tr><tr><td>3</td><td> The new phone right?</td><td>Gabriela </td><td>9695521</td><td>NULL</td><td> The new phone right?</td><td>NULL</td><td>9695521</td><td>3</td><td>1</td></tr><tr><td>4</td><td> correct....</td><td>Mr. Roberts </td><td>299875</td><td> correct....</td><td>NULL</td><td>299875</td><td>NULL</td><td>4</td><td>1</td></tr><tr><td>5</td><td> what will you need to do to calculate the area of either screen since we can assume the shape is a rectangle?</td><td>Mr. Roberts </td><td>299875</td><td> what will you need to do to calculate the area of either screen since we can assume the shape is a rectangle?</td><td>NULL</td><td>299875</td><td>NULL</td><td>5</td><td>1</td></tr><tr><td>6</td><td> I don't know ?</td><td>Gabriela </td><td>9695521</td><td>NULL</td><td> I don't know ?</td><td>NULL</td><td>9695521</td><td>6</td><td>1</td></tr><tr><td>7</td><td> Area of a rectangle = length x width</td><td>Mr. Roberts </td><td>299875</td><td> Area of a rectangle = length x width</td><td>NULL</td><td>299875</td><td>NULL</td><td>7</td><td>1</td></tr><tr><td>8</td><td> start with 'difference in areas = '</td><td>Mr. Roberts </td><td>299875</td><td> start with 'difference in areas = '</td><td>NULL</td><td>299875</td><td>NULL</td><td>8</td><td>1</td></tr><tr><td>9</td><td> after you clear your student answer box</td><td>Mr. Roberts </td><td>299875</td><td> after you clear your student answer box</td><td>NULL</td><td>299875</td><td>NULL</td><td>9</td><td>1</td></tr><tr><td>10</td><td> I already did</td><td>Gabriela </td><td>9695521</td><td>NULL</td><td> I already did</td><td>NULL</td><td>9695521</td><td>10</td><td>1</td></tr></tbody></table>

what I want is like this (notice the new column, second from left):

我想要的是这样的(注意新列,左起第二个):

   <table><tbody><tr><th>ID</th><th>ChatID</th><th>ConversationLine</th><th>SpeakerName</th><th>SpeakerID</th><th>TeacherLineIfSpeaking</th><th>StudentLineIfSpeaking</th><th>TeacherIDifSpeaking</th><th>StudentIDifSpeaking</th><th>CleanLineID</th><th>ConvID</th></tr><tr><td>1</td><td> 1</td><td> Hi! Let's look over your problem again. Would you like me to type or talk?</td><td>Mr. Roberts </td><td>299875</td><td> Hi! Let's look over your problem again. Would you like me to type or talk?</td><td>NULL</td><td>299875</td><td>NULL</td><td>1</td><td>1</td></tr><tr><td>2</td><td> 1</td><td> Hi Gabriela... which phone has the larger area for the screen?</td><td>Mr. Roberts </td><td>299875</td><td> Hi Gabriela... which phone has the larger area for the screen?</td><td>NULL</td><td>299875</td><td>NULL</td><td>2</td><td>1</td></tr><tr><td>3</td><td> 2</td><td> The new phone right?</td><td>Gabriela </td><td>9695521</td><td>NULL</td><td> The new phone right?</td><td>NULL</td><td>9695521</td><td>3</td><td>1</td></tr><tr><td>4</td><td> 3</td><td> correct....</td><td>Mr. Roberts </td><td>299875</td><td> correct....</td><td>NULL</td><td>299875</td><td>NULL</td><td>4</td><td>1</td></tr><tr><td>5</td><td> 3</td><td> what will you need to do to calculate the area of either screen since we can assume the shape is a rectangle?</td><td>Mr. Roberts </td><td>299875</td><td> what will you need to do to calculate the area of either screen since we can assume the shape is a rectangle?</td><td>NULL</td><td>299875</td><td>NULL</td><td>5</td><td>1</td></tr><tr><td>6</td><td> 4</td><td> I don't know ?</td><td>Gabriela </td><td>9695521</td><td>NULL</td><td> I don't know ?</td><td>NULL</td><td>9695521</td><td>6</td><td>1</td></tr><tr><td>7</td><td> 5</td><td> Area of a rectangle = length x width</td><td>Mr. Roberts </td><td>299875</td><td> Area of a rectangle = length x width</td><td>NULL</td><td>299875</td><td>NULL</td><td>7</td><td>1</td></tr><tr><td>8</td><td> 5</td><td> start with 'difference in areas = '</td><td>Mr. Roberts </td><td>299875</td><td> start with 'difference in areas = '</td><td>NULL</td><td>299875</td><td>NULL</td><td>8</td><td>1</td></tr><tr><td>9</td><td> 5</td><td> after you clear your student answer box</td><td>Mr. Roberts </td><td>299875</td><td> after you clear your student answer box</td><td>NULL</td><td>299875</td><td>NULL</td><td>9</td><td>1</td></tr><tr><td>10</td><td> 6</td><td> I already did</td><td>Gabriela </td><td>9695521</td><td>NULL</td><td> I already did</td><td>NULL</td><td>9695521</td><td>10</td><td>1</td></tr></tbody></table>

I realize that once I have the ChatID that I can use to group the values, I can use a recursive CTE or STUFF(..) with FOR XML or COALESCE with a variable or a CLR function, etc. to actually perform the concatenation.

我意识到,一旦我拥有了可用于对值进行分组的ChatID,我就可以使用带有变量或CLR函数等的FOR XML或COALESCE的递归CTE或STUFF(..)来实际执行连接。

I'm running SQL Server 2016.

我正在运行SQL Server 2016。

Also, one other thing I should mention is that there's no predictability regarding how long the dialogue sequences might go. It's possible that a speaker could have a sequence length of 40 consecutive messages (i.e. rows) before the dialogue changes speakers, so techniques that use a fixed number of inner joins are insufficient. Also, the performance of the solution needs to be reasonable because there are over 16 million rows in this database.

另外,我要提到的另一件事是对话序列可能会持续多长时间没有可预测性。在对话改变发言者之前,说话者可能具有40个连续消息(即行)的序列长度,因此使用固定数量的内部联接的技术是不够的。此外,解决方案的性能需要合理,因为此数据库中有超过1600万行。

Here's a trimmed down version of the data in a tabular format (minus some of the extra columns that made it not format so well). Input:

这是一个表格格式的数据的精简版本(减去一些额外的列,使其格式不是那么好)。输入:

╔════╦═══════════════════════╦══════════════╦═══════════╦═════════════╦════════╗
║ ID ║   ConversationLine    ║ SpeakerName  ║ SpeakerID ║ CleanLineID ║ ConvID ║
╠════╬═══════════════════════╬══════════════╬═══════════╬═════════════╬════════╣
║  1 ║  Hi! Let's look...    ║ Mr. Roberts  ║    299875 ║           1 ║      1 ║
║  2 ║  Hi Gabriela...       ║ Mr. Roberts  ║    299875 ║           2 ║      1 ║
║  3 ║  The new phone right? ║ Gabriela     ║   9695521 ║           3 ║      1 ║
║  4 ║  correct....          ║ Mr. Roberts  ║    299875 ║           4 ║      1 ║
║  5 ║  what will you ...?   ║ Mr. Roberts  ║    299875 ║           5 ║      1 ║
║  6 ║  I don't know ?       ║ Gabriela     ║   9695521 ║           6 ║      1 ║
║  7 ║  Area of  = ...       ║ Mr. Roberts  ║    299875 ║           7 ║      1 ║
║  8 ║  start with ...       ║ Mr. Roberts  ║    299875 ║           8 ║      1 ║
║  9 ║  after you ...        ║ Mr. Roberts  ║    299875 ║           9 ║      1 ║
║ 10 ║  I already did        ║ Gabriela     ║   9695521 ║          10 ║      1 ║
╚════╩═══════════════════════╩══════════════╩═══════════╩═════════════╩════════╝

and the desired output:

和期望的输出:

╔════╦════════╦══════════════════════╦═════════════╦═══════════╦═════════════╦════════╗
║ ID ║ ChatID ║   ConversationLine   ║ SpeakerName ║ SpeakerID ║ CleanLineID ║ ConvID ║
╠════╬════════╬══════════════════════╬═════════════╬═══════════╬═════════════╬════════╣
║  1 ║      1 ║ Hi! Let's look...    ║ Mr. Roberts ║    299875 ║           1 ║      1 ║
║  2 ║      1 ║ Hi Gabriela...       ║ Mr. Roberts ║    299875 ║           2 ║      1 ║
║  3 ║      2 ║ The new phone right? ║ Gabriela    ║   9695521 ║           3 ║      1 ║
║  4 ║      3 ║ correct....          ║ Mr. Roberts ║    299875 ║           4 ║      1 ║
║  5 ║      3 ║ what will you ...?   ║ Mr. Roberts ║    299875 ║           5 ║      1 ║
║  6 ║      4 ║ I don't know ?       ║ Gabriela    ║   9695521 ║           6 ║      1 ║
║  7 ║      5 ║ Area of  = ...       ║ Mr. Roberts ║    299875 ║           7 ║      1 ║
║  8 ║      5 ║ start with ...       ║ Mr. Roberts ║    299875 ║           8 ║      1 ║
║  9 ║      5 ║ after you ...        ║ Mr. Roberts ║    299875 ║           9 ║      1 ║
║ 10 ║      6 ║ I already did        ║ Gabriela    ║   9695521 ║          10 ║      1 ║
╚════╩════════╩══════════════════════╩═════════════╩═══════════╩═════════════╩════════╝

Edit: For those interested in the query execution plans of the proposed solutions, here's the estimated query execution plan for @kannan-kandasamy's solution (which you probably need to open in a new window to zoom in and see it because the picture is so wide): 使用RANK或ROW_NUMBER创建可分组ID以在SQL Server中使用难以捉摸的顺序替换来连接行值

编辑:对于那些对提出的解决方案的查询执行计划感兴趣的人,这里是@ kannan-kandasamy解决方案的估计查询执行计划(您可能需要在新窗口中打开以放大并查看它,因为图片太宽了):

Here's the estimated query execution plan for @vkp's solution: 使用RANK或ROW_NUMBER创建可分组ID以在SQL Server中使用难以捉摸的顺序替换来连接行值

以下是@ vkp解决方案的估计查询执行计划:

Edit 2: Here they are in the same batch: 使用RANK或ROW_NUMBER创建可分组ID以在SQL Server中使用难以捉摸的顺序替换来连接行值

编辑2:这里他们是同一批次:

What's interesting is that when I run them in the same batch, it shows @vkp's solution as requiring 99% of the batch cost. But when I ran both queries to SELECT INTO a new table, @vkp's solution ran in less than 1/5th of the time.

有趣的是,当我在同一批次中运行它们时,它显示@ vkp的解决方案需要99%的批量成本。但是,当我向SELECT INTO运行两个查询新表时,@ vkp的解决方案在不到1/5的时间内运行。

Here are the clustered index scan properties of @kannan-kandasamy's solution: 使用RANK或ROW_NUMBER创建可分组ID以在SQL Server中使用难以捉摸的顺序替换来连接行值 The clustered index scan properties of @vkp's solution appear to be identical for every statistic and logical value (except it says the estimated operator cost is 1% for @vkp's solution but 91% for @kannan-kandasamy's solution even though the actual operator cost values are identical).

以下是@ kannan-kandasamy解决方案的聚集索引扫描属性:@ vkp解决方案的聚簇索引扫描属性对于每个统计值和逻辑值看起来都是相同的(除了它表示@ vkp解决方案的估计运算符成本为1%,但91即使实际的运营商成本值相同,@ kannan-kandasamy解决方案的百分比也是如此。

2 个解决方案

#1


1  

This can be done with a difference of row numbers approach. (Run the inner-most query to see how consecutive rows with same speaker_id are assigned to the same group). Then get the start id of each group and use dense_rank to get the chat_id's as required, in order.

这可以通过行数方法的差异来完成。 (运行最内部查询以查看具有相同speaker_id的连续行如何分配给同一组)。然后获取每个组的起始ID并使用dense_rank按顺序获取chat_id。

select t.*,dense_rank() over(order by id_strt) as chat_id
from (select t.*,min(id) over(partition by grp,speakerid) as id_strt
      from (select t.*
            ,row_number() over(order by id)-row_number() over(partition by speakerid order by id) as grp
            from t
           ) t
     ) t

If you just need the chat_id's to identify a group and concatenate values, the inner-most query would suffice. When you are grouping by, just group by grp,speakerid.

如果你只需要chat_id来识别一个组并连接值,那么最内层的查询就足够了。当你分组时,只需按grp,speakerid分组。

#2


1  

You can use lead and windowing sum for achieving this:

您可以使用铅和窗口总和来实现此目的:

select *, ChatId = sum(case when speakerid <> NextSpeakerid then 1 else 0 end) over(order by id)+1 from (
    select *, NextSpeakerid = lag(speakerid, 1, null) over(order by id) from #yourgroup
) a

Output for this query:

此查询的输出:

+----+--------------+-----------+--------+
| ID | SpeakerName  | speakerid | ChatID |
+----+--------------+-----------+--------+
|  1 | Mr. Roberts  |  25239875 |      1 |
|  2 | Mr. Roberts  |  25239875 |      1 |
|  3 | Gabriela     |  19645521 |      2 |
|  4 | Mr. Roberts  |  25239875 |      3 |
|  5 | Mr. Roberts  |  25239875 |      3 |
|  6 | Gabriela     |  19645521 |      4 |
|  7 | Mr. Roberts  |  25239875 |      5 |
|  8 | Mr. Roberts  |  25239875 |      5 |
|  9 | Mr. Roberts  |  25239875 |      5 |
| 10 | Gabriela     |  19645521 |      6 |
+----+--------------+-----------+--------+

Your table:

create table #yourgroup (ID int identity(1,1), speakername varchar(20), speakerid int)

insert into #yourgroup ( speakername, speakerid) values
 ('Mr. Roberts  ', 25239875 )
,('Mr. Roberts  ', 25239875 )
,('Gabriela     ', 19645521 )
,('Mr. Roberts  ', 25239875 )
,('Mr. Roberts  ', 25239875 )
,('Gabriela     ', 19645521 )
,('Mr. Roberts  ', 25239875 )
,('Mr. Roberts  ', 25239875 )
,('Mr. Roberts  ', 25239875 )
,('Gabriela     ', 19645521 )

#1


1  

This can be done with a difference of row numbers approach. (Run the inner-most query to see how consecutive rows with same speaker_id are assigned to the same group). Then get the start id of each group and use dense_rank to get the chat_id's as required, in order.

这可以通过行数方法的差异来完成。 (运行最内部查询以查看具有相同speaker_id的连续行如何分配给同一组)。然后获取每个组的起始ID并使用dense_rank按顺序获取chat_id。

select t.*,dense_rank() over(order by id_strt) as chat_id
from (select t.*,min(id) over(partition by grp,speakerid) as id_strt
      from (select t.*
            ,row_number() over(order by id)-row_number() over(partition by speakerid order by id) as grp
            from t
           ) t
     ) t

If you just need the chat_id's to identify a group and concatenate values, the inner-most query would suffice. When you are grouping by, just group by grp,speakerid.

如果你只需要chat_id来识别一个组并连接值,那么最内层的查询就足够了。当你分组时,只需按grp,speakerid分组。

#2


1  

You can use lead and windowing sum for achieving this:

您可以使用铅和窗口总和来实现此目的:

select *, ChatId = sum(case when speakerid <> NextSpeakerid then 1 else 0 end) over(order by id)+1 from (
    select *, NextSpeakerid = lag(speakerid, 1, null) over(order by id) from #yourgroup
) a

Output for this query:

此查询的输出:

+----+--------------+-----------+--------+
| ID | SpeakerName  | speakerid | ChatID |
+----+--------------+-----------+--------+
|  1 | Mr. Roberts  |  25239875 |      1 |
|  2 | Mr. Roberts  |  25239875 |      1 |
|  3 | Gabriela     |  19645521 |      2 |
|  4 | Mr. Roberts  |  25239875 |      3 |
|  5 | Mr. Roberts  |  25239875 |      3 |
|  6 | Gabriela     |  19645521 |      4 |
|  7 | Mr. Roberts  |  25239875 |      5 |
|  8 | Mr. Roberts  |  25239875 |      5 |
|  9 | Mr. Roberts  |  25239875 |      5 |
| 10 | Gabriela     |  19645521 |      6 |
+----+--------------+-----------+--------+

Your table:

create table #yourgroup (ID int identity(1,1), speakername varchar(20), speakerid int)

insert into #yourgroup ( speakername, speakerid) values
 ('Mr. Roberts  ', 25239875 )
,('Mr. Roberts  ', 25239875 )
,('Gabriela     ', 19645521 )
,('Mr. Roberts  ', 25239875 )
,('Mr. Roberts  ', 25239875 )
,('Gabriela     ', 19645521 )
,('Mr. Roberts  ', 25239875 )
,('Mr. Roberts  ', 25239875 )
,('Mr. Roberts  ', 25239875 )
,('Gabriela     ', 19645521 )