I have a table that contains customers information. Each customer is assigned a Customer ID that they retain as they open more accounts. Two customers may be on the same account, each with their own ID. The account numbers are not ordered by date.
我有一个包含客户信息的表。为每个客户分配一个客户ID,他们在打开更多帐户时会保留这些客户ID。两个客户可能在同一个帐户中,每个客户都有自己的ID。帐号未按日期排序。
I would like to find the most recent account of each customer or pair of customers. If two customers have ever been on an account together, I want to return the most recent account either customer has been on.
我想找到每个客户或一对客户的最新帐户。如果两个客户曾经在一个帐户上,我想要返回客户所在的最新帐户。
Here is a sample table with some of the possible cases.
这是一个包含一些可能情况的示例表。
Example table ACCT:
示例表ACCT:
acctnumber date Cust1ID Cust2ID
10000 2.1.16 1110 --Case0-customer has only ever had one account
10001 2.1.16 1111 --Case1-one customer has multiple accounts
10050 2.1.17 1111
10008 2.1.16 1120 --Case2-customer has an account and later gets
10058 2.3.17 1120 1121 --a second account with another customer
10002 2.1.16 1112 --Case3-customer has an account and later becomes
10052 2.2.17 1113 1112 --the second customer on another account
10003 2.2.16 1114 1115 --Case4-customer and second customer switch which
7060 2.4.17 1115 1114 --is first and second
10004 2.2.16 1116 1117 --Case5-second customer later gets separate account
10067 2.5.17 1117
10005 2.1.16 1118 --Case6-customer gets second account with another
10054 2.3.17 1118 1119 --who later gets a separate account
10101 6.2.17 1119
10007 2.1.16 1122 1123 --Case7-customers play musical chairs
10057 2.3.17 1123 1124
10107 6.2.17 1124 1125
Desired Results:
期望的结果:
acctnumber date Cust1ID Cust2ID
10000 2.1.16 1110 --Case0
10050 2.1.17 1111 --Case1
10058 2.3.17 1120 1121 --Case2
10052 2.2.17 1113 1112 --Case3
7060 2.4.17 1115 1114 --Case4
10067 2.5.17 1117 --Case5
10101 6.2.17 1119 --Case6
10107 6.2.17 1124 1125 --Case7
Alternatively, I would accept Case 7 outputting the two separate customer groups:
或者,我会接受案例7输出两个独立的客户群:
10007 2.1.16 1122 1123 --Case7 group a
10107 6.2.17 1124 1125 --Case7 group b
I have attempted joining the table to a copy of itself whenever acctnumbers are unique and any of the Cust IDs match. However, this removes customers who have only had one account so I added a union of cust that have no matches on the custid or account number and groups by custid.
每当acctnumbers唯一且任何Cust ID匹配时,我都尝试将表连接到自身的副本。但是,这会删除只有一个帐户的客户,因此我添加了一个由custid在custid或帐号和组上没有匹配的cust联合。
Unfortunately, the second piece does not only include custids from case 0 and there are some custids which are excluded all together that shouldn't be.
不幸的是,第二部分不仅包括来自案例0的custids,还有一些custids被排除在一起,不应该是。
select
max(date1) as date,
cust1id1 as cust1id
from
(
select
acctnumber as [acctnumber1],
date as [date1],
cust1id as [cust1id1],
cust2id as [cust2id1]
from
acct
) t1
join
(
select
acctnumber as [acctnumber2],
date as [date2],
cust1id as [cust1id2],
cust2id as [cust2id2]
from
acct
) t2
on t1.date1 > t2.date2 and
(t1.cust1id1 = t2.cust1id2 or
t1.cust1id1 = t2.cust2id2 or
t1.cust2id1 = t2.cust2id2)
Group by
cust1id1
union
select
max(date1) as date,
cust1id1 as cust1id
from
(
select
acctnumber as [acctnumber1],
date as [date1],
cust1id as [cust1id1],
cust2id as [cust2id1]
from
acct
) t1
join
(
select
acctnumber as [acctnumber2],
date as [date2],
cust1id as [cust1id2],
cust2id as [cust2id2]
from
acct
) t2
on (t1.acctnumber1 != t2.acctnumber2 and
t1.cust1id1 != t2.cust1id2 and
t1.cust1id1 != t2.cust2id2 and
t1.cust2id1 != t2.cust2id2)
group by
cust1id1
Update
Thank you for all the great answers and comments so far.
感谢您迄今为止所有出色的答案和评论。
@VladimirBaranov has brought up a rare case that I had not previously considered in comments to other answers.
@VladimirBaranov提出了一个罕见的案例,我以前没有在评论中考虑其他答案。
acctnumber date Cust1ID Cust2ID
50001 '2016-01-01' 2001 NULL --Case8a-customers with an account together
50002 '2017-02-02' 2001 2002 --each later get a separate account
50003 '2017-03-03' 2001 NULL
50004 '2017-04-04' 2002 NULL
50005 '2016-01-01' 2003 NULL --Case8b-customers with an account together
50006 '2017-02-02' 2003 2004 --each later get a separate account
50007 '2017-03-03' 2004 NULL
50008 '2017-04-04' 2003 NULL
50009 '2016-01-01' 2005 NULL --Case9a -customer has one or more individual accounts
50010 '2017-02-02' 2005 2006 --and later gets a second customer on a least one account
50011 '2017-03-03' 2005 2007 --and later still gets a third customer on at least one account
50012 '2017-04-04' 2005 NULL --Case9b starts the same as Case9a, but after all accounts with other customers are complete, the original primary customer begins opening individual accounts again
50013 '2016-01-01' 2008 NULL --Case5b -customer has one or more individual accounts
50014 '2017-02-02' 2008 2009 --and later gets a second customer on a least one account
50015 '2017-04-04' 2008 NULL
Because Case 8 would represent the location acknowledging the customers are worthy of holding separate accounts, we would want to then consider their group as splitting.
因为案例8代表了承认客户值得拥有单独账户的位置,我们会想要将他们的小组视为分裂。
desired result
期望的结果
acctnumber date Cust1ID Cust2ID
50003 '2017-03-03' 2001 NULL --Case8a
50004 '2017-04-04' 2002 NULL --Case8a
50007 '2017-03-03' 2004 NULL --Case8b
50008 '2017-04-04' 2003 NULL --Case8b
50011 '2017-03-03' 2005 2007 --Case9a
50012 '2017-04-04' 2005 NULL --Case9b
50015 '2017-04-04' 2008 NULL --Case5b
Similarly to case 7, it will be a bonus if Case8 is handled, but not expected.
与案例7类似,如果处理Case8将是一个奖励,但不是预期的。
Case 9 is important and the result for 9a and 9b should be handled.
案例9很重要,应该处理9a和9b的结果。
In Case 5b, the desired result is to have the most recent account from the first customer.
在案例5b中,期望的结果是拥有来自第一个客户的最新帐户。
Reviewing the answers I see many have index, create, update and other clauses specific to being able to edit the database. Unfortunately, I am on the consumer side of this database so I have read only access, and any query with those keywords will be automatically rejected.
回顾我看到的答案很多都有索引,创建,更新和其他特定于能够编辑数据库的子句。不幸的是,我在这个数据库的消费者方面,所以我只读访问权限,任何带有这些关键字的查询都将被自动拒绝。
I noticed an issue with my original set of 7 cases. In more recent accounts, when a customer is no longer on the account, it was always the second borrower that remained. This was entirely unintentional, you can look at any of those examples and either customer can potentially be the remaining customer on the most recent account. I have tried to show this with a new example as Case 5b.
我注意到我原来的一组7个案例存在问题。在最近的帐户中,当客户不再在帐户上时,它始终是剩下的第二个借款人。这完全是无意的,您可以查看这些示例中的任何一个,并且客户可能是最近帐户中的剩余客户。我试图用案例5b的新例子来证明这一点。
Also, in most scenarios the customers have many accounts, and mix and matching the above cases overtime is common. For example, a single customer can have five accounts (Case 1), then later opens one or more accounts with another customer (Case 3) sometimes switching the primary account holder (Case 4) then afterwards the first customer begins opening individual accounts again (Case 5b).
此外,在大多数情况下,客户有很多帐户,并且混合和匹配上述情况加班很常见。例如,单个客户可以拥有五个帐户(案例1),然后稍后与另一个客户打开一个或多个帐户(案例3),有时会切换主帐户持有人(案例4),然后第一个客户再次开始打开个人帐户(案例5b)。
9 个解决方案
#1
2
I'd like to thank Jeff Breadner for the DDL with sample data.
我要感谢Jeff Breadner为DDL提供样本数据。
You'll have to run the below query step-by-step, CTE-by-CTE and examine intermediate results to understand what it does. It assumes that AcctNumber
is unique in the given table.
您必须逐步运行以下查询,CTE-by-CTE并检查中间结果以了解它的作用。它假定AcctNumber在给定表中是唯一的。
At first I want to find the latest account for each individual customer. It is a simple top-n-per-group
query and I'm using a ROW_NUMBER
approach here.
首先,我想找到每个客户的最新帐户。这是一个简单的top-n-group-group查询,我在这里使用ROW_NUMBER方法。
CTE_Customers
makes a plain list of all individual customers by putting together Cust1ID
and Cust2ID
. CTE_RN
assigns them row numbers. CTE_LatestAccounts
gives latest account for each individual customer:
CTE_Customers通过将Cust1ID和Cust2ID放在一起,生成所有个人客户的简单列表。 CTE_RN为它们分配行号。 CTE_LatestAccounts为每个客户提供最新帐户:
+------------------+------------+--------+
| LatestAcctNumber | LatestDT | CustID |
+------------------+------------+--------+
| 10000 | 2016-02-01 | 1110 |
| 10050 | 2017-02-01 | 1111 |
| 10052 | 2017-02-02 | 1112 |
| 10052 | 2017-02-02 | 1113 |
| 7060 | 2017-02-04 | 1114 |
| 7060 | 2017-02-04 | 1115 |
| 10004 | 2016-02-02 | 1116 |
| 10067 | 2017-02-05 | 1117 |
| 10054 | 2017-02-03 | 1118 |
| 10101 | 2017-06-02 | 1119 |
| 10058 | 2017-02-03 | 1120 |
| 10058 | 2017-02-03 | 1121 |
| 10007 | 2016-02-01 | 1122 |
| 10057 | 2017-02-03 | 1123 |
| 10107 | 2017-06-02 | 1124 |
| 10107 | 2017-06-02 | 1125 |
+------------------+------------+--------+
The task is complicated by having customer pairs which "propagates" the latest account to another customer.
通过将最新帐户“传播”给另一个客户的客户对,任务变得复杂。
Customer pairs are defined in the original table, so CTE_MaxLatestAccounts
takes each row from the original table and joins latest accounts to it twice - for Cust1D
and Cust2ID
. For each pair I'm picking one of the two latest accounts - the most recent one. Thus a customer that belongs to a pair may get an account from its partner.
客户对在原始表中定义,因此CTE_MaxLatestAccounts从原始表中获取每一行,并将最新帐户连接到它两次 - 对于Cust1D和Cust2ID。对于每一对,我选择了最近的两个账户中的一个 - 最近的账户。因此,属于一对的客户可以从其合作伙伴处获得帐户。
+---------+---------+-------------+---------------------+
| Cust1ID | Cust2ID | MaxLatestDT | MaxLatestAcctNumber |
+---------+---------+-------------+---------------------+
| 1110 | NULL | 2016-02-01 | 10000 |
| 1111 | NULL | 2017-02-01 | 10050 |
| 1111 | NULL | 2017-02-01 | 10050 |
| 1120 | NULL | 2017-02-03 | 10058 |
| 1120 | 1121 | 2017-02-03 | 10058 |
| 1112 | NULL | 2017-02-02 | 10052 |
| 1113 | 1112 | 2017-02-02 | 10052 |
| 1114 | 1115 | 2017-02-04 | 7060 |
| 1115 | 1114 | 2017-02-04 | 7060 |
| 1116 | 1117 | 2017-02-05 | 10067 |
| 1117 | NULL | 2017-02-05 | 10067 |
| 1118 | NULL | 2017-02-03 | 10054 |
| 1118 | 1119 | 2017-06-02 | 10101 |
| 1119 | NULL | 2017-06-02 | 10101 |
| 1122 | 1123 | 2017-02-03 | 10057 |
| 1123 | 1124 | 2017-06-02 | 10107 |
| 1124 | 1125 | 2017-06-02 | 10107 |
+---------+---------+-------------+---------------------+
The MaxLatestAcctNumber
here is applicable to both Cust1ID
and Cust2ID
. The same customer may be listed several times here and we need to choose an entry again with the latest account. Here this is the latest account of a pair, not for individual customer.
此处的MaxLatestAcctNumber适用于Cust1ID和Cust2ID。同一客户可能会在此处多次列出,我们需要使用最新帐户再次选择一个条目。这是一对货币的最新账户,不适用于个人客户。
The approach is the same as in the beginning. Put both Cust1ID
and Cust2ID
customers in a list: CTE_CustomersWithLatestAccountFromPair
. Assign row numbers in CTE_CustomersWithLatestAccountFromPairRN
and pick final account in CTE_FinalAccounts
.
方法与开始时的方法相同。将Cust1ID和Cust2ID客户放在一个列表中:CTE_CustomersWithLatestAccountFromPair。在CTE_CustomersWithLatestAccountFromPairRN中分配行号,并在CTE_FinalAccounts中选择最终帐户。
+---------------------+
| MaxLatestAcctNumber |
+---------------------+
| 10000 |
| 10050 |
| 10052 |
| 10052 |
| 7060 |
| 7060 |
| 10067 |
| 10067 |
| 10101 |
| 10101 |
| 10058 |
| 10058 |
| 10057 |
| 10107 |
| 10107 |
| 10107 |
+---------------------+
Now we just need to filter the original table and leave only those rows (accounts) that appear in this list. See the final result below.
现在我们只需要过滤原始表并仅保留此列表中显示的那些行(帐户)。请参阅下面的最终结果。
Sample data
样本数据
declare @ACCT table (
AcctNumber int,
dt date,
Cust1ID int,
Cust2ID int
);
insert into @ACCT values
(10000, '2016-02-01', 1110, null),
(10001, '2016-02-01', 1111, null),
(10050, '2017-02-01', 1111, null),
(10008, '2016-02-01', 1120, null),
(10058, '2017-02-03', 1120, 1121),
(10002, '2016-02-01', 1112, null),
(10052, '2017-02-02', 1113, 1112),
(10003, '2016-02-02', 1114, 1115),
(7060, '2017-02-04', 1115, 1114),
(10004, '2016-02-02', 1116, 1117),
(10067, '2017-02-05', 1117, null),
(10005, '2016-02-01', 1118, null),
(10054, '2017-02-03', 1118, 1119),
(10101, '2017-06-02', 1119, null),
(10007, '2016-02-01', 1122, 1123),
(10057, '2017-02-03', 1123, 1124),
(10107, '2017-06-02', 1124, 1125);
Query
询问
WITH
CTE_Customers
AS
(
SELECT
AcctNumber
,dt
,Cust1ID AS CustID
FROM @ACCT
WHERE Cust1ID IS NOT NULL
UNION ALL
SELECT
AcctNumber
,dt
,Cust2ID AS CustID
FROM @ACCT
WHERE Cust2ID IS NOT NULL
)
,CTE_RN
AS
(
SELECT
AcctNumber
,dt
,CustID
,ROW_NUMBER() OVER (PARTITION BY CustID ORDER BY dt DESC) AS rn
FROM CTE_Customers
)
,CTE_LatestAccounts
-- this gives one row per CustID
AS
(
SELECT
AcctNumber AS LatestAcctNumber
,dt AS LatestDT
,CustID
FROM CTE_RN
WHERE rn = 1
)
,CTE_MaxLatestAccounts
AS
(
SELECT
A.Cust1ID
,A.Cust2ID
,CASE WHEN ISNULL(A1.LatestDT, '2000-01-01') > ISNULL(A2.LatestDT, '2000-01-01')
THEN A1.LatestDT ELSE A2.LatestDT END AS MaxLatestDT
,CASE WHEN ISNULL(A1.LatestDT, '2000-01-01') > ISNULL(A2.LatestDT, '2000-01-01')
THEN A1.LatestAcctNumber ELSE A2.LatestAcctNumber END AS MaxLatestAcctNumber
FROM
@ACCT AS A
LEFT JOIN CTE_LatestAccounts AS A1 ON A1.CustID = A.Cust1ID
LEFT JOIN CTE_LatestAccounts AS A2 ON A2.CustID = A.Cust2ID
)
,CTE_CustomersWithLatestAccountFromPair
AS
(
SELECT
Cust1ID AS CustID
,MaxLatestDT
,MaxLatestAcctNumber
FROM CTE_MaxLatestAccounts
WHERE Cust1ID IS NOT NULL
UNION ALL
SELECT
Cust2ID AS CustID
,MaxLatestDT
,MaxLatestAcctNumber
FROM CTE_MaxLatestAccounts
WHERE Cust2ID IS NOT NULL
)
,CTE_CustomersWithLatestAccountFromPairRN
AS
(
SELECT
CustID
,MaxLatestDT
,MaxLatestAcctNumber
,ROW_NUMBER() OVER (PARTITION BY CustID ORDER BY MaxLatestDT DESC) AS rn
FROM CTE_CustomersWithLatestAccountFromPair
)
,CTE_FinalAccounts
AS
(
SELECT MaxLatestAcctNumber
FROM CTE_CustomersWithLatestAccountFromPairRN
WHERE rn = 1
)
SELECT *
FROM @ACCT AS A
WHERE A.AcctNumber IN (SELECT MaxLatestAcctNumber FROM CTE_FinalAccounts)
;
Result
结果
+------------+------------+---------+---------+
| AcctNumber | dt | Cust1ID | Cust2ID |
+------------+------------+---------+---------+
| 10000 | 2016-02-01 | 1110 | NULL |
| 10050 | 2017-02-01 | 1111 | NULL |
| 10058 | 2017-02-03 | 1120 | 1121 |
| 10052 | 2017-02-02 | 1113 | 1112 |
| 7060 | 2017-02-04 | 1115 | 1114 |
| 10067 | 2017-02-05 | 1117 | NULL |
| 10101 | 2017-06-02 | 1119 | NULL |
| 10057 | 2017-02-03 | 1123 | 1124 |
| 10107 | 2017-06-02 | 1124 | 1125 |
+------------+------------+---------+---------+
This result matches your desired result, except the last case 7.
除最后一个案例7外,此结果与您想要的结果相符。
My query doesn't attempt to follow the chain of linked customers of arbitrary length and is limited to processing one pair at a time. That's why the case 7 result is not one row. The query will always pick row/account with the very last date (10107
) and it may also pick account(s) in the middle of the chain. In this case it picked a row 10057
, not 10007
, because this is a later account for customers 1122
and 1123
.
我的查询不会尝试跟随任意长度的链接客户链,并且仅限于一次处理一对。这就是案例7结果不是一行的原因。查询将始终选择具有最后日期(10107)的行/帐户,并且它还可以选择链中间的帐户。在这种情况下,它选择了行10057,而不是10007,因为这是客户1122和1123的后续帐户。
When I looked at the execution plan I saw that the query behind CTE_LatestAccounts
is run essentially four times.
当我查看执行计划时,我发现CTE_LatestAccounts背后的查询基本上运行了四次。
It is likely that if you save result of CTE_LatestAccounts
into a temp table with proper indexes the overall performance would be better.
如果将CTE_LatestAccounts的结果保存到具有适当索引的临时表中,则整体性能可能会更好。
Something like this:
像这样的东西:
DECLARE @LatestAccounts TABLE
(LatestAcctNumber int, LatestDT date, CustID int PRIMARY KEY);
WITH
CTE_Customers
AS
(
SELECT
AcctNumber
,dt
,Cust1ID AS CustID
FROM @ACCT
WHERE Cust1ID IS NOT NULL
UNION ALL
SELECT
AcctNumber
,dt
,Cust2ID AS CustID
FROM @ACCT
WHERE Cust2ID IS NOT NULL
)
,CTE_RN
AS
(
SELECT
AcctNumber
,dt
,CustID
,ROW_NUMBER() OVER (PARTITION BY CustID ORDER BY dt DESC) AS rn
FROM CTE_Customers
)
,CTE_LatestAccounts
-- this gives one row per CustID
AS
(
SELECT
AcctNumber AS LatestAcctNumber
,dt AS LatestDT
,CustID
FROM CTE_RN
WHERE rn = 1
)
INSERT INTO @LatestAccounts (LatestAcctNumber, LatestDT, CustID)
SELECT LatestAcctNumber, LatestDT, CustID
FROM CTE_LatestAccounts;
WITH
CTE_MaxLatestAccounts
AS
(
SELECT
A.Cust1ID
,A.Cust2ID
,CASE WHEN ISNULL(A1.LatestDT, '2000-01-01') > ISNULL(A2.LatestDT, '2000-01-01')
THEN A1.LatestDT ELSE A2.LatestDT END AS MaxLatestDT
,CASE WHEN ISNULL(A1.LatestDT, '2000-01-01') > ISNULL(A2.LatestDT, '2000-01-01')
THEN A1.LatestAcctNumber ELSE A2.LatestAcctNumber END AS MaxLatestAcctNumber
FROM
@ACCT AS A
LEFT JOIN @LatestAccounts AS A1 ON A1.CustID = A.Cust1ID
LEFT JOIN @LatestAccounts AS A2 ON A2.CustID = A.Cust2ID
)
,CTE_CustomersWithLatestAccountFromPair
AS
(
SELECT
Cust1ID AS CustID
,MaxLatestDT
,MaxLatestAcctNumber
FROM CTE_MaxLatestAccounts
WHERE Cust1ID IS NOT NULL
UNION ALL
SELECT
Cust2ID AS CustID
,MaxLatestDT
,MaxLatestAcctNumber
FROM CTE_MaxLatestAccounts
WHERE Cust2ID IS NOT NULL
)
,CTE_CustomersWithLatestAccountFromPairRN
AS
(
SELECT
CustID
,MaxLatestDT
,MaxLatestAcctNumber
,ROW_NUMBER() OVER (PARTITION BY CustID ORDER BY MaxLatestDT DESC) AS rn
FROM CTE_CustomersWithLatestAccountFromPair
)
,CTE_FinalAccounts
AS
(
SELECT MaxLatestAcctNumber
FROM CTE_CustomersWithLatestAccountFromPairRN
WHERE rn = 1
)
SELECT *
FROM @ACCT AS A
WHERE A.AcctNumber IN (SELECT MaxLatestAcctNumber FROM CTE_FinalAccounts)
;
If you really need to merge/group all linked customers into one row when the length of the chain is arbitrary, you can do it with recursive query like shown, for example, here: How to find all connected subgraphs of an undirected graph
如果您确实需要在链的长度是任意的时将所有链接的客户合并/分组到一行,您可以使用递归查询来执行此操作,例如,此处:如何查找无向图的所有连接子图
Once you have tagged each customer with some GroupID, find the latest account for each individual customer as in the beginning of this query. Then find the latest account among the group (rather than for the simple pair as in this query).
使用某个GroupID标记每个客户后,在此查询开头查找每个客户的最新帐户。然后在组中找到最新的帐户(而不是像此查询中的简单对)。
The query that finds all subgraphs of an undirected graph in the linked question may be quite slow for a large dataset and there are efficient non-set based algorithms to do it.
在链接问题中查找无向图的所有子图的查询对于大型数据集来说可能非常慢,并且存在有效的非基于集合的算法。
If you know that the maximum length of the chain can't exceed some number, it is possible to make this recursive query more efficient.
如果您知道链的最大长度不能超过某个数字,则可以使此递归查询更有效。
#2
2
To apply logic to each subset a good operator to use is the CROSS APPLY
operator. This allows us to find the most recent account for each Customer Id.
要将逻辑应用于每个子集,要使用的好运算符是CROSS APPLY运算符。这样我们就可以找到每个客户ID的最新帐户。
Setup
建立
DECLARE @Stage TABLE
(
AcctNumber INT
,[Date] DATETIME
,Cust1Id INT
,Cust2Id INT
)
INSERT INTO @Stage (AcctNumber, [Date] ,Cust1Id ,Cust2Id)
VALUES
(10000,'2.1.16',1110,NULL)
,(10001,'2.1.16',1111,NULL)
,(10050,'2.1.17',1111,NULL)
,(10008,'2.1.16',1120,NULL)
,(10058,'2.3.17',1120,1121)
,(10002,'2.1.16',1112,NULL)
,(10052,'2.2.17',1113,1112)
,(10003,'2.2.16',1114,1115)
,(7060,'2.4.17',1115,1114)
,(10004,'2.2.16',1116,1117)
,(10067,'2.5.17',1117,NULL)
,(10005,'2.1.16',1118,NULL)
,(10054,'2.3.17',1118,1119)
,(10101,'6.2.17',1119,NULL)
,(10007,'2.1.16',1122,1123)
,(10057,'2.3.17',1123,1124)
,(10107,'6.2.17',1124,1125)
--Additional Cases to cover
,(50001, '2016-01-01', 2001, NULL)
,(50002, '2017-02-02', 2001, 2002)
,(50003, '2017-03-03', 2001, NULL)
,(50004, '2017-04-04', 2002, NULL)
,(50005, '2016-01-01', 2003, NULL)
,(50006, '2017-02-02', 2003, 2004)
,(50007, '2017-03-03', 2004, NULL)
,(50008, '2017-04-04', 2003, NULL)
Execution
执行
Cross Apply
交叉申请
;WITH Results AS(
SELECT DISTINCT S2.*
FROM @Stage S1
CROSS APPLY (
SELECT TOP 1 S2.*
FROM @Stage S2
WHERE
(S1.Cust1Id = S2.Cust1Id
OR S1.Cust1Id = S2.Cust2Id
OR S1.Cust2Id = S2.Cust1Id
OR S1.Cust2Id = S2.Cust2Id)
ORDER BY S2.[Date] DESC
) S2
)
SELECT R1.*
FROM Results R1
LEFT JOIN Results R2
ON R1.Cust2Id = R2.Cust1Id
WHERE R1.[Date] > R2.[Date]
OR R2.AcctNumber IS NULL
The CROSS APPLY
operators walk the cases backward to apply the logic to each joint account case while ensuring the most recent account is carried over. This alone covers most of the cases. The only lingering cases are the ones with 3 accounts being shifted between 3 customers. The self join and WHERE
clause in the final select cover these.
CROSS APPLY操作员向后走动案例以将逻辑应用于每个联合帐户案例,同时确保最近的帐户被结转。仅这一点就涵盖了大多数情况。唯一挥之不去的案例是3个客户在3个客户之间转移的案例。最终选择中的self join和WHERE子句涵盖了这些。
Results
结果
+------------+------------+---------+---------+
| AcctNumber | Date | Cust1Id | Cust2Id |
| 7060 | 2017-02-04 | 1115 | 1114 |
| 10000 | 2016-02-01 | 1110 | NULL |
| 10050 | 2017-02-01 | 1111 | NULL |
| 10052 | 2017-02-02 | 1113 | 1112 |
| 10058 | 2017-02-03 | 1120 | 1121 |
| 10067 | 2017-02-05 | 1117 | NULL |
| 10101 | 2017-06-02 | 1119 | NULL |
| 10107 | 2017-06-02 | 1124 | 1125 |
| 50003 | 2017-03-03 | 2001 | NULL |
| 50004 | 2017-04-04 | 2002 | NULL |
| 50007 | 2017-03-03 | 2004 | NULL |
| 50008 | 2017-04-04 | 2003 | NULL |
+------------+------------+---------+---------+
#3
1
I'm sure there is a much easier approach, but this is what I've had in mind :
我确信有一个更简单的方法,但这是我的想法:
SELECT
a.acctnumber,
a.date,
a.Cust1ID,
a.Cust2ID
FROM acct a
OUTER APPLY (
SELECT acctnumber
FROM (
SELECT *,
ROW_NUMBER() OVER(PARTITION BY acctnumber ORDER BY [date] DESC) AS ACC_RN,
ROW_NUMBER() OVER(PARTITION BY CustomerID ORDER BY [date] DESC) AS RN
FROM (
SELECT
a1.acctnumber,
a1.[date],
a1.Cust1ID AS CustomerID
FROM acct a1
UNION
SELECT
a2.acctnumber,
a2.[date],
a2.Cust2ID
FROM acct a2
) D
) C
WHERE
RN = 1
AND CustomerID IS NOT NULL
AND ACC_RN = 2
) acc
WHERE a.acctnumber IN(acc.acctnumber)
#4
1
Could you just use a left-join to join accounts with other "linked" accounts with potentially later dates, and then just filter out records where the "Later Account" table is not null? Something like this:
您是否可以使用左连接来加入具有可能更晚日期的其他“链接”帐户的帐户,然后只过滤掉“后期帐户”表不为空的记录?像这样的东西:
select ThisAccount.*
from Accounts ThisAccount
left join Accounts LaterAccount on
LaterAccount.AcctNumber <> ThisAccount.AcctNumber
and LaterAccount.dt > ThisAccount.dt
and
( LaterAccount.Cust1ID = ThisAccount.Cust1ID
or LaterAccount.Cust2ID = ThisAccount.Cust1ID
or LaterAccount.Cust1ID = ThisAccount.Cust2ID
or LaterAccount.Cust2ID = ThisAccount.Cust2ID
)
where LaterAccount.AcctNumber is null
order by ThisAccount.AcctNumber
This should return the results as expected:
这应该按预期返回结果:
AcctNo Dt Cust1 Cust2
7060 2017-02-04 1115 1114
10000 2016-02-01 1110 NULL
10050 2017-02-01 1111 NULL
10052 2017-02-02 1113 1112
10058 2017-02-03 1120 1121
10067 2017-02-05 1117 NULL
10101 2017-06-02 1119 NULL
10107 2017-06-02 1124 1125
50003 2017-03-03 2001 NULL
50004 2017-04-04 2002 NULL
50007 2017-03-03 2004 NULL
50008 2017-04-04 2003 NULL
#5
0
My answer is wrong, sorry for posting prematurely. I'm working on a different idea, I'll be back shortly.
我的回答是错误的,抱歉过早发布。我正在研究一个不同的想法,我很快就会回来。
Original response:
原始回复:
Assuming your date format is MM.DD.YY, I've got the code as shown below. I don't understand why your desired result set doesn't include rows for CustID 1116 or 1118, but I do see how including them will duplicate 1117 and 1119 respectively, unless the source data are modified to remove these duplicate 1117 and 1119 values from the results. For now, I have this interim solution, pending your response.
假设您的日期格式为MM.DD.YY,我的代码如下所示。我不明白为什么你想要的结果集不包括CustID 1116或1118的行,但我确实看到它们将如何分别复制1117和1119,除非修改源数据以从中删除这些重复的1117和1119值结果。目前,我有这个临时解决方案,等待您的回复。
declare @ACCT table (
acctnumber int,
date date,
Cust1ID int,
Cust2ID int
);
insert into @ACCT values (10000, '2016-02-01', 1110, null);
insert into @ACCT values (10001, '2016-02-01', 1111, null);
insert into @ACCT values (10050, '2017-02-01', 1111, null);
insert into @ACCT values (10008, '2016-02-01', 1120, null);
insert into @ACCT values (10058, '2017-02-03', 1120, 1121);
insert into @ACCT values (10002, '2016-02-01', 1112, null);
insert into @ACCT values (10052, '2017-02-02', 1113, 1112);
insert into @ACCT values (10003, '2016-02-02', 1114, 1115);
insert into @ACCT values (7060, '2017-02-04', 1115, 1114);
insert into @ACCT values (10004, '2016-02-02', 1116, 1117);
insert into @ACCT values (10067, '2017-02-05', 1117, null);
insert into @ACCT values (10005, '2016-02-01', 1118, null);
insert into @ACCT values (10054, '2017-02-03', 1118, 1119);
insert into @ACCT values (10101, '2017-06-02', 1119, null);
insert into @ACCT values (10007, '2016-02-01', 1122, 1123);
insert into @ACCT values (10057, '2017-02-03', 1123, 1124);
insert into @ACCT values (10107, '2017-06-02', 1124, 1125);
with
OneCustId as (
select
acctnumber,[date], Cust1ID as CustID
from
@ACCT
union
select
acctnumber, [date], Cust2ID
from
@ACCT
),
SortedByLastUsage as (
select
acctnumber, [date], CustID, row_number() over (partition by CustID order by [date] desc) as RowID
from
OneCustId
),
LastUsage as (
select
acctnumber, [date], CustID
from
SortedByLastUsage
where
RowID = 1
)
select distinct
ACCT.acctnumber, ACCT.[date], ACCT.Cust1ID, ACCT.Cust2ID
from
@ACCT ACCT
inner join LastUsage on
ACCT.acctnumber = LastUsage.acctnumber and
ACCT.[date] = LastUsage.[date] and
LastUsage.CustID in (ACCT.Cust1ID, ACCT.Cust2ID)
order by
Cust1ID, Cust2ID
The result set:
结果集:
acctnumber date Cust1ID Cust2ID
10000 2016-02-01 1110 NULL
10050 2017-02-01 1111 NULL
10052 2017-02-02 1113 1112
7060 2017-02-04 1115 1114
10004 2016-02-02 1116 1117
10067 2017-02-05 1117 NULL
10054 2017-02-03 1118 1119
10101 2017-06-02 1119 NULL
10058 2017-02-03 1120 1121
10007 2016-02-01 1122 1123
10057 2017-02-03 1123 1124
10107 2017-06-02 1124 1125
#6
0
I'm leaving my original answer in place, because the approach might work for someone else searching for this down the line.
我将原来的答案保留下来,因为这种方法可能适用于其他寻找此问题的人。
I can't figure out how to do this without a cursor. As such, any other answer that provides the right answer (that doesn't use a cursor) is going to outperform this one. I'm not smart enough to figure out what that looks like, but it would have to include a nasty recursive CTE.
没有光标,我无法弄清楚如何做到这一点。因此,提供正确答案(不使用游标)的任何其他答案都将胜过这一答案。我不够聪明,不知道它看起来像什么,但它必须包括一个令人讨厌的递归CTE。
The real trick is getting all accounts that were ever related to each other grouped together. That is done in the big cursored if/then/else chain at the top, which could be cleaned up a bit. I've left my debug print
statements in place, they can obviously be removed.
真正的诀窍是将所有彼此相关的帐户组合在一起。这是在顶部的大型cursored if / then / else链中完成的,可以稍微清理一下。我已经将调试打印语句留在原位,显然可以删除它们。
You could also make the Associations table permanent, instead of using a table variable.
您还可以使关联表永久化,而不是使用表变量。
Again, performance-wise, this is going to be really, really bad, but it does work. I'm looking forward to seeing what others come up with. Thanks for the high-quality question, too, that made life a lot easier.
再次,在性能方面,这将是真的,非常糟糕,但它确实有效。我很期待看到别人想出的东西。感谢高质量的问题,这让生活变得更加轻松。
The code:
代码:
declare @Associations table (
GroupID int,
CustID int
);
declare @NextGroupID int = 0;
declare @FoundGroup1ID int;
declare @FoundGroup2ID int;
declare @Cust1 int;
declare @Cust2 int;
declare db_cursor cursor for
select Cust1ID, Cust2ID from @ACCT;
open db_cursor;
fetch next from db_cursor into @Cust1, @Cust2;
while @@fetch_status = 0
begin
set @FoundGroup1ID = null;
set @FoundGroup2ID = null;
print '----------------------------'
print 'Cust1 = ' + isnull(cast(@Cust1 as varchar(max)), 'NULL')
print 'Cust2 = ' + isnull(cast(@Cust2 as varchar(max)), 'NULL')
select @FoundGroup1ID = GroupID from @Associations where CustID = @Cust1
print 'FoundGroup1ID = ' + isnull(cast(@FoundGroup1ID as varchar(max)), 'NULL')
if @Cust2 is null
begin
if @FoundGroup1ID is null
begin
set @NextGroupID = @NextGroupID +1
print 'Adding Cust1 to new group ' + cast(@NextGroupID as varchar(max))
insert into @Associations (GroupID, CustID) values (@NextGroupID, @Cust1)
end
end
else -- @Cust2 is not null
begin
print 'FoundGroup2ID = ' + isnull(cast(@FoundGroup2ID as varchar(max)), 'NULL')
select @FoundGroup2ID = GroupID from @Associations where CustID = @Cust2
if @FoundGroup1ID is null and @FoundGroup2ID is null
begin
set @NextGroupID = @NextGroupID +1
print 'Adding both to new group ' + cast(@NextGroupID as varchar(max))
insert into @Associations (GroupID, CustID) values (@NextGroupID, @Cust1)
insert into @Associations (GroupID, CustID) values (@NextGroupID, @Cust2)
end
else if @FoundGroup1ID is not null and @FoundGroup2ID is null
begin
print 'Adding Cust2 to existing group ' + cast(@FoundGroup1ID as varchar(max))
insert into @Associations (GroupID, CustID) values (@FoundGroup1ID, @Cust2)
end
else if @FoundGroup1ID is null and @FoundGroup2ID is not null
begin
print 'Adding Cust1 to existing group ' + cast(@FoundGroup2ID as varchar(max))
insert into @Associations (GroupID, CustID) values (@FoundGroup2ID, @Cust1)
end
else -- Neither is null
begin
print 'Switching all of GroupID ' + cast(@FoundGroup2ID as varchar(max)) + ' to GroupID ' + cast(@FoundGroup1ID as varchar(max))
update @Associations set GroupID = @FoundGroup1ID where GroupID = @FoundGroup2ID
end
end
fetch next from db_cursor into @Cust1, @Cust2;
end
close db_cursor;
deallocate db_cursor;
;with
AddedGroupID as (
select
ACCT.acctnumber,
ACCT.[date],
ACCT.Cust1ID,
ACCT.Cust2ID,
Associations.GroupID,
row_number() over (partition by Associations.GroupID order by ACCT.[date] desc) as RowID
from
@ACCT ACCT
inner join @Associations Associations on
Associations.CustID in (ACCT.Cust1ID, ACCT.Cust2ID)
)
select
acctnumber, [date], Cust1ID, Cust2ID
from
AddedGroupID
where
RowID = 1
The results:
结果:
acctnumber date Cust1ID Cust2ID
10000 2016-02-01 1110 NULL
10050 2017-02-01 1111 NULL
10058 2017-02-03 1120 1121
10052 2017-02-02 1113 1112
7060 2017-02-04 1115 1114
10067 2017-02-05 1117 NULL
10101 2017-06-02 1119 NULL
10107 2017-06-02 1124 1125
#7
0
we should not worry about using EXISTS as it operate fast in such case and i suppose is simplest possible solution:
我们不应该担心使用EXISTS,因为它在这种情况下运行速度很快,我认为这是最简单的解决方案:
SELECT
A.ACCTNUMBER, A.DT as "date", A.CUST1ID, A.CUST2ID
FROM
ACCT A
WHERE
NOT EXISTS
(SELECT
*
FROM
ACCT A2
WHERE
(A2.CUST1ID = A.CUST1ID
OR A2.CUST2ID = A.CUST1ID
OR (A.CUST2ID IS NOT NULL AND A2.CUST1ID = A.CUST2ID)
OR (A.CUST2ID IS NOT NULL AND A2.CUST2ID = A.CUST2ID)
)
AND A2.DT>A.DT
)
i have assumed that you have separate indexes on CUST1ID and another on CUST2ID. You can compare result without ascending index on DT ("date") field and with it. It can speed up your query or slow down - i do not know how your real data looks like
我假设你在CUST1ID上有另外的索引而在CUST2ID上有另一个索引。您可以在DT(“日期”)字段中比较结果而不使用升序索引。它可以加快您的查询速度或减慢速度 - 我不知道您的真实数据是什么样的
#8
0
Try below query. It lengthy, because there's need to apply repeatedly windowed functions (you can't nest them in a single query), but the query itself is pretty simple. The core idea is to split customers that never share account from customers that do. After that, for the single-account customer, the grouping column is easy, it's Cust1ID
, but for other, you have to do some operations described below, to get grouping column:
请尝试以下查询。它很冗长,因为需要重复应用窗口函数(不能将它们嵌套在单个查询中),但查询本身非常简单。核心思想是将从不与客户共享帐户的客户分开。之后,对于单帐户客户,分组列很容易,它是Cust1ID,但对于其他人,您必须执行下面描述的操作,以获取分组列:
To obtain grouping column (for multiple-account cutomers), you have to apply following logic:
要获取分组列(对于多帐户cutomers),您必须应用以下逻辑:
Put all 1st customers together with 2nd customers in same column using UNION ALL
(CTE called cte
in query). Then, when you sort by that column, and checking both IDs with IDs of the following row, you can check if they are "connected", i.e. they have at least one ID the same:
使用UNION ALL(CTE在查询中称为cte)将所有第一客户与第二客户放在同一列中。然后,当您按该列排序,并使用下一行的ID检查这两个ID时,您可以检查它们是否“已连接”,即它们至少有一个ID相同:
case when Cust1ID in (cust1idLead, cust2idLead) or Cust2ID in (cust1idLead, cust2idLead) then 1 else 0 end SameGroup
This way you can distinct groups and within that groups take maximum respectively to the date (dt
column).
通过这种方式,您可以将不同的组和该组中的组分别最大化到日期(dt列)。
Sample data:
样本数据:
declare @tbl table (acctnumber int, dt date , Cust1ID int, Cust2ID int);
insert into @tbl values
(10000, '2.1.16', 1110, null),
(10001, '2.1.16', 1111, null),
(10050, '2.1.17', 1111, null),
(10008, '2.1.16', 1120, null),
(10058, '2.3.17', 1120, 1121),
(10002, '2.1.16', 1112, null),
(10052, '2.2.17', 1113, 1112),
(10003, '2.2.16', 1114, 1115),
(7060, '2.4.17', 1115, 1114),
(10004, '2.2.16', 1116, 1117),
(10067, '2.5.17', 1117, null),
(10005, '2.1.16', 1118, null),
(10054, '2.3.17', 1118, 1119),
(10101, '6.2.17', 1119, null),
(10007, '2.1.16', 1122, 1123),
(10057, '2.3.17', 1123, 1124),
(10107, '6.2.17', 1124, 1125)
T-SQL:
T-SQL:
;with SingleAccounts as (
select cust1id from @tbl
where Cust2ID is null
except
select cust1id from @tbl
where Cust2ID is not null
except
select cust2id from @tbl
), cte as (
select acctnumber, dt, Cust1ID, Cust2ID from @tbl
where Cust1ID not in (select Cust1ID from SingleAccounts)
union all
select acctnumber, dt, Cust2ID, Cust1ID from @tbl
where Cust1ID not in (select Cust1ID from SingleAccounts) and Cust2ID is not null
), SingleAmountsResult as (
select acctnumber, dt, cust1id, cust2id,
ROW_NUMBER() over (partition by cust1id order by dt desc) rn
from @tbl
where cust1id in (select Cust1ID from SingleAccounts)
), FinalResult as (
select acctnumber, dt, cust1id, cust2id from SingleAmountsResult
where rn = 1
union all
select acctnumber, dt, cust1id, cust2id
from (
select acctnumber, dt, cust1id, cust2id,
ROW_NUMBER() over (partition by GroupingColumn order by dt desc) rn
from (
select acctnumber, dt, cust1id, cust2id,
SUM(NewGroup) over (order by cust1id, cust2id) GroupingColumn
from (
select acctnumber, dt, cust1id, cust2id,
case when LAG(SameGroup) over (order by cust1id, cust2id) = 0 then 1 else 0 end NewGroup
from (
select acctnumber, dt, cust1id, cust2id,
case when Cust1ID in (cust1idLead, cust2idLead) or Cust2ID in (cust1idLead, cust2idLead) then 1 else 0 end SameGroup
from (
select acctnumber, dt, cust1id, cust2id,
LEAD(cust1id) over (order by cust1id, cust2id) cust1idLead,
LEAD(cust2id) over (order by cust1id, cust2id) cust2idLead
from cte
) a
) a
) a
) a
) a where rn = 1
)
--this final query gets you correct Cust1ID and Cust2ID, as FinalResult might have them switched
select * from @tbl
intersect
select * from (
select acctnumber, dt, cust1id, cust2id from FinalResult
union all
select acctnumber, dt, cust2id, cust1id from FinalResult
) fr
UPDATE
UPDATE
This code, accordingly to OP explanation, treat all customers ID that have ever been on one account together as same group (and this is transitive1)), thus, for additional cases, 8a and 8b the results are:
相应于OP解释,此代码将所有曾在一个帐户*享的客户ID视为同一组(并且这是传递1)),因此,对于其他情况,8a和8b的结果为:
acctnumber | dt | Cust1ID | Cust2ID
50004 | 2017-04-04 | 2002 | NULL
50008 | 2017-04-04 | 2003 | NULL
as there will be only 2 groups!
因为只有2组!
1) it means, if element
a
is in group with elementb
andb
is in the same group as elementc
then it implies thata
andc
are also in the same group.1)这意味着,如果元素a与元素b在组中,并且b与元素c在同一组中,则意味着a和c也在同一组中。
#9
0
This is quite complex...
这很复杂......
First you want to identify groups of customers. That is all customers who were directly or indirectly related. With customer pairs A/B, B/C, D/E, D/F, G/A, H/A, H/F you'd have just one single group for instance. In SQL this requires a recursive query.
首先,您要识别客户群。这是所有直接或间接相关的客户。对于客户对A / B,B / C,D / E,D / F,G / A,H / A,H / F,您只有一个组。在SQL中,这需要递归查询。
SQL Server lacks a cycle detection in recursive queries. So from customers A/B you'd get to all pairs containing A or B, which is B/C, A/B G/A, H/A, and A/B itself for that matter. Even, if we detect this direct circle (same pair), we'd go on with B/C looking for all records that contain B or C. And one of these is A/B again and once more we are in a cycle. One way to deal with this is to build a string of yet visited customers and not visit them again.
SQL Server在递归查询中缺少循环检测。因此,从客户A / B,您将获得包含A或B的所有对,即B / C,A / B G / A,H / A和A / B本身。甚至,如果我们检测到这个直接的圆圈(同一对),我们继续用B / C寻找包含B或C的所有记录。其中一个是A / B再次,我们再次进入一个循环。解决这个问题的一种方法是建立一系列尚未访问的客户,而不是再次访问它们。
Our result is all cutomers with all directly or indirectly connected other customers. Using aggregation, we can take the minimum partner per customer and use this as a group key. In above example all customers are related to A, so A is all their minimum partner, showing that all belong to the same group. If we add two records X/Y and Z/-, then we have two more groups: X and Y belonging to the X group, and Z belonging to the Z group.
我们的结果是所有直接或间接连接其他客户的客户。使用聚合,我们可以为每个客户提供最小的合作伙伴,并将其用作组密钥。在上面的示例中,所有客户都与A相关,因此A是他们的最小合作伙伴,表明所有客户都属于同一个组。如果我们添加两个记录X / Y和Z / - ,那么我们还有两个组:X和Y属于X组,Z属于Z组。
These groups we use to look up our original records again. With ROW_NUMBER
we number each group's last record with #1. Then we keep only those and we are done.
我们用这些组再次查找原始记录。使用ROW_NUMBER,我们用#1为每个组的最后一个记录编号。然后我们只保留那些,我们就完成了。
with all_cust(custid) as
(
select cust1id from mytable
union
select cust2id from mytable where cust2id is not null
)
, cte(c1, c2, sofar) as
(
select custid, custid, '<' + cast(custid as varchar(max)) + '>' from all_cust
union all
select cte.c1, case when cte.c2 = m.cust1id then m.cust2id else m.cust1id end,
cte.sofar + '<' + cast(case when cte.c2 = m.cust1id then m.cust2id else m.cust1id end as varchar(max)) + '>'
from mytable m
join cte on cte.c2 in (m.cust1id, m.cust2id)
and cte.sofar not like '%' + cast(case when cte.c2 = m.cust1id then m.cust2id else m.cust1id end as varchar(max)) + '%'
)
, groups(custid, grp) as
(
select c1, min(c2) from cte group by c1
)
, ranked as
(
select *, row_number() over (partition by g.grp order by date desc) as rn
from groups g
join mytable m on g.custid in (m.cust1id, m.cust2id)
)
select acctnumber, date, cust1id, cust2id
from ranked
where rn = 1
order by cust1id;
Rextester demo: http://rextester.com/RWCQ83881
Rextester演示:http://rextester.com/RWCQ83881
#1
2
I'd like to thank Jeff Breadner for the DDL with sample data.
我要感谢Jeff Breadner为DDL提供样本数据。
You'll have to run the below query step-by-step, CTE-by-CTE and examine intermediate results to understand what it does. It assumes that AcctNumber
is unique in the given table.
您必须逐步运行以下查询,CTE-by-CTE并检查中间结果以了解它的作用。它假定AcctNumber在给定表中是唯一的。
At first I want to find the latest account for each individual customer. It is a simple top-n-per-group
query and I'm using a ROW_NUMBER
approach here.
首先,我想找到每个客户的最新帐户。这是一个简单的top-n-group-group查询,我在这里使用ROW_NUMBER方法。
CTE_Customers
makes a plain list of all individual customers by putting together Cust1ID
and Cust2ID
. CTE_RN
assigns them row numbers. CTE_LatestAccounts
gives latest account for each individual customer:
CTE_Customers通过将Cust1ID和Cust2ID放在一起,生成所有个人客户的简单列表。 CTE_RN为它们分配行号。 CTE_LatestAccounts为每个客户提供最新帐户:
+------------------+------------+--------+
| LatestAcctNumber | LatestDT | CustID |
+------------------+------------+--------+
| 10000 | 2016-02-01 | 1110 |
| 10050 | 2017-02-01 | 1111 |
| 10052 | 2017-02-02 | 1112 |
| 10052 | 2017-02-02 | 1113 |
| 7060 | 2017-02-04 | 1114 |
| 7060 | 2017-02-04 | 1115 |
| 10004 | 2016-02-02 | 1116 |
| 10067 | 2017-02-05 | 1117 |
| 10054 | 2017-02-03 | 1118 |
| 10101 | 2017-06-02 | 1119 |
| 10058 | 2017-02-03 | 1120 |
| 10058 | 2017-02-03 | 1121 |
| 10007 | 2016-02-01 | 1122 |
| 10057 | 2017-02-03 | 1123 |
| 10107 | 2017-06-02 | 1124 |
| 10107 | 2017-06-02 | 1125 |
+------------------+------------+--------+
The task is complicated by having customer pairs which "propagates" the latest account to another customer.
通过将最新帐户“传播”给另一个客户的客户对,任务变得复杂。
Customer pairs are defined in the original table, so CTE_MaxLatestAccounts
takes each row from the original table and joins latest accounts to it twice - for Cust1D
and Cust2ID
. For each pair I'm picking one of the two latest accounts - the most recent one. Thus a customer that belongs to a pair may get an account from its partner.
客户对在原始表中定义,因此CTE_MaxLatestAccounts从原始表中获取每一行,并将最新帐户连接到它两次 - 对于Cust1D和Cust2ID。对于每一对,我选择了最近的两个账户中的一个 - 最近的账户。因此,属于一对的客户可以从其合作伙伴处获得帐户。
+---------+---------+-------------+---------------------+
| Cust1ID | Cust2ID | MaxLatestDT | MaxLatestAcctNumber |
+---------+---------+-------------+---------------------+
| 1110 | NULL | 2016-02-01 | 10000 |
| 1111 | NULL | 2017-02-01 | 10050 |
| 1111 | NULL | 2017-02-01 | 10050 |
| 1120 | NULL | 2017-02-03 | 10058 |
| 1120 | 1121 | 2017-02-03 | 10058 |
| 1112 | NULL | 2017-02-02 | 10052 |
| 1113 | 1112 | 2017-02-02 | 10052 |
| 1114 | 1115 | 2017-02-04 | 7060 |
| 1115 | 1114 | 2017-02-04 | 7060 |
| 1116 | 1117 | 2017-02-05 | 10067 |
| 1117 | NULL | 2017-02-05 | 10067 |
| 1118 | NULL | 2017-02-03 | 10054 |
| 1118 | 1119 | 2017-06-02 | 10101 |
| 1119 | NULL | 2017-06-02 | 10101 |
| 1122 | 1123 | 2017-02-03 | 10057 |
| 1123 | 1124 | 2017-06-02 | 10107 |
| 1124 | 1125 | 2017-06-02 | 10107 |
+---------+---------+-------------+---------------------+
The MaxLatestAcctNumber
here is applicable to both Cust1ID
and Cust2ID
. The same customer may be listed several times here and we need to choose an entry again with the latest account. Here this is the latest account of a pair, not for individual customer.
此处的MaxLatestAcctNumber适用于Cust1ID和Cust2ID。同一客户可能会在此处多次列出,我们需要使用最新帐户再次选择一个条目。这是一对货币的最新账户,不适用于个人客户。
The approach is the same as in the beginning. Put both Cust1ID
and Cust2ID
customers in a list: CTE_CustomersWithLatestAccountFromPair
. Assign row numbers in CTE_CustomersWithLatestAccountFromPairRN
and pick final account in CTE_FinalAccounts
.
方法与开始时的方法相同。将Cust1ID和Cust2ID客户放在一个列表中:CTE_CustomersWithLatestAccountFromPair。在CTE_CustomersWithLatestAccountFromPairRN中分配行号,并在CTE_FinalAccounts中选择最终帐户。
+---------------------+
| MaxLatestAcctNumber |
+---------------------+
| 10000 |
| 10050 |
| 10052 |
| 10052 |
| 7060 |
| 7060 |
| 10067 |
| 10067 |
| 10101 |
| 10101 |
| 10058 |
| 10058 |
| 10057 |
| 10107 |
| 10107 |
| 10107 |
+---------------------+
Now we just need to filter the original table and leave only those rows (accounts) that appear in this list. See the final result below.
现在我们只需要过滤原始表并仅保留此列表中显示的那些行(帐户)。请参阅下面的最终结果。
Sample data
样本数据
declare @ACCT table (
AcctNumber int,
dt date,
Cust1ID int,
Cust2ID int
);
insert into @ACCT values
(10000, '2016-02-01', 1110, null),
(10001, '2016-02-01', 1111, null),
(10050, '2017-02-01', 1111, null),
(10008, '2016-02-01', 1120, null),
(10058, '2017-02-03', 1120, 1121),
(10002, '2016-02-01', 1112, null),
(10052, '2017-02-02', 1113, 1112),
(10003, '2016-02-02', 1114, 1115),
(7060, '2017-02-04', 1115, 1114),
(10004, '2016-02-02', 1116, 1117),
(10067, '2017-02-05', 1117, null),
(10005, '2016-02-01', 1118, null),
(10054, '2017-02-03', 1118, 1119),
(10101, '2017-06-02', 1119, null),
(10007, '2016-02-01', 1122, 1123),
(10057, '2017-02-03', 1123, 1124),
(10107, '2017-06-02', 1124, 1125);
Query
询问
WITH
CTE_Customers
AS
(
SELECT
AcctNumber
,dt
,Cust1ID AS CustID
FROM @ACCT
WHERE Cust1ID IS NOT NULL
UNION ALL
SELECT
AcctNumber
,dt
,Cust2ID AS CustID
FROM @ACCT
WHERE Cust2ID IS NOT NULL
)
,CTE_RN
AS
(
SELECT
AcctNumber
,dt
,CustID
,ROW_NUMBER() OVER (PARTITION BY CustID ORDER BY dt DESC) AS rn
FROM CTE_Customers
)
,CTE_LatestAccounts
-- this gives one row per CustID
AS
(
SELECT
AcctNumber AS LatestAcctNumber
,dt AS LatestDT
,CustID
FROM CTE_RN
WHERE rn = 1
)
,CTE_MaxLatestAccounts
AS
(
SELECT
A.Cust1ID
,A.Cust2ID
,CASE WHEN ISNULL(A1.LatestDT, '2000-01-01') > ISNULL(A2.LatestDT, '2000-01-01')
THEN A1.LatestDT ELSE A2.LatestDT END AS MaxLatestDT
,CASE WHEN ISNULL(A1.LatestDT, '2000-01-01') > ISNULL(A2.LatestDT, '2000-01-01')
THEN A1.LatestAcctNumber ELSE A2.LatestAcctNumber END AS MaxLatestAcctNumber
FROM
@ACCT AS A
LEFT JOIN CTE_LatestAccounts AS A1 ON A1.CustID = A.Cust1ID
LEFT JOIN CTE_LatestAccounts AS A2 ON A2.CustID = A.Cust2ID
)
,CTE_CustomersWithLatestAccountFromPair
AS
(
SELECT
Cust1ID AS CustID
,MaxLatestDT
,MaxLatestAcctNumber
FROM CTE_MaxLatestAccounts
WHERE Cust1ID IS NOT NULL
UNION ALL
SELECT
Cust2ID AS CustID
,MaxLatestDT
,MaxLatestAcctNumber
FROM CTE_MaxLatestAccounts
WHERE Cust2ID IS NOT NULL
)
,CTE_CustomersWithLatestAccountFromPairRN
AS
(
SELECT
CustID
,MaxLatestDT
,MaxLatestAcctNumber
,ROW_NUMBER() OVER (PARTITION BY CustID ORDER BY MaxLatestDT DESC) AS rn
FROM CTE_CustomersWithLatestAccountFromPair
)
,CTE_FinalAccounts
AS
(
SELECT MaxLatestAcctNumber
FROM CTE_CustomersWithLatestAccountFromPairRN
WHERE rn = 1
)
SELECT *
FROM @ACCT AS A
WHERE A.AcctNumber IN (SELECT MaxLatestAcctNumber FROM CTE_FinalAccounts)
;
Result
结果
+------------+------------+---------+---------+
| AcctNumber | dt | Cust1ID | Cust2ID |
+------------+------------+---------+---------+
| 10000 | 2016-02-01 | 1110 | NULL |
| 10050 | 2017-02-01 | 1111 | NULL |
| 10058 | 2017-02-03 | 1120 | 1121 |
| 10052 | 2017-02-02 | 1113 | 1112 |
| 7060 | 2017-02-04 | 1115 | 1114 |
| 10067 | 2017-02-05 | 1117 | NULL |
| 10101 | 2017-06-02 | 1119 | NULL |
| 10057 | 2017-02-03 | 1123 | 1124 |
| 10107 | 2017-06-02 | 1124 | 1125 |
+------------+------------+---------+---------+
This result matches your desired result, except the last case 7.
除最后一个案例7外,此结果与您想要的结果相符。
My query doesn't attempt to follow the chain of linked customers of arbitrary length and is limited to processing one pair at a time. That's why the case 7 result is not one row. The query will always pick row/account with the very last date (10107
) and it may also pick account(s) in the middle of the chain. In this case it picked a row 10057
, not 10007
, because this is a later account for customers 1122
and 1123
.
我的查询不会尝试跟随任意长度的链接客户链,并且仅限于一次处理一对。这就是案例7结果不是一行的原因。查询将始终选择具有最后日期(10107)的行/帐户,并且它还可以选择链中间的帐户。在这种情况下,它选择了行10057,而不是10007,因为这是客户1122和1123的后续帐户。
When I looked at the execution plan I saw that the query behind CTE_LatestAccounts
is run essentially four times.
当我查看执行计划时,我发现CTE_LatestAccounts背后的查询基本上运行了四次。
It is likely that if you save result of CTE_LatestAccounts
into a temp table with proper indexes the overall performance would be better.
如果将CTE_LatestAccounts的结果保存到具有适当索引的临时表中,则整体性能可能会更好。
Something like this:
像这样的东西:
DECLARE @LatestAccounts TABLE
(LatestAcctNumber int, LatestDT date, CustID int PRIMARY KEY);
WITH
CTE_Customers
AS
(
SELECT
AcctNumber
,dt
,Cust1ID AS CustID
FROM @ACCT
WHERE Cust1ID IS NOT NULL
UNION ALL
SELECT
AcctNumber
,dt
,Cust2ID AS CustID
FROM @ACCT
WHERE Cust2ID IS NOT NULL
)
,CTE_RN
AS
(
SELECT
AcctNumber
,dt
,CustID
,ROW_NUMBER() OVER (PARTITION BY CustID ORDER BY dt DESC) AS rn
FROM CTE_Customers
)
,CTE_LatestAccounts
-- this gives one row per CustID
AS
(
SELECT
AcctNumber AS LatestAcctNumber
,dt AS LatestDT
,CustID
FROM CTE_RN
WHERE rn = 1
)
INSERT INTO @LatestAccounts (LatestAcctNumber, LatestDT, CustID)
SELECT LatestAcctNumber, LatestDT, CustID
FROM CTE_LatestAccounts;
WITH
CTE_MaxLatestAccounts
AS
(
SELECT
A.Cust1ID
,A.Cust2ID
,CASE WHEN ISNULL(A1.LatestDT, '2000-01-01') > ISNULL(A2.LatestDT, '2000-01-01')
THEN A1.LatestDT ELSE A2.LatestDT END AS MaxLatestDT
,CASE WHEN ISNULL(A1.LatestDT, '2000-01-01') > ISNULL(A2.LatestDT, '2000-01-01')
THEN A1.LatestAcctNumber ELSE A2.LatestAcctNumber END AS MaxLatestAcctNumber
FROM
@ACCT AS A
LEFT JOIN @LatestAccounts AS A1 ON A1.CustID = A.Cust1ID
LEFT JOIN @LatestAccounts AS A2 ON A2.CustID = A.Cust2ID
)
,CTE_CustomersWithLatestAccountFromPair
AS
(
SELECT
Cust1ID AS CustID
,MaxLatestDT
,MaxLatestAcctNumber
FROM CTE_MaxLatestAccounts
WHERE Cust1ID IS NOT NULL
UNION ALL
SELECT
Cust2ID AS CustID
,MaxLatestDT
,MaxLatestAcctNumber
FROM CTE_MaxLatestAccounts
WHERE Cust2ID IS NOT NULL
)
,CTE_CustomersWithLatestAccountFromPairRN
AS
(
SELECT
CustID
,MaxLatestDT
,MaxLatestAcctNumber
,ROW_NUMBER() OVER (PARTITION BY CustID ORDER BY MaxLatestDT DESC) AS rn
FROM CTE_CustomersWithLatestAccountFromPair
)
,CTE_FinalAccounts
AS
(
SELECT MaxLatestAcctNumber
FROM CTE_CustomersWithLatestAccountFromPairRN
WHERE rn = 1
)
SELECT *
FROM @ACCT AS A
WHERE A.AcctNumber IN (SELECT MaxLatestAcctNumber FROM CTE_FinalAccounts)
;
If you really need to merge/group all linked customers into one row when the length of the chain is arbitrary, you can do it with recursive query like shown, for example, here: How to find all connected subgraphs of an undirected graph
如果您确实需要在链的长度是任意的时将所有链接的客户合并/分组到一行,您可以使用递归查询来执行此操作,例如,此处:如何查找无向图的所有连接子图
Once you have tagged each customer with some GroupID, find the latest account for each individual customer as in the beginning of this query. Then find the latest account among the group (rather than for the simple pair as in this query).
使用某个GroupID标记每个客户后,在此查询开头查找每个客户的最新帐户。然后在组中找到最新的帐户(而不是像此查询中的简单对)。
The query that finds all subgraphs of an undirected graph in the linked question may be quite slow for a large dataset and there are efficient non-set based algorithms to do it.
在链接问题中查找无向图的所有子图的查询对于大型数据集来说可能非常慢,并且存在有效的非基于集合的算法。
If you know that the maximum length of the chain can't exceed some number, it is possible to make this recursive query more efficient.
如果您知道链的最大长度不能超过某个数字,则可以使此递归查询更有效。
#2
2
To apply logic to each subset a good operator to use is the CROSS APPLY
operator. This allows us to find the most recent account for each Customer Id.
要将逻辑应用于每个子集,要使用的好运算符是CROSS APPLY运算符。这样我们就可以找到每个客户ID的最新帐户。
Setup
建立
DECLARE @Stage TABLE
(
AcctNumber INT
,[Date] DATETIME
,Cust1Id INT
,Cust2Id INT
)
INSERT INTO @Stage (AcctNumber, [Date] ,Cust1Id ,Cust2Id)
VALUES
(10000,'2.1.16',1110,NULL)
,(10001,'2.1.16',1111,NULL)
,(10050,'2.1.17',1111,NULL)
,(10008,'2.1.16',1120,NULL)
,(10058,'2.3.17',1120,1121)
,(10002,'2.1.16',1112,NULL)
,(10052,'2.2.17',1113,1112)
,(10003,'2.2.16',1114,1115)
,(7060,'2.4.17',1115,1114)
,(10004,'2.2.16',1116,1117)
,(10067,'2.5.17',1117,NULL)
,(10005,'2.1.16',1118,NULL)
,(10054,'2.3.17',1118,1119)
,(10101,'6.2.17',1119,NULL)
,(10007,'2.1.16',1122,1123)
,(10057,'2.3.17',1123,1124)
,(10107,'6.2.17',1124,1125)
--Additional Cases to cover
,(50001, '2016-01-01', 2001, NULL)
,(50002, '2017-02-02', 2001, 2002)
,(50003, '2017-03-03', 2001, NULL)
,(50004, '2017-04-04', 2002, NULL)
,(50005, '2016-01-01', 2003, NULL)
,(50006, '2017-02-02', 2003, 2004)
,(50007, '2017-03-03', 2004, NULL)
,(50008, '2017-04-04', 2003, NULL)
Execution
执行
Cross Apply
交叉申请
;WITH Results AS(
SELECT DISTINCT S2.*
FROM @Stage S1
CROSS APPLY (
SELECT TOP 1 S2.*
FROM @Stage S2
WHERE
(S1.Cust1Id = S2.Cust1Id
OR S1.Cust1Id = S2.Cust2Id
OR S1.Cust2Id = S2.Cust1Id
OR S1.Cust2Id = S2.Cust2Id)
ORDER BY S2.[Date] DESC
) S2
)
SELECT R1.*
FROM Results R1
LEFT JOIN Results R2
ON R1.Cust2Id = R2.Cust1Id
WHERE R1.[Date] > R2.[Date]
OR R2.AcctNumber IS NULL
The CROSS APPLY
operators walk the cases backward to apply the logic to each joint account case while ensuring the most recent account is carried over. This alone covers most of the cases. The only lingering cases are the ones with 3 accounts being shifted between 3 customers. The self join and WHERE
clause in the final select cover these.
CROSS APPLY操作员向后走动案例以将逻辑应用于每个联合帐户案例,同时确保最近的帐户被结转。仅这一点就涵盖了大多数情况。唯一挥之不去的案例是3个客户在3个客户之间转移的案例。最终选择中的self join和WHERE子句涵盖了这些。
Results
结果
+------------+------------+---------+---------+
| AcctNumber | Date | Cust1Id | Cust2Id |
| 7060 | 2017-02-04 | 1115 | 1114 |
| 10000 | 2016-02-01 | 1110 | NULL |
| 10050 | 2017-02-01 | 1111 | NULL |
| 10052 | 2017-02-02 | 1113 | 1112 |
| 10058 | 2017-02-03 | 1120 | 1121 |
| 10067 | 2017-02-05 | 1117 | NULL |
| 10101 | 2017-06-02 | 1119 | NULL |
| 10107 | 2017-06-02 | 1124 | 1125 |
| 50003 | 2017-03-03 | 2001 | NULL |
| 50004 | 2017-04-04 | 2002 | NULL |
| 50007 | 2017-03-03 | 2004 | NULL |
| 50008 | 2017-04-04 | 2003 | NULL |
+------------+------------+---------+---------+
#3
1
I'm sure there is a much easier approach, but this is what I've had in mind :
我确信有一个更简单的方法,但这是我的想法:
SELECT
a.acctnumber,
a.date,
a.Cust1ID,
a.Cust2ID
FROM acct a
OUTER APPLY (
SELECT acctnumber
FROM (
SELECT *,
ROW_NUMBER() OVER(PARTITION BY acctnumber ORDER BY [date] DESC) AS ACC_RN,
ROW_NUMBER() OVER(PARTITION BY CustomerID ORDER BY [date] DESC) AS RN
FROM (
SELECT
a1.acctnumber,
a1.[date],
a1.Cust1ID AS CustomerID
FROM acct a1
UNION
SELECT
a2.acctnumber,
a2.[date],
a2.Cust2ID
FROM acct a2
) D
) C
WHERE
RN = 1
AND CustomerID IS NOT NULL
AND ACC_RN = 2
) acc
WHERE a.acctnumber IN(acc.acctnumber)
#4
1
Could you just use a left-join to join accounts with other "linked" accounts with potentially later dates, and then just filter out records where the "Later Account" table is not null? Something like this:
您是否可以使用左连接来加入具有可能更晚日期的其他“链接”帐户的帐户,然后只过滤掉“后期帐户”表不为空的记录?像这样的东西:
select ThisAccount.*
from Accounts ThisAccount
left join Accounts LaterAccount on
LaterAccount.AcctNumber <> ThisAccount.AcctNumber
and LaterAccount.dt > ThisAccount.dt
and
( LaterAccount.Cust1ID = ThisAccount.Cust1ID
or LaterAccount.Cust2ID = ThisAccount.Cust1ID
or LaterAccount.Cust1ID = ThisAccount.Cust2ID
or LaterAccount.Cust2ID = ThisAccount.Cust2ID
)
where LaterAccount.AcctNumber is null
order by ThisAccount.AcctNumber
This should return the results as expected:
这应该按预期返回结果:
AcctNo Dt Cust1 Cust2
7060 2017-02-04 1115 1114
10000 2016-02-01 1110 NULL
10050 2017-02-01 1111 NULL
10052 2017-02-02 1113 1112
10058 2017-02-03 1120 1121
10067 2017-02-05 1117 NULL
10101 2017-06-02 1119 NULL
10107 2017-06-02 1124 1125
50003 2017-03-03 2001 NULL
50004 2017-04-04 2002 NULL
50007 2017-03-03 2004 NULL
50008 2017-04-04 2003 NULL
#5
0
My answer is wrong, sorry for posting prematurely. I'm working on a different idea, I'll be back shortly.
我的回答是错误的,抱歉过早发布。我正在研究一个不同的想法,我很快就会回来。
Original response:
原始回复:
Assuming your date format is MM.DD.YY, I've got the code as shown below. I don't understand why your desired result set doesn't include rows for CustID 1116 or 1118, but I do see how including them will duplicate 1117 and 1119 respectively, unless the source data are modified to remove these duplicate 1117 and 1119 values from the results. For now, I have this interim solution, pending your response.
假设您的日期格式为MM.DD.YY,我的代码如下所示。我不明白为什么你想要的结果集不包括CustID 1116或1118的行,但我确实看到它们将如何分别复制1117和1119,除非修改源数据以从中删除这些重复的1117和1119值结果。目前,我有这个临时解决方案,等待您的回复。
declare @ACCT table (
acctnumber int,
date date,
Cust1ID int,
Cust2ID int
);
insert into @ACCT values (10000, '2016-02-01', 1110, null);
insert into @ACCT values (10001, '2016-02-01', 1111, null);
insert into @ACCT values (10050, '2017-02-01', 1111, null);
insert into @ACCT values (10008, '2016-02-01', 1120, null);
insert into @ACCT values (10058, '2017-02-03', 1120, 1121);
insert into @ACCT values (10002, '2016-02-01', 1112, null);
insert into @ACCT values (10052, '2017-02-02', 1113, 1112);
insert into @ACCT values (10003, '2016-02-02', 1114, 1115);
insert into @ACCT values (7060, '2017-02-04', 1115, 1114);
insert into @ACCT values (10004, '2016-02-02', 1116, 1117);
insert into @ACCT values (10067, '2017-02-05', 1117, null);
insert into @ACCT values (10005, '2016-02-01', 1118, null);
insert into @ACCT values (10054, '2017-02-03', 1118, 1119);
insert into @ACCT values (10101, '2017-06-02', 1119, null);
insert into @ACCT values (10007, '2016-02-01', 1122, 1123);
insert into @ACCT values (10057, '2017-02-03', 1123, 1124);
insert into @ACCT values (10107, '2017-06-02', 1124, 1125);
with
OneCustId as (
select
acctnumber,[date], Cust1ID as CustID
from
@ACCT
union
select
acctnumber, [date], Cust2ID
from
@ACCT
),
SortedByLastUsage as (
select
acctnumber, [date], CustID, row_number() over (partition by CustID order by [date] desc) as RowID
from
OneCustId
),
LastUsage as (
select
acctnumber, [date], CustID
from
SortedByLastUsage
where
RowID = 1
)
select distinct
ACCT.acctnumber, ACCT.[date], ACCT.Cust1ID, ACCT.Cust2ID
from
@ACCT ACCT
inner join LastUsage on
ACCT.acctnumber = LastUsage.acctnumber and
ACCT.[date] = LastUsage.[date] and
LastUsage.CustID in (ACCT.Cust1ID, ACCT.Cust2ID)
order by
Cust1ID, Cust2ID
The result set:
结果集:
acctnumber date Cust1ID Cust2ID
10000 2016-02-01 1110 NULL
10050 2017-02-01 1111 NULL
10052 2017-02-02 1113 1112
7060 2017-02-04 1115 1114
10004 2016-02-02 1116 1117
10067 2017-02-05 1117 NULL
10054 2017-02-03 1118 1119
10101 2017-06-02 1119 NULL
10058 2017-02-03 1120 1121
10007 2016-02-01 1122 1123
10057 2017-02-03 1123 1124
10107 2017-06-02 1124 1125
#6
0
I'm leaving my original answer in place, because the approach might work for someone else searching for this down the line.
我将原来的答案保留下来,因为这种方法可能适用于其他寻找此问题的人。
I can't figure out how to do this without a cursor. As such, any other answer that provides the right answer (that doesn't use a cursor) is going to outperform this one. I'm not smart enough to figure out what that looks like, but it would have to include a nasty recursive CTE.
没有光标,我无法弄清楚如何做到这一点。因此,提供正确答案(不使用游标)的任何其他答案都将胜过这一答案。我不够聪明,不知道它看起来像什么,但它必须包括一个令人讨厌的递归CTE。
The real trick is getting all accounts that were ever related to each other grouped together. That is done in the big cursored if/then/else chain at the top, which could be cleaned up a bit. I've left my debug print
statements in place, they can obviously be removed.
真正的诀窍是将所有彼此相关的帐户组合在一起。这是在顶部的大型cursored if / then / else链中完成的,可以稍微清理一下。我已经将调试打印语句留在原位,显然可以删除它们。
You could also make the Associations table permanent, instead of using a table variable.
您还可以使关联表永久化,而不是使用表变量。
Again, performance-wise, this is going to be really, really bad, but it does work. I'm looking forward to seeing what others come up with. Thanks for the high-quality question, too, that made life a lot easier.
再次,在性能方面,这将是真的,非常糟糕,但它确实有效。我很期待看到别人想出的东西。感谢高质量的问题,这让生活变得更加轻松。
The code:
代码:
declare @Associations table (
GroupID int,
CustID int
);
declare @NextGroupID int = 0;
declare @FoundGroup1ID int;
declare @FoundGroup2ID int;
declare @Cust1 int;
declare @Cust2 int;
declare db_cursor cursor for
select Cust1ID, Cust2ID from @ACCT;
open db_cursor;
fetch next from db_cursor into @Cust1, @Cust2;
while @@fetch_status = 0
begin
set @FoundGroup1ID = null;
set @FoundGroup2ID = null;
print '----------------------------'
print 'Cust1 = ' + isnull(cast(@Cust1 as varchar(max)), 'NULL')
print 'Cust2 = ' + isnull(cast(@Cust2 as varchar(max)), 'NULL')
select @FoundGroup1ID = GroupID from @Associations where CustID = @Cust1
print 'FoundGroup1ID = ' + isnull(cast(@FoundGroup1ID as varchar(max)), 'NULL')
if @Cust2 is null
begin
if @FoundGroup1ID is null
begin
set @NextGroupID = @NextGroupID +1
print 'Adding Cust1 to new group ' + cast(@NextGroupID as varchar(max))
insert into @Associations (GroupID, CustID) values (@NextGroupID, @Cust1)
end
end
else -- @Cust2 is not null
begin
print 'FoundGroup2ID = ' + isnull(cast(@FoundGroup2ID as varchar(max)), 'NULL')
select @FoundGroup2ID = GroupID from @Associations where CustID = @Cust2
if @FoundGroup1ID is null and @FoundGroup2ID is null
begin
set @NextGroupID = @NextGroupID +1
print 'Adding both to new group ' + cast(@NextGroupID as varchar(max))
insert into @Associations (GroupID, CustID) values (@NextGroupID, @Cust1)
insert into @Associations (GroupID, CustID) values (@NextGroupID, @Cust2)
end
else if @FoundGroup1ID is not null and @FoundGroup2ID is null
begin
print 'Adding Cust2 to existing group ' + cast(@FoundGroup1ID as varchar(max))
insert into @Associations (GroupID, CustID) values (@FoundGroup1ID, @Cust2)
end
else if @FoundGroup1ID is null and @FoundGroup2ID is not null
begin
print 'Adding Cust1 to existing group ' + cast(@FoundGroup2ID as varchar(max))
insert into @Associations (GroupID, CustID) values (@FoundGroup2ID, @Cust1)
end
else -- Neither is null
begin
print 'Switching all of GroupID ' + cast(@FoundGroup2ID as varchar(max)) + ' to GroupID ' + cast(@FoundGroup1ID as varchar(max))
update @Associations set GroupID = @FoundGroup1ID where GroupID = @FoundGroup2ID
end
end
fetch next from db_cursor into @Cust1, @Cust2;
end
close db_cursor;
deallocate db_cursor;
;with
AddedGroupID as (
select
ACCT.acctnumber,
ACCT.[date],
ACCT.Cust1ID,
ACCT.Cust2ID,
Associations.GroupID,
row_number() over (partition by Associations.GroupID order by ACCT.[date] desc) as RowID
from
@ACCT ACCT
inner join @Associations Associations on
Associations.CustID in (ACCT.Cust1ID, ACCT.Cust2ID)
)
select
acctnumber, [date], Cust1ID, Cust2ID
from
AddedGroupID
where
RowID = 1
The results:
结果:
acctnumber date Cust1ID Cust2ID
10000 2016-02-01 1110 NULL
10050 2017-02-01 1111 NULL
10058 2017-02-03 1120 1121
10052 2017-02-02 1113 1112
7060 2017-02-04 1115 1114
10067 2017-02-05 1117 NULL
10101 2017-06-02 1119 NULL
10107 2017-06-02 1124 1125
#7
0
we should not worry about using EXISTS as it operate fast in such case and i suppose is simplest possible solution:
我们不应该担心使用EXISTS,因为它在这种情况下运行速度很快,我认为这是最简单的解决方案:
SELECT
A.ACCTNUMBER, A.DT as "date", A.CUST1ID, A.CUST2ID
FROM
ACCT A
WHERE
NOT EXISTS
(SELECT
*
FROM
ACCT A2
WHERE
(A2.CUST1ID = A.CUST1ID
OR A2.CUST2ID = A.CUST1ID
OR (A.CUST2ID IS NOT NULL AND A2.CUST1ID = A.CUST2ID)
OR (A.CUST2ID IS NOT NULL AND A2.CUST2ID = A.CUST2ID)
)
AND A2.DT>A.DT
)
i have assumed that you have separate indexes on CUST1ID and another on CUST2ID. You can compare result without ascending index on DT ("date") field and with it. It can speed up your query or slow down - i do not know how your real data looks like
我假设你在CUST1ID上有另外的索引而在CUST2ID上有另一个索引。您可以在DT(“日期”)字段中比较结果而不使用升序索引。它可以加快您的查询速度或减慢速度 - 我不知道您的真实数据是什么样的
#8
0
Try below query. It lengthy, because there's need to apply repeatedly windowed functions (you can't nest them in a single query), but the query itself is pretty simple. The core idea is to split customers that never share account from customers that do. After that, for the single-account customer, the grouping column is easy, it's Cust1ID
, but for other, you have to do some operations described below, to get grouping column:
请尝试以下查询。它很冗长,因为需要重复应用窗口函数(不能将它们嵌套在单个查询中),但查询本身非常简单。核心思想是将从不与客户共享帐户的客户分开。之后,对于单帐户客户,分组列很容易,它是Cust1ID,但对于其他人,您必须执行下面描述的操作,以获取分组列:
To obtain grouping column (for multiple-account cutomers), you have to apply following logic:
要获取分组列(对于多帐户cutomers),您必须应用以下逻辑:
Put all 1st customers together with 2nd customers in same column using UNION ALL
(CTE called cte
in query). Then, when you sort by that column, and checking both IDs with IDs of the following row, you can check if they are "connected", i.e. they have at least one ID the same:
使用UNION ALL(CTE在查询中称为cte)将所有第一客户与第二客户放在同一列中。然后,当您按该列排序,并使用下一行的ID检查这两个ID时,您可以检查它们是否“已连接”,即它们至少有一个ID相同:
case when Cust1ID in (cust1idLead, cust2idLead) or Cust2ID in (cust1idLead, cust2idLead) then 1 else 0 end SameGroup
This way you can distinct groups and within that groups take maximum respectively to the date (dt
column).
通过这种方式,您可以将不同的组和该组中的组分别最大化到日期(dt列)。
Sample data:
样本数据:
declare @tbl table (acctnumber int, dt date , Cust1ID int, Cust2ID int);
insert into @tbl values
(10000, '2.1.16', 1110, null),
(10001, '2.1.16', 1111, null),
(10050, '2.1.17', 1111, null),
(10008, '2.1.16', 1120, null),
(10058, '2.3.17', 1120, 1121),
(10002, '2.1.16', 1112, null),
(10052, '2.2.17', 1113, 1112),
(10003, '2.2.16', 1114, 1115),
(7060, '2.4.17', 1115, 1114),
(10004, '2.2.16', 1116, 1117),
(10067, '2.5.17', 1117, null),
(10005, '2.1.16', 1118, null),
(10054, '2.3.17', 1118, 1119),
(10101, '6.2.17', 1119, null),
(10007, '2.1.16', 1122, 1123),
(10057, '2.3.17', 1123, 1124),
(10107, '6.2.17', 1124, 1125)
T-SQL:
T-SQL:
;with SingleAccounts as (
select cust1id from @tbl
where Cust2ID is null
except
select cust1id from @tbl
where Cust2ID is not null
except
select cust2id from @tbl
), cte as (
select acctnumber, dt, Cust1ID, Cust2ID from @tbl
where Cust1ID not in (select Cust1ID from SingleAccounts)
union all
select acctnumber, dt, Cust2ID, Cust1ID from @tbl
where Cust1ID not in (select Cust1ID from SingleAccounts) and Cust2ID is not null
), SingleAmountsResult as (
select acctnumber, dt, cust1id, cust2id,
ROW_NUMBER() over (partition by cust1id order by dt desc) rn
from @tbl
where cust1id in (select Cust1ID from SingleAccounts)
), FinalResult as (
select acctnumber, dt, cust1id, cust2id from SingleAmountsResult
where rn = 1
union all
select acctnumber, dt, cust1id, cust2id
from (
select acctnumber, dt, cust1id, cust2id,
ROW_NUMBER() over (partition by GroupingColumn order by dt desc) rn
from (
select acctnumber, dt, cust1id, cust2id,
SUM(NewGroup) over (order by cust1id, cust2id) GroupingColumn
from (
select acctnumber, dt, cust1id, cust2id,
case when LAG(SameGroup) over (order by cust1id, cust2id) = 0 then 1 else 0 end NewGroup
from (
select acctnumber, dt, cust1id, cust2id,
case when Cust1ID in (cust1idLead, cust2idLead) or Cust2ID in (cust1idLead, cust2idLead) then 1 else 0 end SameGroup
from (
select acctnumber, dt, cust1id, cust2id,
LEAD(cust1id) over (order by cust1id, cust2id) cust1idLead,
LEAD(cust2id) over (order by cust1id, cust2id) cust2idLead
from cte
) a
) a
) a
) a
) a where rn = 1
)
--this final query gets you correct Cust1ID and Cust2ID, as FinalResult might have them switched
select * from @tbl
intersect
select * from (
select acctnumber, dt, cust1id, cust2id from FinalResult
union all
select acctnumber, dt, cust2id, cust1id from FinalResult
) fr
UPDATE
UPDATE
This code, accordingly to OP explanation, treat all customers ID that have ever been on one account together as same group (and this is transitive1)), thus, for additional cases, 8a and 8b the results are:
相应于OP解释,此代码将所有曾在一个帐户*享的客户ID视为同一组(并且这是传递1)),因此,对于其他情况,8a和8b的结果为:
acctnumber | dt | Cust1ID | Cust2ID
50004 | 2017-04-04 | 2002 | NULL
50008 | 2017-04-04 | 2003 | NULL
as there will be only 2 groups!
因为只有2组!
1) it means, if element
a
is in group with elementb
andb
is in the same group as elementc
then it implies thata
andc
are also in the same group.1)这意味着,如果元素a与元素b在组中,并且b与元素c在同一组中,则意味着a和c也在同一组中。
#9
0
This is quite complex...
这很复杂......
First you want to identify groups of customers. That is all customers who were directly or indirectly related. With customer pairs A/B, B/C, D/E, D/F, G/A, H/A, H/F you'd have just one single group for instance. In SQL this requires a recursive query.
首先,您要识别客户群。这是所有直接或间接相关的客户。对于客户对A / B,B / C,D / E,D / F,G / A,H / A,H / F,您只有一个组。在SQL中,这需要递归查询。
SQL Server lacks a cycle detection in recursive queries. So from customers A/B you'd get to all pairs containing A or B, which is B/C, A/B G/A, H/A, and A/B itself for that matter. Even, if we detect this direct circle (same pair), we'd go on with B/C looking for all records that contain B or C. And one of these is A/B again and once more we are in a cycle. One way to deal with this is to build a string of yet visited customers and not visit them again.
SQL Server在递归查询中缺少循环检测。因此,从客户A / B,您将获得包含A或B的所有对,即B / C,A / B G / A,H / A和A / B本身。甚至,如果我们检测到这个直接的圆圈(同一对),我们继续用B / C寻找包含B或C的所有记录。其中一个是A / B再次,我们再次进入一个循环。解决这个问题的一种方法是建立一系列尚未访问的客户,而不是再次访问它们。
Our result is all cutomers with all directly or indirectly connected other customers. Using aggregation, we can take the minimum partner per customer and use this as a group key. In above example all customers are related to A, so A is all their minimum partner, showing that all belong to the same group. If we add two records X/Y and Z/-, then we have two more groups: X and Y belonging to the X group, and Z belonging to the Z group.
我们的结果是所有直接或间接连接其他客户的客户。使用聚合,我们可以为每个客户提供最小的合作伙伴,并将其用作组密钥。在上面的示例中,所有客户都与A相关,因此A是他们的最小合作伙伴,表明所有客户都属于同一个组。如果我们添加两个记录X / Y和Z / - ,那么我们还有两个组:X和Y属于X组,Z属于Z组。
These groups we use to look up our original records again. With ROW_NUMBER
we number each group's last record with #1. Then we keep only those and we are done.
我们用这些组再次查找原始记录。使用ROW_NUMBER,我们用#1为每个组的最后一个记录编号。然后我们只保留那些,我们就完成了。
with all_cust(custid) as
(
select cust1id from mytable
union
select cust2id from mytable where cust2id is not null
)
, cte(c1, c2, sofar) as
(
select custid, custid, '<' + cast(custid as varchar(max)) + '>' from all_cust
union all
select cte.c1, case when cte.c2 = m.cust1id then m.cust2id else m.cust1id end,
cte.sofar + '<' + cast(case when cte.c2 = m.cust1id then m.cust2id else m.cust1id end as varchar(max)) + '>'
from mytable m
join cte on cte.c2 in (m.cust1id, m.cust2id)
and cte.sofar not like '%' + cast(case when cte.c2 = m.cust1id then m.cust2id else m.cust1id end as varchar(max)) + '%'
)
, groups(custid, grp) as
(
select c1, min(c2) from cte group by c1
)
, ranked as
(
select *, row_number() over (partition by g.grp order by date desc) as rn
from groups g
join mytable m on g.custid in (m.cust1id, m.cust2id)
)
select acctnumber, date, cust1id, cust2id
from ranked
where rn = 1
order by cust1id;
Rextester demo: http://rextester.com/RWCQ83881
Rextester演示:http://rextester.com/RWCQ83881