Hi I have a stored procedure that is used to fetch records while searching. This procedure returns millions of records. However there was a bug found inside the search procedure which also return duplicate records in some scenario when certain condition are met. I have found the error why it was returning duplicate records: Below is the query that is in question:
嗨,我有一个存储过程,用于搜索时获取记录。此过程返回数百万条记录。但是,在搜索过程中发现了一个错误,当某些条件满足时,它也会在某些情况下返回重复记录。我找到了它返回重复记录的错误:下面是有问题的查询:
With cteAutoApprove (AcctID, AutoApproved,DecisionDate)
AS (
select
A.AcctID,
CAST(autoEnter AS SMALLINT) AS AutoApproved,
DecisionDate
from
(
SELECT
awt.AcctID,
MIN(awt.dtEnter) AS DecisionDate
FROM
dbo.AccountWorkflowTask awt
JOIN dbo.WorkflowTask wt ON awt.WorkflowTaskID = wt.WorkflowTaskID
Join Task T on T.TaskID = wt.TaskID
WHERE
(
(T.TaskStageID = 3 and awt.ReasonIDExit is NULL)
OR (wt.TaskID IN (9,15,201,208,220,308,319,320,408,420,508,608,620,1470,1608,1620))
)
GROUP BY
awt.AcctID
) A
Join AccountWorkflowTask awt1
on awt1.dtEnter=A.DecisionDate and awt1.AcctID=a.AcctID
),
This CTE was returning duplicate record because of the condition on awt1.dtEnter=A.DecisionDate the dtEnter for some account was exactly same. This is the reason it returned duplicate record.
由于awt1.dtEnter = A.DecisionDate上的条件,某个帐户的dtEnter完全相同,因此此CTE返回重复记录。这是它返回重复记录的原因。
My question is what should I use to prevent this. I cannot use Distinct here as it will definitely slow down the search procedure. Shall I use Rank or Dense Rank so that it is optimized and the query takes less time to execute the result? Or some other technique? Please help as I am actually stuck here
我的问题是我应该用什么来防止这种情况。我不能在这里使用Distinct,因为它肯定会减慢搜索过程。我应该使用Rank或Dense Rank以便优化它并且查询执行结果所需的时间更少?还是其他一些技巧?请帮忙,因为我实际上被困在这里
1 个解决方案
#1
1
It does seem like a good candidate for row_number (not rank, with the same dates on the same acctid, you'd still have multiple records) Obviously I can't test the query here, but winging it:
它似乎是row_number的一个很好的候选者(不是排名,在相同的acctid上具有相同的日期,你仍然有多个记录)显然我不能在这里测试查询,但是它可以实现:
select
A.AcctID,
CAST(autoEnter AS SMALLINT) AS AutoApproved,
DecisionDate
from
(
SELECT
awt.AcctID,
awt.dtEnter AS DecisionDate,
autoEnter,
row_number() over (partition by awt.acctid order by awt.dtEnter) rnr
FROM
dbo.AccountWorkflowTask awt
JOIN dbo.WorkflowTask wt ON awt.WorkflowTaskID = wt.WorkflowTaskID
Join Task T on T.TaskID = wt.TaskID
WHERE
(
(T.TaskStageID = 3 and awt.ReasonIDExit is NULL)
OR (wt.TaskID IN (9,15,201,208,220,308,319,320,408,420,508,608,620,1470,1608,1620))
)
) A
where rnr = 1
This way, the group by is no longer necessary: getting the first date is done by row_number. Neither is the second join, the subquery already contains all the data (and the optimizer is smart enough not to do anything with the rows it doesn't need)
这样,不再需要group by:获取第一个日期由row_number完成。也不是第二个连接,子查询已经包含所有数据(并且优化器足够智能,不会对它不需要的行做任何事情)
PS. because sql server window functions work incredibly efficient, using row_number instead of the min() - join construction, will most likely gain a performance boost, even if there were no double rows.
PS。因为sql server窗口函数工作效率非常高,使用row_number而不是min() - join构造,即使没有双行,也很可能获得性能提升。
#1
1
It does seem like a good candidate for row_number (not rank, with the same dates on the same acctid, you'd still have multiple records) Obviously I can't test the query here, but winging it:
它似乎是row_number的一个很好的候选者(不是排名,在相同的acctid上具有相同的日期,你仍然有多个记录)显然我不能在这里测试查询,但是它可以实现:
select
A.AcctID,
CAST(autoEnter AS SMALLINT) AS AutoApproved,
DecisionDate
from
(
SELECT
awt.AcctID,
awt.dtEnter AS DecisionDate,
autoEnter,
row_number() over (partition by awt.acctid order by awt.dtEnter) rnr
FROM
dbo.AccountWorkflowTask awt
JOIN dbo.WorkflowTask wt ON awt.WorkflowTaskID = wt.WorkflowTaskID
Join Task T on T.TaskID = wt.TaskID
WHERE
(
(T.TaskStageID = 3 and awt.ReasonIDExit is NULL)
OR (wt.TaskID IN (9,15,201,208,220,308,319,320,408,420,508,608,620,1470,1608,1620))
)
) A
where rnr = 1
This way, the group by is no longer necessary: getting the first date is done by row_number. Neither is the second join, the subquery already contains all the data (and the optimizer is smart enough not to do anything with the rows it doesn't need)
这样,不再需要group by:获取第一个日期由row_number完成。也不是第二个连接,子查询已经包含所有数据(并且优化器足够智能,不会对它不需要的行做任何事情)
PS. because sql server window functions work incredibly efficient, using row_number instead of the min() - join construction, will most likely gain a performance boost, even if there were no double rows.
PS。因为sql server窗口函数工作效率非常高,使用row_number而不是min() - join构造,即使没有双行,也很可能获得性能提升。