我应该使用什么技术来优化SQL查询

时间:2022-12-08 14:41:27

Hi I have a stored procedure that is used to fetch records while searching. This procedure returns millions of records. However there was a bug found inside the search procedure which also return duplicate records in some scenario when certain condition are met. I have found the error why it was returning duplicate records: Below is the query that is in question:

嗨,我有一个存储过程,用于搜索时获取记录。此过程返回数百万条记录。但是,在搜索过程中发现了一个错误,当某些条件满足时,它也会在某些情况下返回重复记录。我找到了它返回重复记录的错误:下面是有问题的查询:

With cteAutoApprove (AcctID, AutoApproved,DecisionDate)                
AS (
select 
    A.AcctID,
    CAST(autoEnter AS SMALLINT) AS AutoApproved, 
    DecisionDate 
from 
(
    SELECT 
        awt.AcctID, 
        MIN(awt.dtEnter) AS DecisionDate
    FROM
        dbo.AccountWorkflowTask awt 
        JOIN dbo.WorkflowTask wt ON awt.WorkflowTaskID = wt.WorkflowTaskID
        Join Task T on T.TaskID = wt.TaskID
    WHERE
        (
            (T.TaskStageID = 3 and awt.ReasonIDExit is NULL) 
            OR (wt.TaskID IN (9,15,201,208,220,308,319,320,408,420,508,608,620,1470,1608,1620))
        )
    GROUP BY 
        awt.AcctID
) A 
Join AccountWorkflowTask awt1 
    on awt1.dtEnter=A.DecisionDate and awt1.AcctID=a.AcctID
), 

This CTE was returning duplicate record because of the condition on awt1.dtEnter=A.DecisionDate the dtEnter for some account was exactly same. This is the reason it returned duplicate record.

由于awt1.dtEnter = A.DecisionDate上的条件,某个帐户的dtEnter完全相同,因此此CTE返回重复记录。这是它返回重复记录的原因。

My question is what should I use to prevent this. I cannot use Distinct here as it will definitely slow down the search procedure. Shall I use Rank or Dense Rank so that it is optimized and the query takes less time to execute the result? Or some other technique? Please help as I am actually stuck here

我的问题是我应该用什么来防止这种情况。我不能在这里使用Distinct,因为它肯定会减慢搜索过程。我应该使用Rank或Dense Rank以便优化它并且查询执行结果所需的时间更少?还是其他一些技巧?请帮忙,因为我实际上被困在这里

1 个解决方案

#1


1  

It does seem like a good candidate for row_number (not rank, with the same dates on the same acctid, you'd still have multiple records) Obviously I can't test the query here, but winging it:

它似乎是row_number的一个很好的候选者(不是排名,在相同的acctid上具有相同的日期,你仍然有多个记录)显然我不能在这里测试查询,但是它可以实现:

select 
    A.AcctID,
    CAST(autoEnter AS SMALLINT) AS AutoApproved, 
    DecisionDate 
from 
(
    SELECT 
        awt.AcctID, 
        awt.dtEnter AS DecisionDate,
        autoEnter,
        row_number() over (partition by awt.acctid order by awt.dtEnter) rnr
    FROM
        dbo.AccountWorkflowTask awt 
        JOIN dbo.WorkflowTask wt ON awt.WorkflowTaskID = wt.WorkflowTaskID
        Join Task T on T.TaskID = wt.TaskID
    WHERE
        (
            (T.TaskStageID = 3 and awt.ReasonIDExit is NULL) 
            OR (wt.TaskID IN (9,15,201,208,220,308,319,320,408,420,508,608,620,1470,1608,1620))
        )
) A 
where rnr = 1

This way, the group by is no longer necessary: getting the first date is done by row_number. Neither is the second join, the subquery already contains all the data (and the optimizer is smart enough not to do anything with the rows it doesn't need)

这样,不再需要group by:获取第一个日期由row_number完成。也不是第二个连接,子查询已经包含所有数据(并且优化器足够智能,不会对它不需要的行做任何事情)

PS. because sql server window functions work incredibly efficient, using row_number instead of the min() - join construction, will most likely gain a performance boost, even if there were no double rows.

PS。因为sql server窗口函数工作效率非常高,使用row_number而不是min() - join构造,即使没有双行,也很可能获得性能提升。

#1


1  

It does seem like a good candidate for row_number (not rank, with the same dates on the same acctid, you'd still have multiple records) Obviously I can't test the query here, but winging it:

它似乎是row_number的一个很好的候选者(不是排名,在相同的acctid上具有相同的日期,你仍然有多个记录)显然我不能在这里测试查询,但是它可以实现:

select 
    A.AcctID,
    CAST(autoEnter AS SMALLINT) AS AutoApproved, 
    DecisionDate 
from 
(
    SELECT 
        awt.AcctID, 
        awt.dtEnter AS DecisionDate,
        autoEnter,
        row_number() over (partition by awt.acctid order by awt.dtEnter) rnr
    FROM
        dbo.AccountWorkflowTask awt 
        JOIN dbo.WorkflowTask wt ON awt.WorkflowTaskID = wt.WorkflowTaskID
        Join Task T on T.TaskID = wt.TaskID
    WHERE
        (
            (T.TaskStageID = 3 and awt.ReasonIDExit is NULL) 
            OR (wt.TaskID IN (9,15,201,208,220,308,319,320,408,420,508,608,620,1470,1608,1620))
        )
) A 
where rnr = 1

This way, the group by is no longer necessary: getting the first date is done by row_number. Neither is the second join, the subquery already contains all the data (and the optimizer is smart enough not to do anything with the rows it doesn't need)

这样,不再需要group by:获取第一个日期由row_number完成。也不是第二个连接,子查询已经包含所有数据(并且优化器足够智能,不会对它不需要的行做任何事情)

PS. because sql server window functions work incredibly efficient, using row_number instead of the min() - join construction, will most likely gain a performance boost, even if there were no double rows.

PS。因为sql server窗口函数工作效率非常高,使用row_number而不是min() - join构造,即使没有双行,也很可能获得性能提升。