如何删除除一条没有相同列/主键的记录之外的巨大重复数据

时间:2022-01-15 02:14:15

I want to delete duplicate rows in a table when no primary key is defined(Non normalised DB).

我想在没有定义主键时删除表中的重复行(非标准化DB)。

My problem is - My table has approx 540 million records. Previously I used CTE to delete the records but it was taking time more than 8 hours. I want to optimize the query. For example: If we have table1 with data as below,

我的问题是 - 我的表有大约5.4亿条记录。以前我使用CTE删除记录但是花费的时间超过8小时。我想优化查询。例如:如果我们有table1,数据如下,

ID FNAME LNAME 
1 AAA CCC
2 BBB DDD
1 AAA CCC
2 BBB DDD
1 AAA CCC
2 BBB DDD
3 BCB DGD

Remove duplicate rows and keep the data in to the table like this using single query.

删除重复的行并使用单个查询将数据保存到表中。

ID FNAME LNAME
1 AAA CCC
2 BBB DDD
3 BCB DGD

Previously I applied this type of query-

以前我应用这种类型的查询 -

;with TBLCTE(EmpID,Ranking)
AS
(
select
EmpID,
Ranking = DENSE_RANK() over (PARTITION BY EmpID order by newID())
from @TBL
)
delete from TBLCTE where Ranking > 1
select * from @TBL order by EmpID

But it is taking too much time.

但这花费了太多时间。

I want a solution which answer these conditions:

我想要一个满足这些条件的解决方案:

  1. No primary key or identical column
  2. 没有主键或相同的列
  3. Data is more then 540 million, query should take less time to delete the records.
  4. 数据超过5.4亿,查询应该花费更少的时间来删除记录。

5 个解决方案

#1


1  

Try this

尝试这个

 WITH TempId AS (
 SELECT *, 
 row_number() OVER(PARTITION BY ID, FNAME,LNAME ORDER BY ID) AS [Num]
 FROM Employee)

DELETE TempId WHERE [Num] > 1

Select * from Employee

Find the solution in Fiddle http://sqlfiddle.com/#!6/394a9/1

在Fiddle http://sqlfiddle.com/#!6/394a9/1中找到解决方案

#2


1  

Transferring into the NewTable data from an OldTable by chunks:

通过块从OldTable传输到NewT​​able数据:

DECLARE @ChunkSize INT = 1000;

WHILE (EXISTS(
    SELECT TOP(1)1 FROM OldTable ot 
    WHERE 
        NOT EXISTS(
            SELECT TOP(1) 1 
              FROM NewTable nt 
            WHERE 
              nt.FNAME = ot.FNAME AND nt.LNAME = ot.LNAME)))
BEGIN

    BEGIN TRANSACTION;


        INSERT INTO NewTable(FNAME, LNAME)
        SELECT DISTINCT TOP(@ChunkSize) 
            FNAME, LNAME
          FROM 
            OldTable ot 
        WHERE 
            NOT EXISTS(
                SELECT TOP(1) 1 
                  FROM NewTable nt 
                WHERE 
                    nt.FNAME = ot.FNAME AND 
                    nt.LNAME = ot.LNAME);

    COMMIT TRANSACTION;

END;

Clear source table

清除源表

TRUNCATE TABLE OldTable;

Transferring data back to OldTable

将数据传输回OldTable

WHILE (EXISTS(
    SELECT TOP(1)1 FROM NewTable nt 
    WHERE 
        NOT EXISTS(
            SELECT TOP(1) 1 
              FROM OldTable ot 
            WHERE 
              nt.FNAME = ot.FNAME AND nt.LNAME = ot.LNAME)))
BEGIN

    BEGIN TRANSACTION;


        INSERT INTO OldTable(FNAME, LNAME)
        SELECT DISTINCT TOP(@ChunkSize) 
            FNAME, LNAME
          FROM 
            NewTable nt 
        WHERE 
            NOT EXISTS(
                SELECT TOP(1) 1 
                  FROM OldTable ot 
                WHERE 
                    nt.FNAME = ot.FNAME AND 
                    nt.LNAME = ot.LNAME);

    COMMIT TRANSACTION;

END;

Clear transfer table:

清除转移表:

TRUNCATE TABLE NewTable;

SELECT TOP(1000) * FROM OldTable

Result:

结果:

如何删除除一条没有相同列/主键的记录之外的巨大重复数据

I think using of SSIS the fastest way to transfer data to another table and then return them back.

我认为使用SSIS是将数据传输到另一个表然后将其返回的最快方法。

Try right mouse button click on your DB, Tasks->Import data. Then select same DB as the data source and choose "Write custom query..." option: SELECT DISTINCT FNAME, LNAME FROM Old Table and choose NewTable as destination table.

尝试用鼠标右键单击数据库,任务 - >导入数据。然后选择相同的DB作为数据源并选择“编写自定义查询...”选项:SELECT DISTINCT FNAME,LNAME FROM Old Table并选择NewTable作为目标表。

And finally, run an import

最后,运行导入

#3


0  

Try this

尝试这个

Select * into #temp from @TBL
create nonclustered index Temp_Index on #temp(EmpId)

;with TBLCTE(EmpID,Ranking)
AS
(
select
EmpID,
Ranking = DENSE_RANK() over (PARTITION BY EmpID order by newID())
from #temp
)
delete from #temp where Ranking > 1

truncate table @TBL
insert into @TBL
Select * from #temp

#4


0  

You have to delete the duplicated data in smaller chunks - make a loop and process chunks after each other. The smallest chunk is defined as all duplicates of 1 unique record...
Your statement is taking so long, because it has to create the whole snapshot in the memory.

您必须删除较小块中的重复数据 - 进行循环并相互处理块。最小的块被定义为1个唯一记录的所有重复...您的语句花了这么长时间,因为它必须在内存中创建整个快照。

#5


0  

You can do this:

你可以这样做:

1.Insert the distinct records in a temporary table.

1.将不同记录插入临时表中。

2.Truncate the original table.

2.Truncate原始表。

3.Insert back the record to the original table.

3.将记录插回原始表。

SELECT DISTINCT T.*
INTO #TEMPTABLE
FROM T;

TRUNCATE TABLE T;

INSERT INTO t
    SELECT tt.*
    FROM #temptable tt;

OR

要么

1.Insert the distinct records in a temporary table.

1.将不同记录插入临时表中。

2.Drop the original table.

2.Drop原始表。

3.Rename the temporary table

3.重命名临时表

SELECT DISTINCT *
INTO NewTable
FROM OldTable;

DROP TABLE OldTable;

EXEC sp_rename 'OldTable', 'NewTable'

You can replace SELECT DISTINCT with a faster way to get distinct records. But the procedure still stays the same.

您可以使用更快的方式替换SELECT DISTINCT以获取不同的记录。但程序仍然保持不变。

Here is super fast DISTINCT using recursive CTE by Paul White. See this for reference:

这是使用Paul White的递归CTE的超快速DISTINCT。请参阅此参考:

CREATE  CLUSTERED INDEX c ON dbo.T(EmpID);


WITH    RecursiveCTE
AS      (
        SELECT  data = MIN(T.data)
        FROM    dbo.Test T
        UNION   ALL
        SELECT  R.data
        FROM    (
                -- A cunning way to use TOP in the recursive part of a CTE :)
                SELECT  T.data,
                        rn = ROW_NUMBER() OVER (ORDER BY T.data)
                FROM    dbo.Test T
                JOIN    RecursiveCTE R
                        ON  R.data < T.data
                ) R
        WHERE   R.rn = 1
        )
SELECT  *
FROM    RecursiveCTE
OPTION  (MAXRECURSION 0);

#1


1  

Try this

尝试这个

 WITH TempId AS (
 SELECT *, 
 row_number() OVER(PARTITION BY ID, FNAME,LNAME ORDER BY ID) AS [Num]
 FROM Employee)

DELETE TempId WHERE [Num] > 1

Select * from Employee

Find the solution in Fiddle http://sqlfiddle.com/#!6/394a9/1

在Fiddle http://sqlfiddle.com/#!6/394a9/1中找到解决方案

#2


1  

Transferring into the NewTable data from an OldTable by chunks:

通过块从OldTable传输到NewT​​able数据:

DECLARE @ChunkSize INT = 1000;

WHILE (EXISTS(
    SELECT TOP(1)1 FROM OldTable ot 
    WHERE 
        NOT EXISTS(
            SELECT TOP(1) 1 
              FROM NewTable nt 
            WHERE 
              nt.FNAME = ot.FNAME AND nt.LNAME = ot.LNAME)))
BEGIN

    BEGIN TRANSACTION;


        INSERT INTO NewTable(FNAME, LNAME)
        SELECT DISTINCT TOP(@ChunkSize) 
            FNAME, LNAME
          FROM 
            OldTable ot 
        WHERE 
            NOT EXISTS(
                SELECT TOP(1) 1 
                  FROM NewTable nt 
                WHERE 
                    nt.FNAME = ot.FNAME AND 
                    nt.LNAME = ot.LNAME);

    COMMIT TRANSACTION;

END;

Clear source table

清除源表

TRUNCATE TABLE OldTable;

Transferring data back to OldTable

将数据传输回OldTable

WHILE (EXISTS(
    SELECT TOP(1)1 FROM NewTable nt 
    WHERE 
        NOT EXISTS(
            SELECT TOP(1) 1 
              FROM OldTable ot 
            WHERE 
              nt.FNAME = ot.FNAME AND nt.LNAME = ot.LNAME)))
BEGIN

    BEGIN TRANSACTION;


        INSERT INTO OldTable(FNAME, LNAME)
        SELECT DISTINCT TOP(@ChunkSize) 
            FNAME, LNAME
          FROM 
            NewTable nt 
        WHERE 
            NOT EXISTS(
                SELECT TOP(1) 1 
                  FROM OldTable ot 
                WHERE 
                    nt.FNAME = ot.FNAME AND 
                    nt.LNAME = ot.LNAME);

    COMMIT TRANSACTION;

END;

Clear transfer table:

清除转移表:

TRUNCATE TABLE NewTable;

SELECT TOP(1000) * FROM OldTable

Result:

结果:

如何删除除一条没有相同列/主键的记录之外的巨大重复数据

I think using of SSIS the fastest way to transfer data to another table and then return them back.

我认为使用SSIS是将数据传输到另一个表然后将其返回的最快方法。

Try right mouse button click on your DB, Tasks->Import data. Then select same DB as the data source and choose "Write custom query..." option: SELECT DISTINCT FNAME, LNAME FROM Old Table and choose NewTable as destination table.

尝试用鼠标右键单击数据库,任务 - >导入数据。然后选择相同的DB作为数据源并选择“编写自定义查询...”选项:SELECT DISTINCT FNAME,LNAME FROM Old Table并选择NewTable作为目标表。

And finally, run an import

最后,运行导入

#3


0  

Try this

尝试这个

Select * into #temp from @TBL
create nonclustered index Temp_Index on #temp(EmpId)

;with TBLCTE(EmpID,Ranking)
AS
(
select
EmpID,
Ranking = DENSE_RANK() over (PARTITION BY EmpID order by newID())
from #temp
)
delete from #temp where Ranking > 1

truncate table @TBL
insert into @TBL
Select * from #temp

#4


0  

You have to delete the duplicated data in smaller chunks - make a loop and process chunks after each other. The smallest chunk is defined as all duplicates of 1 unique record...
Your statement is taking so long, because it has to create the whole snapshot in the memory.

您必须删除较小块中的重复数据 - 进行循环并相互处理块。最小的块被定义为1个唯一记录的所有重复...您的语句花了这么长时间,因为它必须在内存中创建整个快照。

#5


0  

You can do this:

你可以这样做:

1.Insert the distinct records in a temporary table.

1.将不同记录插入临时表中。

2.Truncate the original table.

2.Truncate原始表。

3.Insert back the record to the original table.

3.将记录插回原始表。

SELECT DISTINCT T.*
INTO #TEMPTABLE
FROM T;

TRUNCATE TABLE T;

INSERT INTO t
    SELECT tt.*
    FROM #temptable tt;

OR

要么

1.Insert the distinct records in a temporary table.

1.将不同记录插入临时表中。

2.Drop the original table.

2.Drop原始表。

3.Rename the temporary table

3.重命名临时表

SELECT DISTINCT *
INTO NewTable
FROM OldTable;

DROP TABLE OldTable;

EXEC sp_rename 'OldTable', 'NewTable'

You can replace SELECT DISTINCT with a faster way to get distinct records. But the procedure still stays the same.

您可以使用更快的方式替换SELECT DISTINCT以获取不同的记录。但程序仍然保持不变。

Here is super fast DISTINCT using recursive CTE by Paul White. See this for reference:

这是使用Paul White的递归CTE的超快速DISTINCT。请参阅此参考:

CREATE  CLUSTERED INDEX c ON dbo.T(EmpID);


WITH    RecursiveCTE
AS      (
        SELECT  data = MIN(T.data)
        FROM    dbo.Test T
        UNION   ALL
        SELECT  R.data
        FROM    (
                -- A cunning way to use TOP in the recursive part of a CTE :)
                SELECT  T.data,
                        rn = ROW_NUMBER() OVER (ORDER BY T.data)
                FROM    dbo.Test T
                JOIN    RecursiveCTE R
                        ON  R.data < T.data
                ) R
        WHERE   R.rn = 1
        )
SELECT  *
FROM    RecursiveCTE
OPTION  (MAXRECURSION 0);