I want to delete duplicate rows in a table when no primary key is defined(Non normalised DB).
我想在没有定义主键时删除表中的重复行(非标准化DB)。
My problem is - My table has approx 540 million records. Previously I used CTE to delete the records but it was taking time more than 8 hours. I want to optimize the query. For example: If we have table1 with data as below,
我的问题是 - 我的表有大约5.4亿条记录。以前我使用CTE删除记录但是花费的时间超过8小时。我想优化查询。例如:如果我们有table1,数据如下,
ID FNAME LNAME
1 AAA CCC
2 BBB DDD
1 AAA CCC
2 BBB DDD
1 AAA CCC
2 BBB DDD
3 BCB DGD
Remove duplicate rows and keep the data in to the table like this using single query.
删除重复的行并使用单个查询将数据保存到表中。
ID FNAME LNAME
1 AAA CCC
2 BBB DDD
3 BCB DGD
Previously I applied this type of query-
以前我应用这种类型的查询 -
;with TBLCTE(EmpID,Ranking)
AS
(
select
EmpID,
Ranking = DENSE_RANK() over (PARTITION BY EmpID order by newID())
from @TBL
)
delete from TBLCTE where Ranking > 1
select * from @TBL order by EmpID
But it is taking too much time.
但这花费了太多时间。
I want a solution which answer these conditions:
我想要一个满足这些条件的解决方案:
- No primary key or identical column
- 没有主键或相同的列
- Data is more then 540 million, query should take less time to delete the records.
- 数据超过5.4亿,查询应该花费更少的时间来删除记录。
5 个解决方案
#1
1
Try this
尝试这个
WITH TempId AS (
SELECT *,
row_number() OVER(PARTITION BY ID, FNAME,LNAME ORDER BY ID) AS [Num]
FROM Employee)
DELETE TempId WHERE [Num] > 1
Select * from Employee
Find the solution in Fiddle http://sqlfiddle.com/#!6/394a9/1
在Fiddle http://sqlfiddle.com/#!6/394a9/1中找到解决方案
#2
1
Transferring into the NewTable data from an OldTable by chunks:
通过块从OldTable传输到NewTable数据:
DECLARE @ChunkSize INT = 1000;
WHILE (EXISTS(
SELECT TOP(1)1 FROM OldTable ot
WHERE
NOT EXISTS(
SELECT TOP(1) 1
FROM NewTable nt
WHERE
nt.FNAME = ot.FNAME AND nt.LNAME = ot.LNAME)))
BEGIN
BEGIN TRANSACTION;
INSERT INTO NewTable(FNAME, LNAME)
SELECT DISTINCT TOP(@ChunkSize)
FNAME, LNAME
FROM
OldTable ot
WHERE
NOT EXISTS(
SELECT TOP(1) 1
FROM NewTable nt
WHERE
nt.FNAME = ot.FNAME AND
nt.LNAME = ot.LNAME);
COMMIT TRANSACTION;
END;
Clear source table
清除源表
TRUNCATE TABLE OldTable;
Transferring data back to OldTable
将数据传输回OldTable
WHILE (EXISTS(
SELECT TOP(1)1 FROM NewTable nt
WHERE
NOT EXISTS(
SELECT TOP(1) 1
FROM OldTable ot
WHERE
nt.FNAME = ot.FNAME AND nt.LNAME = ot.LNAME)))
BEGIN
BEGIN TRANSACTION;
INSERT INTO OldTable(FNAME, LNAME)
SELECT DISTINCT TOP(@ChunkSize)
FNAME, LNAME
FROM
NewTable nt
WHERE
NOT EXISTS(
SELECT TOP(1) 1
FROM OldTable ot
WHERE
nt.FNAME = ot.FNAME AND
nt.LNAME = ot.LNAME);
COMMIT TRANSACTION;
END;
Clear transfer table:
清除转移表:
TRUNCATE TABLE NewTable;
SELECT TOP(1000) * FROM OldTable
Result:
结果:
I think using of SSIS the fastest way to transfer data to another table and then return them back.
我认为使用SSIS是将数据传输到另一个表然后将其返回的最快方法。
Try right mouse button click on your DB, Tasks->Import data. Then select same DB as the data source and choose "Write custom query..." option: SELECT DISTINCT FNAME, LNAME FROM Old Table and choose NewTable as destination table.
尝试用鼠标右键单击数据库,任务 - >导入数据。然后选择相同的DB作为数据源并选择“编写自定义查询...”选项:SELECT DISTINCT FNAME,LNAME FROM Old Table并选择NewTable作为目标表。
And finally, run an import
最后,运行导入
#3
0
Try this
尝试这个
Select * into #temp from @TBL
create nonclustered index Temp_Index on #temp(EmpId)
;with TBLCTE(EmpID,Ranking)
AS
(
select
EmpID,
Ranking = DENSE_RANK() over (PARTITION BY EmpID order by newID())
from #temp
)
delete from #temp where Ranking > 1
truncate table @TBL
insert into @TBL
Select * from #temp
#4
0
You have to delete the duplicated data in smaller chunks - make a loop and process chunks after each other. The smallest chunk is defined as all duplicates of 1 unique record...
Your statement is taking so long, because it has to create the whole snapshot in the memory.
您必须删除较小块中的重复数据 - 进行循环并相互处理块。最小的块被定义为1个唯一记录的所有重复...您的语句花了这么长时间,因为它必须在内存中创建整个快照。
#5
0
You can do this:
你可以这样做:
1.Insert the distinct records in a temporary table.
1.将不同记录插入临时表中。
2.Truncate the original table.
2.Truncate原始表。
3.Insert back the record to the original table.
3.将记录插回原始表。
SELECT DISTINCT T.*
INTO #TEMPTABLE
FROM T;
TRUNCATE TABLE T;
INSERT INTO t
SELECT tt.*
FROM #temptable tt;
OR
要么
1.Insert the distinct records in a temporary table.
1.将不同记录插入临时表中。
2.Drop the original table.
2.Drop原始表。
3.Rename the temporary table
3.重命名临时表
SELECT DISTINCT *
INTO NewTable
FROM OldTable;
DROP TABLE OldTable;
EXEC sp_rename 'OldTable', 'NewTable'
You can replace SELECT DISTINCT
with a faster way to get distinct records. But the procedure still stays the same.
您可以使用更快的方式替换SELECT DISTINCT以获取不同的记录。但程序仍然保持不变。
Here is super fast DISTINCT
using recursive CTE by Paul White. See this for reference:
这是使用Paul White的递归CTE的超快速DISTINCT。请参阅此参考:
CREATE CLUSTERED INDEX c ON dbo.T(EmpID);
WITH RecursiveCTE
AS (
SELECT data = MIN(T.data)
FROM dbo.Test T
UNION ALL
SELECT R.data
FROM (
-- A cunning way to use TOP in the recursive part of a CTE :)
SELECT T.data,
rn = ROW_NUMBER() OVER (ORDER BY T.data)
FROM dbo.Test T
JOIN RecursiveCTE R
ON R.data < T.data
) R
WHERE R.rn = 1
)
SELECT *
FROM RecursiveCTE
OPTION (MAXRECURSION 0);
#1
1
Try this
尝试这个
WITH TempId AS (
SELECT *,
row_number() OVER(PARTITION BY ID, FNAME,LNAME ORDER BY ID) AS [Num]
FROM Employee)
DELETE TempId WHERE [Num] > 1
Select * from Employee
Find the solution in Fiddle http://sqlfiddle.com/#!6/394a9/1
在Fiddle http://sqlfiddle.com/#!6/394a9/1中找到解决方案
#2
1
Transferring into the NewTable data from an OldTable by chunks:
通过块从OldTable传输到NewTable数据:
DECLARE @ChunkSize INT = 1000;
WHILE (EXISTS(
SELECT TOP(1)1 FROM OldTable ot
WHERE
NOT EXISTS(
SELECT TOP(1) 1
FROM NewTable nt
WHERE
nt.FNAME = ot.FNAME AND nt.LNAME = ot.LNAME)))
BEGIN
BEGIN TRANSACTION;
INSERT INTO NewTable(FNAME, LNAME)
SELECT DISTINCT TOP(@ChunkSize)
FNAME, LNAME
FROM
OldTable ot
WHERE
NOT EXISTS(
SELECT TOP(1) 1
FROM NewTable nt
WHERE
nt.FNAME = ot.FNAME AND
nt.LNAME = ot.LNAME);
COMMIT TRANSACTION;
END;
Clear source table
清除源表
TRUNCATE TABLE OldTable;
Transferring data back to OldTable
将数据传输回OldTable
WHILE (EXISTS(
SELECT TOP(1)1 FROM NewTable nt
WHERE
NOT EXISTS(
SELECT TOP(1) 1
FROM OldTable ot
WHERE
nt.FNAME = ot.FNAME AND nt.LNAME = ot.LNAME)))
BEGIN
BEGIN TRANSACTION;
INSERT INTO OldTable(FNAME, LNAME)
SELECT DISTINCT TOP(@ChunkSize)
FNAME, LNAME
FROM
NewTable nt
WHERE
NOT EXISTS(
SELECT TOP(1) 1
FROM OldTable ot
WHERE
nt.FNAME = ot.FNAME AND
nt.LNAME = ot.LNAME);
COMMIT TRANSACTION;
END;
Clear transfer table:
清除转移表:
TRUNCATE TABLE NewTable;
SELECT TOP(1000) * FROM OldTable
Result:
结果:
I think using of SSIS the fastest way to transfer data to another table and then return them back.
我认为使用SSIS是将数据传输到另一个表然后将其返回的最快方法。
Try right mouse button click on your DB, Tasks->Import data. Then select same DB as the data source and choose "Write custom query..." option: SELECT DISTINCT FNAME, LNAME FROM Old Table and choose NewTable as destination table.
尝试用鼠标右键单击数据库,任务 - >导入数据。然后选择相同的DB作为数据源并选择“编写自定义查询...”选项:SELECT DISTINCT FNAME,LNAME FROM Old Table并选择NewTable作为目标表。
And finally, run an import
最后,运行导入
#3
0
Try this
尝试这个
Select * into #temp from @TBL
create nonclustered index Temp_Index on #temp(EmpId)
;with TBLCTE(EmpID,Ranking)
AS
(
select
EmpID,
Ranking = DENSE_RANK() over (PARTITION BY EmpID order by newID())
from #temp
)
delete from #temp where Ranking > 1
truncate table @TBL
insert into @TBL
Select * from #temp
#4
0
You have to delete the duplicated data in smaller chunks - make a loop and process chunks after each other. The smallest chunk is defined as all duplicates of 1 unique record...
Your statement is taking so long, because it has to create the whole snapshot in the memory.
您必须删除较小块中的重复数据 - 进行循环并相互处理块。最小的块被定义为1个唯一记录的所有重复...您的语句花了这么长时间,因为它必须在内存中创建整个快照。
#5
0
You can do this:
你可以这样做:
1.Insert the distinct records in a temporary table.
1.将不同记录插入临时表中。
2.Truncate the original table.
2.Truncate原始表。
3.Insert back the record to the original table.
3.将记录插回原始表。
SELECT DISTINCT T.*
INTO #TEMPTABLE
FROM T;
TRUNCATE TABLE T;
INSERT INTO t
SELECT tt.*
FROM #temptable tt;
OR
要么
1.Insert the distinct records in a temporary table.
1.将不同记录插入临时表中。
2.Drop the original table.
2.Drop原始表。
3.Rename the temporary table
3.重命名临时表
SELECT DISTINCT *
INTO NewTable
FROM OldTable;
DROP TABLE OldTable;
EXEC sp_rename 'OldTable', 'NewTable'
You can replace SELECT DISTINCT
with a faster way to get distinct records. But the procedure still stays the same.
您可以使用更快的方式替换SELECT DISTINCT以获取不同的记录。但程序仍然保持不变。
Here is super fast DISTINCT
using recursive CTE by Paul White. See this for reference:
这是使用Paul White的递归CTE的超快速DISTINCT。请参阅此参考:
CREATE CLUSTERED INDEX c ON dbo.T(EmpID);
WITH RecursiveCTE
AS (
SELECT data = MIN(T.data)
FROM dbo.Test T
UNION ALL
SELECT R.data
FROM (
-- A cunning way to use TOP in the recursive part of a CTE :)
SELECT T.data,
rn = ROW_NUMBER() OVER (ORDER BY T.data)
FROM dbo.Test T
JOIN RecursiveCTE R
ON R.data < T.data
) R
WHERE R.rn = 1
)
SELECT *
FROM RecursiveCTE
OPTION (MAXRECURSION 0);