从MySQL表中删除重复值的最佳方法是什么?

时间:2022-09-16 14:26:27

I have the following SQL to delete duplicate values form a table,

我有以下SQL从表中删除重复值,

DELETE p1 
FROM `ProgramsList` p1, `ProgramsList` p2  
WHERE p1.CustId = p2.CustId 
    AND p1.CustId = 1 
    AND p1.`Id`>p2.`Id` 
    AND p1.`ProgramName` = p2.`ProgramName`;

Id is auto incremental
for a given CustId ProgramName must be unique (currently it is not)
The above SQL takes about 4 to 5 hours to complete with about 1,000,000 records

Id是给定CustId的自动增量ProgramName必须是唯一的(目前不是)上述SQL需要大约4到5个小时才能完成,大约有1,000,000条记录

Could anyone suggest a quicker way of deleting duplicates from a table?

有人能建议更快捷地从表中删除重复项吗?

2 个解决方案

#1


0  

First, You might try adding indexes to ProgramName and CustID fields if you don't already have them.

首先,您可以尝试将索引添加到ProgramName和CustID字段(如果您还没有它们)。

De-Duping

You can group your records to identify dupes, and as you are doing that, grab the min ID value for each group. Then, just delete all records whose ID is not one of the MinID's.

您可以将记录分组以识别欺骗,并在执行此操作时,获取每个组的最小ID值。然后,只删除ID不是MinID之一的所有记录。

In-Clause Method

delete from
 ProgramsList
where
 id not in 
    (select min(id) as MinID
      from ProgramsList
      group by ProgramName, CustID) 

Join-Method

You may have to run this more than once, if there are many members per group.

如果每个组有许多成员,则可能必须多次运行此操作。

DELETE P
FROM ProgramsList as P
INNER JOIN 
    (select count(*) as Count, max(id) as MaxID
     from ProgramsList
     group by ProgramName, CustID) as A on A.MaxID = P.id
WHERE A.Count >= 2

Some people have performance issues with the In-Clause, some don't. It depends a lot on your indexes and such. If one is too slow, try the other.

有些人在使用In-Clause时会遇到性能问题,有些则没有。这很大程度上取决于您的索引等。如果一个太慢,请尝试另一个。

Related: https://*.com/a/4192849/127880

#2


0  

This will remove all the duplicates in one go.

这将一次性删除所有重复项。

From the inner query an ID is got which is not deleted and the rest is deleted for each of the program.

从内部查询中获取一个ID,该ID不会被删除,其余的将被删除。

delete p from ProgramsList as p
INNER JOIN (select ProgramName as Pname, max(id) as MaxID
     from ProgramsList
     group by ProgramName, CustID order by null) as A on  Pname=P.ProgramName
    where A.MaxID != P.id

#1


0  

First, You might try adding indexes to ProgramName and CustID fields if you don't already have them.

首先,您可以尝试将索引添加到ProgramName和CustID字段(如果您还没有它们)。

De-Duping

You can group your records to identify dupes, and as you are doing that, grab the min ID value for each group. Then, just delete all records whose ID is not one of the MinID's.

您可以将记录分组以识别欺骗,并在执行此操作时,获取每个组的最小ID值。然后,只删除ID不是MinID之一的所有记录。

In-Clause Method

delete from
 ProgramsList
where
 id not in 
    (select min(id) as MinID
      from ProgramsList
      group by ProgramName, CustID) 

Join-Method

You may have to run this more than once, if there are many members per group.

如果每个组有许多成员,则可能必须多次运行此操作。

DELETE P
FROM ProgramsList as P
INNER JOIN 
    (select count(*) as Count, max(id) as MaxID
     from ProgramsList
     group by ProgramName, CustID) as A on A.MaxID = P.id
WHERE A.Count >= 2

Some people have performance issues with the In-Clause, some don't. It depends a lot on your indexes and such. If one is too slow, try the other.

有些人在使用In-Clause时会遇到性能问题,有些则没有。这很大程度上取决于您的索引等。如果一个太慢,请尝试另一个。

Related: https://*.com/a/4192849/127880

#2


0  

This will remove all the duplicates in one go.

这将一次性删除所有重复项。

From the inner query an ID is got which is not deleted and the rest is deleted for each of the program.

从内部查询中获取一个ID,该ID不会被删除,其余的将被删除。

delete p from ProgramsList as p
INNER JOIN (select ProgramName as Pname, max(id) as MaxID
     from ProgramsList
     group by ProgramName, CustID order by null) as A on  Pname=P.ProgramName
    where A.MaxID != P.id