如何删除mysql数据库中的重复记录?

时间:2021-05-28 22:12:09

What's the best way to delete duplicate records in a mysql database using rails or mysql queries?

使用rails或mysql查询删除mysql数据库中重复记录的最佳方法是什么?

15 个解决方案

#1


10  

What you can do is copy the distinct records into a new table by:

您可以做的是通过以下方式将不同记录复制到新表中:

 select distinct * into NewTable from MyTable

#2


8  

Here's another idea in no particular language:

这是另一种没有特定语言的想法:

rs = `select a, b, count(*) as c from entries group by 1, 2 having c > 1`
rs.each do |a, b, c|
  `delete from entries where a=#{a} and b=#{b} limit #{c - 1}`
end

Edit:

编辑:

Kudos to Olaf for that "having" hint :)

对奥拉夫的“有”提示感到荣幸:)

#3


7  

well, if it's a small table, from rails console you can do

好吧,如果它是一个小桌子,你可以从rails控制台做

class ActiveRecord::Base
  def non_id_attributes
    atts = self.attributes
    atts.delete('id')
    atts
  end
end

duplicate_groups = YourClass.find(:all).group_by { |element| element.non_id_attributes }.select{ |gr| gr.last.size > 1 }
redundant_elements = duplicate_groups.map { |group| group.last - [group.last.first] }.flatten
redundant_elements.each(&:destroy)

#4


7  

Check for Duplicate entries :

SELECT DISTINCT(req_field) AS field, COUNT(req_field) AS fieldCount FROM 
table_name GROUP BY req_field HAVING fieldCount > 1


Remove Duplicate Queries :

DELETE FROM table_name 
USING table_name, table_name AS vtable 
WHERE 
    (table_name.id > vtable.id) 
AND (table_name.req_field=req_field)

Replace req_field and table_name - should work without any issues.

替换req_field和table_name - 应该没有任何问题。

#5


4  

New to SQL :-) This is a classic question - often asked in interviews:-) I don't know whether it'll work in MYSQL but it works in most databases -

SQL的新手:-)这是一个经典的问题 - 经常在访谈中询问:-)我不知道它是否可以在MYSQL中工作但它适用于大多数数据库 -

> create table t(
>     a char(2),
>     b char(2),
>     c smallint )

> select a,b,c,count(*) from t
> group by a,b,c
> having count(*) > 1
a  b  c
-- -- ------ -----------
(0 rows affected)

> insert into t values ("aa","bb",1)
(1 row affected)

> insert into t values ("aa","bb",1)
(1 row affected)

> insert into t values ("aa","bc",1)
(1 row affected)

> select a,b,c,count(*) from t group by a,b,c having count(*) > 1
a  b  c 
-- -- ------ -----------
aa bb      1           2
(1 row affected)

#6


1  

If you have PK (id) in table (EMP) and want to older delete duplicate records with name column. For large data following query may be good approach.

如果您在表(EMP)中有PK(id)并且希望旧版删除具有名称列的重复记录。对于大数据后续查询可能是好方法。

DELETE t3
FROM (
        SELECT t1.name, t1.id
        FROM (
                SELECT name
                FROM EMP
                GROUP BY name
                HAVING COUNT(name) > 1
        ) AS t0 INNER JOIN EMP t1 ON t0.name = t1.name
) AS t2 INNER JOIN EMP t3 ON t3.name = t2.name
WHERE t2.id < t3.id;

#7


1  

suppose we have a table name tbl_product and there is duplicacy in the field p_pi_code and p_nats_id in maximum no of count then first create a new table insert the data from existing table ...
ie from tbl_product to newtable1 if anything else then newtable1 to newtable2

假设我们有一个表名tbl_product并且字段p_pi_code和p_nats_id中存在重复,最多没有计数,那么首先创建一个新表插入现有表中的数据...即从tbl_product到newtable1如果还有其他什么则newtable1到newtable2

CREATE TABLE `newtable2` (                                  
            `p_id` int(10) unsigned NOT NULL auto_increment,         
            `p_status` varchar(45) NOT NULL,                         
            `p_pi_code` varchar(45) NOT NULL,                        
            `p_nats_id` mediumint(8) unsigned NOT NULL,              
            `p_is_special` tinyint(4) NOT NULL,                      
             PRIMARY KEY (`p_id`)                                   
      ) ENGINE=InnoDB;

INSERT INTO newtable1 (p_status, p_pi_code, p_nats_id, p_is_special) SELECT 
    p_status, p_pi_code, p_nats_id, p_is_special FROM tbl_product group by p_pi_code;

INSERT INTO newtable2 (p_status, p_pi_code, p_nats_id, p_is_special) SELECT 
    p_status, p_pi_code, p_nats_id, p_is_special FROM newtable1 group by p_nats_id;

after that we see all the duplicacy in the field is removed

之后我们看到该字段中的所有重复项都被删除了

#8


0  

I had to do this recently on Oracle, but the steps would have been the same on MySQL. It was a lot of data, at least compared to what I'm used to working with, so my process to de-dup was comparatively heavyweight. I'm including it here in case someone else comes along with a similar problem.

我最近不得不在Oracle上做这个,但是MySQL的步骤也是如此。这是一个很多数据,至少与我以前的工作相比,所以我的重复数据流程相对较重。我把它包括在这里以防其他人遇到类似的问题。

My duplicate records had different IDs, different updated_at times, possibly different updated_by IDs, but all other columns the same. I wanted to keep the most recently updated of any duplicate set.

我的重复记录具有不同的ID,不同的updated_at时间,可能不同的updated_by ID,但所有其他列都相同。我想保留最近更新的任何重复集。

I used a combination of Rails logic and SQL to get it done.

我使用Rails逻辑和SQL的组合来完成它。

Step one: run a rake script to identify the IDs of the duplicate records, using model logic. IDs go in a text file.

第一步:使用模型逻辑运行rake脚本以识别重复记录的ID。 ID放在文本文件中。

Step two: create a temporary table with one column, the IDs to delete, loaded from the text file.

第二步:创建一个临时表,其中包含一列,要删除的ID,从文本文件加载。

Step three: create another temporary table with all the records I'm going to delete (just in case!).

第三步:创建另一个临时表,其中包含我要删除的所有记录(以防万一!)。

CREATE TABLE temp_duplicate_models 
  AS (SELECT * FROM models 
  WHERE id IN (SELECT * FROM temp_duplicate_ids));

Step four: actual deleting.

第四步:实际删除。

DELETE FROM models WHERE id IN (SELECT * FROM temp_duplicate_ids);

#9


0  

You can use:

您可以使用:

http://lenniedevilliers.blogspot.com/2008/10/weekly-code-find-duplicates-in-sql.html

http://lenniedevilliers.blogspot.com/2008/10/weekly-code-find-duplicates-in-sql.html

to get the duplicates and then just delete them via Ruby code or SQL code (I would do it in SQL code but thats up to you :-)

获取重复项,然后通过Ruby代码或SQL代码删除它们(我会在SQL代码中执行它,但这取决于你:-)

#10


0  

If your table has a PK (or you can easily give it one), you can specify any number of columns in the table to be equal (to qualify is as a duplicate) with the following query (may be a bit messy looking but it works):

如果你的表有一个PK(或者你可以轻松地给它一个),你可以使用以下查询指定表中任意数量的列相等(限定为重复)(可能看起来有点凌乱,但它作品):

DELETE FROM table WHERE pk_id IN(
   SELECT DISTINCT t3.pk_id FROM (
       SELECT t1.* FROM table AS t1 INNER JOIN (
           SELECT col1, col2, col3, col4, COUNT(*) FROM table
           GROUP BY col1, col2, col3, col4 HAVING COUNT(*)>1) AS t2
       ON t1.col1 = t2.col1 AND t1.col2 = t2.col2 AND t1.col3 = t2.col3 AND
       t1.col4 = t2.col4)
   AS t3, (
       SELECT t1.* FROM table AS t1 INNER JOIN (
           SELECT col1, col2, col3, col4, COUNT(*) FROM table
           GROUP BY col1, col2, col3, col4 HAVING COUNT(*)>1) AS t2
       ON t1.col1 = t2.col1 AND t1.col2 = t2.col2 AND t1.col3 = t2.col3 AND
       t1.col4 = t2.col4)
   AS t4
   WHERE t3.col1 = t4.col1 AND t3.pk_id > t4.pk_id

)

This will leave the first record entered into the database, deleting the 'newest' duplicates. If you want to keep the last record, switch the > to <.

这将使第一条记录进入数据库,删除“最新”重复项。如果要保留最后一条记录,请将>切换为<。

#11


0  

In MySql when I put something like

在MySql中我放了类似的东西

delete from A where IDA in (select IDA from A )

mySql said something like "you can't use the same table in the select part of the delete operation."

mySql说“你不能在删除操作的选择部分使用同一个表”。

I've just have to delete some duplicate records, and I have succeeded with a .php program like that

我只需要删除一些重复的记录,我已经成功完成了这样的.php程序

<?php
...
$res = hacer_sql("SELECT MIN(IDESTUDIANTE) as IDTODELETE 
FROM `estudiante` group by `LASTNAME`,`FIRSTNAME`,`CI`,`PHONE`
HAVING COUNT(*) > 1 )");
while ( $reg = mysql_fetch_assoc($res) ) {
   hacer_sql("delete from estudiante where IDESTUDIANTE = {$reg['IDTODELETE']}");
}
?>

#12


0  

I am using Alter Table

我正在使用Alter Table

ALTER IGNORE TABLE jos_city ADD UNIQUE INDEX(`city`);

#13


0  

I used @krukid's answer above to do the following on a table with around 70,000 entries:

我使用@krukid上面的答案在一个包含大约70,000个条目的表格上执行以下操作:

rs = 'select a, b, count(*) as c from table group by 1, 2 having c > 1'

# get a hashmap
dups = MyModel.connection.select_all(rs)

# convert to array
dupsarr = dups.map { |i|  [i.a, i.b, i.c] }

# delete dups
dupsarr.each do |a,b,c|
    ActiveRecord::Base.connection.execute("delete from table_name where a=#{MyModel.sanitize(a)} and b=#{MyModel.sanitize(b)} limit #{c-1}")
end

#14


0  

Here is the rails solution I came up with. May not be the most efficient, but not a big deal if its a one time migration.

这是我提出的rails解决方案。如果它是一次性迁移,可能不是最有效的,但不是一个大问题。

distinct_records = MyTable.all.group(:distinct_column_1, :distinct_column_2).map {|mt| mt.id}
duplicates = MyTable.all.to_a.reject!{|mt| distinct_records.include? mt.id}
duplicates.each(&:destroy)

First, groups by all columns that determine uniqueness, the example shows 2 but you could have more or less

首先,确定唯一性的所有列的组,示例显示2,但您可以有更多或更少

Second, selects the inverse of that group...all other records

其次,选择该组的反转...所有其他记录

Third, Deletes all those records.

第三,删除所有这些记录。

#15


0  

Firstly do group by column on which you want to delete duplicate.But I am not doing it with group by.I am writing self join.

首先按列删除要删除的副本。但是我没有使用group by。我正在编写自我加入。

You don't need to create the temporary table.

您不需要创建临时表。

Delete duplicate except one record: In this table it should have auto increment column. The possible solution that I've just come across:

删除除一条记录之外的重复:在此表中,它应具有自动增量列。我刚刚遇到的可能的解决方案:

DELETE n1 FROM names n1, names n2 WHERE n1.id > n2.id AND n1.name = n2.name

if you want to keep the row with the lowest auto increment id value OR

如果要保留具有最低自动增量id值OR的行

DELETE n1 FROM names n1, names n2 WHERE n1.id < n2.id AND n1.name = n2.name

if you want to keep the row with the highest auto increment id value.

如果要保留具有最高自动增量id值的行。

You can cross check your solution, find duplicate again:

您可以交叉检查您的解决方案,再次找到重复:

SELECT * FROM `names` GROUP BY name, id having count(name) > 1;

If it return 0 result, then you query is successful.

如果返回0结果,则查询成功。

#1


10  

What you can do is copy the distinct records into a new table by:

您可以做的是通过以下方式将不同记录复制到新表中:

 select distinct * into NewTable from MyTable

#2


8  

Here's another idea in no particular language:

这是另一种没有特定语言的想法:

rs = `select a, b, count(*) as c from entries group by 1, 2 having c > 1`
rs.each do |a, b, c|
  `delete from entries where a=#{a} and b=#{b} limit #{c - 1}`
end

Edit:

编辑:

Kudos to Olaf for that "having" hint :)

对奥拉夫的“有”提示感到荣幸:)

#3


7  

well, if it's a small table, from rails console you can do

好吧,如果它是一个小桌子,你可以从rails控制台做

class ActiveRecord::Base
  def non_id_attributes
    atts = self.attributes
    atts.delete('id')
    atts
  end
end

duplicate_groups = YourClass.find(:all).group_by { |element| element.non_id_attributes }.select{ |gr| gr.last.size > 1 }
redundant_elements = duplicate_groups.map { |group| group.last - [group.last.first] }.flatten
redundant_elements.each(&:destroy)

#4


7  

Check for Duplicate entries :

SELECT DISTINCT(req_field) AS field, COUNT(req_field) AS fieldCount FROM 
table_name GROUP BY req_field HAVING fieldCount > 1


Remove Duplicate Queries :

DELETE FROM table_name 
USING table_name, table_name AS vtable 
WHERE 
    (table_name.id > vtable.id) 
AND (table_name.req_field=req_field)

Replace req_field and table_name - should work without any issues.

替换req_field和table_name - 应该没有任何问题。

#5


4  

New to SQL :-) This is a classic question - often asked in interviews:-) I don't know whether it'll work in MYSQL but it works in most databases -

SQL的新手:-)这是一个经典的问题 - 经常在访谈中询问:-)我不知道它是否可以在MYSQL中工作但它适用于大多数数据库 -

> create table t(
>     a char(2),
>     b char(2),
>     c smallint )

> select a,b,c,count(*) from t
> group by a,b,c
> having count(*) > 1
a  b  c
-- -- ------ -----------
(0 rows affected)

> insert into t values ("aa","bb",1)
(1 row affected)

> insert into t values ("aa","bb",1)
(1 row affected)

> insert into t values ("aa","bc",1)
(1 row affected)

> select a,b,c,count(*) from t group by a,b,c having count(*) > 1
a  b  c 
-- -- ------ -----------
aa bb      1           2
(1 row affected)

#6


1  

If you have PK (id) in table (EMP) and want to older delete duplicate records with name column. For large data following query may be good approach.

如果您在表(EMP)中有PK(id)并且希望旧版删除具有名称列的重复记录。对于大数据后续查询可能是好方法。

DELETE t3
FROM (
        SELECT t1.name, t1.id
        FROM (
                SELECT name
                FROM EMP
                GROUP BY name
                HAVING COUNT(name) > 1
        ) AS t0 INNER JOIN EMP t1 ON t0.name = t1.name
) AS t2 INNER JOIN EMP t3 ON t3.name = t2.name
WHERE t2.id < t3.id;

#7


1  

suppose we have a table name tbl_product and there is duplicacy in the field p_pi_code and p_nats_id in maximum no of count then first create a new table insert the data from existing table ...
ie from tbl_product to newtable1 if anything else then newtable1 to newtable2

假设我们有一个表名tbl_product并且字段p_pi_code和p_nats_id中存在重复,最多没有计数,那么首先创建一个新表插入现有表中的数据...即从tbl_product到newtable1如果还有其他什么则newtable1到newtable2

CREATE TABLE `newtable2` (                                  
            `p_id` int(10) unsigned NOT NULL auto_increment,         
            `p_status` varchar(45) NOT NULL,                         
            `p_pi_code` varchar(45) NOT NULL,                        
            `p_nats_id` mediumint(8) unsigned NOT NULL,              
            `p_is_special` tinyint(4) NOT NULL,                      
             PRIMARY KEY (`p_id`)                                   
      ) ENGINE=InnoDB;

INSERT INTO newtable1 (p_status, p_pi_code, p_nats_id, p_is_special) SELECT 
    p_status, p_pi_code, p_nats_id, p_is_special FROM tbl_product group by p_pi_code;

INSERT INTO newtable2 (p_status, p_pi_code, p_nats_id, p_is_special) SELECT 
    p_status, p_pi_code, p_nats_id, p_is_special FROM newtable1 group by p_nats_id;

after that we see all the duplicacy in the field is removed

之后我们看到该字段中的所有重复项都被删除了

#8


0  

I had to do this recently on Oracle, but the steps would have been the same on MySQL. It was a lot of data, at least compared to what I'm used to working with, so my process to de-dup was comparatively heavyweight. I'm including it here in case someone else comes along with a similar problem.

我最近不得不在Oracle上做这个,但是MySQL的步骤也是如此。这是一个很多数据,至少与我以前的工作相比,所以我的重复数据流程相对较重。我把它包括在这里以防其他人遇到类似的问题。

My duplicate records had different IDs, different updated_at times, possibly different updated_by IDs, but all other columns the same. I wanted to keep the most recently updated of any duplicate set.

我的重复记录具有不同的ID,不同的updated_at时间,可能不同的updated_by ID,但所有其他列都相同。我想保留最近更新的任何重复集。

I used a combination of Rails logic and SQL to get it done.

我使用Rails逻辑和SQL的组合来完成它。

Step one: run a rake script to identify the IDs of the duplicate records, using model logic. IDs go in a text file.

第一步:使用模型逻辑运行rake脚本以识别重复记录的ID。 ID放在文本文件中。

Step two: create a temporary table with one column, the IDs to delete, loaded from the text file.

第二步:创建一个临时表,其中包含一列,要删除的ID,从文本文件加载。

Step three: create another temporary table with all the records I'm going to delete (just in case!).

第三步:创建另一个临时表,其中包含我要删除的所有记录(以防万一!)。

CREATE TABLE temp_duplicate_models 
  AS (SELECT * FROM models 
  WHERE id IN (SELECT * FROM temp_duplicate_ids));

Step four: actual deleting.

第四步:实际删除。

DELETE FROM models WHERE id IN (SELECT * FROM temp_duplicate_ids);

#9


0  

You can use:

您可以使用:

http://lenniedevilliers.blogspot.com/2008/10/weekly-code-find-duplicates-in-sql.html

http://lenniedevilliers.blogspot.com/2008/10/weekly-code-find-duplicates-in-sql.html

to get the duplicates and then just delete them via Ruby code or SQL code (I would do it in SQL code but thats up to you :-)

获取重复项,然后通过Ruby代码或SQL代码删除它们(我会在SQL代码中执行它,但这取决于你:-)

#10


0  

If your table has a PK (or you can easily give it one), you can specify any number of columns in the table to be equal (to qualify is as a duplicate) with the following query (may be a bit messy looking but it works):

如果你的表有一个PK(或者你可以轻松地给它一个),你可以使用以下查询指定表中任意数量的列相等(限定为重复)(可能看起来有点凌乱,但它作品):

DELETE FROM table WHERE pk_id IN(
   SELECT DISTINCT t3.pk_id FROM (
       SELECT t1.* FROM table AS t1 INNER JOIN (
           SELECT col1, col2, col3, col4, COUNT(*) FROM table
           GROUP BY col1, col2, col3, col4 HAVING COUNT(*)>1) AS t2
       ON t1.col1 = t2.col1 AND t1.col2 = t2.col2 AND t1.col3 = t2.col3 AND
       t1.col4 = t2.col4)
   AS t3, (
       SELECT t1.* FROM table AS t1 INNER JOIN (
           SELECT col1, col2, col3, col4, COUNT(*) FROM table
           GROUP BY col1, col2, col3, col4 HAVING COUNT(*)>1) AS t2
       ON t1.col1 = t2.col1 AND t1.col2 = t2.col2 AND t1.col3 = t2.col3 AND
       t1.col4 = t2.col4)
   AS t4
   WHERE t3.col1 = t4.col1 AND t3.pk_id > t4.pk_id

)

This will leave the first record entered into the database, deleting the 'newest' duplicates. If you want to keep the last record, switch the > to <.

这将使第一条记录进入数据库,删除“最新”重复项。如果要保留最后一条记录,请将>切换为<。

#11


0  

In MySql when I put something like

在MySql中我放了类似的东西

delete from A where IDA in (select IDA from A )

mySql said something like "you can't use the same table in the select part of the delete operation."

mySql说“你不能在删除操作的选择部分使用同一个表”。

I've just have to delete some duplicate records, and I have succeeded with a .php program like that

我只需要删除一些重复的记录,我已经成功完成了这样的.php程序

<?php
...
$res = hacer_sql("SELECT MIN(IDESTUDIANTE) as IDTODELETE 
FROM `estudiante` group by `LASTNAME`,`FIRSTNAME`,`CI`,`PHONE`
HAVING COUNT(*) > 1 )");
while ( $reg = mysql_fetch_assoc($res) ) {
   hacer_sql("delete from estudiante where IDESTUDIANTE = {$reg['IDTODELETE']}");
}
?>

#12


0  

I am using Alter Table

我正在使用Alter Table

ALTER IGNORE TABLE jos_city ADD UNIQUE INDEX(`city`);

#13


0  

I used @krukid's answer above to do the following on a table with around 70,000 entries:

我使用@krukid上面的答案在一个包含大约70,000个条目的表格上执行以下操作:

rs = 'select a, b, count(*) as c from table group by 1, 2 having c > 1'

# get a hashmap
dups = MyModel.connection.select_all(rs)

# convert to array
dupsarr = dups.map { |i|  [i.a, i.b, i.c] }

# delete dups
dupsarr.each do |a,b,c|
    ActiveRecord::Base.connection.execute("delete from table_name where a=#{MyModel.sanitize(a)} and b=#{MyModel.sanitize(b)} limit #{c-1}")
end

#14


0  

Here is the rails solution I came up with. May not be the most efficient, but not a big deal if its a one time migration.

这是我提出的rails解决方案。如果它是一次性迁移,可能不是最有效的,但不是一个大问题。

distinct_records = MyTable.all.group(:distinct_column_1, :distinct_column_2).map {|mt| mt.id}
duplicates = MyTable.all.to_a.reject!{|mt| distinct_records.include? mt.id}
duplicates.each(&:destroy)

First, groups by all columns that determine uniqueness, the example shows 2 but you could have more or less

首先,确定唯一性的所有列的组,示例显示2,但您可以有更多或更少

Second, selects the inverse of that group...all other records

其次,选择该组的反转...所有其他记录

Third, Deletes all those records.

第三,删除所有这些记录。

#15


0  

Firstly do group by column on which you want to delete duplicate.But I am not doing it with group by.I am writing self join.

首先按列删除要删除的副本。但是我没有使用group by。我正在编写自我加入。

You don't need to create the temporary table.

您不需要创建临时表。

Delete duplicate except one record: In this table it should have auto increment column. The possible solution that I've just come across:

删除除一条记录之外的重复:在此表中,它应具有自动增量列。我刚刚遇到的可能的解决方案:

DELETE n1 FROM names n1, names n2 WHERE n1.id > n2.id AND n1.name = n2.name

if you want to keep the row with the lowest auto increment id value OR

如果要保留具有最低自动增量id值OR的行

DELETE n1 FROM names n1, names n2 WHERE n1.id < n2.id AND n1.name = n2.name

if you want to keep the row with the highest auto increment id value.

如果要保留具有最高自动增量id值的行。

You can cross check your solution, find duplicate again:

您可以交叉检查您的解决方案,再次找到重复:

SELECT * FROM `names` GROUP BY name, id having count(name) > 1;

If it return 0 result, then you query is successful.

如果返回0结果,则查询成功。