如何从mysql数据库中删除重复项?

时间:2021-10-20 00:14:37

I have a table with some ids + titles. I want to make the title column unique, but it has over 600k records already, some of which are duplicates (sometimes several dozen times over).

我有一张有一些id +标题的表格。我希望标题列是唯一的,但是它已经有超过600k的记录,其中一些是重复的(有时超过几十次)。

How do I remove all duplicates, except one, so I can add a UNIQUE key to the title column after?

如何删除除一个以外的所有重复项,以便在标题列后面添加惟一键?

8 个解决方案

#1


78  

This command adds a unique key, and drops all rows that generate errors (due to the unique key). This removes duplicates.

这个命令添加一个唯一的键,并删除所有产生错误的行(由于唯一的键)。这个删除重复。

ALTER IGNORE TABLE table ADD UNIQUE KEY idx1(title); 

Edit: Note that this command may not work for InnoDB tables for some versions of MySQL. See this post for a workaround. (Thanks to "an anonymous user" for this information.)

编辑:注意,这个命令可能不适用于一些版本的MySQL的InnoDB表。请参阅这篇文章了解解决方案。(感谢“匿名用户”提供的信息。)

#2


8  

Create a new table with just the distinct rows of the original table. There may be other ways but I find this the cleanest.

使用原始表的不同行创建一个新表。也许还有别的办法,但我觉得这是最干净的。

CREATE TABLE tmp_table AS SELECT DISTINCT [....] FROM main_table

More specifically:
The faster way is to insert distinct rows into a temporary table. Using delete, it took me a few hours to remove duplicates from a table of 8 million rows. Using insert and distinct, it took just 13 minutes.

更具体地说:更快的方法是将不同的行插入到临时表中。使用delete,我花了几个小时从一个包含800万行的表中删除副本。使用insert和distinct,只需13分钟。

CREATE TABLE tempTableName LIKE tableName;  
CREATE INDEX ix_all_id ON tableName(cellId,attributeId,entityRowId,value);  
INSERT INTO tempTableName(cellId,attributeId,entityRowId,value) SELECT DISTINCT cellId,attributeId,entityRowId,value FROM tableName;  
DROP TABLE tableName;  
INSERT tableName SELECT * FROM tempTableName;  
DROP TABLE tempTableName;  

#3


0  

This shows how to do it in SQL2000. I'm not completely familiar with MySQL syntax but I'm sure there's something comparable

这显示了如何在SQL2000中执行该操作。我不完全熟悉MySQL的语法,但我确信有一些类似的东西。

create table #titles (iid int identity (1, 1), title varchar(200))

-- Repeat this step many times to create duplicates
insert into #titles(title) values ('bob')
insert into #titles(title) values ('bob1')
insert into #titles(title) values ('bob2')
insert into #titles(title) values ('bob3')
insert into #titles(title) values ('bob4')


DELETE T  FROM 
#titles T left join 
(
  select title, min(iid) as minid from #titles group by title
) D on T.title = D.title and T.iid = D.minid
WHERE D.minid is null

Select * FROM #titles

#4


0  

delete from student where id in (
SELECT distinct(s1.`student_id`) from student as s1 inner join student as s2
where s1.`sex` = s2.`sex` and
s1.`student_id` > s2.`student_id` and
s1.`sex` = 'M'
    ORDER BY `s1`.`student_id` ASC
)

#5


0  

The solution posted by Nitin seems to be the most elegant / logical one.

Nitin发布的解决方案似乎是最优雅/最符合逻辑的。

However it has one issue:

然而,它有一个问题:

ERROR 1093 (HY000): You can't specify target table 'student' for update in FROM clause

错误1093 (HY000):您不能在FROM子句中指定目标表“student”。

This can however be resolved by using (SELECT * FROM student) instead of student:

但这可以通过使用(从学生中选择*)而不是学生来解决:

DELETE FROM student WHERE id IN (
SELECT distinct(s1.`student_id`) FROM (SELECT * FROM student) AS s1 INNER JOIN (SELECT * FROM student) AS s2
WHERE s1.`sex` = s2.`sex` AND
s1.`student_id` > s2.`student_id` AND
s1.`sex` = 'M'
ORDER BY `s1`.`student_id` ASC
)

Give your +1's to Nitin for coming up with the original solution.

把你的+1给尼丁,让他想出原来的解决办法。

#6


0  

Since the MySql ALTER IGNORE TABLE has been deprecated, you need to actually delete the duplicate date before adding an index.

由于不赞成使用MySql ALTER IGNORE表,因此需要在添加索引之前实际删除重复的日期。

First write a query that finds all the duplicates. Here I'm assuming that email is the field that contains duplicates.

首先编写查找所有副本的查询。这里我假设电子邮件是包含副本的字段。

SELECT
    s1.email
    s1.id, 
    s1.created
    s2.id,
    s2.created 
FROM 
    student AS s1 
INNER JOIN 
    student AS s2 
WHERE 
    /* Emails are the same */
    s1.email = s2.email AND
    /* DON'T select both accounts,
       only select the one created later.
       The serial id could also be used here */
    s2.created > s1.created 
;

Next select only the unique duplicate ids:

接下来只选择唯一的重复id:

SELECT 
    DISTINCT s2.id
FROM 
    student AS s1 
INNER JOIN 
    student AS s2 
WHERE 
    s1.email = s2.email AND
    s2.created > s1.created 
;

Once you are sure that only contains the duplicate ids you want to delete, run the delete. You have to add (SELECT * FROM tblname) so that MySql doesn't complain.

一旦确定只包含要删除的重复id,运行delete。您必须添加(从tblname中选择*),以便MySql不会抱怨。

DELETE FROM
    student 
WHERE
    id
IN (
    SELECT 
        DISTINCT s2.id
    FROM 
        (SELECT * FROM student) AS s1 
    INNER JOIN 
        (SELECT * FROM student) AS s2 
    WHERE 
        s1.email = s2.email AND
        s2.created > s1.created 
);

Then create the unique index:

然后创建唯一索引:

ALTER TABLE
    student
ADD UNIQUE INDEX
    idx_student_unique_email(email)
;

#7


0  

Below query can be used to delete all the duplicate except the one row with lowest "id" field value

下面的查询可用于删除除具有最低“id”字段值的一行以外的所有副本。

DELETE t1 FROM table_name t1, table_name t2 WHERE t1.id > t2.id AND t1.name = t2.name

In the similar way, we can keep the row with the highest value in 'id' as follows

以类似的方式,我们可以将值最高的行保存为“id”,如下所示

 DELETE t1 FROM table_name t1, table_name t2 WHERE t1.id < t2.id AND t1.name = t2.name

#8


0  

Deleting duplicates on MySQL tables is a common issue, that usually comes with specific needs. In case anyone is interested, here (Remove duplicate rows in MySQL) I explain how to use a temporary table to delete MySQL duplicates in a reliable and fast way (with examples for different use cases).

删除MySQL表上的副本是一个常见的问题,这通常伴随着特定的需求。如果有人感兴趣,这里(在MySQL中删除重复的行)我将解释如何使用临时表以一种可靠且快速的方式删除MySQL重复(对于不同的用例有示例)。

In this case, something like this should work:

在这种情况下,类似这样的东西应该是有用的:

-- create a new temporary table
CREATE TABLE tmp_table1 LIKE table1;

-- add a unique constraint    
ALTER TABLE tmp_table1 ADD UNIQUE(id, title);

-- scan over the table to insert entries
INSERT IGNORE INTO tmp_table1 SELECT * FROM table1 ORDER BY sid;

-- rename tables
RENAME TABLE table1 TO backup_table1, tmp_table1 TO table1;

#1


78  

This command adds a unique key, and drops all rows that generate errors (due to the unique key). This removes duplicates.

这个命令添加一个唯一的键,并删除所有产生错误的行(由于唯一的键)。这个删除重复。

ALTER IGNORE TABLE table ADD UNIQUE KEY idx1(title); 

Edit: Note that this command may not work for InnoDB tables for some versions of MySQL. See this post for a workaround. (Thanks to "an anonymous user" for this information.)

编辑:注意,这个命令可能不适用于一些版本的MySQL的InnoDB表。请参阅这篇文章了解解决方案。(感谢“匿名用户”提供的信息。)

#2


8  

Create a new table with just the distinct rows of the original table. There may be other ways but I find this the cleanest.

使用原始表的不同行创建一个新表。也许还有别的办法,但我觉得这是最干净的。

CREATE TABLE tmp_table AS SELECT DISTINCT [....] FROM main_table

More specifically:
The faster way is to insert distinct rows into a temporary table. Using delete, it took me a few hours to remove duplicates from a table of 8 million rows. Using insert and distinct, it took just 13 minutes.

更具体地说:更快的方法是将不同的行插入到临时表中。使用delete,我花了几个小时从一个包含800万行的表中删除副本。使用insert和distinct,只需13分钟。

CREATE TABLE tempTableName LIKE tableName;  
CREATE INDEX ix_all_id ON tableName(cellId,attributeId,entityRowId,value);  
INSERT INTO tempTableName(cellId,attributeId,entityRowId,value) SELECT DISTINCT cellId,attributeId,entityRowId,value FROM tableName;  
DROP TABLE tableName;  
INSERT tableName SELECT * FROM tempTableName;  
DROP TABLE tempTableName;  

#3


0  

This shows how to do it in SQL2000. I'm not completely familiar with MySQL syntax but I'm sure there's something comparable

这显示了如何在SQL2000中执行该操作。我不完全熟悉MySQL的语法,但我确信有一些类似的东西。

create table #titles (iid int identity (1, 1), title varchar(200))

-- Repeat this step many times to create duplicates
insert into #titles(title) values ('bob')
insert into #titles(title) values ('bob1')
insert into #titles(title) values ('bob2')
insert into #titles(title) values ('bob3')
insert into #titles(title) values ('bob4')


DELETE T  FROM 
#titles T left join 
(
  select title, min(iid) as minid from #titles group by title
) D on T.title = D.title and T.iid = D.minid
WHERE D.minid is null

Select * FROM #titles

#4


0  

delete from student where id in (
SELECT distinct(s1.`student_id`) from student as s1 inner join student as s2
where s1.`sex` = s2.`sex` and
s1.`student_id` > s2.`student_id` and
s1.`sex` = 'M'
    ORDER BY `s1`.`student_id` ASC
)

#5


0  

The solution posted by Nitin seems to be the most elegant / logical one.

Nitin发布的解决方案似乎是最优雅/最符合逻辑的。

However it has one issue:

然而,它有一个问题:

ERROR 1093 (HY000): You can't specify target table 'student' for update in FROM clause

错误1093 (HY000):您不能在FROM子句中指定目标表“student”。

This can however be resolved by using (SELECT * FROM student) instead of student:

但这可以通过使用(从学生中选择*)而不是学生来解决:

DELETE FROM student WHERE id IN (
SELECT distinct(s1.`student_id`) FROM (SELECT * FROM student) AS s1 INNER JOIN (SELECT * FROM student) AS s2
WHERE s1.`sex` = s2.`sex` AND
s1.`student_id` > s2.`student_id` AND
s1.`sex` = 'M'
ORDER BY `s1`.`student_id` ASC
)

Give your +1's to Nitin for coming up with the original solution.

把你的+1给尼丁,让他想出原来的解决办法。

#6


0  

Since the MySql ALTER IGNORE TABLE has been deprecated, you need to actually delete the duplicate date before adding an index.

由于不赞成使用MySql ALTER IGNORE表,因此需要在添加索引之前实际删除重复的日期。

First write a query that finds all the duplicates. Here I'm assuming that email is the field that contains duplicates.

首先编写查找所有副本的查询。这里我假设电子邮件是包含副本的字段。

SELECT
    s1.email
    s1.id, 
    s1.created
    s2.id,
    s2.created 
FROM 
    student AS s1 
INNER JOIN 
    student AS s2 
WHERE 
    /* Emails are the same */
    s1.email = s2.email AND
    /* DON'T select both accounts,
       only select the one created later.
       The serial id could also be used here */
    s2.created > s1.created 
;

Next select only the unique duplicate ids:

接下来只选择唯一的重复id:

SELECT 
    DISTINCT s2.id
FROM 
    student AS s1 
INNER JOIN 
    student AS s2 
WHERE 
    s1.email = s2.email AND
    s2.created > s1.created 
;

Once you are sure that only contains the duplicate ids you want to delete, run the delete. You have to add (SELECT * FROM tblname) so that MySql doesn't complain.

一旦确定只包含要删除的重复id,运行delete。您必须添加(从tblname中选择*),以便MySql不会抱怨。

DELETE FROM
    student 
WHERE
    id
IN (
    SELECT 
        DISTINCT s2.id
    FROM 
        (SELECT * FROM student) AS s1 
    INNER JOIN 
        (SELECT * FROM student) AS s2 
    WHERE 
        s1.email = s2.email AND
        s2.created > s1.created 
);

Then create the unique index:

然后创建唯一索引:

ALTER TABLE
    student
ADD UNIQUE INDEX
    idx_student_unique_email(email)
;

#7


0  

Below query can be used to delete all the duplicate except the one row with lowest "id" field value

下面的查询可用于删除除具有最低“id”字段值的一行以外的所有副本。

DELETE t1 FROM table_name t1, table_name t2 WHERE t1.id > t2.id AND t1.name = t2.name

In the similar way, we can keep the row with the highest value in 'id' as follows

以类似的方式,我们可以将值最高的行保存为“id”,如下所示

 DELETE t1 FROM table_name t1, table_name t2 WHERE t1.id < t2.id AND t1.name = t2.name

#8


0  

Deleting duplicates on MySQL tables is a common issue, that usually comes with specific needs. In case anyone is interested, here (Remove duplicate rows in MySQL) I explain how to use a temporary table to delete MySQL duplicates in a reliable and fast way (with examples for different use cases).

删除MySQL表上的副本是一个常见的问题,这通常伴随着特定的需求。如果有人感兴趣,这里(在MySQL中删除重复的行)我将解释如何使用临时表以一种可靠且快速的方式删除MySQL重复(对于不同的用例有示例)。

In this case, something like this should work:

在这种情况下,类似这样的东西应该是有用的:

-- create a new temporary table
CREATE TABLE tmp_table1 LIKE table1;

-- add a unique constraint    
ALTER TABLE tmp_table1 ADD UNIQUE(id, title);

-- scan over the table to insert entries
INSERT IGNORE INTO tmp_table1 SELECT * FROM table1 ORDER BY sid;

-- rename tables
RENAME TABLE table1 TO backup_table1, tmp_table1 TO table1;