如何选择和/或删除表中每组重复项的所有行?

时间:2021-10-22 12:54:25

Let's say I have a MySQL table with four columns:

假设我有一个包含四列的MySQL表:

ID DRIVER_ID CAR_ID NOTES (NULL for most rows)

ID DRIVER_ID CAR_ID NOTES(大多数行为NULL)

I have a bunch of duplicate rows where DRIVER_ID and CAR_ID are the same. For each pair of DRIVER_ID and CAR_ID, I want one row. If one of the rows in the set has non-NULL NOTES, I want that one, but otherwise it doesn't matter.

我有一堆重复的行,其中DRIVER_ID和CAR_ID是相同的。对于每对DRIVER_ID和CAR_ID,我想要一行。如果集合中的一行具有非NULL NOTES,我想要那个,但是否则无关紧要。

so if I have:

所以,如果我有:

ID  |  DRIVER_ID  |  CAR_ID  |  NOTES
1      1             1          NULL
2      1             1          NULL
3      1             2          NULL
4      1             2          NULL
5      2             3          NULL
6      2             3          NULL
7      2             3          NULL
8      2             3          hi
9      3             5          NULL

I want to keep the following IDs: 9, 8, and then one each of [3,4] and [1,2].

我想保留以下ID:9,8,然后分别为[3,4]和[1,2]。

It's a huge table, and the clunky methods I've tried are insanely slow, to the point where I'm sure I'm going about it all wrong. How can I efficiently a) select the list of IDs to delete? b) delete them in the same query?

这是一张巨大的桌子,我尝试过的笨重的方法非常慢,到了我确定我会把它弄错的地步。我怎样才能有效地a)选择要删除的ID列表? b)在同一个查询中删除它们?

(And yes, I know the deal with composite keys. That's not an issue here.)

(是的,我知道复合键的处理。这不是问题。)

EDIT: Sorry, forgot to specify that this was MySQL.

编辑:对不起,忘了指定这是MySQL。

Some of the stuff I've tried so far:

到目前为止我尝试过的一些东西:

select ID, COUNT(DRIVER_ID) rowcount from CARS_DRIVERS group by CAR_ID,DRIVER_ID HAVING rowcount > 1;

will get me one ID per group. It doesn't necessarily leave the row with NOTES if there is one, though. It will also only get me one ID per duplicate group. There are some cases where there are 20+ duplicate combos, so I would need to iterate that over and over to whittle each group down to a single row.

我会给每个组一个ID。但是,如果有一行,它不一定会留下带有NOTES的行。每个重复组也只能获得一个ID。在某些情况下,有20多个重复组合,所以我需要反复迭代,将每个组缩小到一行。

select distinct t1.ID from CARS_DRIVERS t1 where exists (select * from CARS_DRIVERS t2 where t2.CAR_ID = t1.CAR_ID and t2.DRIVER_ID = t1.DRIVER_ID and t2.id > t1.id);

This is much slower, and still doesn't really address the NOTES issue. It does have the advantage of getting the oldest row for each group, which, if I can't isolate on the NOTES field easily, could be a proxy for that. If a row in a set has NOTES, I believe it's always the oldest one (one with the lowest ID), but I'm not certain.

这要慢得多,但仍然没有真正解决NOTES问题。它确实具有为每个组获取最旧行的优势,如果我不能轻易地在NOTES字段上隔离,则可以代表它。如果集合中的一行有NOTES,我相信它总是最老的(ID最低的那一行),但我不确定。

Some additional context: DRIVER_ID and CAR_ID are not the real column names, and there are other columns in the table. I was trying to distill down the info to get at the root of the problem, but I see from W4M's comment that this makes it look like a homework assignment. The real deal is that I'm looking at a very unoptimized database (not my purview normally) and when trying to get rid of these dupes before adding a key, the operation is taking forever. As in, hours. The table is big but certainly doesn't justify that. I'm trying to pitch in with my limited SQL expertise and figure out a way to get this done. Doesn't matter if it's pretty, I can sit at the command line and brute-force a bunch of queries if necessary. But I noticed that SELECTing IDs that are candidates for deletion only takes a few seconds, and although the table is huge, the total number of rows to delete is less than 10k so there must be a way to make this happen without some script that takes a whole weekend to finish.

一些额外的上下文:DRIVER_ID和CAR_ID不是真正的列名,表中还有其他列。我试图提取信息以解决问题的根源,但我从W4M的评论中看到,这使它看起来像一个家庭作业。真正的问题是,我正在寻找一个非常不优化的数据库(通常不是我的权限),并且在添加密钥之前试图摆脱这些欺骗时,操作将永远持续下去。如,小时。表很大但肯定不合理。我正试图利用我有限的SQL专业知识,找到一种方法来完成这项工作。如果它很漂亮无关紧要,我可以坐在命令行,并在必要时强行执行一系列查询。但是我注意到,SELECTing ID作为删除的候选者只需要几秒钟,虽然表很大,但要删除的行总数少于10k所以必须有一种方法可以在没有一些脚本的情况下实现这一点。整个周末结束。

2 个解决方案

#1


7  

Here's one solution. I tested this on MySQL 5.5.8.

这是一个解决方案。我在MySQL 5.5.8上测试了这个。

SELECT MAX(COALESCE(c2.id, c1.id)) AS id,
 c1.driver_id, c1.car_id,
 c2.notes AS notes
FROM cars_drivers AS c1
LEFT OUTER JOIN cars_drivers AS c2
 ON (c1.driver_id,c1.car_id) = (c2.driver_id,c2.car_id) AND c2.notes IS NOT NULL
GROUP BY c1.driver_id, c1.car_id, c2.notes;

I include c2.notes as a GROUP BY key because you might have more than one row with non-null notes per values of driver_id,car_id.

我将c2.notes包含为GROUP BY键,因为每个driver_id,car_id值可能有多行非空注释。

Result using your example data:

结果使用您的示例数据:

+------+-----------+--------+-------+
| id   | driver_id | car_id | notes |
+------+-----------+--------+-------+
|    2 |         1 |      1 | NULL  |
|    4 |         2 |      1 | NULL  |
|    8 |         3 |      2 | hi    |
|    9 |         5 |      3 | NULL  |
+------+-----------+--------+-------+

Regarding deleting. In your example data, it's always the highest id value per driver_id & car_id that you want to keep. If you can depend on that, you can do a multi-table delete that deletes all rows for which a row with a higher id value and the same driver_id & car_id exists:

关于删除。在您的示例数据中,它始终是您要保留的每个driver_id和car_id的最高ID值。如果您可以依赖它,则可以执行多表删除,删除具有较高id值且存在相同driver_id&car_id的行的所有行:

DELETE c1 FROM cars_drivers AS c1 INNER JOIN cars_drivers AS c2
 ON (c1.driver_id,c1.car_id) = (c2.driver_id,c2.car_id) AND c1.id < c2.id;

This naturally skips any cases where only one row exists with a given pair of driver_id & car_id values, because the conditions of the inner join require two rows with different id values.

这自然会跳过任何只存在一行且具有给定的一对driver_id和car_id值的情况,因为内连接的条件需要两行具有不同的id值。

But if you can't depend on the latest id per group being the one you want to keep, the solution is more complex. It's probably more complex than it's worth to solve in one statement, so do it in two statements.

但是,如果你不能依赖每个组的最新id是你想要保留的那个,那么解决方案就更复杂了。它可能比在一个语句中解决它更复杂,所以在两个语句中这样做。

I tested this too, after adding a couple more rows for testing:

在添加了几行进行测试后,我也对此进行了测试:

INSERT INTO cars_drivers VALUES (10,2,3,NULL), (11,2,3,'bye');

+----+--------+-----------+-------+
| id | car_id | driver_id | notes |
+----+--------+-----------+-------+
|  1 |      1 |         1 | NULL  |
|  2 |      1 |         1 | NULL  |
|  3 |      1 |         2 | NULL  |
|  4 |      1 |         2 | NULL  |
|  5 |      2 |         3 | NULL  |
|  6 |      2 |         3 | NULL  |
|  7 |      2 |         3 | NULL  |
|  8 |      2 |         3 | hi    |
|  9 |      3 |         5 | NULL  |
| 10 |      2 |         3 | NULL  |
| 11 |      2 |         3 | bye   |
+----+--------+-----------+-------+

First delete rows with null notes, where a row with non-null notes exists.

首先删除具有空注释的行,其中存在具有非空注释的行。

DELETE c1 FROM cars_drivers AS c1 INNER JOIN cars_drivers AS c2
 ON (c1.driver_id,c1.car_id) = (c2.driver_id,c2.car_id)
WHERE c1.notes IS NULL AND c2.notes IS NOT NULL;

+----+--------+-----------+-------+
| id | car_id | driver_id | notes |
+----+--------+-----------+-------+
|  1 |      1 |         1 | NULL  |
|  2 |      1 |         1 | NULL  |
|  3 |      1 |         2 | NULL  |
|  4 |      1 |         2 | NULL  |
|  8 |      2 |         3 | hi    |
|  9 |      3 |         5 | NULL  |
| 11 |      2 |         3 | bye   |
+----+--------+-----------+-------+

Second, delete all but the highest-id row from each group of duplicates.

其次,从每组重复项中删除除最高id行之外的所有行。

DELETE c1 FROM cars_drivers AS c1 INNER JOIN cars_drivers AS c2
 ON (c1.driver_id,c1.car_id) = (c2.driver_id,c2.car_id) AND c1.id < c2.id;

+----+--------+-----------+-------+
| id | car_id | driver_id | notes |
+----+--------+-----------+-------+
|  2 |      1 |         1 | NULL  |
|  4 |      1 |         2 | NULL  |
|  9 |      3 |         5 | NULL  |
| 11 |      2 |         3 | bye   |
+----+--------+-----------+-------+

#2


-1  

Since this is very similar to homework I will not give the answer. You want to do a left join and/or issue a distinct query.

由于这与家庭作业非常相似,我不会给出答案。您想要执行左连接和/或发出不同的查询。

http://dev.mysql.com/doc/refman/5.0/en/distinct-optimization.html

EDIT Completely untested:

编辑完全未经测试:

select distinct(t1.car_id) from cars_drivers t1 where t1.car_id = t1.driver_id and notes != null;

handles the call where you want notes. In the event that list is zero you want to run this:

处理你想要笔记的电话。如果列表为零,您希望运行此:

select distinct(t1.car_id) from cars_drivers t1 where t1.car_id = t1.driver_id;

#1


7  

Here's one solution. I tested this on MySQL 5.5.8.

这是一个解决方案。我在MySQL 5.5.8上测试了这个。

SELECT MAX(COALESCE(c2.id, c1.id)) AS id,
 c1.driver_id, c1.car_id,
 c2.notes AS notes
FROM cars_drivers AS c1
LEFT OUTER JOIN cars_drivers AS c2
 ON (c1.driver_id,c1.car_id) = (c2.driver_id,c2.car_id) AND c2.notes IS NOT NULL
GROUP BY c1.driver_id, c1.car_id, c2.notes;

I include c2.notes as a GROUP BY key because you might have more than one row with non-null notes per values of driver_id,car_id.

我将c2.notes包含为GROUP BY键,因为每个driver_id,car_id值可能有多行非空注释。

Result using your example data:

结果使用您的示例数据:

+------+-----------+--------+-------+
| id   | driver_id | car_id | notes |
+------+-----------+--------+-------+
|    2 |         1 |      1 | NULL  |
|    4 |         2 |      1 | NULL  |
|    8 |         3 |      2 | hi    |
|    9 |         5 |      3 | NULL  |
+------+-----------+--------+-------+

Regarding deleting. In your example data, it's always the highest id value per driver_id & car_id that you want to keep. If you can depend on that, you can do a multi-table delete that deletes all rows for which a row with a higher id value and the same driver_id & car_id exists:

关于删除。在您的示例数据中,它始终是您要保留的每个driver_id和car_id的最高ID值。如果您可以依赖它,则可以执行多表删除,删除具有较高id值且存在相同driver_id&car_id的行的所有行:

DELETE c1 FROM cars_drivers AS c1 INNER JOIN cars_drivers AS c2
 ON (c1.driver_id,c1.car_id) = (c2.driver_id,c2.car_id) AND c1.id < c2.id;

This naturally skips any cases where only one row exists with a given pair of driver_id & car_id values, because the conditions of the inner join require two rows with different id values.

这自然会跳过任何只存在一行且具有给定的一对driver_id和car_id值的情况,因为内连接的条件需要两行具有不同的id值。

But if you can't depend on the latest id per group being the one you want to keep, the solution is more complex. It's probably more complex than it's worth to solve in one statement, so do it in two statements.

但是,如果你不能依赖每个组的最新id是你想要保留的那个,那么解决方案就更复杂了。它可能比在一个语句中解决它更复杂,所以在两个语句中这样做。

I tested this too, after adding a couple more rows for testing:

在添加了几行进行测试后,我也对此进行了测试:

INSERT INTO cars_drivers VALUES (10,2,3,NULL), (11,2,3,'bye');

+----+--------+-----------+-------+
| id | car_id | driver_id | notes |
+----+--------+-----------+-------+
|  1 |      1 |         1 | NULL  |
|  2 |      1 |         1 | NULL  |
|  3 |      1 |         2 | NULL  |
|  4 |      1 |         2 | NULL  |
|  5 |      2 |         3 | NULL  |
|  6 |      2 |         3 | NULL  |
|  7 |      2 |         3 | NULL  |
|  8 |      2 |         3 | hi    |
|  9 |      3 |         5 | NULL  |
| 10 |      2 |         3 | NULL  |
| 11 |      2 |         3 | bye   |
+----+--------+-----------+-------+

First delete rows with null notes, where a row with non-null notes exists.

首先删除具有空注释的行,其中存在具有非空注释的行。

DELETE c1 FROM cars_drivers AS c1 INNER JOIN cars_drivers AS c2
 ON (c1.driver_id,c1.car_id) = (c2.driver_id,c2.car_id)
WHERE c1.notes IS NULL AND c2.notes IS NOT NULL;

+----+--------+-----------+-------+
| id | car_id | driver_id | notes |
+----+--------+-----------+-------+
|  1 |      1 |         1 | NULL  |
|  2 |      1 |         1 | NULL  |
|  3 |      1 |         2 | NULL  |
|  4 |      1 |         2 | NULL  |
|  8 |      2 |         3 | hi    |
|  9 |      3 |         5 | NULL  |
| 11 |      2 |         3 | bye   |
+----+--------+-----------+-------+

Second, delete all but the highest-id row from each group of duplicates.

其次,从每组重复项中删除除最高id行之外的所有行。

DELETE c1 FROM cars_drivers AS c1 INNER JOIN cars_drivers AS c2
 ON (c1.driver_id,c1.car_id) = (c2.driver_id,c2.car_id) AND c1.id < c2.id;

+----+--------+-----------+-------+
| id | car_id | driver_id | notes |
+----+--------+-----------+-------+
|  2 |      1 |         1 | NULL  |
|  4 |      1 |         2 | NULL  |
|  9 |      3 |         5 | NULL  |
| 11 |      2 |         3 | bye   |
+----+--------+-----------+-------+

#2


-1  

Since this is very similar to homework I will not give the answer. You want to do a left join and/or issue a distinct query.

由于这与家庭作业非常相似,我不会给出答案。您想要执行左连接和/或发出不同的查询。

http://dev.mysql.com/doc/refman/5.0/en/distinct-optimization.html

EDIT Completely untested:

编辑完全未经测试:

select distinct(t1.car_id) from cars_drivers t1 where t1.car_id = t1.driver_id and notes != null;

handles the call where you want notes. In the event that list is zero you want to run this:

处理你想要笔记的电话。如果列表为零,您希望运行此:

select distinct(t1.car_id) from cars_drivers t1 where t1.car_id = t1.driver_id;