如何基于两个字段查找副本?

I have rows in an Oracle database table which should be unique for a combination of two fields but the unique constrain is not set up on the table so I need to find all rows which violate the constraint myself using SQL. Unfortunately my meager SQL skills aren't up to the task.

我在Oracle数据库表中有行，这对于两个字段的组合应该是唯一的，但是没有在表上设置唯一的约束，所以我需要查找所有违反约束的行。不幸的是，我贫乏的SQL技能不能胜任这项工作。

My table has three columns which are relevant: entity_id, station_id, and obs_year. For each row the combination of station_id and obs_year should be unique, and I want to find out if there are rows which violate this by flushing them out with an SQL query.

我的表有三个相关列:entity_id、station_id和obs_year。对于每一行，都应该是唯一的，并且我想知道是否有行违反了这一点，用SQL查询将它们删除。

I have tried the following SQL (suggested by this previous question) but it doesn't work for me (I get ORA-00918 column ambiguously defined):

我尝试过以下SQL语句(由前面的问题建议)，但它对我不起作用(我得到的ORA-00918列定义含糊):

SELECT
entity_id, station_id, obs_year
FROM
mytable t1
INNER JOIN (
SELECT entity_id, station_id, obs_year FROM mytable 
GROUP BY entity_id, station_id, obs_year HAVING COUNT(*) > 1) dupes 
ON 
t1.station_id = dupes.station_id AND
t1.obs_year = dupes.obs_year

Can someone suggest what I'm doing wrong, and/or how to solve this?

有人能指出我做错了什么吗?

8 个解决方案

#1

SELECT  *
FROM    (
        SELECT  t.*, ROW_NUMBER() OVER (PARTITION BY station_id, obs_year ORDER BY entity_id) AS rn
        FROM    mytable t
        )
WHERE   rn > 1

#2

SELECT entity_id, station_id, obs_year
FROM mytable t1
WHERE EXISTS (SELECT 1 from mytable t2 Where
       t1.station_id = t2.station_id
       AND t1.obs_year = t2.obs_year
       AND t1.RowId <> t2.RowId)

#3

Change the 3 fields in the initial select to be

更改初始选择中的3个字段

SELECT
t1.entity_id, t1.station_id, t1.obs_year

#4

Re-write of your query

重写你的查询

SELECT
t1.entity_id, t1.station_id, t1.obs_year
FROM
mytable t1
INNER JOIN (
SELECT entity_id, station_id, obs_year FROM mytable 
GROUP BY entity_id, station_id, obs_year HAVING COUNT(*) > 1) dupes 
ON 
t1.station_id = dupes.station_id AND
t1.obs_year = dupes.obs_year

I think the ambiguous column error (ORA-00918) was because you were selecting columns whose names appeared in both the table and the subquery, but you did not specifiy if you wanted it from dupes or from mytable (aliased as t1).

我认为不明确的列错误(ORA-00918)是因为您选择的列的名称同时出现在表和子查询中，但是如果您希望它来自dupes或mytable(别名为t1)，则不指定。

#5

Could you not create a new table that includes the unique constraint, and then copy across the data row by row, ignoring failures?

难道您不可以创建一个包含唯一约束的新表，然后逐行复制数据，忽略失败吗?

#6

You need to specify the table for the columns in the main select. Also, assuming entity_id is the unique key for mytable and is irrelevant to finding duplicates, you should not be grouping on it in the dupes subquery.

您需要为主select中的列指定表。另外，假设entity_id是mytable的惟一键，并且与查找重复项无关，您不应该在dupes子查询中对其进行分组。

Try:

试一试:

SELECT t1.entity_id, t1.station_id, t1.obs_year
FROM mytable t1
INNER JOIN (
SELECT station_id, obs_year FROM mytable 
GROUP BY station_id, obs_year HAVING COUNT(*) > 1) dupes 
ON 
t1.station_id = dupes.station_id AND
t1.obs_year = dupes.obs_year

#7

SELECT  *
FROM    (
        SELECT  t.*, ROW_NUMBER() OVER (PARTITION BY station_id, obs_year ORDER BY entity_id) AS rn
        FROM    mytable t
        )
WHERE   rn > 1

by Quassnoi is the most efficient for large tables. I had this analysis of cost :

Quassnoi的作品对于大型表格来说是最有效的。我对成本进行了分析

SELECT a.dist_code, a.book_date, a.book_no
FROM trn_refil_book a
WHERE EXISTS (SELECT 1 from trn_refil_book b Where
       a.dist_code = b.dist_code and a.book_date = b.book_date and a.book_no = b.book_no
       AND a.RowId <> b.RowId)
       ;

gave a cost of 1322341

成本是1322341

SELECT a.dist_code, a.book_date, a.book_no
FROM trn_refil_book a
INNER JOIN (
SELECT b.dist_code, b.book_date, b.book_no FROM trn_refil_book b 
GROUP BY b.dist_code, b.book_date, b.book_no HAVING COUNT(*) > 1) c 
ON 
 a.dist_code = c.dist_code and a.book_date = c.book_date and a.book_no = c.book_no
;

gave a cost of 1271699

价格是1271699美元

while

而

SELECT  dist_code, book_date, book_no
FROM    (
        SELECT  t.dist_code, t.book_date, t.book_no, ROW_NUMBER() OVER (PARTITION BY t.book_date, t.book_no
          ORDER BY t.dist_code) AS rn
        FROM    trn_refil_book t
        ) p
WHERE   p.rn > 1
;

gave a cost of 1021984

成本是1021984

The table was not indexed....

表没有索引....

#8

  SELECT entity_id, station_id, obs_year
    FROM mytable
GROUP BY entity_id, station_id, obs_year
HAVING COUNT(*) > 1

Specify the fields to find duplicates on both the SELECT and the GROUP BY.

指定要在SELECT和GROUP BY上查找副本的字段。

It works by using GROUP BY to find any rows that match any other rows based on the specified Columns. The HAVING COUNT(*) > 1 says that we are only interested in seeing any rows that occur more than 1 time (and are therefore duplicates)

它通过使用GROUP by查找任何与指定列匹配的行。COUNT(*) > 1表示我们只希望看到出现超过1次的行(因此它们是重复的)

#1

SELECT  *
FROM    (
        SELECT  t.*, ROW_NUMBER() OVER (PARTITION BY station_id, obs_year ORDER BY entity_id) AS rn
        FROM    mytable t
        )
WHERE   rn > 1

#2

SELECT entity_id, station_id, obs_year
FROM mytable t1
WHERE EXISTS (SELECT 1 from mytable t2 Where
       t1.station_id = t2.station_id
       AND t1.obs_year = t2.obs_year
       AND t1.RowId <> t2.RowId)

#3