I need to find duplicates in a table. In MySQL I simply write:
我需要在表格中找到重复项。在MySQL中我只写:
SELECT *,count(id) count FROM `MY_TABLE`
GROUP BY SOME_COLUMN ORDER BY count DESC
This query nicely:
这个查询很好:
- Finds duplicates based on SOME_COLUMN, giving its repetition count.
- Sorts in desc order of repetition, which is useful to quickly scan major dups.
- Chooses a random value for all remaining columns, giving me an idea of values in those columns.
根据SOME_COLUMN查找重复项,并重复计算。
以重复的顺序排序,这对于快速扫描主要副本很有用。
为所有剩余列选择一个随机值,让我了解这些列中的值。
Similar query in Postgres greets me with an error:
Postgres中的类似查询向我发出错误:
column "MY_TABLE.SOME_COLUMN" must appear in the GROUP BY clause or be used in an aggregate function
列“MY_TABLE.SOME_COLUMN”必须出现在GROUP BY子句中,或者用在聚合函数中
What is the Postgres equivalent of this query?
什么是Postgres相当于这个查询?
PS: I know that MySQL behaviour deviates from SQL standards.
PS:我知道MySQL的行为偏离了SQL标准。
4 个解决方案
#1
12
Back-ticks are a non-standard MySQL thing. Use the canonical double quotes to quote identifiers (possible in MySQL, too). That is, if your table in fact is named "MY_TABLE"
(all upper case). If you (more wisely) named it my_table
(all lower case), then you can remove the double quotes or use lower case.
反向标记是一种非标准的MySQL事物。使用规范的双引号引用标识符(也可以在MySQL中引用)。也就是说,如果您的表实际上被命名为“MY_TABLE”(全部大写)。如果您(更明智地)将其命名为my_table(全部为小写),则可以删除双引号或使用小写。
Also, I use ct
instead of count
as alias, because it is bad practice to use function names as identifiers.
另外,我使用ct而不是count作为别名,因为使用函数名作为标识符是不好的做法。
Simple case
This would work with PostgreSQL 9.1:
这适用于PostgreSQL 9.1:
SELECT *, count(id) ct
FROM my_table
GROUP BY primary_key_column(s)
ORDER BY ct DESC;
It requires primary key column(s) in the GROUP BY
clause. The results are identical to a MySQL query, but ct
would always be 1 (or 0 if id IS NULL
) - useless to find duplicates.
它需要GROUP BY子句中的主键列。结果与MySQL查询相同,但ct将始终为1(如果id为NULL,则为0) - 无法查找重复项。
Group by other than primary key columns
If you want to group by other column(s), things get more complicated. This query mimics the behavior of your MySQL query - and you can use *
.
如果您想按其他列分组,事情会变得更复杂。此查询模仿MySQL查询的行为 - 您可以使用*。
SELECT DISTINCT ON (1, some_column)
count(*) OVER (PARTITION BY some_column) AS ct
,*
FROM my_table
ORDER BY 1 DESC, some_column, id, col1;
This works because DISTINCT ON
(PostgreSQL specific), like DISTINCT
(SQL-Standard), are applied after the window function count(*) OVER (...)
. Window functions (with the OVER
clause) require PostgreSQL 8.4 or later and are not available in MySQL.
这是有效的,因为DISTINCT ON(特定于PostgreSQL),如DISTINCT(SQL-Standard),在窗口函数count(*)OVER(...)之后应用。窗口函数(使用OVER子句)需要PostgreSQL 8.4或更高版本,并且在MySQL中不可用。
Works with any table, regardless of primary or unique constraints.
适用于任何表,无论主要或唯一约束如何。
The 1
in DISTINCT ON
and ORDER BY
is just shorthand to refer to the ordinal number of the item in the SELECT
list.
DISTINCT ON和ORDER BY中的1只是简写,用于引用SELECT列表中项目的序号。
SQL Fiddle to demonstrate both side by side.
SQL小提琴并排展示。
More details in this closely related answer:
这个密切相关答案的更多细节:
- Select first row in each GROUP BY group?
选择每个GROUP BY组中的第一行?
count(*)
vs. count(id)
If you are looking for duplicates, you are better off with count(*)
than with count(id)
. There is a subtle difference if id
can be NULL
, because NULL
values are not counted - while count(*)
counts all rows. If id
is defined NOT NULL
, results are the same, but count(*)
is generally more appropriate (and slightly faster, too).
如果您正在寻找重复项,那么使用count(*)比使用count(id)更好。如果id可以为NULL,则存在细微差别,因为不计算NULL值 - 而count(*)计算所有行。如果id被定义为NOT NULL,则结果是相同的,但count(*)通常更合适(并且也更快)。
#2
3
Here's another approach, uses DISTINCT ON:
这是另一种方法,使用DISTINCT ON:
select
distinct on(ct, some_column)
*,
count(id) over(PARTITION BY some_column) as ct
from my_table x
order by ct desc, some_column, id
Data source:
CREATE TABLE my_table (some_column int, id int, col1 int);
INSERT INTO my_table VALUES
(1, 3, 4)
,(2, 4, 1)
,(2, 5, 1)
,(3, 6, 4)
,(3, 7, 3)
,(4, 8, 3)
,(4, 9, 4)
,(5, 10, 1)
,(5, 11, 2)
,(5, 11, 3);
Output:
SOME_COLUMN ID COL1 CT
5 10 1 3
2 4 1 2
3 6 4 2
4 8 3 2
1 3 4 1
Live test: http://www.sqlfiddle.com/#!1/e2509/1
实时测试:http://www.sqlfiddle.com/#!1 / e2509 / 1
DISTINCT ON documentation: http://www.postgresonline.com/journal/archives/4-Using-Distinct-ON-to-return-newest-order-for-each-customer.html
DISTINCT ON文档:http://www.postgresonline.com/journal/archives/4-Using-Distinct-ON-to-return-newest-order-for-each-customer.html
#3
1
mysql allows group by
to omit non-aggregated selected columns from the group by
list, which it executes by returning the first row found for each unique combination of grouped by columns. This is non-standard SQL behaviour.
mysql允许group by从列表中省略非聚合的选定列,它通过返回为每个按列分组的唯一组合找到的第一行来执行。这是非标准的SQL行为。
postgres on the other hand is SQL standard compliant.
另一方面,postgres符合SQL标准。
There is no equivalent query in postgres.
postgres中没有等效的查询。
#4
1
Here is a self-joined CTE, which allows you to use select *
. key0 is the intended unique key, {key1,key2} are the additional key elements needed to address the currently non-unique rows. Use at your own risk, YMMV.
这是一个自连接的CTE,允许您使用select *。 key0是预期的唯一键,{key1,key2}是解决当前非唯一行所需的其他关键元素。使用YMMV需要您自担风险。
WITH zcte AS (
SELECT DISTINCT tt.key0
, MIN(tt.key1) AS key1
, MIN(tt.key2) AS key2
, COUNT(*) AS cnt
FROM ztable tt
GROUP BY tt.key0
HAVING COUNT(*) > 1
)
SELECT zt.*
, zc.cnt AS cnt
FROM ztable zt
JOIN zcte zc ON zc.key0 = zt.key0 AND zc.key1 = zt.key1 AND zc.key2 = zt.key2
ORDER BY zt.key0, zt.key1,zt.key2
;
BTW: to get the intended behaviour for the OP, the HAVING COUNT(*) > 1
clause should be omitted.
顺便说一句:要获得OP的预期行为,应省略HAVING COUNT(*)> 1子句。
#1
12
Back-ticks are a non-standard MySQL thing. Use the canonical double quotes to quote identifiers (possible in MySQL, too). That is, if your table in fact is named "MY_TABLE"
(all upper case). If you (more wisely) named it my_table
(all lower case), then you can remove the double quotes or use lower case.
反向标记是一种非标准的MySQL事物。使用规范的双引号引用标识符(也可以在MySQL中引用)。也就是说,如果您的表实际上被命名为“MY_TABLE”(全部大写)。如果您(更明智地)将其命名为my_table(全部为小写),则可以删除双引号或使用小写。
Also, I use ct
instead of count
as alias, because it is bad practice to use function names as identifiers.
另外,我使用ct而不是count作为别名,因为使用函数名作为标识符是不好的做法。
Simple case
This would work with PostgreSQL 9.1:
这适用于PostgreSQL 9.1:
SELECT *, count(id) ct
FROM my_table
GROUP BY primary_key_column(s)
ORDER BY ct DESC;
It requires primary key column(s) in the GROUP BY
clause. The results are identical to a MySQL query, but ct
would always be 1 (or 0 if id IS NULL
) - useless to find duplicates.
它需要GROUP BY子句中的主键列。结果与MySQL查询相同,但ct将始终为1(如果id为NULL,则为0) - 无法查找重复项。
Group by other than primary key columns
If you want to group by other column(s), things get more complicated. This query mimics the behavior of your MySQL query - and you can use *
.
如果您想按其他列分组,事情会变得更复杂。此查询模仿MySQL查询的行为 - 您可以使用*。
SELECT DISTINCT ON (1, some_column)
count(*) OVER (PARTITION BY some_column) AS ct
,*
FROM my_table
ORDER BY 1 DESC, some_column, id, col1;
This works because DISTINCT ON
(PostgreSQL specific), like DISTINCT
(SQL-Standard), are applied after the window function count(*) OVER (...)
. Window functions (with the OVER
clause) require PostgreSQL 8.4 or later and are not available in MySQL.
这是有效的,因为DISTINCT ON(特定于PostgreSQL),如DISTINCT(SQL-Standard),在窗口函数count(*)OVER(...)之后应用。窗口函数(使用OVER子句)需要PostgreSQL 8.4或更高版本,并且在MySQL中不可用。
Works with any table, regardless of primary or unique constraints.
适用于任何表,无论主要或唯一约束如何。
The 1
in DISTINCT ON
and ORDER BY
is just shorthand to refer to the ordinal number of the item in the SELECT
list.
DISTINCT ON和ORDER BY中的1只是简写,用于引用SELECT列表中项目的序号。
SQL Fiddle to demonstrate both side by side.
SQL小提琴并排展示。
More details in this closely related answer:
这个密切相关答案的更多细节:
- Select first row in each GROUP BY group?
选择每个GROUP BY组中的第一行?
count(*)
vs. count(id)
If you are looking for duplicates, you are better off with count(*)
than with count(id)
. There is a subtle difference if id
can be NULL
, because NULL
values are not counted - while count(*)
counts all rows. If id
is defined NOT NULL
, results are the same, but count(*)
is generally more appropriate (and slightly faster, too).
如果您正在寻找重复项,那么使用count(*)比使用count(id)更好。如果id可以为NULL,则存在细微差别,因为不计算NULL值 - 而count(*)计算所有行。如果id被定义为NOT NULL,则结果是相同的,但count(*)通常更合适(并且也更快)。
#2
3
Here's another approach, uses DISTINCT ON:
这是另一种方法,使用DISTINCT ON:
select
distinct on(ct, some_column)
*,
count(id) over(PARTITION BY some_column) as ct
from my_table x
order by ct desc, some_column, id
Data source:
CREATE TABLE my_table (some_column int, id int, col1 int);
INSERT INTO my_table VALUES
(1, 3, 4)
,(2, 4, 1)
,(2, 5, 1)
,(3, 6, 4)
,(3, 7, 3)
,(4, 8, 3)
,(4, 9, 4)
,(5, 10, 1)
,(5, 11, 2)
,(5, 11, 3);
Output:
SOME_COLUMN ID COL1 CT
5 10 1 3
2 4 1 2
3 6 4 2
4 8 3 2
1 3 4 1
Live test: http://www.sqlfiddle.com/#!1/e2509/1
实时测试:http://www.sqlfiddle.com/#!1 / e2509 / 1
DISTINCT ON documentation: http://www.postgresonline.com/journal/archives/4-Using-Distinct-ON-to-return-newest-order-for-each-customer.html
DISTINCT ON文档:http://www.postgresonline.com/journal/archives/4-Using-Distinct-ON-to-return-newest-order-for-each-customer.html
#3
1
mysql allows group by
to omit non-aggregated selected columns from the group by
list, which it executes by returning the first row found for each unique combination of grouped by columns. This is non-standard SQL behaviour.
mysql允许group by从列表中省略非聚合的选定列,它通过返回为每个按列分组的唯一组合找到的第一行来执行。这是非标准的SQL行为。
postgres on the other hand is SQL standard compliant.
另一方面,postgres符合SQL标准。
There is no equivalent query in postgres.
postgres中没有等效的查询。
#4
1
Here is a self-joined CTE, which allows you to use select *
. key0 is the intended unique key, {key1,key2} are the additional key elements needed to address the currently non-unique rows. Use at your own risk, YMMV.
这是一个自连接的CTE,允许您使用select *。 key0是预期的唯一键,{key1,key2}是解决当前非唯一行所需的其他关键元素。使用YMMV需要您自担风险。
WITH zcte AS (
SELECT DISTINCT tt.key0
, MIN(tt.key1) AS key1
, MIN(tt.key2) AS key2
, COUNT(*) AS cnt
FROM ztable tt
GROUP BY tt.key0
HAVING COUNT(*) > 1
)
SELECT zt.*
, zc.cnt AS cnt
FROM ztable zt
JOIN zcte zc ON zc.key0 = zt.key0 AND zc.key1 = zt.key1 AND zc.key2 = zt.key2
ORDER BY zt.key0, zt.key1,zt.key2
;
BTW: to get the intended behaviour for the OP, the HAVING COUNT(*) > 1
clause should be omitted.
顺便说一句:要获得OP的预期行为,应省略HAVING COUNT(*)> 1子句。