I'm sure this is common place, but Google is not helping. I am trying to write a simple stored procedure in PostgreSQL 9.1 that will remove duplicate entries from a parent cpt
table. The parent table cpt
is referenced by a child table lab
defined as:
我确信这是常见的地方,但谷歌没有帮助。我试图在PostgreSQL 9.1中编写一个简单的存储过程,它将从父cpt表中删除重复的条目。父表cpt由子表实验室引用,定义为:
CREATE TABLE lab (
recid serial NOT NULL,
cpt_recid integer,
........
CONSTRAINT cs_cpt FOREIGN KEY (cpt_recid)
REFERENCES cpt (recid) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE RESTRICT,
...
);
The biggest problem I'm having is how to obtain the record which failed so that I can use it in the EXCEPTION
clause to move the children rows from lab
to one acceptable key, then loop back through and delete the unnecessary records from the cpt
table.
我遇到的最大问题是如何获取失败的记录,以便我可以在EXCEPTION子句中将它用于将子行从lab移动到一个可接受的键,然后循环回来并从cpt表中删除不必要的记录。
Here is the (very wrong) code:
这是(非常错误的)代码:
CREATE OR REPLACE FUNCTION h_RemoveDuplicateCPT()
RETURNS void AS
$BODY$
BEGIN
LOOP
BEGIN
DELETE FROM cpt
WHERE recid IN (
SELECT recid
FROM (
SELECT recid,
row_number() over (partition BY cdesc ORDER BY recid) AS rnum
FROM cpt) t
WHERE t.rnum > 1)
RETURNING recid;
IF count = 0 THEN
RETURN;
END IF;
EXCEPTION WHEN foreign_key_violation THEN
RAISE NOTICE 'fixing unique_violation';
RAISE NOTICE 'recid is %' , recid;
END;
END LOOP;
END;
$BODY$
LANGUAGE plpgsql VOLATILE;
2 个解决方案
#1
5
You can do this much more efficiently with a single SQL statement with data-modifying CTEs.
No function required (but possible, of course), no looping, no exception handling:
使用具有数据修改CTE的单个SQL语句,您可以更有效地执行此操作。不需要任何功能(当然可能),没有循环,没有异常处理:
WITH plan AS (
SELECT recid, cdesc, min(recid) OVER (PARTITION BY cdesc) AS master_recid
FROM cpt
)
, upd_lab AS (
UPDATE lab l
SET cpt_recid = p.master_recid -- link to master recid ...
FROM plan p
WHERE l.cpt_recid = p.recid
AND p.recid <> p.master_recid -- ... only if not linked to master
)
DELETE FROM cpt c
USING plan p
WHERE c.recid = p.recid
AND p.recid <> p.master_recid -- ... only if not master
RETURNING c.recid; -- optionaly return all deleted (dupe) IDs
SQL小提琴。
This should be much faster and cleaner. Looping is comparatively expensive, exception handling is comparatively even more expensive.
Much more importantly, references in lab
are redirected to the respective master row in cpt
automatically, which wasn't in your original code, yet. So you can delete all dupes at once.
这应该更快更清洁。循环比较昂贵,异常处理相对更昂贵。更重要的是,实验室中的引用会自动重定向到cpt中的相应主行,而这在原始代码中并不存在。所以你可以一次删除所有欺骗。
You can wrap this in a plpgsql or SQL function if you like.
如果您愿意,可以将它包装在plpgsql或SQL函数中。
Explanation
-
In the first CTE
plan
, identify the master-row per group of dupes. In your case the row with the minimumrecid
percdesc
.在第一个CTE计划中,确定每组欺骗的主排。在您的情况下,每个cdesc具有最小recid的行。
-
In the second CTE
upd_lab
redirect all rows referencing a dupe to the master row incpt
.在第二个CTE upd_lab重定向所有引用欺骗的行到cpt中的主行。
-
Finally, delete dupes, which is not going to raise exceptions because depending rows are being linked to the remaining master-row virtually at the same time.
最后,删除dupes,这不会引发异常,因为依赖行实际上同时链接到剩余的主行。
ON DELETE RESTRICT
All CTEs and the main query of a statement operate on the same snapshot of underlying tables, virtually concurrently. They don't see each others' effects on underlying tables:
所有CTE和语句的主要查询几乎同时在基础表的相同快照上运行。他们没有看到彼此对基础表的影响:
- PostgreSQL: using foreign keys, delete parent if it's not referenced by any other child
- PostgreSQL:使用外键,如果没有其他孩子引用,则删除父项
One might expect a FK constraint with ON DELETE RESTRICT
to raise exceptions because, per documentation:
人们可能期望使用ON DELETE RESTRICT的FK约束来引发异常,因为根据文档:
Referential actions other than the
NO ACTION
check cannot be deferred, even if the constraint is declared deferrable.即使约束被声明为可延迟,也不能延迟除NO ACTION检查以外的参照动作。
However, the above statement is a single command and, per documentation:
但是,上面的语句是单个命令,并且根据文档:
A constraint that is not deferrable will be checked immediately after every command.
在每个命令之后立即检查不可延迟的约束。
Bold emphasis mine. You only need to be aware of concurrent transactions writing to the same tables, but that's a general consideration, not specific to this task.
大胆强调我的。您只需要知道写入相同表的并发事务,但这是一个普遍的考虑因素,并非特定于此任务。
An exception applies for UNIQUE
and PRIMARY KEY
constraint, but that does not concern this case:
一个例外适用于UNIQUE和PRIMARY KEY约束,但这不涉及这种情况:
- Constraint defined DEFERRABLE INITIALLY IMMEDIATE is still DEFERRED?
- 约束定义DEFERRABLE INTITIALLY IMMEDIATE仍然是DEFERRED?
#2
1
You can select all duplicates once and loop over the result with a record variable. You'll have access to whole current record. The function below may serve as an example:
您可以选择所有重复项一次,并使用记录变量循环结果。您将可以访问整个当前记录。以下功能可以作为一个例子:
create or replace function show_remove_duplicates_in_cpt ()
returns setof text language plpgsql
as $$
declare
rec record;
begin
for rec in
select * from (
select
recid, cdesc,
row_number() over (partition by cdesc order by recid) as rnum
from cpt
) alias
where rnum > 1
loop
return next format ('fixing foreign key for %s %s %s', rec.recid, rec.cdesc, rec.rnum);
return next format ('deleting from cpt where recid = %s', rec.recid);
end loop;
end $$;
select * from show_remove_duplicates_in_cpt ();
#1
5
You can do this much more efficiently with a single SQL statement with data-modifying CTEs.
No function required (but possible, of course), no looping, no exception handling:
使用具有数据修改CTE的单个SQL语句,您可以更有效地执行此操作。不需要任何功能(当然可能),没有循环,没有异常处理:
WITH plan AS (
SELECT recid, cdesc, min(recid) OVER (PARTITION BY cdesc) AS master_recid
FROM cpt
)
, upd_lab AS (
UPDATE lab l
SET cpt_recid = p.master_recid -- link to master recid ...
FROM plan p
WHERE l.cpt_recid = p.recid
AND p.recid <> p.master_recid -- ... only if not linked to master
)
DELETE FROM cpt c
USING plan p
WHERE c.recid = p.recid
AND p.recid <> p.master_recid -- ... only if not master
RETURNING c.recid; -- optionaly return all deleted (dupe) IDs
SQL小提琴。
This should be much faster and cleaner. Looping is comparatively expensive, exception handling is comparatively even more expensive.
Much more importantly, references in lab
are redirected to the respective master row in cpt
automatically, which wasn't in your original code, yet. So you can delete all dupes at once.
这应该更快更清洁。循环比较昂贵,异常处理相对更昂贵。更重要的是,实验室中的引用会自动重定向到cpt中的相应主行,而这在原始代码中并不存在。所以你可以一次删除所有欺骗。
You can wrap this in a plpgsql or SQL function if you like.
如果您愿意,可以将它包装在plpgsql或SQL函数中。
Explanation
-
In the first CTE
plan
, identify the master-row per group of dupes. In your case the row with the minimumrecid
percdesc
.在第一个CTE计划中,确定每组欺骗的主排。在您的情况下,每个cdesc具有最小recid的行。
-
In the second CTE
upd_lab
redirect all rows referencing a dupe to the master row incpt
.在第二个CTE upd_lab重定向所有引用欺骗的行到cpt中的主行。
-
Finally, delete dupes, which is not going to raise exceptions because depending rows are being linked to the remaining master-row virtually at the same time.
最后,删除dupes,这不会引发异常,因为依赖行实际上同时链接到剩余的主行。
ON DELETE RESTRICT
All CTEs and the main query of a statement operate on the same snapshot of underlying tables, virtually concurrently. They don't see each others' effects on underlying tables:
所有CTE和语句的主要查询几乎同时在基础表的相同快照上运行。他们没有看到彼此对基础表的影响:
- PostgreSQL: using foreign keys, delete parent if it's not referenced by any other child
- PostgreSQL:使用外键,如果没有其他孩子引用,则删除父项
One might expect a FK constraint with ON DELETE RESTRICT
to raise exceptions because, per documentation:
人们可能期望使用ON DELETE RESTRICT的FK约束来引发异常,因为根据文档:
Referential actions other than the
NO ACTION
check cannot be deferred, even if the constraint is declared deferrable.即使约束被声明为可延迟,也不能延迟除NO ACTION检查以外的参照动作。
However, the above statement is a single command and, per documentation:
但是,上面的语句是单个命令,并且根据文档:
A constraint that is not deferrable will be checked immediately after every command.
在每个命令之后立即检查不可延迟的约束。
Bold emphasis mine. You only need to be aware of concurrent transactions writing to the same tables, but that's a general consideration, not specific to this task.
大胆强调我的。您只需要知道写入相同表的并发事务,但这是一个普遍的考虑因素,并非特定于此任务。
An exception applies for UNIQUE
and PRIMARY KEY
constraint, but that does not concern this case:
一个例外适用于UNIQUE和PRIMARY KEY约束,但这不涉及这种情况:
- Constraint defined DEFERRABLE INITIALLY IMMEDIATE is still DEFERRED?
- 约束定义DEFERRABLE INTITIALLY IMMEDIATE仍然是DEFERRED?
#2
1
You can select all duplicates once and loop over the result with a record variable. You'll have access to whole current record. The function below may serve as an example:
您可以选择所有重复项一次,并使用记录变量循环结果。您将可以访问整个当前记录。以下功能可以作为一个例子:
create or replace function show_remove_duplicates_in_cpt ()
returns setof text language plpgsql
as $$
declare
rec record;
begin
for rec in
select * from (
select
recid, cdesc,
row_number() over (partition by cdesc order by recid) as rnum
from cpt
) alias
where rnum > 1
loop
return next format ('fixing foreign key for %s %s %s', rec.recid, rec.cdesc, rec.rnum);
return next format ('deleting from cpt where recid = %s', rec.recid);
end loop;
end $$;
select * from show_remove_duplicates_in_cpt ();