主键冲突错误后继续进行事务

I am doing a bulk insert of records into a database from a log file. Occasionally (~1 row out of every thousand) one of the rows violates the primary key and causes the transaction to fail. Currently, the user has to manually go through the file that caused the failure and remove the offending row before attempting to re-import. Given that there are hundreds of these files to import it is impractical.

我正在从日志文件中将记录批量插入到数据库中。偶尔（每千行约1行）其中一行违反主键并导致事务失败。目前，用户必须手动浏览导致失败的文件，并在尝试重新导入之前删除有问题的行。鉴于要导入数百个这样的文件，这是不切实际的。

My question: How can I skip the insertion of records that will violate the primary key constraint, without having to do a SELECT statement before each row to see if it already exists?

我的问题：如何跳过违反主键约束的记录插入，而不必在每行之前执行SELECT语句以查看它是否已存在？

Note: I am aware of the very similar question #1054695, but it appears to be a SQL Server specific answer and I am using PostgreSQL (importing via Python/psycopg2).

注意：我知道非常相似的问题＃1054695，但它似乎是一个SQL Server特定的答案，我使用PostgreSQL（通过Python / psycopg2导入）。

4 个解决方案

#1

You can also use SAVEPOINTs in a transaction.

您还可以在事务中使用SAVEPOINT。

Pythonish pseudocode is illustrate from the application side:

Pythonish伪代码从应用程序端说明：

database.execute("BEGIN")
foreach data_row in input_data_dictionary:
    database.execute("SAVEPOINT bulk_savepoint")
    try:
        database.execute("INSERT", table, data_row)
    except:
        database.execute("ROLLBACK TO SAVEPOINT bulk_savepoint")
        log_error(data_row)
        error_count = error_count + 1
    else:
        database.execute("RELEASE SAVEPOINT bulk_savepoint")

if error_count > error_threshold:
    database.execute("ROLLBACK")
else:
    database.execute("COMMIT")

Edit: Here's an actual example of this in action in psql based on a slight variation of the example in the documentation (SQL statements prefixed by ">"):

编辑：这是psql中实际操作的一个实际示例，基于文档中示例的略微变化（以“>”为前缀的SQL语句）：

> CREATE TABLE table1 (test_field INTEGER NOT NULL PRIMARY KEY);
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "table1_pkey" for table "table1"
CREATE TABLE

> BEGIN;
BEGIN
> INSERT INTO table1 VALUES (1);
INSERT 0 1
> SAVEPOINT my_savepoint;
SAVEPOINT
> INSERT INTO table1 VALUES (1);
ERROR:  duplicate key value violates unique constraint "table1_pkey"
> ROLLBACK TO SAVEPOINT my_savepoint;
ROLLBACK
> INSERT INTO table1 VALUES (3);
INSERT 0 1
> COMMIT;
COMMIT
> SELECT * FROM table1;  
 test_field 
------------
          1
          3
(2 rows)

Note that the value 3 was inserted after the error, but still inside the same transaction!

请注意，值3是在错误之后插入的，但仍然在同一个事务中！

The documentation for SAVEPOINT is at http://www.postgresql.org/docs/8.4/static/sql-savepoint.html.

SAVEPOINT的文档位于http://www.postgresql.org/docs/8.4/static/sql-savepoint.html。

#2

I would use a stored procedure to catch the exceptions on your unique violations. Example:

我会使用存储过程来捕获您的唯一违规行为的例外情况。例：

CREATE OR REPLACE FUNCTION my_insert(i_foo text, i_bar text)
  RETURNS boolean LANGUAGE plpgsql AS
$BODY$
begin   
    insert into foo(x, y) values(i_foo, i_bar);
    exception
        when unique_violation THEN -- nothing

    return true;
end;
$BODY$;

SELECT my_insert('value 1','another value');

#3

You can do a rollback to the transaction or a rollback to a save point just before the code that raises the exception (cr is the cursor):

您可以在引发异常的代码（cr是游标）之前回滚事务或回滚到保存点：

name = uuid.uuid1().hex
cr.execute('SAVEPOINT "%s"' % name)
try:
    # your failing query goes here
except Exception:
    cr.execute('ROLLBACK TO SAVEPOINT "%s"' % name)
    # your alternative code goes here 
else:
    cr.execute('RELEASE SAVEPOINT "%s"' % name)

This code assumes there is running transaction, otherwise you would not receive that error message.

此代码假定存在正在运行的事务，否则您将不会收到该错误消息。

Django postgresql backend creates cursors directly from psycopg. Maybe in the future they make a proxy class for the Django cursor, similar to the cursor of odoo. They extend the cursor with the following code (self is the cursor):

Django postgresql后端直接从psycopg创建游标。也许将来他们会为Django游标创建一个代理类，类似于odoo的游标。它们使用以下代码扩展光标（self是光标）：

@contextmanager
@check
def savepoint(self):
    """context manager entering in a new savepoint"""
    name = uuid.uuid1().hex
    self.execute('SAVEPOINT "%s"' % name)
    try:
        yield
    except Exception:
        self.execute('ROLLBACK TO SAVEPOINT "%s"' % name)
        raise
    else:
        self.execute('RELEASE SAVEPOINT "%s"' % name)

That way the context makes your code easier, it will be:

这样，上下文使您的代码更容易，它将是：

try:
    with cr.savepoint():
        # your failing query goes here
except Exception:
    # your alternative code goes here

and the code is more readable, because the transaction stuff is not there.

并且代码更具可读性，因为交易内容不存在。

#4

Or you can use SSIS and have the failed rows take a differnt path than the successful ones.

或者您可以使用SSIS并使失败的行采用与成功路径不同的路径。

SInce you are usinga differnt database can you bulk insert the files to a staging table and then use SQL code to select only those records which do not have an exisitng id?

您是否正在使用不同的数据库，您可以批量插入文件到临时表，然后使用SQL代码只选择那些没有exisitng id的记录吗？

#1