Redshift：表上的可序列化隔离违规

I have a very large Redshift database that contains billions of rows of HTTP request data.

我有一个非常大的Redshift数据库,其中包含数十亿行HTTP请求数据。

I have a table called requests which has a few important fields:

我有一个名为requests的表,它有几个重要的字段:

ip_address
city
state
country

I have a Python process running once per day, which grabs all distinct rows which have not yet been geocoded (do not have any city / state / country information), and then attempts to geocode each IP address via Google's Geocoding API.

我有一个每天运行一次的Python进程,它抓取所有尚未进行地理编码的不同行(没有任何城市/州/国家/地区信息),然后尝试通过Google的地理编码API对每个IP地址进行地理编码。

This process (pseudocode) looks like this:

此过程(伪代码)如下所示:

for ip_address in ips_to_geocode:
    country, state, city = geocode_ip_address(ip_address)
    execute_transaction('''
        UPDATE requests
        SET ip_country = %s, ip_state = %s, ip_city = %s
        WHERE ip_address = %s
    ''')

When running this code, I often receive errors like the following:

运行此代码时,我经常会收到如下错误:

psycopg2.InternalError: 1023
DETAIL:  Serializable isolation violation on table - 108263, transactions forming the cycle are: 647671, 647682 (pid:23880)

I'm assuming this is because I have other processes constantly logging HTTP requests into my table, so when I attempt to execute my UPDATE statement, it is unable to select all rows with the ip address I'd like to update.

我假设这是因为我有其他进程不断地将HTTP请求记录到我的表中,所以当我尝试执行我的UPDATE语句时,它无法选择所有具有我要更新的IP地址的行。

My question is this: what can I do to update these records in a sane way that will stop failing regularly?

我的问题是:我能做些什么才能以理智的方式更新这些记录,以便经常停止失败?

3 个解决方案

#1

Your code is violating the serializable isolation level of Redshift. You need to make sure that your code is not trying to open multiple transactions on the same table before closing all open transactions.

您的代码违反了Redshift的可序列化隔离级别。在关闭所有打开的事务之前,您需要确保代码不会尝试在同一个表上打开多个事务。

You can achieve this by locking the table in each transaction so that no other transaction can access the table for updates until the open transaction gets closed. Not sure how your code is architected (synchronous or asynchronous), but this will increase the run time as each lock will force others to wait till the transaction gets over.

您可以通过在每个事务中锁定表来实现此目的,以便在打开事务关闭之前,其他任何事务都无法访问该表以进行更新。不确定代码是如何构建的(同步或异步),但这会增加运行时间,因为每个锁都会强制其他人等待事务结束。

Refer: http://docs.aws.amazon.com/redshift/latest/dg/r_LOCK.html

#2

Just got the same issue on my code, and this is how I fixed it:

我的代码遇到了同样的问题,这就是我修复它的方法:

First things first, it is good to know that this error code means you are trying to do concurrent operations in redshift. When you do a second query to a table before the first query you did moments ago was done, for example, is a case where you would get this kind of error (that was my case).

首先,最好知道这个错误代码意味着你试图在redshift中进行并发操作。当您在第一个查询之前对表执行第二次查询时,您刚刚完成了,例如,您将遇到此类错误(这是我的情况)。

Good news is: there is a simple way to serialize redshift operations! You just need to use the LOCK command. Here is the Amazon documentation for the redshift LOCK command. It works basically making the next operation wait until the previous one is closed. Note that, using this command your script will naturally get a little bit slower.

好消息是:有一种简单的方法可以序列化红移操作!您只需要使用LOCK命令。以下是redshift LOCK命令的Amazon文档。它基本上使下一个操作等到前一个操作关闭。请注意,使用此命令,您的脚本自然会慢一点。

In the end, the practical solution for me was: I inserted the LOCK command before the query messages (in the same string, separated by a ';'). Something like this:

最后,对我来说实际的解决方案是:我在查询消息之前插入了LOCK命令(在同一个字符串中,用';'分隔)。像这样的东西:

LOCK table_name; SELECT * from ...

LOCK table_name; SELECT * from ...

And you should be good to go! I hope it helps you.

你应该好好去!我希望它对你有所帮助。

#3

Either you start a new session when you do second update on the same table or you have to 'commit' once you transaction is complete.

You can write set autocommit=on before you start updating.

您在同一个表上进行第二次更新时启动新会话,或者在事务完成后必须“提交”。在开始更新之前,您可以编写set autocommit = on。

#1