So I have a python script that goes through roughly 350,000 data objects, and depending on some tests, it needs to update a row which represents each one of those objects in a MySQl db. I'm also using pymysql as I've had the least trouble with it especially when sending over large select queries (select statements with where column IN (....)
clause that can contain 100,000+ values).
我有一个python脚本,它要处理大约35万个数据对象,根据一些测试,它需要更新一个行,表示MySQl db中的每个对象。我也使用pymysql作为我最麻烦,尤其是当发送在大型select查询(与列在select语句(....)条款,可以包含100000 +值)。
Since each update for each row can be different, each update statement is different. For example, for one row we might want to update first_name
but for another row we want to leave first_name
untouched and we want to update last_name
.
由于每一行的每个更新都是不同的,所以每个更新语句都是不同的。例如,对于一行,我们可能想要更新first_name,但是对于另一行,我们想要保留first_name,而我们想要更新last_name。
This is why I don't want to use the cursor.executemany()
method which takes in one generic update statement and you then feed it the values however as I mentioned, each update is different so having one generic update statement doesn't really work for my case. I also don't want to send over 350,000 update statements individually over the wire. Is there anyway I can package all of my update statements together and send them at once?
这就是为什么我不想使用cursor.executemany()方法,它接受一个通用更新语句,然后向它提供值,但是正如我所提到的,每个更新都是不同的,所以对于我来说,有一个通用更新语句并不适用。我也不想通过网络单独发送超过350,000条更新语句。我是否可以将所有更新语句打包并立即发送?
I tried having them all in one query and using the cursor.execute()
method but it doesn't seem to update all the rows.
我尝试让它们都在一个查询中,并使用cursor.execute()方法,但它似乎并没有更新所有的行。
2 个解决方案
#1
4
SQL #1: CREATE TABLE t
with whatever columns you might need to change. Make all of them NULL
(as opposed to NOT NULL
).
SQL #1:使用可能需要更改的列创建表t。使它们都为NULL(而不是NOT NULL)。
SQL #2: Do a bulk INSERT
(or LOAD DATA
) of all the changes needed. Eg, if changing only first_name
, fill in id
and first_name
, but have the other columns NULL
.
SQL #2:对所有需要的更改执行批量插入(或加载数据)。如果只更改first_name,填写id和first_name,但其他列都为空。
SQL #3-14:
SQL # 3 - 14:
UPDATE real_table
JOIN t ON t.id = real_table.id
SET real_table.first_name = t.first_name
WHERE t.first_name IS NOT NULL;
# ditto for each other column.
All SQLs except #1 will be time-consuming. And, since UPDATE
needs to build a undo log, it could timeout or otherwise be problematical. See a discussion of chunking if necessary.
除了#1之外,所有的SQLs都是耗时的。而且,由于更新需要构建一个undo日志,它可能会超时,或者有其他问题。如果有必要,请参阅关于分块的讨论。
If necessary, use functions such as COALESCE()
, GREATEST()
, IFNULL()
, etc.
如果需要,可以使用COALESCE()、GREATEST()、IFNULL()等函数。
Mass UPDATEs
usually imply poor schema design.
大量更新通常意味着糟糕的模式设计。
(If Ryan jumps in with an 'Answer' instead of just a 'Comment', he should probably get the 'bounty'.)
(如果瑞安插话说“答案”而不是“评论”,他很可能会得到“赏金”。)
#2
5
Your best performance will be if you can encode your "tests" into the SQL logic itself, so you can boil everything down to a handful of UPDATE statements. Or at least get as many as possible done that way, so that fewer rows need to be updated individually.
如果您能够将“测试”编码到SQL逻辑本身中,那么您的最佳性能将是,因此您可以将所有内容归结为几个更新语句。或者至少尽可能多地这样做,以便更少的行需要单独更新。
For example:
例如:
UPDATE tablename set firstname = [some logic]
WHERE [logic that identifies which rows need the firstname updated];
You don't describe much about your tests, so it's hard to be sure. But you can typically get quite a lot of logic into your WHERE clause with a little bit of work.
你对你的测试没有太多的描述,所以很难确定。但是你通常可以在WHERE子句中加入很多逻辑,只需要做一点工作。
Another option would be to put your logic into a stored procedure. You'll still be doing 350,000 updates, but at least they aren't all "going over the wire". I would use this only as a last resort, though; business logic should be kept in the application layer whenever possible, and stored procedures make your application less portable.
另一种选择是将您的逻辑放入存储过程中。你仍将做35万次更新,但至少它们不都是“走钢丝”。不过,我只能把这作为最后的手段;业务逻辑应该尽可能地保留在应用程序层中,存储过程使应用程序的可移植性降低。
#1
4
SQL #1: CREATE TABLE t
with whatever columns you might need to change. Make all of them NULL
(as opposed to NOT NULL
).
SQL #1:使用可能需要更改的列创建表t。使它们都为NULL(而不是NOT NULL)。
SQL #2: Do a bulk INSERT
(or LOAD DATA
) of all the changes needed. Eg, if changing only first_name
, fill in id
and first_name
, but have the other columns NULL
.
SQL #2:对所有需要的更改执行批量插入(或加载数据)。如果只更改first_name,填写id和first_name,但其他列都为空。
SQL #3-14:
SQL # 3 - 14:
UPDATE real_table
JOIN t ON t.id = real_table.id
SET real_table.first_name = t.first_name
WHERE t.first_name IS NOT NULL;
# ditto for each other column.
All SQLs except #1 will be time-consuming. And, since UPDATE
needs to build a undo log, it could timeout or otherwise be problematical. See a discussion of chunking if necessary.
除了#1之外,所有的SQLs都是耗时的。而且,由于更新需要构建一个undo日志,它可能会超时,或者有其他问题。如果有必要,请参阅关于分块的讨论。
If necessary, use functions such as COALESCE()
, GREATEST()
, IFNULL()
, etc.
如果需要,可以使用COALESCE()、GREATEST()、IFNULL()等函数。
Mass UPDATEs
usually imply poor schema design.
大量更新通常意味着糟糕的模式设计。
(If Ryan jumps in with an 'Answer' instead of just a 'Comment', he should probably get the 'bounty'.)
(如果瑞安插话说“答案”而不是“评论”,他很可能会得到“赏金”。)
#2
5
Your best performance will be if you can encode your "tests" into the SQL logic itself, so you can boil everything down to a handful of UPDATE statements. Or at least get as many as possible done that way, so that fewer rows need to be updated individually.
如果您能够将“测试”编码到SQL逻辑本身中,那么您的最佳性能将是,因此您可以将所有内容归结为几个更新语句。或者至少尽可能多地这样做,以便更少的行需要单独更新。
For example:
例如:
UPDATE tablename set firstname = [some logic]
WHERE [logic that identifies which rows need the firstname updated];
You don't describe much about your tests, so it's hard to be sure. But you can typically get quite a lot of logic into your WHERE clause with a little bit of work.
你对你的测试没有太多的描述,所以很难确定。但是你通常可以在WHERE子句中加入很多逻辑,只需要做一点工作。
Another option would be to put your logic into a stored procedure. You'll still be doing 350,000 updates, but at least they aren't all "going over the wire". I would use this only as a last resort, though; business logic should be kept in the application layer whenever possible, and stored procedures make your application less portable.
另一种选择是将您的逻辑放入存储过程中。你仍将做35万次更新,但至少它们不都是“走钢丝”。不过,我只能把这作为最后的手段;业务逻辑应该尽可能地保留在应用程序层中,存储过程使应用程序的可移植性降低。