I'm trying to design a bulk data import task using Django's ORM ontop of MySQL. Normally, I'd simply use LOAD DATA INFILE, but the data I'm bulk importing contains records over three tables, and some of the records may already exist, so I have to check for pre-existing records, create or retrieve their ID, and then use this ID when creating or retrieving the other records.
我正在尝试使用Django的ORM在MySQL上设计一个批量数据导入任务。通常,我简单地使用INFILE加载数据,但是数据我批量导入包含三个表记录,和一些记录可能已经存在,所以我必须检查已存在的记录,创建或检索他们的ID,然后使用这个ID在创建或获取其他记录。
By default, the import rate is 0.8rows/sec. Quite horrible. I managed to bump this up to 1.5rows/sec by running DISABLE KEYS on the effected tables, but as I have a few million rows, this is still way too slow.
默认情况下,导入率是0.8行/秒。很可怕的。通过在受影响的表上运行禁用键,我设法将其增加到1.5行/秒,但是由于我有数百万行,这仍然太慢。
Is there any general advice for speeding up Django's ORM for bulk importing complex table relationships?
对于加速Django的ORM,对于大量导入复杂的表关系,有什么通用的建议吗?
I was considering disabling Django's transaction management, in order to wrap the entire import in a single transaction. However, since the import takes so long, the import process periodically updates a status model to report percent completion. If I wrap the entire import in a single transaction, it won't be able to update this status record. So is there any way to disable transaction management for only a specific set of models, and still allow it to commit a separate model?
我正在考虑禁用Django的事务管理,以便在单个事务中包装整个导入。但是,由于导入需要很长时间,因此导入过程会定期更新状态模型,以报告百分比完成情况。如果我将整个导入封装在一个事务中,它将无法更新此状态记录。那么,是否有任何方法可以仅为一组特定的模型禁用事务管理,并仍然允许它提交一个单独的模型?
I'd like to do something like:
我想做的是:
from django.db import transaction
transaction.enter_transaction_management()
transaction.managed(True)
from myapp.models import Status, Data
status = Status.objects.get(id=123)
try:
data = magically_get_data_iter()
for row in data:
d,_ = Data.objects.get_or_create(**data.keys())
d.update(data)
d.save() # not actually visible in admin until the commit below
if not row.i % 100:
status.current_row = row.i
status.total_rows = row.total
# obviously doesn't work, but this should somehow actually commit
status.save(commit=True)
finally:
transaction.commit()
1 个解决方案
#1
3
I solved this by placing the bulk-updated model and the model storing the status record onto different databases, and then disabling transaction management to the former database.
通过将批量更新的模型和存储状态记录的模型放置到不同的数据库中,然后将事务管理禁用到前一个数据库中,我解决了这个问题。
e.g. a simplification of my example above:
我上面例子的简化:
django.db.transaction.enter_transaction_management(using='primary')
django.db.transaction.managed(True, using='primary')
i = 0
for record in records:
i += 1
r = PrimaryDBModel(**record)
r.save() # This will no be committed until the end.
if not i % 100:
SecondaryDBModel.update()
status = SecondaryDBModel(id=123)
status.current_row = i
status.save() # This will committed immediately.
django.db.transaction.commit(using='primary')
django.db.transaction.leave_transaction_management(using='primary')
#1
3
I solved this by placing the bulk-updated model and the model storing the status record onto different databases, and then disabling transaction management to the former database.
通过将批量更新的模型和存储状态记录的模型放置到不同的数据库中,然后将事务管理禁用到前一个数据库中,我解决了这个问题。
e.g. a simplification of my example above:
我上面例子的简化:
django.db.transaction.enter_transaction_management(using='primary')
django.db.transaction.managed(True, using='primary')
i = 0
for record in records:
i += 1
r = PrimaryDBModel(**record)
r.save() # This will no be committed until the end.
if not i % 100:
SecondaryDBModel.update()
status = SecondaryDBModel(id=123)
status.current_row = i
status.save() # This will committed immediately.
django.db.transaction.commit(using='primary')
django.db.transaction.leave_transaction_management(using='primary')