使用SqlBulkCopy和Azure并行批量插入

时间:2021-09-24 13:54:06

I have an azure app on the cloud with a sql azure database. I have a worker role which needs to do parsing+processing on a file (up to ~30 million rows) so i can't directly use BCP or SSIS.

我在云上有一个azure应用,它有一个sql azure数据库。我有一个worker角色,它需要对一个文件进行解析+处理(高达3000万行),所以我不能直接使用BCP或SSIS。

I'm currently using SqlBulkCopy, however this seems too slow as I've seen load times of up to 4-5 minutes for 400k rows.

我目前正在使用SqlBulkCopy,但是这似乎太慢了,因为我看到400k行加载时间高达4-5分钟。

I want to run my bulk inserts in parallel; however reading through the articles on importing data in parallel/controlling lock behaviour, it says that SqlBulkCopy requires that the table does not have clustered indexes and a tablelock (BU lock) needs to be specified. However azure tables must have a clustered index...

我想并行运行我的大容量插入;然而,阅读关于在并行/控制锁行为中导入数据的文章时,它指出SqlBulkCopy要求该表没有聚集索引,并且需要指定一个tablelock (BU lock)。但是azure表必须有一个聚集索引……

Is it even possible to use SqlBulkCopy in parallel on the same table in SQL Azure? If not is there another API (that I can use in code) to do this?

是否可能在SQL Azure中的同一个表上并行地使用SqlBulkCopy ?如果没有,是否有另一个API(我可以在代码中使用)来实现这一点?

2 个解决方案

#1


4  

I don't see how you can run any faster than using SqlBulkCopy. On our project we can import 250K rows in about 3 mins, so your rate seems about right.

我不知道如何比使用SqlBulkCopy更快地运行。在我们的项目中,我们可以在大约3分钟内导入250K行,所以您的速率似乎是对的。

I don't think that doing it in parallel would help, even if it was technically possible. We only run 1 import at a time otherwise SQL Azure starts timing out our requests.

我不认为这样做会有帮助,即使在技术上是可行的。我们每次只运行1个导入,否则SQL Azure将开始计时。

In fact sometimes, running a large group-by query at the same time as the import isn't possible. SQL Azure does a lot of work to ensure quality of service, this includes timing out requests that take too long, take too many resource, etc

事实上,有时候在导入的同时运行一个大的组-by查询是不可能的。SQL Azure做了很多工作来确保服务质量,这包括超时请求、占用太多资源等等

So doing several large bulk inserts at the same time will probably cause one to time out.

因此,同时做几个大的批量插入可能会导致一个超时。

#2


1  

It is possible to run SQLBulkCopy in parallel against SQL Azure, even if you load the same table. You need to prepare your records in batches yourself before sending them to the SQLBulkCopy API. This will absolutely help with performance, and it allows you to control retry operations for a smaller batch of records when you get throttled for reasons outside of your own doing.

可以针对SQL Azure并行运行SQLBulkCopy,即使您加载了相同的表。在将记录发送到SQLBulkCopy API之前,您需要自己成批地准备它们。这将绝对有助于提高性能,并且当您因自己的行为之外的原因而受到限制时,它允许您控制较小批记录的重试操作。

Take a look at my blog post comparing load times of various approaches. There is a sample code as well. In separate tests I was able to cut the load time of a table in half.

看看我的博客文章,比较各种方法的加载时间。还有一个示例代码。在单独的测试中,我将表的加载时间缩短了一半。

This is the technique I am using for a couple of tools (Enzo Backup; Enzo Data Copy); It's not a simple thing to do but when done properly you can optimize load times significantly.

这是我在一些工具中使用的技术(Enzo Backup;恩佐数据复制);这不是一件简单的事情,但如果做得正确,您可以显著地优化加载时间。

#1


4  

I don't see how you can run any faster than using SqlBulkCopy. On our project we can import 250K rows in about 3 mins, so your rate seems about right.

我不知道如何比使用SqlBulkCopy更快地运行。在我们的项目中,我们可以在大约3分钟内导入250K行,所以您的速率似乎是对的。

I don't think that doing it in parallel would help, even if it was technically possible. We only run 1 import at a time otherwise SQL Azure starts timing out our requests.

我不认为这样做会有帮助,即使在技术上是可行的。我们每次只运行1个导入,否则SQL Azure将开始计时。

In fact sometimes, running a large group-by query at the same time as the import isn't possible. SQL Azure does a lot of work to ensure quality of service, this includes timing out requests that take too long, take too many resource, etc

事实上,有时候在导入的同时运行一个大的组-by查询是不可能的。SQL Azure做了很多工作来确保服务质量,这包括超时请求、占用太多资源等等

So doing several large bulk inserts at the same time will probably cause one to time out.

因此,同时做几个大的批量插入可能会导致一个超时。

#2


1  

It is possible to run SQLBulkCopy in parallel against SQL Azure, even if you load the same table. You need to prepare your records in batches yourself before sending them to the SQLBulkCopy API. This will absolutely help with performance, and it allows you to control retry operations for a smaller batch of records when you get throttled for reasons outside of your own doing.

可以针对SQL Azure并行运行SQLBulkCopy,即使您加载了相同的表。在将记录发送到SQLBulkCopy API之前,您需要自己成批地准备它们。这将绝对有助于提高性能,并且当您因自己的行为之外的原因而受到限制时,它允许您控制较小批记录的重试操作。

Take a look at my blog post comparing load times of various approaches. There is a sample code as well. In separate tests I was able to cut the load time of a table in half.

看看我的博客文章,比较各种方法的加载时间。还有一个示例代码。在单独的测试中,我将表的加载时间缩短了一半。

This is the technique I am using for a couple of tools (Enzo Backup; Enzo Data Copy); It's not a simple thing to do but when done properly you can optimize load times significantly.

这是我在一些工具中使用的技术(Enzo Backup;恩佐数据复制);这不是一件简单的事情,但如果做得正确,您可以显著地优化加载时间。