快速插入;BulkCopy与关系数据

时间:2022-03-06 16:59:03

I have a large amount of constantly incoming data (roughly 10,000 a minute, and growing) that I want to insert into a database as efficiently as possible. At the moment I'm using prepared insert statements, but am thinking of using the SqlBulkCopy class to import the data in larger chunks.

我有大量不断输入的数据(大约每分钟10,000个,并在不断增加),我希望尽可能有效地插入到数据库中。此时,我正在使用预准备的insert语句,但是正在考虑使用SqlBulkCopy类将数据导入到更大的块中。

The problem is that I'm not inserting into a single table - elements of the data item are inserted into numerous tables, and their identity columns are used as foreign keys in other rows that are inserted at the same time. I understand that bulk copies aren't meant to allow for more complex inserts like this, but I wonder if it is worth exchanging my identity columns (bigints in this case) for uniqueidentifier columns. This will allow me to do a couple of bulk copies for each table, and since I can determine the IDs before the insert, I don't need to check for anything like SCOPE_IDENTITY which is preventing me from using bulk copy.

问题是我没有插入到单个表中——数据项的元素被插入到许多表中,并且它们的标识列在其他插入的行中用作外键。我理解批量拷贝不打算允许像这样的更复杂的插入,但是我想知道是否值得将标识列(在本例中是bigints)替换为惟一标识符列。这将允许我为每个表执行一些批量拷贝,并且由于我可以在插入之前确定id,所以我不需要检查诸如SCOPE_IDENTITY之类的任何东西,因为它阻止我使用批量拷贝。

Does this sound like a viable solution, or are there other potential issues I might face? Or, is there another way I can insert data quickly, but retain my use of bigint identity columns?

这听起来是一个可行的解决方案,还是我可能面临的其他潜在问题?或者,是否有另一种方法可以快速插入数据,但仍然保留了对bigint identity列的使用?

Thanks.

谢谢。

2 个解决方案

#1


1  

It sounds like you are planning on exchanging "SQL assigns a [bigint identity() column] surrogate key" with a "data prep routine assings a GUID surrogate key" methodology. In other words, the key will not be assigned within SQL, but from outside SQL. Given your volumes, if the data-generating process can assign surrogate key, I'd definitely go with that.

这听起来像是你正在计划交换“SQL分配a [bigint identity()列]代理键”,并使用“数据准备例程分析GUID代理键”的方法。换句话说,密钥不会在SQL中分配,而是从SQL外部分配。给定您的卷,如果数据生成过程可以指定代理键,我肯定会使用它。

The question then becomes, must you use GUIDs, or can your data-generation process produce auto-incrementing integers? Creating such a process that works consistantly and infallibly is hard (one reason why you pay $$$ for SQL Server), but the trade-off for smaller and more human-legible keys within the database might be worth it.

接下来的问题是,您必须使用gui吗?或者您的数据生成过程能够生成自动递增的整数吗?创建这样一个运行一致且始终正确的进程是很困难的(这就是为什么您要为SQL Server花费$$),但是在数据库中使用更小、更容易识别的键可能是值得的。

#2


3  

uniqueidentifier will probably make things worse: page splits and wider. See this

惟一标识符可能会使事情变得更糟:页面分割和更宽。看到这个

If your load is/can be batched, one options is to:

如果你的货物是/可以分批装运,一种选择是:

  • you load a staging table
  • 加载一个staging表
  • load the real tables in one go as a stored procedure
  • 以存储过程的形式一次加载真实的表
  • use a uniqueidentifier in the staging table for each batch
  • 在staging表中为每个批使用惟一标识符

We deal with peaks of around 50k rows per second (and increasing this way). We actually use a separate staging database to avoid double transaction log writes)

我们处理大约每秒50k行的峰值(并通过这种方式增加)。我们实际上使用一个单独的staging数据库来避免双事务日志写入)

#1


1  

It sounds like you are planning on exchanging "SQL assigns a [bigint identity() column] surrogate key" with a "data prep routine assings a GUID surrogate key" methodology. In other words, the key will not be assigned within SQL, but from outside SQL. Given your volumes, if the data-generating process can assign surrogate key, I'd definitely go with that.

这听起来像是你正在计划交换“SQL分配a [bigint identity()列]代理键”,并使用“数据准备例程分析GUID代理键”的方法。换句话说,密钥不会在SQL中分配,而是从SQL外部分配。给定您的卷,如果数据生成过程可以指定代理键,我肯定会使用它。

The question then becomes, must you use GUIDs, or can your data-generation process produce auto-incrementing integers? Creating such a process that works consistantly and infallibly is hard (one reason why you pay $$$ for SQL Server), but the trade-off for smaller and more human-legible keys within the database might be worth it.

接下来的问题是,您必须使用gui吗?或者您的数据生成过程能够生成自动递增的整数吗?创建这样一个运行一致且始终正确的进程是很困难的(这就是为什么您要为SQL Server花费$$),但是在数据库中使用更小、更容易识别的键可能是值得的。

#2


3  

uniqueidentifier will probably make things worse: page splits and wider. See this

惟一标识符可能会使事情变得更糟:页面分割和更宽。看到这个

If your load is/can be batched, one options is to:

如果你的货物是/可以分批装运,一种选择是:

  • you load a staging table
  • 加载一个staging表
  • load the real tables in one go as a stored procedure
  • 以存储过程的形式一次加载真实的表
  • use a uniqueidentifier in the staging table for each batch
  • 在staging表中为每个批使用惟一标识符

We deal with peaks of around 50k rows per second (and increasing this way). We actually use a separate staging database to avoid double transaction log writes)

我们处理大约每秒50k行的峰值(并通过这种方式增加)。我们实际上使用一个单独的staging数据库来避免双事务日志写入)