SQL:内部连接两个大表。

I have two massive tables with about 100 million records each and I'm afraid I needed to perform an Inner Join between the two. Now, both tables are very simple; here's the description:

我有两个巨大的表，每个都有大约1亿的记录，恐怕我需要在这两个表之间执行一个内部连接。两个表都很简单;这是描述:

BioEntity table:

BioEntity表:

BioEntityId (int)
BioEntityId(int)
Name (nvarchar 4000, although this is an overkill)
名称(nvarchar 4000，尽管这是一个超杀)
TypeId (int)
类型id(int)

EGM table (an auxiliar table, in fact, resulting of bulk import operations):

EGM表(实际上是一个辅助表，由于大量进口操作而产生):

EMGId (int)
EMGId(int)
PId (int)
PId(int)
Name (nvarchar 4000, although this is an overkill)
名称(nvarchar 4000，尽管这是一个超杀)
TypeId (int)
类型id(int)
LastModified (date)
LastModified(日期)

I need to get a matching Name in order to associate BioEntityId with the PId residing in the EGM table. Originally, I tried to do everything with a single inner join but the query appeared to be taking way too long and the logfile of the database (in simple recovery mode) managed to chew up all the available disk space (that's just over 200 GB, when the database occupies 18GB) and the query would fail after waiting for two days, If I'm not mistaken. I managed to keep the log from growing (only 33 MB now) but the query has been running non-stop for 6 days now and it doesn't look like it's gonna stop anytime soon.

我需要获得一个匹配的名称，以便将BioEntityId与位于EGM表中的PId关联。最初,我试图做的一切只有一个内连接但查询似乎太长,数据库的日志文件(在简单恢复模式)设法消耗所有可用的磁盘空间(超过200 GB,当数据库占地18 GB),查询将失败后等待两天,如果我没弄错了。我设法阻止了日志的增长(现在只有33 MB)，但是这个查询已经连续运行6天了，而且看起来不会很快停止。

I'm running it on a fairly decent computer (4GB RAM, Core 2 Duo (E8400) 3GHz, Windows Server 2008, SQL Server 2008) and I've noticed that the computer jams occasionally every 30 seconds (give or take) for a couple of seconds. This makes it quite hard to use it for anything else, which is really getting on my nerves.

我在一台相当不错的电脑上运行它(4GB RAM, Core 2 Duo (E8400) 3GHz, Windows Server 2008, SQL Server 2008)，我注意到计算机有时每30秒就会阻塞一次(或多或少)几秒钟。这让我很难把它用在别的地方，这真的让我很紧张。

Now, here's the query:

现在,这里的查询:

 SELECT EGM.Name, BioEntity.BioEntityId INTO AUX
 FROM EGM INNER JOIN BioEntity 
 ON EGM.name LIKE BioEntity.Name AND EGM.TypeId = BioEntity.TypeId

I had manually setup some indexes; both EGM and BioEntity had a non-clustered covering index containing TypeId and Name. However, the query ran for five days and it did not end either, so I tried running Database Tuning Advisor to get the thing to work. It suggested deleting my older indexes and creating statistics and two clustered indexes instead (one on each table, just containing the TypeId which I find rather odd - or just plain dumb - but I gave it a go anyway).

我手动设置了一些索引;EGM和BioEntity都有一个包含类型和名称的非集群覆盖索引。但是，查询运行了5天，它也没有结束，所以我尝试运行数据库调优Advisor工具来让它运行。它建议删除旧的索引，并创建统计数据和两个集群索引(每个表上都有一个，只是包含了我认为相当奇怪的类型id——或者只是简单的哑了——但我还是尝试了一下)。

It has been running for 6 days now and I'm still not sure what to do... Any ideas guys? How can I make this faster (or, at least, finite)?

它已经运行了6天了，我还不知道该怎么做……有什么想法吗?我怎样才能使它更快(或者，至少是有限的)呢?

Update: - Ok, I've canceled the query and rebooted the server to get the OS up and running again - I'm rerunning the workflow with your proposed changes, specifically cropping the nvarchar field to a much smaller size and swapping "like" for "=". This is gonna take at least two hours, so I'll be posting further updates later on

更新:-好的，我已经取消了查询并重新启动服务器以使操作系统重新启动并重新运行——我将重新运行工作流，并对您提出的更改进行修改，特别是将nvarchar字段裁剪成更小的大小，并将“like”替换为“=”。这需要至少两个小时，所以我以后会发布更多的更新。

Update 2 (1PM GMT time, 18/11/09): - The estimated execution plan reveals a 67% cost regarding table scans followed by a 33% hash match. Next comes 0% parallelism (isn't this strange? This is the first time I'm using the estimated execution plan but this particular fact just lifted my eyebrow), 0% hash match, more 0% parallelism, 0% top, 0% table insert and finally another 0% select into. Seems the indexes are crap, as expected, so I'll be making manual indexes and discard the crappy suggested ones.

更新2(格林尼治时间下午1点，18/11/09):-估计的执行计划显示，表扫描的成本为67%，哈希匹配为33%。接下来是0%并行(这难道不奇怪吗?这是我第一次使用估计的执行计划，但是这个特殊的事实让我吃惊)，0哈希匹配，更多的0%并行，0% top, 0 insert，最后还有0 select into。似乎这些索引是垃圾，正如预期的那样，所以我将创建手动索引并丢弃那些糟糕的建议索引。

16 个解决方案

#1

For huge joins, sometimes explicitly choosing a loop join speeds things up:

对于大型连接，有时显式地选择循环连接会加快速度:

SELECT EGM.Name, BioEntity.BioEntityId INTO AUX
FROM EGM 
INNER LOOP JOIN BioEntity 
    ON EGM.name LIKE BioEntity.Name AND EGM.TypeId = BioEntity.TypeId

As always, posting your estimated execution plan could help us provide better answers.

一如既往，发布您的估计执行计划可以帮助我们提供更好的答案。

EDIT: If both inputs are sorted (they should be, with the covering index), you can try a MERGE JOIN:

编辑:如果两个输入都已排序(应该是包含索引的)，您可以尝试合并连接:

SELECT EGM.Name, BioEntity.BioEntityId INTO AUX
FROM EGM 
INNER JOIN BioEntity 
    ON EGM.name LIKE BioEntity.Name AND EGM.TypeId = BioEntity.TypeId
OPTION (MERGE JOIN)

#2

I'm not an SQL tuning expert, but joining hundreds of millions of rows on a VARCHAR field doesn't sound like a good idea in any database system I know.

我不是SQL调优专家，但是在我所知道的任何数据库系统中，加入数以百万的VARCHAR字段的行列听起来都不是个好主意。

You could try adding an integer column to each table and computing a hash on the NAME field that should get the possible matches to a reasonable number before the engine has to look at the actual VARCHAR data.

您可以尝试向每个表添加一个整数列，并在NAME字段上计算一个散列，在引擎必须查看实际的VARCHAR数据之前，该字段应该将可能的匹配结果变为一个合理的数字。

#3

Maybe a bit offtopic, but: " I've noticed that the computer jams occasionally every 30 seconds (give or take) for a couple of seconds."

也许有点离题，但是:“我注意到电脑偶尔会在几秒钟内每隔30秒就会卡一次。”

This behavior is characteristic for cheap RAID5 array (or maybe for single disk) while copying (and your query mostly copies data) gigabytes of information.

这种行为是廉价的RAID5数组(或单个磁盘)在复制(您的查询主要复制数据)千兆字节信息时的特征。

More about problem - can't you partition your query into smaller blocks? Like names starting with A, B etc or IDs in specific ranges? This could substantially decrease transactional/locking overhead.

更多关于问题——你不能把你的查询划分成小块吗?像以A、B等开头的名字或特定范围内的id ?这可以大大减少事务/锁开销。

#4

First, 100M-row joins are not at all unreasonable or uncommon.

首先，100m行连接并不不合理或不常见。

However, I suspect the cause of the poor performance you're seeing may be related to the INTO clause. With that, you are not only doing a join, you are also writing the results to a new table. Your observation about the log file growing so huge is basically confirmation of this.

然而，我怀疑您看到的糟糕表现的原因可能与INTO子句有关。这样，您不仅要执行连接，还要将结果写入一个新表。您对日志文件增长如此巨大的观察基本上证实了这一点。

One thing to try: remove the INTO and see how it performs. If the performance is reasonable, then to address the slow write you should make sure that your DB log file is on a separate physical volume from the data. If it isn't, the disk heads will thrash (lots of seeks) as they read the data and write the log, and your perf will collapse (possibly to as little as 1/40th to 1/60th of what it could be otherwise).

需要尝试的一件事是:删除INTO并查看它是如何执行的。如果性能是合理的，那么要解决慢写问题，应该确保DB日志文件位于与数据分开的物理卷上。如果不是，磁盘磁头会在读取数据和写入日志时抖动(大量查找)，并且您的perf会崩溃(可能只有正常情况下的1/40到1/60)。

#5

I'd try maybe removing the 'LIKE' operator; as you don't seem to be doing any wildcard matching.

我会试着去掉'LIKE'运算符;因为您似乎没有进行任何通配符匹配。

#6

As recommended, I would hash the name to make the join more reasonable. I would strongly consider investigating assigning the id during the import of batches through a lookup if it is possible, since this would eliminate the need to do the join later (and potentially repeatedly having to perform such an inefficient join).

按照建议，我将对名称进行散列，以使连接更合理。如果可能的话，我将强烈地考虑在批量导入过程中通过查找分配id，因为这将消除稍后执行连接的需要(并且可能会反复执行这种低效的连接)。

I see you have this index on the TypeID - this would help immensely if this is at all selective. In addition, add the column with the hash of the name to the same index:

我看到您在类型id上有这个索引——如果这是有选择性的，这将非常有帮助。此外，将名称哈希列添加到同一索引中:

SELECT EGM.Name
       ,BioEntity.BioEntityId
INTO AUX 
FROM EGM 
INNER JOIN BioEntity  
    ON EGM.TypeId = BioEntity.TypeId -- Hopefully a good index
    AND EGM.NameHash = BioEntity.NameHash -- Should be a very selective index now
    AND EGM.name LIKE BioEntity.Name

#7

Another suggestion I might offer is try to get a subset of the data instead of processing all 100 M rows at once to tune your query. This way you don't have to spend so much time waiting to see when your query is going to finish. Then you could consider inspecting the query execution plan which may also provide some insight to the problem at hand.

我可能提供的另一个建议是尝试获取数据的子集，而不是一次处理所有100 M行来调优查询。这样，您就不必花费太多时间等待查询何时完成。然后，您可以考虑检查查询执行计划，这也可能提供了一些关于手头问题的见解。

#8

100 million records is HUGE. I'd say to work with a database that large you'd require a dedicated test server. Using the same machine to do other work while performing queries like that is not practical.

1亿的记录是巨大的。我想说，要使用这么大的数据库，您需要一个专用的测试服务器。在执行诸如此类的查询时，使用同一台机器进行其他工作是不实际的。

Your hardware is fairly capable, but for joins that big to perform decently you'd need even more power. A quad-core system with 8GB would be a good start. Beyond that you have to make sure your indexes are setup just right.

您的硬件是相当强大的，但是对于这么大的连接，您需要更大的功耗。一个8GB的四核系统将是一个好的开始。除此之外，您还必须确保您的索引设置正确。

#9

do you have any primary keys or indexes? can you select it in stages? i.e. where name like 'A%', where name like 'B%', etc.

你有主键或索引吗?你能分阶段选择吗?即“%”的名称，如“B%”等名称。

#10

I had manually setup some indexes; both EGM and BioEntity had a non-clustered covering index containing TypeId and Name. However, the query ran for five days and it did not end either, so I tried running Database Tuning Advisor to get the thing to work. It suggested deleting my older indexes and creating statistics and two clustered indexes instead (one on each table, just containing the TypeId which I find rather odd - or just plain dumb - but I gave it a go anyway).

我手动设置了一些索引;EGM和BioEntity都有一个包含类型和名称的非集群覆盖索引。但是，查询运行了5天，它也没有结束，所以我尝试运行数据库调优Advisor工具来让它运行。它建议删除旧的索引，并创建统计数据和两个集群索引(每个表上都有一个，只是包含了我认为相当奇怪的类型id——或者只是简单的哑了——但我还是尝试了一下)。

You said you made a clustered index on TypeId in both tables, although it appears you have a primary key on each table already (BioEntityId & EGMId, respectively). You do not want your TypeId to be the clustered index on those tables. You want the BioEntityId & EGMId to be clustered (that will physically sort your data in order of the clustered index on disk. You want non-clustered indexes on foreign keys you will be using for lookups. I.e. TypeId. Try making the primary keys clustered, and adding a non-clustered index on both tables that ONLY CONTAINS TypeId.

您说您在两个表中都对TypeId创建了集群索引，尽管看起来您已经在每个表上都有一个主键(BioEntityId和EGMId)。您不希望您的类型id是这些表上的聚集索引。您希望将BioEntityId和EGMId聚集在一起(这将根据磁盘上的聚集索引对您的数据进行物理排序)。您希望在用于查找的外键上使用非集群索引。即类型id。尝试对主键进行集群化，并在两个只包含TypeId的表上添加非集群索引。

In our environment we have a tables that are roughly 10-20 million records apiece. We do a lot of queries similar to yours, where we are combining two datasets on one or two columns. Adding an index for each foreign key should help out a lot with your performance.

在我们的环境中，我们有一个表，每个表大约有1000到2000万条记录。我们做了很多类似于您的查询，我们在一或两列上合并两个数据集。为每个外键添加索引将对您的性能有很大帮助。

Please keep in mind that with 100 million records, those indexes are going to require a lot of disk space. However, it seems like performance is key here, so it should be worth it.

请记住，有了1亿个记录，这些索引将需要大量的磁盘空间。然而，似乎性能是关键，所以它应该是值得的。

K. Scott has a pretty good article here which explains some issues more in depth.

斯科特在这里有一篇很好的文章，更深入地解释了一些问题。

#11

Reiterating a few prior posts here (which I'll vote up)...

在此重申一些先前的职位(我将投票支持)……

How selective is TypeId? If you only have 5, 10, or even 100 distinct values across your 100M+ rows, the index does nothing for you -- particularly since you're selecting all the rows anyway.

如何选择类型id吗?如果您的100多行中只有5、10甚至100个不同的值，那么索引对您没有任何帮助——尤其是因为您正在选择所有的行。

I'd suggest creating a column on CHECKSUM(Name) in both tables seems good. Perhaps make this a persisted computed column:

我建议在两个表中创建一个校验和(Name)的列。也许使它成为一个持久的计算列:

CREATE TABLE BioEntity
 (
   BioEntityId  int
  ,Name         nvarchar(4000)
  ,TypeId       int
  ,NameLookup  AS checksum(Name) persisted
 )

and then create an index like so (I'd use clustered, but even nonclustered would help):

然后创建一个这样的索引(我使用集群，但即使是非集群也会有帮助):

CREATE clustered INDEX IX_BioEntity__Lookup on BioEntity (NameLookup, TypeId)

(Check BOL, there are rules and limitations on building indexes on computed columns that may apply to your environment.)

(检查BOL，对于在可能适用于您的环境的计算列上构建索引有规则和限制。)

Done on both tables, this should provide a very selective index to support your query if it's revised like this:

在两个表上完成后，如果查询被这样修改，这应该提供一个非常有选择性的索引来支持您的查询:

SELECT EGM.Name, BioEntity.BioEntityId INTO AUX
 FROM EGM INNER JOIN BioEntity 
 ON EGM.NameLookup = BioEntity.NameLookup
  and EGM.name = BioEntity.Name
  and EGM.TypeId = BioEntity.TypeId

Depending on many factors it will still run long (not least because you're copying how much data into a new table?) but this should take less than days.

取决于许多因素，它仍然会运行很长时间(尤其是因为您正在将多少数据复制到一个新表中?)，但是这需要花费不到几天的时间。

#12

Why an nvarchar? Best practice is, if you don't NEED (or expect to need) the unicode support, just use varchar. If you think the longest name is under 200 characters, I'd make that column a varchar(255). I can see scenarios where the hashing that has been recommended to you would be costly (it seems like this database is insert intensive). With that much size, however, and the frequency and random nature of the names, your indexes will become fragmented quickly in most scenarios where you index on a hash (dependent on the hash) or the name.

为什么一个nvarchar吗?最佳实践是，如果您不需要(或预期需要)unicode支持，只需使用varchar。如果您认为最长的名字在200个字符以下，那么我将该列设为varchar(255)。我可以看到一些场景，在这些场景中，向您推荐的散列将会非常昂贵(看起来这个数据库是密集插入的)。但是，有了这么大的大小，以及名称的频率和随机性，在大多数情况下，索引将很快地变得支离破碎，在这些情况下，索引将基于散列(依赖于散列)或名称进行索引。

I would alter the name column as described above and make the clustered index TypeId, EGMId/BioentityId (the surrogate key for either table). Then you can join nicely on TypeId, and the "rough" join on Name will have less to loop through. To see how long this query might run, try it for a very small subset of your TypeIds, and that should give you an estimate of the run time (although it might ignore factors like cache size, memory size, hard disk transfer rates).

我将更改上面描述的名称列，并将聚集索引类型id为EGMId/BioentityId(这两个表的代理键)。然后，您可以在TypeId上很好地进行连接，而在Name上的“粗略”连接将有更少的循环。要查看该查询可能运行多长时间，请尝试使用类型id的一个非常小的子集，这应该会给出运行时间的估计(尽管它可能会忽略缓存大小、内存大小、硬盘传输速率等因素)。

Edit: if this is an ongoing process, you should enforce the foreign key constraint between your two tables for future imports/dumps. If it's not ongoing, the hashing is probably your best best.

编辑:如果这是一个正在进行的过程，您应该在您的两个表之间为将来的导入/转储执行外键约束。如果不是持续的，散列可能是最好的。

#13

I would try to solve the issue outside the box, maybe there is some other algorithm that could do the job much better and faster than the database. Of course it all depends on the nature of the data but there are some string search algorithm that are pretty fast (Boyer-Moore, ZBox etc), or other datamining algorithm (MapReduce ?) By carefully crafting the data export it could be possible to bend the problem to fit a more elegant and faster solution. Also, it could be possible to better parallelize the problem and with a simple client make use of the idle cycles of the systems around you, there are framework that can help with this.

我会试着跳出框框来解决这个问题，也许还有其他的算法可以比数据库做得更好更快。当然，这完全取决于数据的性质，但是有一些字符串搜索算法非常快(Boyer-Moore, ZBox等)，或者其他数据挖掘算法(MapReduce ?)通过精心设计数据导出，可以将问题转化为更优雅、更快速的解决方案。此外，还可以更好地将问题并行化，并且使用一个简单的客户机来利用您周围系统的空闲周期，有一些框架可以帮助解决这个问题。

the output of this could be a list of refid tuples that you could use to fetch the complete data from the database much faster.

它的输出可能是一个refid元组列表，您可以使用它更快地从数据库中获取完整的数据。

This does not prevent you from experimenting with index, but if you have to wait 6 days for the results I think that justifies resources spent exploring other possible options.

这并不妨碍您对索引进行实验，但是如果您必须等待6天才能得到结果，那么我认为有理由花费资源来探索其他可能的选项。

my 2 cent

我2分

#14

Since you're not asking the DB to do any fancy relational operations, you could easily script this. Instead of killing the DB with a massive yet simple query, try exporting the two tables (can you get offline copies from the backups?).

由于您没有要求DB做任何花哨的关系操作，所以您可以轻松地编写这个脚本。与其用一个庞大而简单的查询杀死DB，不如尝试导出这两个表(您能从备份中获得脱机副本吗?)

Once you have the tables exported, write a script to perform this simple join for you. It'll take about the same amount of time to execute, but won't kill the DB.

导出表后，编写一个脚本为您执行这个简单的连接。它将花费相同的时间执行，但不会杀死DB。

Due to the size of the data and length of time the query takes to run, you won't be doing this very often, so an offline batch process makes sense.

由于数据的大小和查询运行所需的时间长度，您不会经常这样做，因此脱机批处理是有意义的。

For the script, you'll want to index the larger dataset, then iterate through the smaller dataset and do lookups into the large dataset index. It'll be O(n*m) to run.

对于脚本，您将希望对较大的数据集进行索引，然后遍历较小的数据集，并对较大的数据集索引进行查找。它将是O(n*m)运行。

#15

If the hash match consumes too many resources, then do your query in batches of, say, 10000 rows at a time, "walking" the TypeID column. You didn't say the selectivity of TypeID, but presumably it is selective enough to be able to do batches this small and completely cover one or more TypeIDs at a time. You're also looking for loop joins in your batches, so if you still get hash joins then either force loop joins or reduce the batch size.

如果散列匹配消耗了太多的资源，那么以批次(比如一次处理10000行)进行查询，“遍历”TypeID列。您并没有说类型id的选择性，但可以推测它的选择性足够大，可以批量生产这么小的类型，同时完全覆盖一个或多个类型。您还需要在您的批次中寻找循环连接，因此，如果您仍然得到散列连接，则要么强制循环联接，要么减少批处理大小。

Using batches will also, in simple recovery mode, keep your tran log from growing very large. Even in simple recovery mode, a huge join like you are doing will consume loads of space because it has to keep the entire transaction open, whereas when doing batches it can reuse the log file for each batch, limiting its size to the largest needed for one batch operation.

在简单的恢复模式中，使用batch将避免您的tran日志变得非常大。即使在简单的恢复模式中，像您这样的大型联接也会消耗大量空间，因为它必须保持整个事务的开放，而在批量处理时，它可以重用每个批处理的日志文件，将其大小限制为一次批量操作所需的最大容量。

If you truly need to join on Name, then you might consider some helper tables that convert names into IDs, basically repairing the denormalized design temporarily (if you can't repair it permanently).

如果您确实需要对名称进行连接，那么您可以考虑使用一些帮助表将名称转换为id，基本上是临时修复非规范化设计(如果无法永久修复的话)。

The idea about checksum can be good, too, but I haven't played with that very much, myself.

校验和的想法也不错，但我自己也没怎么玩过。

In any case, such a huge hash match is not going to perform as well as batched loop joins. If you could get a merge join it would be awesome...

在任何情况下，如此巨大的散列匹配都不能像批处理循环连接那样执行。如果你能得到一个合并连接，那将是很棒的…

#16

I wonder, whether the execution time is taken by the join or by the data transfer.

我想知道执行时间是由join还是由数据传输占用的。

Assumed, the average data size in your Name column is 150 chars, you will actually have 300 bytes plus the other columns per record. Multiply this by 100 million records and you get about 30GB of data to transfer to your client. Do you run the client remote or on the server itself ? Maybe you wait for 30GB of data being transferred to your client...

假设您的Name列中的平均数据大小是150个字符，实际上您将有300个字节加上每个记录的其他列。再乘以1亿个记录，您将得到大约30GB的数据传输到您的客户端。您是远程运行客户端还是服务器本身?也许你需要等待30GB的数据传输到你的客户端……

EDIT: Ok, i see you are inserting into Aux table. What is the setting of the recovery model of the database?

编辑:好的，我看到你在插入辅助表。数据库恢复模型的设置是什么?

To investigate the bottleneck on the hardware side, it might be interesting whether the limiting resource is reading data or writing data. You can start a run of the windows performance monitor and capture the length of the queues for reading and writing of your disks for example.

为了研究硬件方面的瓶颈，限制资源是读取数据还是写入数据可能很有趣。您可以启动windows性能监视器的运行，并捕获用于读写磁盘的队列的长度。

Ideal, you should place the db log file, the input tables and the output table on separate physical volumes to increase speed.

理想情况下，应该将db日志文件、输入表和输出表放在单独的物理卷上，以提高速度。

#1