SQL Server搜索正确命名全文索引与LIKE + SOUNDEX

I have a database of names of people that has (currently) 35 million rows. I need to know what is the best method for quickly searching these names. The current system (not designed by me), simply has the first and last name columns indexed and uses "LIKE" queries with the additional option of using SOUNDEX (though I'm not sure this is actually used much). Performance has always been a problem with this system, and so currently the searches are limited to 200 results (which still takes too long to run). So, I have a few questions:

我有一个拥有(目前)3500万行的人名的数据库。我需要知道快速搜索这些名称的最佳方法是什么。当前系统(不是由我设计的),只是将索引的名字和姓氏列用于“LIKE”查询,并使用SOUNDEX的附加选项(尽管我不确定这实际上使用了多少)。性能一直是该系统的一个问题,因此目前搜索限制为200个结果(运行时间仍然太长)。所以,我有几个问题:

Does full text index work well for proper names?

全文索引是否适用于专有名称?

If so, what is the best way to query proper names? (CONTAINS, FREETEXT, etc)

如果是这样,查询专有名称的最佳方法是什么? (CONTAINS,FREETEXT等)

Is there some other system (like Lucene.net) that would be better?

是否有其他系统(如Lucene.net)会更好?

Just for reference, I'm using Fluent NHibernate for data access, so methods that work will with that will be preferred. I'm using SQL Server 2008 currently.

仅供参考,我使用Fluent NHibernate进行数据访问,因此首选的方法将是首选。我目前正在使用SQL Server 2008。

EDIT I want to add that I'm very interested in solutions that will deal with things like commonly misspelled names, eg 'smythe', 'smith', as well as first names, eg 'tomas', 'thomas'.

编辑我想补充一点,我对解决方案非常感兴趣,这些解决方案可以处理常见拼写错误的名称,例如'smythe','smith',以及名字,例如'tomas','thomas'。

Query Plan

  |--Parallelism(Gather Streams)
       |--Nested Loops(Inner Join, OUTER REFERENCES:([testdb].[dbo].[Test].[Id], [Expr1004]) OPTIMIZED WITH UNORDERED PREFETCH)
            |--Hash Match(Inner Join, HASH:([testdb].[dbo].[Test].[Id])=([testdb].[dbo].[Test].[Id]))
            |    |--Bitmap(HASH:([testdb].[dbo].[Test].[Id]), DEFINE:([Bitmap1003]))
            |    |    |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([testdb].[dbo].[Test].[Id]))
            |    |         |--Index Seek(OBJECT:([testdb].[dbo].[Test].[IX_Test_LastName]), SEEK:([testdb].[dbo].[Test].[LastName] >= 'WHITDþ' AND [testdb].[dbo].[Test].[LastName] < 'WHITF'),  WHERE:([testdb].[dbo].[Test].[LastName] like 'WHITE%') ORDERED FORWARD)
            |    |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([testdb].[dbo].[Test].[Id]))
            |         |--Index Seek(OBJECT:([testdb].[dbo].[Test].[IX_Test_FirstName]), SEEK:([testdb].[dbo].[Test].[FirstName] >= 'THOMARþ' AND [testdb].[dbo].[Test].[FirstName] < 'THOMAT'),  WHERE:([testdb].[dbo].[Test].[FirstName] like 'THOMAS%' AND PROBE([Bitmap1003],[testdb].[dbo].[Test].[Id],N'[IN ROW]')) ORDERED FORWARD)
            |--Clustered Index Seek(OBJECT:([testdb].[dbo].[Test].[PK__TEST__3214EC073B95D2F1]), SEEK:([testdb].[dbo].[Test].[Id]=[testdb].[dbo].[Test].[Id]) LOOKUP ORDERED FORWARD)

SQL for above:

上面的SQL:

SELECT * FROM testdb.dbo.Test WHERE LastName LIKE 'WHITE%' AND FirstName LIKE 'THOMAS%'

Based on advice from Mitch, I created an index like this:

根据Mitch的建议,我创建了一个这样的索引:

CREATE INDEX IX_Test_Name_DOB
ON Test (LastName ASC, FirstName ASC, BirthDate ASC)
INCLUDE (and here I list the other columns)

My searches are now incredibly fast for my typical search (last, first, and birth date).

我的典型搜索(最后,第一和出生日期)的搜索速度非常快。

3 个解决方案

#1

Depends what your LIKE queries look like.

取决于你的LIKE查询的样子。

If you are searching for LIKE '%abc%' then no index can be utilised, whereas when searching for LIKE 'abc%' an index can be used. Also, if the index(es) on First and Last name are not 'covering' the emitted query then key lookups (Bookmark Lookups) will be performed and significantly impact performance.

如果您正在搜索LIKE'%abc%',则不能使用索引,而在搜索LIKE'abc%'时,可以使用索引。此外,如果First和Last名称上的索引不“覆盖”发出的查询,则将执行密钥查找(书签查找)并显着影响性能。

Are your indexes rebuilt regularly?

您的索引是否定期重建?

Do you have an example query plan?

你有一个示例查询计划吗?

Update: A covering index for a query is one which can be used to perform the WHERE criteria and also has all of the columns required to satisfy the rest of the query such as the SELECT column list.

更新:查询的覆盖索引是可用于执行WHERE条件的索引,并且还具有满足查询其余部分所需的所有列,例如SELECT列列表。

Using Covering Indexes to Improve Query Performance

使用覆盖索引提高查询性能

Update: Even if you create a composite index on (Lastname, Firstname) (since lastname should be more selective), a lookup for all the other columns (the '*' column list) will still be required into the tables clustered index.

更新:即使您在(Lastname,Firstname)上创建复合索引(因为lastname应该更具选择性),仍然需要在表聚簇索引中查找所有其他列('*'列列表)。

#2

I don't like soundex much. I think newer iterations of the algorithm are better, but you are hashing every word in the English language down to a fairly small hash. This tends to generate a ton of false matches over time. I've read that metaphone and it's successor double metaphone are better, but I don't have direct experience with them.

我不喜欢soundex。我认为算法的更新迭代更好,但是你将英语中的每个单词都哈希到一个相当小的哈希值。随着时间的推移,这往往会产生大量的错误匹配。我已经阅读过metaphone并且它的后继双音乐手机更好,但我没有直接经验。

Mitch's coverage of like is pretty thorough, so I'm not going to repeat it.

米奇对喜欢的报道非常透彻,所以我不打算再重复一遍。

#3

If you create an index on the first name and last name columns, then exact match searches and prefix searches using LIKE will become blazingly fast.

如果您在名字和姓氏列上创建索引,那么使用LIKE的完全匹配搜索和前缀搜索将变得非常快。

(In MySQL, "The index also can be used for LIKE comparisons if the argument to LIKE is a constant string that does not start with a wildcard character." I think MS SQL has a similar rule, but check the MS SQL documentation to be sure.)

(在MySQL中,“如果LIKE的参数是一个不以通配符开头的常量字符串,那么索引也可以用于LIKE比较。”我认为MS SQL有一个类似的规则,但检查MS SQL文档是当然。)

To speed up SoundEx searches, store the SoundEx version of the first name and last name new columns, and create indices on those columns.

要加速SoundEx搜索,请将SoundEx版本的名字和姓氏存储为新列,并在这些列上创建索引。

#1