SQL Server全文搜索与回退完全匹配

时间:2022-09-19 17:53:39

First off there seems to be no way to get an exact match using a full-text search. This seems to be a highly discussed issue when using the full-text search method and there are lots of different solutions to achieve the desired result, however most seem very inefficient. Being I'm forced to use full-text search due to the volume of my database I recently had to implement one of these solutions to get more accurate results.

首先,似乎无法使用全文搜索获得完全匹配。当使用全文搜索方法时,这似乎是一个高度讨论的问题,并且有许多不同的解决方案来实现期望的结果,但是大多数看起来效率非常低。由于我的数据库量,我*使用全文搜索,我最近不得不实施其中一种解决方案以获得更准确的结果。

I could not use the ranking results from the full-text search because of how it works. For instance if you searched for a movie called Toy Story and there was also a movie called The Story Behind Toy Story that would come up instead of the exact match because it found the word Story twice and Toy.

由于它的工作原理,我无法使用全文搜索的排名结果。例如,如果您搜索了一部名为“玩具总动员”的电影,并且还有一部名为“玩具故事背后的故事”的电影会出现,而不是完全匹配,因为它发现了两次“故事”和“玩具”。

I do track my own rankings which I call "Popularity" each time a user access a record the number goes up. I use this datapoint to weight my results to help determine what the user might be looking for.

我跟踪自己的排名,每当用户访问记录数量增加时,我称之为“人气”。我使用此数据点来加权我的结果,以帮助确定用户可能正在寻找的内容。

I also have the issue where sometimes need to fall back to a LIKE search and not return an exact match. I.e. searching Goonies should return The Goonies (most popular result)

我还有一个问题,有时需要回到LIKE搜索而不返回完全匹配。即搜索Goonies应该返回The Goonies(最受欢迎的结果)

So here is an example of my current stored procedure for achieving this:

所以这是我当前存储过程的一个示例:

DECLARE @Title varchar(255)
SET @Title = '"Toy Story"'
--need to remove quotes from parameter for LIKE search
DECLARE @Title2 varchar(255)
SET @Title2 = REPLACE(@title, '"', '')

--get top 100 results using full-text search and sort them by popularity
SELECT TOP(100) id, title, popularity As Weight into #TempTable FROM movies WHERE CONTAINS(title, @Title) ORDER BY [Weight] DESC

--check if exact match can be found
IF EXISTS(select * from #TempTable where Title = @title2)
--return exact match
SELECT TOP(1) * from #TempTable where Title = @title2
ELSE
--no exact match found, try using like with wildcards
SELECT TOP(1) * from #TempTable where Title like '%' + @title2 + '%'
DROP TABLE #TEMPTABLE

This stored procedure is executed about 5,000 times a minute, and crazy enough it's not bringing my server to it's knees. But I really want to know if there was a more efficient approach to this? Thanks.

这个存储过程每分钟执行大约5000次,并且疯狂到不会让我的服务器瘫痪。但我真的想知道是否有更有效的方法呢?谢谢。

4 个解决方案

#1


4  

You should use full text search CONTAINSTABLE to find the top 100 (possibly 200) candidate results and then order the results you found using your own criteria.

您应该使用全文搜索CONTAINSTABLE来查找前100个(可能是200个)候选结果,然后使用您自己的条件对您找到的结果进行排序。

It sounds like you'd like to ORDER BY

听起来你想订购

  1. exact match of the phrase (=)
  2. 完全匹配的短语(=)
  3. the fully matched phrase (LIKE)
  4. 完全匹配的短语(LIKE)
  5. higher value for the Popularity column
  6. 人气专栏的价值更高
  7. the Rank from the CONTAINSTABLE
  8. 来自CONTAINSTABLE的排名

But you can toy around with the exact order you prefer.

但是你可以按照自己喜欢的顺序玩具。

In SQL that looks something like:

在SQL中看起来像:

DECLARE @title varchar(255)
SET @title = '"Toy Story"'
--need to remove quotes from parameter for LIKE search
DECLARE @title2 varchar(255)
SET @title2 = REPLACE(@title, '"', '')

SELECT
    m.ID,
    m.title,
    m.Popularity,
    k.Rank
FROM Movies m
INNER JOIN CONTAINSTABLE(Movies, title, @title, 100) as [k]
    ON m.ID = k.[Key]
ORDER BY 
  CASE WHEN m.title = @title2 THEN 0 ELSE 1 END,
  CASE WHEN m.title LIKE @title2 THEN 0 ELSE 1 END,
  m.popularity desc,
  k.rank

See SQLFiddle

请参见SQLFiddle

#2


2  

This will give you the movies that contain the exact phrase "Toy Story", ordered by their popularity.

这将为您提供包含完整短语“玩具总动员”的电影,按其受欢迎程度排序。

SELECT
    m.[ID],
    m.[Popularity],
    k.[Rank]
FROM [dbo].[Movies] m
INNER JOIN CONTAINSTABLE([dbo].[Movies], [Title], N'"Toy Story"') as [k]
    ON m.[ID] = k.[Key]
ORDER BY m.[Popularity]

Note the above would also give you "The Goonies Return" if you searched "The Goonies".

请注意,如果你搜索了“The Goonies”,上面的内容也会给你“The Goonies Return”。

#3


0  

If got the feeling you don't really like the fuzzy part of the full text search but you do like the performance part.

如果感觉到你并不喜欢全文搜索的模糊部分,但你确实喜欢表演部分。

Maybe is this a path: if you insist on getting the EXACT match before a weighted match you could try to hash the value. For example 'Toy Story' -> bring to lowercase -> toy story -> Hash into 4de2gs5sa (with whatever hash you like) and perform a search on the hash.

也许这是一条路径:如果你坚持在加权匹配之前获得完全匹配,你可以尝试哈希值。例如'玩具总动员' - >带小写 - >玩具故事 - >哈希到4de2gs5sa(你喜欢的任何哈希)并对哈希进行搜索。

#4


0  

In Oracle I've used UTL_MATCH for similar purposes. (http://docs.oracle.com/cd/E11882_01/appdev.112/e25788/u_match.htm)

在Oracle中,我使用UTL_MATCH用于类似目的。 (http://docs.oracle.com/cd/E11882_01/appdev.112/e25788/u_match.htm)

Even though using the Jaro Winkler algorithm, for instance, might take awhile if you compare the title column from table 1 and table 2, you can improve performance if you partially join the 2 tables. I have in some cases compared person names on table 1 with table 2 using Jaro Winkler, but limited results not just above a certain Jaro Winkler threshold, but also to names between the 2 tables where the first letter is the same. For instance I would compare Albert with Aden, Alfonzo, and Alberto, using Jaro Winkler, but not Albert and Frank (limiting the number of situations where the algorithm needs to be used).

例如,即使使用Jaro Winkler算法,如果比较表1和表2中的标题列,也可能需要一段时间,如果部分加入2个表,则可以提高性能。在某些情况下,我在表1中将人名与使用Jaro Winkler的表2进行了比较,但结果不仅仅限于某个Jaro Winkler阈值,还包括第一个字母相同的2个表之间的名称。例如,我会使用Jaro Winkler将Albert与Aden,Alfonzo和Alberto进行比较,而不是使用Albert和Frank(限制算法需要使用的情况的数量)。

Jaro Winkler may actually be suitable for movie titles as well. Although you are using SQL server (can't use the utl_match package) it looks like there is a free library called "SimMetrics" which has the Jaro Winkler algorithm among other string comparison metrics. You can find detail on that and instructions here: http://anastasiosyal.com/POST/2009/01/11/18.ASPX?#simmetrics

Jaro Winkler也可能适合电影片头。虽然您使用的是SQL服务器(不能使用utl_match包),但看起来有一个名为“SimMetrics”的免费库,其中包含Jaro Winkler算法以及其他字符串比较指标。您可以在此处找到详细信息和说明:http://anastasiosyal.com/POST/2009/01/11/18.ASPX?#simmetrics

#1


4  

You should use full text search CONTAINSTABLE to find the top 100 (possibly 200) candidate results and then order the results you found using your own criteria.

您应该使用全文搜索CONTAINSTABLE来查找前100个(可能是200个)候选结果,然后使用您自己的条件对您找到的结果进行排序。

It sounds like you'd like to ORDER BY

听起来你想订购

  1. exact match of the phrase (=)
  2. 完全匹配的短语(=)
  3. the fully matched phrase (LIKE)
  4. 完全匹配的短语(LIKE)
  5. higher value for the Popularity column
  6. 人气专栏的价值更高
  7. the Rank from the CONTAINSTABLE
  8. 来自CONTAINSTABLE的排名

But you can toy around with the exact order you prefer.

但是你可以按照自己喜欢的顺序玩具。

In SQL that looks something like:

在SQL中看起来像:

DECLARE @title varchar(255)
SET @title = '"Toy Story"'
--need to remove quotes from parameter for LIKE search
DECLARE @title2 varchar(255)
SET @title2 = REPLACE(@title, '"', '')

SELECT
    m.ID,
    m.title,
    m.Popularity,
    k.Rank
FROM Movies m
INNER JOIN CONTAINSTABLE(Movies, title, @title, 100) as [k]
    ON m.ID = k.[Key]
ORDER BY 
  CASE WHEN m.title = @title2 THEN 0 ELSE 1 END,
  CASE WHEN m.title LIKE @title2 THEN 0 ELSE 1 END,
  m.popularity desc,
  k.rank

See SQLFiddle

请参见SQLFiddle

#2


2  

This will give you the movies that contain the exact phrase "Toy Story", ordered by their popularity.

这将为您提供包含完整短语“玩具总动员”的电影,按其受欢迎程度排序。

SELECT
    m.[ID],
    m.[Popularity],
    k.[Rank]
FROM [dbo].[Movies] m
INNER JOIN CONTAINSTABLE([dbo].[Movies], [Title], N'"Toy Story"') as [k]
    ON m.[ID] = k.[Key]
ORDER BY m.[Popularity]

Note the above would also give you "The Goonies Return" if you searched "The Goonies".

请注意,如果你搜索了“The Goonies”,上面的内容也会给你“The Goonies Return”。

#3


0  

If got the feeling you don't really like the fuzzy part of the full text search but you do like the performance part.

如果感觉到你并不喜欢全文搜索的模糊部分,但你确实喜欢表演部分。

Maybe is this a path: if you insist on getting the EXACT match before a weighted match you could try to hash the value. For example 'Toy Story' -> bring to lowercase -> toy story -> Hash into 4de2gs5sa (with whatever hash you like) and perform a search on the hash.

也许这是一条路径:如果你坚持在加权匹配之前获得完全匹配,你可以尝试哈希值。例如'玩具总动员' - >带小写 - >玩具故事 - >哈希到4de2gs5sa(你喜欢的任何哈希)并对哈希进行搜索。

#4


0  

In Oracle I've used UTL_MATCH for similar purposes. (http://docs.oracle.com/cd/E11882_01/appdev.112/e25788/u_match.htm)

在Oracle中,我使用UTL_MATCH用于类似目的。 (http://docs.oracle.com/cd/E11882_01/appdev.112/e25788/u_match.htm)

Even though using the Jaro Winkler algorithm, for instance, might take awhile if you compare the title column from table 1 and table 2, you can improve performance if you partially join the 2 tables. I have in some cases compared person names on table 1 with table 2 using Jaro Winkler, but limited results not just above a certain Jaro Winkler threshold, but also to names between the 2 tables where the first letter is the same. For instance I would compare Albert with Aden, Alfonzo, and Alberto, using Jaro Winkler, but not Albert and Frank (limiting the number of situations where the algorithm needs to be used).

例如,即使使用Jaro Winkler算法,如果比较表1和表2中的标题列,也可能需要一段时间,如果部分加入2个表,则可以提高性能。在某些情况下,我在表1中将人名与使用Jaro Winkler的表2进行了比较,但结果不仅仅限于某个Jaro Winkler阈值,还包括第一个字母相同的2个表之间的名称。例如,我会使用Jaro Winkler将Albert与Aden,Alfonzo和Alberto进行比较,而不是使用Albert和Frank(限制算法需要使用的情况的数量)。

Jaro Winkler may actually be suitable for movie titles as well. Although you are using SQL server (can't use the utl_match package) it looks like there is a free library called "SimMetrics" which has the Jaro Winkler algorithm among other string comparison metrics. You can find detail on that and instructions here: http://anastasiosyal.com/POST/2009/01/11/18.ASPX?#simmetrics

Jaro Winkler也可能适合电影片头。虽然您使用的是SQL服务器(不能使用utl_match包),但看起来有一个名为“SimMetrics”的免费库,其中包含Jaro Winkler算法以及其他字符串比较指标。您可以在此处找到详细信息和说明:http://anastasiosyal.com/POST/2009/01/11/18.ASPX?#simmetrics