如何使用Postgresql创建简单的模糊搜索?

时间:2021-09-15 19:25:51

I have a little problem with search functionality on my RoR based site. I have many Produts with some CODEs. This code can be any string like "AB-123-lHdfj". Now I use ILIKE operator to find products:

我的基于RoR的网站上的搜索功能有点问题。我有很多产品和一些CODE。此代码可以是任何字符串,如“AB-123-lHdfj”。现在我使用ILIKE运算符来查找产品:

Product.where("code ILIKE ?", "%" + params[:search] + "%")

It works fine, but it can't find product with codes like "AB123-lHdfj", or "AB123lHdfj".

它工作正常,但找不到像“AB123-lHdfj”或“AB123lHdfj”这样的代码的产品。

What should I do for this? May be postgresql has some string normalization function, or some other methods to help me? :)

我该怎么办?可能postgresql有一些字符串规范化功能,还是其他一些方法来帮助我? :)

2 个解决方案

#1


37  

Postgres provides a module with several string comparsion functions such as soundex and metaphone. But you will want to use the levenshtein edit distance function.

Postgres提供了一个带有多个字符串比较函数的模块,例如soundex和metaphone。但是你会想要使用levenshtein编辑距离函数。

Example:

test=# SELECT levenshtein('GUMBO', 'GAMBOL');
 levenshtein
-------------
           2
(1 row)

The 2 is the edit distance between the two words. When you apply this against a number of words and sort by the edit distance result you will have the type of fuzzy matches that you're looking for.

2是两个单词之间的编辑距离。当您对多个单词应用此单词并按编辑距离结果排序时,您将获得您正在寻找的模糊匹配类型。

Try this query sample: (with your own object names and data of course)

试试这个查询示例:(当然有你自己的对象名和数据)

SELECT * 
FROM some_table
WHERE levenshtein(code, 'AB123-lHdfj') <= 3
ORDER BY levenshtein(code, 'AB123-lHdfj')
LIMIT 10

This query says:

这个查询说:

Give me the top 10 results of all data from some_table where the edit distance between the code value and the input 'AB123-lHdfj' is less than 3. You will get back all rows where the value of code is within 3 characters difference to 'AB123-lHdfj'...

给我来自some_table的所有数据的前10个结果,其中代码值和输入'AB123-lHdfj'之间的编辑距离小于3.您将获得代码值在3个字符之内的所有行的差异' AB123-lHdfj” ...

Note: if you get an error like:

注意:如果您收到如下错误:

function levenshtein(character varying, unknown) does not exist

Install the fuzzystrmatch extension using:

使用以下命令安装fuzzystrmatch扩展:

test=# CREATE EXTENSION fuzzystrmatch;

#2


33  

Paul told you about levenshtein(). That's a very useful tool, but it's also very slow with big tables. It has to calculate the levenshtein-distance from the search term for every single row, that's expensive.

保罗告诉你levenshtein()。这是一个非常有用的工具,但是对于大表来说它也很慢。它必须计算每一行搜索项的levenshtein距离,这是昂贵的。

First off, if your requirements are as simple as the example indicates, you can still use LIKE. Just replace any - in your search term with % to create the WHERE clause

首先,如果您的要求与示例所示一样简单,您仍然可以使用LIKE。只需用%替换任何 - 在搜索词中创建WHERE子句

WHERE code LIKE "%AB%123%lHdfj%"

instead of

代替

WHERE code LIKE "%AB-123-lHdfj%"

If your real problem is more complex and you need something faster then - depending on your requirements - there are several options.

如果你真正的问题更复杂,而你需要更快的东西 - 根据你的要求 - 有几种选择。

  • There is full text search, of course. But this may be an overkill in your case.

    当然还有全文搜索。但在你的情况下,这可能是一种矫枉过正。

  • A more likely candidate is pg_trgm. Note that you can combine that with LIKE in PostgreSQL 9.1. See this blog post by Depesz.
    Also very interesting in this context: the similarity() function or % operator of that module. More:

    更可能的候选人是pg_trgm。请注意,您可以将它与PostgreSQL 9.1中的LIKE结合使用。请参阅Depesz的这篇博客文章。在此上下文中也非常有趣:该模块的similarity()函数或%运算符。更多:

  • Last but not least you can implement a hand-knit solution with a function to normalize the strings to be searched. For instance, you could transform AB1-23-lHdfj -> ab123lhdfj, save it in an additional column and search it with search terms that have been transformed the same way.

    最后但并非最不重要的是,您可以实现一个手工编织的解决方案,其中包含一个函数来规范化要搜索的字符串。例如,您可以转换AB1-23-lHdfj - > ab123lhdfj,将其保存在其他列中,并使用以相同方式转换的搜索项进行搜索。

    Or use an index on an expression instead of the redundant column. (Involved functions must be IMMUTABLE.) And possibly combine that with pg_tgrm from above.

    或者在表达式而不是冗余列上使用索引。 (涉及的函数必须是IMMUTABLE。)并且可能将它与上面的pg_tgrm结合起来。

Overview of pattern-matching techniques:

模式匹配技术概述:

#1


37  

Postgres provides a module with several string comparsion functions such as soundex and metaphone. But you will want to use the levenshtein edit distance function.

Postgres提供了一个带有多个字符串比较函数的模块,例如soundex和metaphone。但是你会想要使用levenshtein编辑距离函数。

Example:

test=# SELECT levenshtein('GUMBO', 'GAMBOL');
 levenshtein
-------------
           2
(1 row)

The 2 is the edit distance between the two words. When you apply this against a number of words and sort by the edit distance result you will have the type of fuzzy matches that you're looking for.

2是两个单词之间的编辑距离。当您对多个单词应用此单词并按编辑距离结果排序时,您将获得您正在寻找的模糊匹配类型。

Try this query sample: (with your own object names and data of course)

试试这个查询示例:(当然有你自己的对象名和数据)

SELECT * 
FROM some_table
WHERE levenshtein(code, 'AB123-lHdfj') <= 3
ORDER BY levenshtein(code, 'AB123-lHdfj')
LIMIT 10

This query says:

这个查询说:

Give me the top 10 results of all data from some_table where the edit distance between the code value and the input 'AB123-lHdfj' is less than 3. You will get back all rows where the value of code is within 3 characters difference to 'AB123-lHdfj'...

给我来自some_table的所有数据的前10个结果,其中代码值和输入'AB123-lHdfj'之间的编辑距离小于3.您将获得代码值在3个字符之内的所有行的差异' AB123-lHdfj” ...

Note: if you get an error like:

注意:如果您收到如下错误:

function levenshtein(character varying, unknown) does not exist

Install the fuzzystrmatch extension using:

使用以下命令安装fuzzystrmatch扩展:

test=# CREATE EXTENSION fuzzystrmatch;

#2


33  

Paul told you about levenshtein(). That's a very useful tool, but it's also very slow with big tables. It has to calculate the levenshtein-distance from the search term for every single row, that's expensive.

保罗告诉你levenshtein()。这是一个非常有用的工具,但是对于大表来说它也很慢。它必须计算每一行搜索项的levenshtein距离,这是昂贵的。

First off, if your requirements are as simple as the example indicates, you can still use LIKE. Just replace any - in your search term with % to create the WHERE clause

首先,如果您的要求与示例所示一样简单,您仍然可以使用LIKE。只需用%替换任何 - 在搜索词中创建WHERE子句

WHERE code LIKE "%AB%123%lHdfj%"

instead of

代替

WHERE code LIKE "%AB-123-lHdfj%"

If your real problem is more complex and you need something faster then - depending on your requirements - there are several options.

如果你真正的问题更复杂,而你需要更快的东西 - 根据你的要求 - 有几种选择。

  • There is full text search, of course. But this may be an overkill in your case.

    当然还有全文搜索。但在你的情况下,这可能是一种矫枉过正。

  • A more likely candidate is pg_trgm. Note that you can combine that with LIKE in PostgreSQL 9.1. See this blog post by Depesz.
    Also very interesting in this context: the similarity() function or % operator of that module. More:

    更可能的候选人是pg_trgm。请注意,您可以将它与PostgreSQL 9.1中的LIKE结合使用。请参阅Depesz的这篇博客文章。在此上下文中也非常有趣:该模块的similarity()函数或%运算符。更多:

  • Last but not least you can implement a hand-knit solution with a function to normalize the strings to be searched. For instance, you could transform AB1-23-lHdfj -> ab123lhdfj, save it in an additional column and search it with search terms that have been transformed the same way.

    最后但并非最不重要的是,您可以实现一个手工编织的解决方案,其中包含一个函数来规范化要搜索的字符串。例如,您可以转换AB1-23-lHdfj - > ab123lhdfj,将其保存在其他列中,并使用以相同方式转换的搜索项进行搜索。

    Or use an index on an expression instead of the redundant column. (Involved functions must be IMMUTABLE.) And possibly combine that with pg_tgrm from above.

    或者在表达式而不是冗余列上使用索引。 (涉及的函数必须是IMMUTABLE。)并且可能将它与上面的pg_tgrm结合起来。

Overview of pattern-matching techniques:

模式匹配技术概述: