I have a little problem with search functionality on my RoR based site. I have many Produts with some CODEs. This code can be any string like "AB-123-lHdfj". Now I use ILIKE operator to find products:
我的基于RoR的网站上的搜索功能有点问题。我有很多产品和一些CODE。此代码可以是任何字符串,如“AB-123-lHdfj”。现在我使用ILIKE运算符来查找产品:
Product.where("code ILIKE ?", "%" + params[:search] + "%")
It works fine, but it can't find product with codes like "AB123-lHdfj", or "AB123lHdfj".
它工作正常,但找不到像“AB123-lHdfj”或“AB123lHdfj”这样的代码的产品。
What should I do for this? May be postgresql has some string normalization function, or some other methods to help me? :)
我该怎么办?可能postgresql有一些字符串规范化功能,还是其他一些方法来帮助我? :)
2 个解决方案
#1
37
Postgres provides a module with several string comparsion functions such as soundex and metaphone. But you will want to use the levenshtein edit distance function.
Postgres提供了一个带有多个字符串比较函数的模块,例如soundex和metaphone。但是你会想要使用levenshtein编辑距离函数。
Example:
test=# SELECT levenshtein('GUMBO', 'GAMBOL');
levenshtein
-------------
2
(1 row)
The 2
is the edit distance between the two words. When you apply this against a number of words and sort by the edit distance result you will have the type of fuzzy matches that you're looking for.
2是两个单词之间的编辑距离。当您对多个单词应用此单词并按编辑距离结果排序时,您将获得您正在寻找的模糊匹配类型。
Try this query sample: (with your own object names and data of course)
试试这个查询示例:(当然有你自己的对象名和数据)
SELECT *
FROM some_table
WHERE levenshtein(code, 'AB123-lHdfj') <= 3
ORDER BY levenshtein(code, 'AB123-lHdfj')
LIMIT 10
This query says:
这个查询说:
Give me the top 10 results of all data from some_table where the edit distance between the code value and the input 'AB123-lHdfj' is less than 3. You will get back all rows where the value of code is within 3 characters difference to 'AB123-lHdfj'...
给我来自some_table的所有数据的前10个结果,其中代码值和输入'AB123-lHdfj'之间的编辑距离小于3.您将获得代码值在3个字符之内的所有行的差异' AB123-lHdfj” ...
Note: if you get an error like:
注意:如果您收到如下错误:
function levenshtein(character varying, unknown) does not exist
Install the fuzzystrmatch
extension using:
使用以下命令安装fuzzystrmatch扩展:
test=# CREATE EXTENSION fuzzystrmatch;
#2
33
Paul told you about levenshtein()
. That's a very useful tool, but it's also very slow with big tables. It has to calculate the levenshtein-distance from the search term for every single row, that's expensive.
保罗告诉你levenshtein()。这是一个非常有用的工具,但是对于大表来说它也很慢。它必须计算每一行搜索项的levenshtein距离,这是昂贵的。
First off, if your requirements are as simple as the example indicates, you can still use LIKE
. Just replace any -
in your search term with %
to create the WHERE
clause
首先,如果您的要求与示例所示一样简单,您仍然可以使用LIKE。只需用%替换任何 - 在搜索词中创建WHERE子句
WHERE code LIKE "%AB%123%lHdfj%"
instead of
代替
WHERE code LIKE "%AB-123-lHdfj%"
If your real problem is more complex and you need something faster then - depending on your requirements - there are several options.
如果你真正的问题更复杂,而你需要更快的东西 - 根据你的要求 - 有几种选择。
-
There is full text search, of course. But this may be an overkill in your case.
当然还有全文搜索。但在你的情况下,这可能是一种矫枉过正。
-
A more likely candidate is pg_trgm. Note that you can combine that with
LIKE
in PostgreSQL 9.1. See this blog post by Depesz.
Also very interesting in this context: thesimilarity()
function or%
operator of that module. More:更可能的候选人是pg_trgm。请注意,您可以将它与PostgreSQL 9.1中的LIKE结合使用。请参阅Depesz的这篇博客文章。在此上下文中也非常有趣:该模块的similarity()函数或%运算符。更多:
- PostgreSQL LIKE query performance variations
- PostgreSQL LIKE查询性能变化
-
Last but not least you can implement a hand-knit solution with a function to normalize the strings to be searched. For instance, you could transform
AB1-23-lHdfj
->ab123lhdfj
, save it in an additional column and search it with search terms that have been transformed the same way.最后但并非最不重要的是,您可以实现一个手工编织的解决方案,其中包含一个函数来规范化要搜索的字符串。例如,您可以转换AB1-23-lHdfj - > ab123lhdfj,将其保存在其他列中,并使用以相同方式转换的搜索项进行搜索。
Or use an index on an expression instead of the redundant column. (Involved functions must be
IMMUTABLE
.) And possibly combine that withpg_tgrm
from above.或者在表达式而不是冗余列上使用索引。 (涉及的函数必须是IMMUTABLE。)并且可能将它与上面的pg_tgrm结合起来。
Overview of pattern-matching techniques:
模式匹配技术概述:
- Pattern matching with LIKE, SIMILAR TO or regular expressions in PostgreSQL
- 与PostgreSQL中的LIKE,SIMILAR TO或正则表达式匹配的模式
#1
37
Postgres provides a module with several string comparsion functions such as soundex and metaphone. But you will want to use the levenshtein edit distance function.
Postgres提供了一个带有多个字符串比较函数的模块,例如soundex和metaphone。但是你会想要使用levenshtein编辑距离函数。
Example:
test=# SELECT levenshtein('GUMBO', 'GAMBOL');
levenshtein
-------------
2
(1 row)
The 2
is the edit distance between the two words. When you apply this against a number of words and sort by the edit distance result you will have the type of fuzzy matches that you're looking for.
2是两个单词之间的编辑距离。当您对多个单词应用此单词并按编辑距离结果排序时,您将获得您正在寻找的模糊匹配类型。
Try this query sample: (with your own object names and data of course)
试试这个查询示例:(当然有你自己的对象名和数据)
SELECT *
FROM some_table
WHERE levenshtein(code, 'AB123-lHdfj') <= 3
ORDER BY levenshtein(code, 'AB123-lHdfj')
LIMIT 10
This query says:
这个查询说:
Give me the top 10 results of all data from some_table where the edit distance between the code value and the input 'AB123-lHdfj' is less than 3. You will get back all rows where the value of code is within 3 characters difference to 'AB123-lHdfj'...
给我来自some_table的所有数据的前10个结果,其中代码值和输入'AB123-lHdfj'之间的编辑距离小于3.您将获得代码值在3个字符之内的所有行的差异' AB123-lHdfj” ...
Note: if you get an error like:
注意:如果您收到如下错误:
function levenshtein(character varying, unknown) does not exist
Install the fuzzystrmatch
extension using:
使用以下命令安装fuzzystrmatch扩展:
test=# CREATE EXTENSION fuzzystrmatch;
#2
33
Paul told you about levenshtein()
. That's a very useful tool, but it's also very slow with big tables. It has to calculate the levenshtein-distance from the search term for every single row, that's expensive.
保罗告诉你levenshtein()。这是一个非常有用的工具,但是对于大表来说它也很慢。它必须计算每一行搜索项的levenshtein距离,这是昂贵的。
First off, if your requirements are as simple as the example indicates, you can still use LIKE
. Just replace any -
in your search term with %
to create the WHERE
clause
首先,如果您的要求与示例所示一样简单,您仍然可以使用LIKE。只需用%替换任何 - 在搜索词中创建WHERE子句
WHERE code LIKE "%AB%123%lHdfj%"
instead of
代替
WHERE code LIKE "%AB-123-lHdfj%"
If your real problem is more complex and you need something faster then - depending on your requirements - there are several options.
如果你真正的问题更复杂,而你需要更快的东西 - 根据你的要求 - 有几种选择。
-
There is full text search, of course. But this may be an overkill in your case.
当然还有全文搜索。但在你的情况下,这可能是一种矫枉过正。
-
A more likely candidate is pg_trgm. Note that you can combine that with
LIKE
in PostgreSQL 9.1. See this blog post by Depesz.
Also very interesting in this context: thesimilarity()
function or%
operator of that module. More:更可能的候选人是pg_trgm。请注意,您可以将它与PostgreSQL 9.1中的LIKE结合使用。请参阅Depesz的这篇博客文章。在此上下文中也非常有趣:该模块的similarity()函数或%运算符。更多:
- PostgreSQL LIKE query performance variations
- PostgreSQL LIKE查询性能变化
-
Last but not least you can implement a hand-knit solution with a function to normalize the strings to be searched. For instance, you could transform
AB1-23-lHdfj
->ab123lhdfj
, save it in an additional column and search it with search terms that have been transformed the same way.最后但并非最不重要的是,您可以实现一个手工编织的解决方案,其中包含一个函数来规范化要搜索的字符串。例如,您可以转换AB1-23-lHdfj - > ab123lhdfj,将其保存在其他列中,并使用以相同方式转换的搜索项进行搜索。
Or use an index on an expression instead of the redundant column. (Involved functions must be
IMMUTABLE
.) And possibly combine that withpg_tgrm
from above.或者在表达式而不是冗余列上使用索引。 (涉及的函数必须是IMMUTABLE。)并且可能将它与上面的pg_tgrm结合起来。
Overview of pattern-matching techniques:
模式匹配技术概述:
- Pattern matching with LIKE, SIMILAR TO or regular expressions in PostgreSQL
- 与PostgreSQL中的LIKE,SIMILAR TO或正则表达式匹配的模式