I'm looking for a fuzzy search JavaScript library to filter an array. I've tried using fuzzyset.js and fuse.js, but the results are terrible (there are demos you can try on the linked pages).
我正在寻找一个模糊搜索JavaScript库来过滤一个数组。我试着使用fuzzyset。js和保险丝。但是结果很糟糕(你可以在链接页面上尝试一些演示)。
After doing some reading on Levenshtein distance, it strikes me as a poor approximation of what users are looking for when they type. For those who don't know, the system calculates how many insertions, deletions, and substitutions are needed to make two strings match.
在对Levenshtein距离进行了一些阅读之后,我觉得它很不符合用户输入时的需求。对于那些不知道的人,系统会计算需要多少插入、删除和替换才能使两个字符串匹配。
One obvious flaw, which is fixed in the Levenshtein-Demerau model, is that both blub and boob are considered equally similar to bulb (each requiring two substitutions). It is clear, however, that bulb is more similar to blub than boob is, and the model I just mentioned recognizes that by allowing for transpositions.
在Levenshtein-Demerau模型中,一个明显的缺陷是,blub和boob被认为与灯泡同样相似(每一个需要两个替换)。然而,很明显,球茎与blub比boob更相似,我刚才提到的模型通过允许换位来识别这一点。
I want to use this in the context of text completion, so if I have an array ['international', 'splint', 'tinder']
, and my query is int, I think international ought to rank more highly than splint, even though the former has a score (higher=worse) of 10 versus the latter's 3.
我想在文本补全的上下文中使用这个,所以如果我有一个数组['international', 'splint', 'tinder'],我的查询是int,我认为international应该比splint排名更高,尽管前者的分数(更高=更差)是10,后者是3。
So what I'm looking for (and will create if it doesn't exist), is a library that does the following:
所以我要找的(如果它不存在,我也会创建)是一个做以下事情的库:
- Weights the different text manipulations
- 对不同的文本操作进行加权
- Weights each manipulation differently depending on where they appear in a word (early manipulations being more costly than late manipulations)
- 根据每个操作在一个单词中出现的位置对每个操作进行不同的加权(早期操作比后期操作更昂贵)
- Returns a list of results sorted by relevance
- 返回按相关性排序的结果列表
Has anyone come across anything like this? I realize that * isn't the place to be asking for software recommendations, but implicit (not anymore!) in the above is: am I thinking about this the right way?
有人遇到过这样的事吗?我意识到*并不是一个需要软件推荐的地方,但是上面隐含的(不再是!)是:我的想法是正确的吗?
Edit
I found a good paper (pdf) on the subject. Some notes and excerpts:
我找到一篇关于这个问题的好论文。一些笔记和摘录:
Affine edit-distance functions assign a relatively lower cost to a sequence of insertions or deletions
仿射编辑距离函数为一系列插入或删除分配了相对较低的成本
the Monger-Elkan distance function (Monge & Elkan 1996), which is an affine variant of the Smith-Waterman distance function (Durban et al. 1998) with particular cost parameters
Monger-Elkan距离函数(Monge & Elkan 1996)是Smith-Waterman距离函数的仿射变体(Durban et al. 1998),具有特定的成本参数
For the Smith-Waterman distance (wikipedia), "Instead of looking at the total sequence, the Smith–Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure." It's the n-gram approach.
对于Smith-Waterman距离(wikipedia),“Smith-Waterman算法不是查看整个序列,而是比较所有可能长度的片段,并优化相似度度量。”这是语法的方法。
A broadly similar metric, which is not based on an edit-distance model, is the Jaro metric (Jaro 1995; 1989; Winkler 1999). In the record-linkage literature, good results have been obtained using variants of this method, which is based on the number and order of the common characters between two strings.
一个广泛相似的度量标准不是基于编辑距离模型,而是Jaro度量(Jaro 1995;1989;温克勒1999)。在记录连接文献中,使用这种方法的变体获得了良好的结果,它基于两个字符串之间的公共字符的数量和顺序。
A variant of this due to Winkler (1999) also uses the length P of the longest common prefix
Winkler(1999)的一个变体也使用最长公共前缀的长度P
(seem to be intended primarily for short strings)
(似乎主要用于短字符串)
For text completion purposes, the Monger-Elkan and Jaro-Winkler approaches seem to make the most sense. Winkler's addition to the Jaro metric effectively weights the beginnings of words more heavily. And the affine aspect of Monger-Elkan means that the necessity to complete a word (which is simply a sequence of additions) won't disfavor it too heavily.
为了完成文本,Monger-Elkan和Jaro-Winkler方法似乎最有意义。Winkler在Jaro度规的基础上增加了一些词的首字母。而蒙尼-埃尔坎的仿射性意味着完成一个单词(仅仅是一系列的添加)的必要性不会对它产生太大的负面影响。
Conclusion:
结论:
the TFIDF ranking performed best among several token-based distance metrics, and a tuned affine-gap edit-distance metric proposed by Monge and Elkan performed best among several string edit-distance metrics. A surprisingly good distance metric is a fast heuristic scheme, proposed by Jaro and later extended by Winkler. This works almost as well as the Monge-Elkan scheme, but is an order of magnitude faster. One simple way of combining the TFIDF method and the Jaro-Winkler is to replace the exact token matches used in TFIDF with approximate token matches based on the Jaro- Winkler scheme. This combination performs slightly better than either Jaro-Winkler or TFIDF on average, and occasionally performs much better. It is also close in performance to a learned combination of several of the best metrics considered in this paper.
TFIDF排名在几个基于标记的距离度量中表现得最好,Monge和Elkan提出的优化的仿射间隙编辑距离度量在几个字符串编辑距离度量中表现得最好。一个令人惊讶的好距离度量是一个快速启发式方案,由Jaro提出,后来由Winkler扩展。这几乎和蒙格-埃尔坎方案一样有效,但速度要快一个数量级。将TFIDF方法与Jaro-Winkler方法组合在一起的一种简单方法是将TFIDF中使用的精确令牌匹配替换为基于Jaro-Winkler方案的近似令牌匹配。这种组合的平均性能略好于Jaro-Winkler或TFIDF,有时表现得更好。它的性能也接近于本文所考虑的几个最佳指标的学习组合。
5 个解决方案
#1
16
Good question! But my thought is that, rather than trying to modify Levenshtein-Demerau, you might be better to try a different algorithm or combine/ weight the results from two algorithms.
好问题!但我的想法是,与其试图修改Levenshtein-Demerau,不如尝试一种不同的算法,或者将两种算法的结果合并/加权。
It strikes me that exact or close matches to the "starting prefix" are something Levenshtein-Demerau gives no particular weight to -- but your apparent user expectations would.
让我吃惊的是,与“开始前缀”精确或接近的匹配是Levenshtein-Demerau没有特别重视的东西——但您明显的用户期望是这样的。
I searched for "better than Levenshtein" and, among other things, found this:
我搜索“比Levenshtein好”,还有其他的发现:
http://www.joyofdata.de/blog/comparison-of-string-distance-algorithms/
http://www.joyofdata.de/blog/comparison-of-string-distance-algorithms/
This mentions a number of "string distance" measures. Three which looked particularly relevant to your requirement, would be:
这涉及到一些“字符串距离”度量。三种与你的要求特别相关的方法是:
-
Longest Common Substring distance: Minimum number of symbols that have to be removed in both strings until resulting substrings are identical.
最长公共子字符串距离:两个字符串中必须删除的符号的最小数目,直到产生的子字符串是相同的。
-
q-gram distance: Sum of absolute differences between N-gram vectors of both strings.
q-g距离:两个弦的n -g向量的绝对差的和。
-
Jaccard distance: 1 minues the quotient of shared N-grams and all observed N-grams.
Jaccard距离:1分钟共享n克和所有观测到的n克的商值。
Maybe you could use a weighted combination (or minimum) of these metrics, with Levenshtein -- common substring, common N-gram or Jaccard will all strongly prefer similar strings -- or perhaps try just using Jaccard?
也许您可以使用这些度量的加权组合(或最小),使用Levenshtein——常见的子字符串、常见的N-gram或Jaccard都非常喜欢类似的字符串——或者尝试使用Jaccard?
Depending on the size of your list/ database, these algorithms can be moderately expensive. For a fuzzy search I implemented, I used a configurable number of N-grams as "retrieval keys" from the DB then ran the expensive string-distance measure to sort them in preference order.
根据列表/数据库的大小,这些算法的代价可能比较高。对于我实现的模糊搜索,我使用一个可配置的n -g作为来自DB的“检索键”,然后运行昂贵的字符串距离度量,按优先顺序对它们进行排序。
I wrote some notes on Fuzzy String Search in SQL. See:
我用SQL写了一些关于模糊字符串搜索的笔记。看到的:
- http://literatejava.com/sql/fuzzy-string-search-sql/
- http://literatejava.com/sql/fuzzy-string-search-sql/
#2
12
Here is a technique I have used a few times...It gives pretty good results. Does not do everything you asked for though. Also, this can be expensive if the list is massive.
这是我用过几次的技巧……结果很好。并不是所有你想要的。此外,如果清单太多,这可能会很昂贵。
get_bigrams = (string) ->
s = string.toLowerCase()
v = new Array(s.length - 1)
for i in [0..v.length] by 1
v[i] = s.slice(i, i + 2)
return v
string_similarity = (str1, str2) ->
if str1.length > 0 and str2.length > 0
pairs1 = get_bigrams(str1)
pairs2 = get_bigrams(str2)
union = pairs1.length + pairs2.length
hit_count = 0
for x in pairs1
for y in pairs2
if x is y
hit_count++
if hit_count > 0
return ((2.0 * hit_count) / union)
return 0.0
Pass two strings to string_similarity
which will return a number between 0
and 1.0
depending on how similar they are. This example uses Lo-Dash
将两个字符串传递给string_similarity,它将返回0到1.0之间的数字,这取决于它们的相似程度。这个示例使用Lo-Dash
Usage Example....
使用例子....
query = 'jenny Jackson'
names = ['John Jackson', 'Jack Johnson', 'Jerry Smith', 'Jenny Smith']
results = []
for name in names
relevance = string_similarity(query, name)
obj = {name: name, relevance: relevance}
results.push(obj)
results = _.first(_.sortBy(results, 'relevance').reverse(), 10)
console.log results
Also....have a fiddle
还....有一个小提琴
Make sure your console is open or you wont see anything :)
确保你的控制台是打开的,否则你什么都看不到。
#3
7
I tried using existing fuzzy libraries like fuse.js and also found them to be terrible, so I wrote one which behaves basically like sublime's search. https://github.com/farzher/fuzzysort
我尝试使用现有的模糊库,如fuse。我也发现他们很糟糕,所以我写了一篇,基本上就像《崇高的搜索》https://github.com/farzher/fuzzysort
The only typo it allows is a transpose. It's pretty solid (1k stars, 0 issues), very fast, and handles your case easily:
唯一的类型是a转置。它非常坚固(1k星,0问题),非常快,很容易处理你的情况:
fuzzysort.go('int', ['international', 'splint', 'tinder'])
// [{highlighted: '*int*ernational', score: 10}, {highlighted: 'spl*int*', socre: 3003}]
#4
5
you may take a look at Atom's https://github.com/atom/fuzzaldrin/ lib.
您可以查看Atom的https://github.com/atom/fuzzaldrin/lib。
it is available on npm, has simple API, and worked ok for me.
它在npm上可用,有简单的API,对我来说还可以。
> fuzzaldrin.filter(['international', 'splint', 'tinder'], 'int');
< ["international", "splint"]
#5
1
(function (int) {
$("input[id=input]")
.on("input", {
sort: int
}, function (e) {
$.each(e.data.sort, function (index, value) {
if ( value.indexOf($(e.target).val()) != -1
&& value.charAt(0) === $(e.target).val().charAt(0)
&& $(e.target).val().length === 3 ) {
$("output[for=input]").val(value);
};
return false
});
return false
});
}(["international", "splint", "tinder"]))
jsfiddle http://jsfiddle.net/guest271314/QP7z5/
jsfiddle http://jsfiddle.net/guest271314/QP7z5/
#1
16
Good question! But my thought is that, rather than trying to modify Levenshtein-Demerau, you might be better to try a different algorithm or combine/ weight the results from two algorithms.
好问题!但我的想法是,与其试图修改Levenshtein-Demerau,不如尝试一种不同的算法,或者将两种算法的结果合并/加权。
It strikes me that exact or close matches to the "starting prefix" are something Levenshtein-Demerau gives no particular weight to -- but your apparent user expectations would.
让我吃惊的是,与“开始前缀”精确或接近的匹配是Levenshtein-Demerau没有特别重视的东西——但您明显的用户期望是这样的。
I searched for "better than Levenshtein" and, among other things, found this:
我搜索“比Levenshtein好”,还有其他的发现:
http://www.joyofdata.de/blog/comparison-of-string-distance-algorithms/
http://www.joyofdata.de/blog/comparison-of-string-distance-algorithms/
This mentions a number of "string distance" measures. Three which looked particularly relevant to your requirement, would be:
这涉及到一些“字符串距离”度量。三种与你的要求特别相关的方法是:
-
Longest Common Substring distance: Minimum number of symbols that have to be removed in both strings until resulting substrings are identical.
最长公共子字符串距离:两个字符串中必须删除的符号的最小数目,直到产生的子字符串是相同的。
-
q-gram distance: Sum of absolute differences between N-gram vectors of both strings.
q-g距离:两个弦的n -g向量的绝对差的和。
-
Jaccard distance: 1 minues the quotient of shared N-grams and all observed N-grams.
Jaccard距离:1分钟共享n克和所有观测到的n克的商值。
Maybe you could use a weighted combination (or minimum) of these metrics, with Levenshtein -- common substring, common N-gram or Jaccard will all strongly prefer similar strings -- or perhaps try just using Jaccard?
也许您可以使用这些度量的加权组合(或最小),使用Levenshtein——常见的子字符串、常见的N-gram或Jaccard都非常喜欢类似的字符串——或者尝试使用Jaccard?
Depending on the size of your list/ database, these algorithms can be moderately expensive. For a fuzzy search I implemented, I used a configurable number of N-grams as "retrieval keys" from the DB then ran the expensive string-distance measure to sort them in preference order.
根据列表/数据库的大小,这些算法的代价可能比较高。对于我实现的模糊搜索,我使用一个可配置的n -g作为来自DB的“检索键”,然后运行昂贵的字符串距离度量,按优先顺序对它们进行排序。
I wrote some notes on Fuzzy String Search in SQL. See:
我用SQL写了一些关于模糊字符串搜索的笔记。看到的:
- http://literatejava.com/sql/fuzzy-string-search-sql/
- http://literatejava.com/sql/fuzzy-string-search-sql/
#2
12
Here is a technique I have used a few times...It gives pretty good results. Does not do everything you asked for though. Also, this can be expensive if the list is massive.
这是我用过几次的技巧……结果很好。并不是所有你想要的。此外,如果清单太多,这可能会很昂贵。
get_bigrams = (string) ->
s = string.toLowerCase()
v = new Array(s.length - 1)
for i in [0..v.length] by 1
v[i] = s.slice(i, i + 2)
return v
string_similarity = (str1, str2) ->
if str1.length > 0 and str2.length > 0
pairs1 = get_bigrams(str1)
pairs2 = get_bigrams(str2)
union = pairs1.length + pairs2.length
hit_count = 0
for x in pairs1
for y in pairs2
if x is y
hit_count++
if hit_count > 0
return ((2.0 * hit_count) / union)
return 0.0
Pass two strings to string_similarity
which will return a number between 0
and 1.0
depending on how similar they are. This example uses Lo-Dash
将两个字符串传递给string_similarity,它将返回0到1.0之间的数字,这取决于它们的相似程度。这个示例使用Lo-Dash
Usage Example....
使用例子....
query = 'jenny Jackson'
names = ['John Jackson', 'Jack Johnson', 'Jerry Smith', 'Jenny Smith']
results = []
for name in names
relevance = string_similarity(query, name)
obj = {name: name, relevance: relevance}
results.push(obj)
results = _.first(_.sortBy(results, 'relevance').reverse(), 10)
console.log results
Also....have a fiddle
还....有一个小提琴
Make sure your console is open or you wont see anything :)
确保你的控制台是打开的,否则你什么都看不到。
#3
7
I tried using existing fuzzy libraries like fuse.js and also found them to be terrible, so I wrote one which behaves basically like sublime's search. https://github.com/farzher/fuzzysort
我尝试使用现有的模糊库,如fuse。我也发现他们很糟糕,所以我写了一篇,基本上就像《崇高的搜索》https://github.com/farzher/fuzzysort
The only typo it allows is a transpose. It's pretty solid (1k stars, 0 issues), very fast, and handles your case easily:
唯一的类型是a转置。它非常坚固(1k星,0问题),非常快,很容易处理你的情况:
fuzzysort.go('int', ['international', 'splint', 'tinder'])
// [{highlighted: '*int*ernational', score: 10}, {highlighted: 'spl*int*', socre: 3003}]
#4
5
you may take a look at Atom's https://github.com/atom/fuzzaldrin/ lib.
您可以查看Atom的https://github.com/atom/fuzzaldrin/lib。
it is available on npm, has simple API, and worked ok for me.
它在npm上可用,有简单的API,对我来说还可以。
> fuzzaldrin.filter(['international', 'splint', 'tinder'], 'int');
< ["international", "splint"]
#5
1
(function (int) {
$("input[id=input]")
.on("input", {
sort: int
}, function (e) {
$.each(e.data.sort, function (index, value) {
if ( value.indexOf($(e.target).val()) != -1
&& value.charAt(0) === $(e.target).val().charAt(0)
&& $(e.target).val().length === 3 ) {
$("output[for=input]").val(value);
};
return false
});
return false
});
}(["international", "splint", "tinder"]))
jsfiddle http://jsfiddle.net/guest271314/QP7z5/
jsfiddle http://jsfiddle.net/guest271314/QP7z5/