Java中的模糊字符串搜索库

时间:2022-08-14 19:25:21

I'm looking for a high performance Java library for fuzzy string search.

我正在寻找一个用于模糊字符串搜索的高性能Java库。

There are numerous algorithms to find similar strings, Levenshtein distance, Daitch-Mokotoff Soundex, n-grams etc.

有许多算法可以找到类似的字符串,Levenshtein距离,Daitch-Mokotoff Soundex,n-gram等。

What Java implementations exists? Pros and cons for them? I'm aware of Lucene, any other solution or Lucene is best?

存在哪些Java实现?他们的利弊?我知道Lucene,任何其他解决方案或Lucene是最好的吗?

I found these, does anyone have experience with them?

我找到了这些,有没有人有经验?

8 个解决方案

#1


35  

Commons Lang has an implementation of Levenshtein distance.

Commons Lang实施了Levenshtein距离。

Commons Codec has an implementation of soundex and metaphone.

Commons Codec有soundex和metaphone的实现。

#2


10  

You can use Apache Lucene, but depending on the use case this may be too heavy weight. For very simple fuzzy searches it may be a bit complex to use and (correct me if I'm wrong) it requires you to build an index.

您可以使用Apache Lucene,但根据用例,这可能是太重了。对于非常简单的模糊搜索,使用起来可能有点复杂(如果我错了,请纠正我)它需要您构建索引。

If you need a simple online (= not maintaining an index) algorithm you can use the fuzzy Bitap algorithm. I found an implementation in Java here. It's code fits in a single relatively short method with an almost self-explaining signature:

如果您需要一个简单的在线(=不维护索引)算法,您可以使用模糊Bitap算法。我在这里找到了Java实现。它的代码适用于一个相对较短的方法,具有几乎自我解释的签名:

public static List<Integer> find(String doc, String pattern, int k)

Apache Commons StringUtils has an implementation of the Levenshtein algorithm for fuzzy String matching. It can be seen as the fuzzy version of String.equals, Bitap is like the fuzzy version of String.indexOf and still uses the Levenshtein distance measure. It is generally more efficient than naively using Levenshtein to compare the search pattern with each substring that could possibly match.

Apache Commons StringUtils具有用于模糊字符串匹配的Levenshtein算法的实现。它可以看作是String.equals的模糊版本,Bitap就像String.indexOf的模糊版本,仍然使用Levenshtein距离测量。通常使用Levenshtein比较搜索模式与可能匹配的每个子字符串更有效。

Notes:

笔记:

  • The Bitap algorithm seems to be mostly useful for relatively small alphabets, e.g. plain ASCII. In fact the Simon Watiau version I linked to throws an ArrayIndexOutOfBoundsException on non-ASCII characters (>= 128) so you will have to filter these out.
  • Bitap算法似乎主要用于相对较小的字母表,例如纯ASCII。事实上,我链接的Simon Watiau版本会在非ASCII字符(> = 128)上抛出ArrayIndexOutOfBoundsException,因此您必须对这些字符进行过滤。
  • I tried using Bimap in an application to search an in-memory list of persons by name. I found that a Levenhstein distance of 2 gives way too many false positives. A Levenhstein distance of 1 works better, but it cannot detect a typo where you swap two letters, e.g. "William" and "Willaim". I can think of a few ways to solve this, e.g.

    我尝试在应用程序中使用Bimap按名称搜索内存中的人员列表。我发现Levenhstein距离为2时会产生太多误报。 Levenhstein距离为1可以更好地工作,但它无法检测到交换两个字母的拼写错误,例如“威廉”和“威廉姆”。我可以想出几种方法来解决这个问题,例如:

    1. do a fuzzy search only when an exact search finds no matches (and show a message to the user about this)
    2. 仅在精确搜索未找到匹配项时进行模糊搜索(并向用户显示有关此内容的消息)
    3. adjust Bitap to use Damerau-Levenshtein distance where a swap has distance 1 instead of 2. According to wikipedia, this is possible, but I could not find an existing implementation in Java.
    4. 调整Bitap使用Damerau-Levenshtein距离,其中交换距离为1而不是2.根据*,这是可能的,但我找不到Java中的现有实现。
    5. instead of "contains" do a "startsWith". The fuzzy search tools contains a prefix version of Damerau-Levenshtein, but it gave me an ArrayIndexOutOfBoundsException
    6. 而不是“包含”做一个“startsWith”。模糊搜索工具包含Damerau-Levenshtein的前缀版本,但它给了我一个ArrayIndexOutOfBoundsException
    7. adjust the algorithm to introduce search result ranking where exact matches score higher
    8. 调整算法以引入搜索结果排名,其中精确匹配得分更高

    If you are going to do 2 or 4, it may be better to use a proper full-text search library like Lucene anyway.

    如果您要做2或4,最好使用像Lucene这样的正确全文搜索库。

  • More information on fuzzy search can be found on this blog. It's author also created an implementation in Java called BitapOnlineSearcher, but requires you to use java.io.Reader together with an Alphabet class. It's Javadoc is written in Russian.
  • 有关模糊搜索的更多信息,请访问此博客。它的作者还在Java中创建了一个名为BitapOnlineSearcher的实现,但是要求您将java.io.Reader与Alphabet类一起使用。它的Javadoc是用俄语写的。

#3


9  

If you are mostly comparing short strings and want something portable and lightweight you can use the well known python algorithm fuzzywuzzy ported to Java.

如果你主要是比较短字符串并想要一些便携和轻量级的东西,你可以使用众所周知的python算法fuzzywuzzy移植到Java。

You can read more about it here

你可以在这里读更多关于它的内容

#4


8  

SimMetrics is probably what you need: http://sourceforge.net/projects/simmetrics/

SimMetrics可能就是您所需要的:http://sourceforge.net/projects/simmetrics/

It has several algorithms for calculating various flavours of edit-distance.

它有几种算法用于计算编辑距离的各种风格。

Lucene is a very powerful full-text search engine, but FT search isn't exactly the same thing as fuzzy string matching (eg. given a list of strings find me the one that is most similar to some candidate string).

Lucene是一个非常强大的全文搜索引擎,但FT搜索与模糊字符串匹配并不完全相同(例如,给定一个字符串列表找到我与某个候选字符串最相似的字符串)。

#5


2  

To Lucene I would add SOLR http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

到Lucene我会添加SOLR http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

#6


1  

You can try bitap. I was playing with bitap written in ANSI C and it was pretty fast there is java implementation in http://www.crosswire.org.

你可以试试bitap。我正在使用ANSI C编写的bitap,并且很快就有http://www.crosswire.org中的java实现。

#7


1  

You can try the Completely library, it relies on text preprocessing to create an in-memory index for efficiently answering (fuzzy) searches in large data sets. Unlike Lucene and other full featured text search libraries, the API is small and easy to get started.

您可以尝试使用Completely库,它依赖于文本预处理来创建内存索引,以便在大型数据集中有效地回答(模糊)搜索。与Lucene和其他全功能文本搜索库不同,API很小且易于上手。

#8


0  

Apache Lucene is the only way, I think. I don't know any better search lib.

我认为Apache Lucene是唯一的方法。我不知道任何更好的搜索库。

Apache Lucene(TM) is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.

Apache Lucene(TM)是一个完全用Java编写的高性能,功能齐全的文本搜索引擎库。它是一种适用于几乎所有需要全文搜索的应用程序的技术,尤其是跨平台搜索。

#1


35  

Commons Lang has an implementation of Levenshtein distance.

Commons Lang实施了Levenshtein距离。

Commons Codec has an implementation of soundex and metaphone.

Commons Codec有soundex和metaphone的实现。

#2


10  

You can use Apache Lucene, but depending on the use case this may be too heavy weight. For very simple fuzzy searches it may be a bit complex to use and (correct me if I'm wrong) it requires you to build an index.

您可以使用Apache Lucene,但根据用例,这可能是太重了。对于非常简单的模糊搜索,使用起来可能有点复杂(如果我错了,请纠正我)它需要您构建索引。

If you need a simple online (= not maintaining an index) algorithm you can use the fuzzy Bitap algorithm. I found an implementation in Java here. It's code fits in a single relatively short method with an almost self-explaining signature:

如果您需要一个简单的在线(=不维护索引)算法,您可以使用模糊Bitap算法。我在这里找到了Java实现。它的代码适用于一个相对较短的方法,具有几乎自我解释的签名:

public static List<Integer> find(String doc, String pattern, int k)

Apache Commons StringUtils has an implementation of the Levenshtein algorithm for fuzzy String matching. It can be seen as the fuzzy version of String.equals, Bitap is like the fuzzy version of String.indexOf and still uses the Levenshtein distance measure. It is generally more efficient than naively using Levenshtein to compare the search pattern with each substring that could possibly match.

Apache Commons StringUtils具有用于模糊字符串匹配的Levenshtein算法的实现。它可以看作是String.equals的模糊版本,Bitap就像String.indexOf的模糊版本,仍然使用Levenshtein距离测量。通常使用Levenshtein比较搜索模式与可能匹配的每个子字符串更有效。

Notes:

笔记:

  • The Bitap algorithm seems to be mostly useful for relatively small alphabets, e.g. plain ASCII. In fact the Simon Watiau version I linked to throws an ArrayIndexOutOfBoundsException on non-ASCII characters (>= 128) so you will have to filter these out.
  • Bitap算法似乎主要用于相对较小的字母表,例如纯ASCII。事实上,我链接的Simon Watiau版本会在非ASCII字符(> = 128)上抛出ArrayIndexOutOfBoundsException,因此您必须对这些字符进行过滤。
  • I tried using Bimap in an application to search an in-memory list of persons by name. I found that a Levenhstein distance of 2 gives way too many false positives. A Levenhstein distance of 1 works better, but it cannot detect a typo where you swap two letters, e.g. "William" and "Willaim". I can think of a few ways to solve this, e.g.

    我尝试在应用程序中使用Bimap按名称搜索内存中的人员列表。我发现Levenhstein距离为2时会产生太多误报。 Levenhstein距离为1可以更好地工作,但它无法检测到交换两个字母的拼写错误,例如“威廉”和“威廉姆”。我可以想出几种方法来解决这个问题,例如:

    1. do a fuzzy search only when an exact search finds no matches (and show a message to the user about this)
    2. 仅在精确搜索未找到匹配项时进行模糊搜索(并向用户显示有关此内容的消息)
    3. adjust Bitap to use Damerau-Levenshtein distance where a swap has distance 1 instead of 2. According to wikipedia, this is possible, but I could not find an existing implementation in Java.
    4. 调整Bitap使用Damerau-Levenshtein距离,其中交换距离为1而不是2.根据*,这是可能的,但我找不到Java中的现有实现。
    5. instead of "contains" do a "startsWith". The fuzzy search tools contains a prefix version of Damerau-Levenshtein, but it gave me an ArrayIndexOutOfBoundsException
    6. 而不是“包含”做一个“startsWith”。模糊搜索工具包含Damerau-Levenshtein的前缀版本,但它给了我一个ArrayIndexOutOfBoundsException
    7. adjust the algorithm to introduce search result ranking where exact matches score higher
    8. 调整算法以引入搜索结果排名,其中精确匹配得分更高

    If you are going to do 2 or 4, it may be better to use a proper full-text search library like Lucene anyway.

    如果您要做2或4,最好使用像Lucene这样的正确全文搜索库。

  • More information on fuzzy search can be found on this blog. It's author also created an implementation in Java called BitapOnlineSearcher, but requires you to use java.io.Reader together with an Alphabet class. It's Javadoc is written in Russian.
  • 有关模糊搜索的更多信息,请访问此博客。它的作者还在Java中创建了一个名为BitapOnlineSearcher的实现,但是要求您将java.io.Reader与Alphabet类一起使用。它的Javadoc是用俄语写的。

#3


9  

If you are mostly comparing short strings and want something portable and lightweight you can use the well known python algorithm fuzzywuzzy ported to Java.

如果你主要是比较短字符串并想要一些便携和轻量级的东西,你可以使用众所周知的python算法fuzzywuzzy移植到Java。

You can read more about it here

你可以在这里读更多关于它的内容

#4


8  

SimMetrics is probably what you need: http://sourceforge.net/projects/simmetrics/

SimMetrics可能就是您所需要的:http://sourceforge.net/projects/simmetrics/

It has several algorithms for calculating various flavours of edit-distance.

它有几种算法用于计算编辑距离的各种风格。

Lucene is a very powerful full-text search engine, but FT search isn't exactly the same thing as fuzzy string matching (eg. given a list of strings find me the one that is most similar to some candidate string).

Lucene是一个非常强大的全文搜索引擎,但FT搜索与模糊字符串匹配并不完全相同(例如,给定一个字符串列表找到我与某个候选字符串最相似的字符串)。

#5


2  

To Lucene I would add SOLR http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

到Lucene我会添加SOLR http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

#6


1  

You can try bitap. I was playing with bitap written in ANSI C and it was pretty fast there is java implementation in http://www.crosswire.org.

你可以试试bitap。我正在使用ANSI C编写的bitap,并且很快就有http://www.crosswire.org中的java实现。

#7


1  

You can try the Completely library, it relies on text preprocessing to create an in-memory index for efficiently answering (fuzzy) searches in large data sets. Unlike Lucene and other full featured text search libraries, the API is small and easy to get started.

您可以尝试使用Completely库,它依赖于文本预处理来创建内存索引,以便在大型数据集中有效地回答(模糊)搜索。与Lucene和其他全功能文本搜索库不同,API很小且易于上手。

#8


0  

Apache Lucene is the only way, I think. I don't know any better search lib.

我认为Apache Lucene是唯一的方法。我不知道任何更好的搜索库。

Apache Lucene(TM) is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.

Apache Lucene(TM)是一个完全用Java编写的高性能,功能齐全的文本搜索引擎库。它是一种适用于几乎所有需要全文搜索的应用程序的技术,尤其是跨平台搜索。