如何在大型字符串数据库中找到字符串的最佳模糊匹配

I have a database of strings (arbitrary length) which holds more than one million items (potentially more).

我有一个字符串数据库(任意长度),它拥有超过一百万个项目(可能更多)。

I need to compare a user-provided string against the whole database and retrieve an identical string if it exists or otherwise return the closest fuzzy match(es) (60% similarity or better). The search time should ideally be under one second.

我需要将用户提供的字符串与整个数据库进行比较,并检索相同的字符串(如果存在)或以其他方式返回最接近的模糊匹配(60%相似性或更好)。理想情况下,搜索时间应小于一秒。

My idea is to use edit distance for comparing each db string to the search string after narrowing down the candidates from the db based on their length.

我的想法是使用编辑距离将每个数据库字符串与搜索字符串进行比较,然后根据数据库的长度缩小数据库中的候选项。

However, as I will need to perform this operation very often, I'm thinking about building an index of the db strings to keep in memory and query the index, not the db directly.

但是,因为我需要经常执行此操作,所以我正在考虑构建db字符串的索引以保留在内存中并查询索引,而不是直接查询db。

Any ideas on how to approach this problem differently or how to build the in-memory index?

有关如何以不同方式解决此问题或如何构建内存中索引的任何想法?

7 个解决方案

#1

This paper seems to describe exactly what you want.

本文似乎准确描述了你想要的东西。

Lucene (http://lucene.apache.org/) also implements Levenshtein edit distance.

Lucene(http://lucene.apache.org/)也实现了Levenshtein编辑距离。

#2

You didn't mention your database system, but for PostrgreSQL you could use the following contrib module: trgm - Trigram matching for PostgreSQL

您没有提到您的数据库系统,但对于PostrgreSQL,您可以使用以下contrib模块:trgm - PostgreSQL的Trigram匹配

The pg_trgm contrib module provides functions and index classes for determining the similarity of text based on trigram matching.

pg_trgm contrib模块提供函数和索引类,用于根据trigram匹配确定文本的相似性。

#3

If your database supports it, you should use full-text search. Otherwise, you can use an indexer like lucene and its various implementations.

如果您的数据库支持它,您应该使用全文搜索。否则,您可以使用像lucene这样的索引器及其各种实现。

#4

Compute the SOUNDEX hash (which is built into many SQL database engines) and index by it.

计算SOUNDEX哈希(内置于许多SQL数据库引擎中)并通过它进行索引。

SOUNDEX is a hash based on the sound of the words, so spelling errors of the same word are likely to have the same SOUNDEX hash.

SOUNDEX是基于单词声音的散列,因此同一单词的拼写错误可能具有相同的SOUNDEX散列。

Then find the SOUNDEX hash of the search string, and match on it.

然后找到搜索字符串的SOUNDEX哈希值,并匹配它。

#5

Since the amount of data is large, when inserting a record I would compute and store the value of the phonetic algorithm in an indexed column and then constrain (WHERE clause) my select queries within a range on that column.

由于数据量很大,因此在插入记录时,我会计算并将语音算法的值存储在索引列中,然后在该列的范围内约束(WHERE子句)我的选择查询。

#6

A very extensive explanation of relevant algorithms is in the book Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology by Dan Gusfield.

关于相关算法的非常广泛的解释在Dan Gusfield的Algorithms on Strings,Trees,and Sequences:Computer Science and Computational Biology一书中。

#7

https://en.wikipedia.org/wiki/Levenshtein_distance

Levenshtein algorithm has been implemented in some DBMS

Levenshtein算法已在某些DBMS中实现

(E.g. PostgreSql: http://www.postgresql.org/docs/9.1/static/fuzzystrmatch.html)

(例如PostgreSql:http://www.postgresql.org/docs/9.1/static/fuzzystrmatch.html)

#1