在MYSQL中，我如何用PHP来完成公司名称的模糊匹配?

My users will import through cut and paste a large string that will contain company names.

我的用户将通过剪切和粘贴一个包含公司名称的大字符串来导入。

I have an existing and growing MYSQL database of companies names, each with a unique company_id.

我有一个现有且不断增长的MYSQL数据库，其中包含公司名称，每个名称都有唯一的company_id。

I want to be able to parse through the string and assign to each of the user-inputed company names a fuzzy match.

我希望能够解析字符串并为每个用户输入的公司名称分配一个模糊匹配。

Right now, just doing a straight-up string match, is also slow. ** Will Soundex indexing be faster? How can I give the user some options as they are typing? **

现在，做一个直接的字符串匹配，也是很慢的。** Soundex会更快吗?如何在用户输入时给他们一些选项?* *

For example, someone writes:

例如,有人写道:

Microsoft       -> Microsoft
Bare Essentials -> Bare Escentuals
Polycom, Inc.   -> Polycom

I have found the following threads that seem similar to this question, but the poster has not approved and I'm not sure if their use-case is applicable:

我发现了以下与这个问题相似的线索，但是海报没有被批准，我不确定他们的用例是否适用:

How to find best fuzzy match for a string in a large string database

如何在大的字符串数据库中找到最佳的模糊匹配

Matching inexact company names in Java

在Java中匹配不精确的公司名称

7 个解决方案

#1

You can start with using SOUNDEX(), this will probably do for what you need (I picture an auto-suggestion box of already-existing alternatives for what the user is typing).

您可以从使用SOUNDEX()开始，这可能会满足您的需要(我想象一下，对于用户输入的内容，一个已经存在的替代方案的自动建议框)。

The drawbacks of SOUNDEX() are:

SOUNDEX()的缺点是:

its inability to differentiate longer strings. Only the first few characters are taken into account, longer strings that diverge at the end generate the same SOUNDEX value
它不能区分更长的字符串。只有前几个字符被考虑在内，较长的字符串在最后分叉产生相同的SOUNDEX值
the fact the the first letter must be the same or you won't find a match easily. SQL Server has DIFFERENCE() function to tell you how much two SOUNDEX values are apart, but I think MySQL has nothing of that kind built in.
第一个字母必须是相同的，否则你很难找到匹配的。SQL Server有DIFFERENCE()函数来告诉您两个SOUNDEX值之间有多少个值，但我认为MySQL没有这种内置的值。
for MySQL, at least according to the docs, SOUNDEX is broken for unicode input
对于MySQL来说，至少在文档中是这样的，但对于unicode输入，SOUNDEX是被破坏了的

Example:

例子:

SELECT SOUNDEX('Microsoft')
SELECT SOUNDEX('Microsift')
SELECT SOUNDEX('Microsift Corporation')
SELECT SOUNDEX('Microsift Subsidary')

/* all of these return 'M262' */

For more advanced needs, I think you need to look at the Levenshtein distance (also called "edit distance") of two strings and work with a threshold. This is the more complex (=slower) solution, but it allows for greater flexibility.

对于更高级的需求，我认为您需要查看两个字符串的Levenshtein距离(也称为“编辑距离”)，并使用阈值。这是更复杂(=更慢)的解决方案，但它允许更大的灵活性。

Main drawback is, that you need both strings to calculate the distance between them. With SOUNDEX you can store a pre-calculated SOUNDEX in your table and compare/sort/group/filter on that. With the Levenshtein distance, you might find that the difference between "Microsoft" and "Nzcrosoft" is only 2, but it will take a lot more time to come to that result.

主要的缺点是，您需要两个字符串来计算它们之间的距离。使用SOUNDEX，您可以在您的表中存储一个预先计算好的SOUNDEX，并对其进行比较/排序/组/过滤器。使用Levenshtein距离，您可能会发现“Microsoft”和“Nzcrosoft”之间的区别仅仅是2，但是要达到这个结果需要更多的时间。

In any case, an example Levenshtein distance function for MySQL can be found at codejanitor.com: Levenshtein Distance as a MySQL Stored Function (Feb. 10th, 2007).

在任何情况下，都可以在codejanitor.com上找到Levenshtein距离函数示例(2007年2月10日):Levenshtein距离作为MySQL存储函数。

#2

SOUNDEX is an OK algorithm for this, but there have been recent advances on this topic. Another algorithm was created called the Metaphone, and it was later revised to a Double Metaphone algorithm. I have personally used the java apache commons implementation of double metaphone and it is customizable and accurate.

SOUNDEX是一个不错的算法，但是最近在这个话题上有了一些进展。另一种算法被创建为变音位，后来被修改为双变音位算法。我个人使用了java apache commons实现的double metaphone，它是可定制的，而且准确。

They have implementations in lots of other languages on the wikipedia page for it, too. This question has been answered, but should you find any of the identified problems with the SOUNDEX appearing in your application, it's nice to know there are options. Sometimes it can generate the same code for two really different words. Double metaphone was created to help take care of that problem.

他们在*页面上也有很多其他语言的实现。这个问题已经得到了回答，但是如果您在应用程序中发现了与SOUNDEX有关的任何问题，那么很高兴知道有一些选项。有时它可以为两个完全不同的词生成相同的代码。创建双变音是为了帮助解决这个问题。

Stolen from wikipedia: http://en.wikipedia.org/wiki/Soundex

偷从*:http://en.wikipedia.org/wiki/Soundex

As a response to deficiencies in the Soundex algorithm, Lawrence Philips developed the Metaphone algorithm for the same purpose. Philips later developed an improvement to Metaphone, which he called Double-Metaphone. Double-Metaphone includes a much larger encoding rule set than its predecessor, handles a subset of non-Latin characters, and returns a primary and a secondary encoding to account for different pronunciations of a single word in English.

为了应对Soundex算法的不足，劳伦斯·飞利浦(Lawrence Philips)开发了Metaphone算法。飞利浦后来发展到变音变，他称之为双变音位。双变音位包括一个比它的前身更大的编码规则集，处理非拉丁字符的子集，并返回主和次编码，以解释英语中单个单词的不同发音。

At the bottom of the double metaphone page, they have the implementations of it for all kinds of programming languages: http://en.wikipedia.org/wiki/Double-Metaphone

在double metaphone页面的底部，他们实现了各种编程语言:http://en.wikipedia.org/wiki/Double-Metaphone

Python & MySQL implementation: https://github.com/AtomBoy/double-metaphone

Python和MySQL实现:https://github.com/AtomBoy/double-metaphone

#3

Firstly, I would like to add that you should be very careful when using any form of Phonetic/Fuzzy Matching Algorithm, as this kind of logic is exactly that, Fuzzy or to put it more simply; potentially inaccurate. Especially true when used for matching company names.

首先，我想补充一点，在使用任何形式的语音/模糊匹配算法时都要非常小心，因为这种逻辑就是模糊的，或者更简单地说;可能不准确。当用于匹配公司名称时尤其如此。

A good approach is to seek corroboration from other data, such as address information, postal codes, tel numbers, Geo Coordinates etc. This will help confirm the probability of your data being accurately matched.

一个好的方法是从其他数据中寻找证据，如地址信息、邮政编码、电话号码、地理坐标等。这将有助于确认您的数据被准确匹配的可能性。

There are a whole range of issues related to B2B Data Matching too many to be addressed here, I have written more about Company Name Matching in my blog, but in summary the key issues are:

B2B数据匹配有很多问题需要解决，我在博客中写了更多关于公司名称匹配的内容，但总的来说，关键问题是:

Looking at the whole string is unhelpful as the most important part of a Company Name is not necessarily at the beginning of the Company Name. i.e. ‘The Proctor and Gamble Company’ or ‘United States Federal Reserve ‘
查看整个字符串是没有帮助的，因为公司名称最重要的部分不一定在公司名称的开头。即“宝洁公司”或“美国联邦储备委员会”
Abbreviations are common place in Company Names i.e. HP, GM, GE, P&G, D&B etc..
缩写在公司名称中很常见，例如HP, GM, GE, P&G, D&B等等。
Some companies deliberately spell their names incorrectly as part of their branding and to differentiate themselves from other companies.
一些公司故意把自己的名字拼错，作为品牌的一部分，并将自己与其他公司区分开来。

Matching exact data is easy, but matching non-exact data can be much more time consuming and I would suggest that you should consider how you will be validating the non-exact matches to ensure these are of acceptable quality.

匹配精确的数据是容易的，但是匹配不精确的数据可能会花费更多的时间，我建议您应该考虑如何验证不精确的匹配，以确保这些匹配的质量是可接受的。

Before we built Match2Lists.com, we used to spend an unhealthy amount of time validating fuzzy matches. In Match2Lists we incorporated a powerful Visualisation tool enabling us to review non-exact matches, this proved to be a real game changer in terms of match validation, reducing our costs and enabling us to deliver results much more quickly.

在我们创建Match2Lists.com之前，我们常常花不健康的时间验证模糊匹配。在Match2Lists我们引入了一个强大的可视化工具，使我们可以查看不精确的匹配，这证明了在匹配验证方面是一个真正的游戏改变者，降低了我们的成本，并使我们能够更快地交付结果。

Best of Luck!!

好运! !

#4

Here's a link to the php discussion of the soundex functions in mysql and php. I'd start from there, then expand into your other not-so-well-defined requirements.

下面是对mysql和php中soundex函数的php讨论的链接。我将从这里开始，然后扩展到其他定义不明确的需求。

Your reference references the Levenshtein methodology for matching. Two problems. 1. It's more appropriate for measuring the difference between two known words, not for searching. 2. It discusses a solution designed more to detect things like proofing errors (using "Levenshtien" for "Levenshtein") rather than spelling errors (where the user doesn't know how to spell, say "Levenshtein" and types in "Levinstein". I usually associate it with looking for a phrase in a book rather than a key value in a database.

您的参考文献引用Levenshtein方法进行匹配。两个问题。1。它更适合测量两个已知单词之间的差异，而不是搜索。2。它讨论了一个解决方案，设计了更多的方法来检测类似打样的错误(使用“Levenshtien”代替“Levenshtein”)，而不是拼写错误(用户不知道如何拼写，比如“Levenshtein”和“Levinstein”中的类型)。我通常把它与在书中查找短语联系起来，而不是在数据库中查找键值。

EDIT: In response to comment--

编辑:回应评论——

Can you at least get the users to put the company names into multiple text boxes; 2. or use an unambigous name delimiter (say backslash); 3. leave out articles ("The") and generic abbreviations (or you can filter for these); 4. Squoosh the spaces out and match for that also (so Micro Soft => microsoft, Bare Essentials => bareessentials); 5. Filter out punctuation; 6. Do "OR" searches on words ("bare" OR "essentials") - people will inevitably leave one or the other out sometimes.
你能不能至少让用户把公司名称放在多个文本框中;2。或者使用一个unambigous名称分隔符(比如反斜杠);3所示。省略冠词(“The”)和通用缩写(或者你可以过滤它们);4所示。把这些空间清理干净，并与之匹配(所以Micro Soft => microsoft, Bare Essentials => bareessentials);5。过滤标点符号;6。做“或者”搜索单词(“裸”或“必需品”)——人们有时不可避免地会遗漏一个或另一个。

Test like mad and use the feedback loop from users.

像mad一样测试并使用用户的反馈循环。

#5

the best function for fuzzy matching is levenshtein. it's traditionally used by spell checkers, so that might be the way to go. there's a UDF for it available here: http://joshdrew.com/

模糊匹配的最佳函数是levenshtein。它通常被拼写检查器使用，所以这可能是正确的方法。这里有一个UDF: http://joshdrew.com/

the downside to using levenshtein is that it won't scale very well. a better idea might be to dump the whole table in to a spell checker custom dictionary file and do the suggestion from your application tier instead of the database tier.

使用levenshtein的缺点是不能很好地扩展。更好的方法可能是将整个表转储到一个拼写检查器自定义字典文件中，并从应用程序层而不是数据库层执行建议。

#6

This answer results in indexed lookup of almost any entity using input of 2 or 3 characters or more.

这个答案导致使用2或3个字符或更多字符的输入对几乎所有实体进行索引查找。

Basically, create a new table with 2 columns, word and key. Run a process on the original table containing the column to be fuzzy searched. This process will extract every individual word from the original column and write these words to the word table along with the original key. During this process, commonly occurring words like 'the','and', etc should be discarded.

基本上，创建一个包含两个列的新表，word和key。在包含要模糊搜索的列的原始表上运行一个进程。此过程将从原始列中提取每个单词，并将这些单词连同原始键一起写入word表。在这个过程中，经常出现的单词如“the”、“and”等应该被丢弃。

We then create several indices on the word table, as follows...

然后我们在单词表上创建几个索引，如下所示……

A normal, lowercase index on word + key
一个普通的，小写的字+键索引
An index on the 2nd through 5th character + key
第二个到第五个字符的索引+键
An index on the 3rd through 6th character + key

第3到第6个字符的索引+键

Alternately, create a SOUNDEX() index on the word column.

或者，在单词列上创建一个SOUNDEX()索引。

Once this is in place, we take any user input and search using normal word = input or LIKE input%. We never do a LIKE %input as we are always looking for a match on any of the first 3 characters, which are all indexed.

一旦设置好，我们将使用普通的word = input或类似的input%进行任何用户输入和搜索。我们从不做LIKE %的输入，因为我们总是在前3个字符中寻找匹配，这些字符都被索引。

If your original table is massive, you could partition the word table by chunks of the alphabet to ensure the user's input is being narrowed down to candidate rows immediately.

如果原始表很大，可以将word表按字母表的块进行分区，以确保用户的输入被立即缩小到候选行。

#7

-1

May be its too late but it might help others. Check this link out.It uses the levenshtein distance metrics but is much faster. http://narenonit.blogspot.com/2012/07/fuzzy-matching-autocomplete-library.html

可能为时已晚，但它可能会帮助他人。看看这个链接。它使用levenshtein距离度量，但要快得多。http://narenonit.blogspot.com/2012/07/fuzzy-matching-autocomplete-library.html

#1