基于Levenshtein距离的方法Vs Soundex

As per this comment in a related thread, I'd like to know why Levenshtein distance based methods are better than Soundex.

根据相关主题中的评论,我想知道为什么基于Levenshtein距离的方法比Soundex更好。

4 个解决方案

#1

Soundex is rather primitive - it was originally developed to be hand calculated. It results in a key that can be compared.

Soundex相当原始 - 它最初是为手工计算而开发的。它产生了一个可以比较的密钥。

Soundex works well with western names, as it was originally developed for US census data. It's intended for phonetic comparison.

Soundex适用于西方名称,因为它最初是为美国人口普查数据开发的。它用于语音比较。

Levenshtein distance looks at two values and produces a value based on their similarity. It's looking for missing or substituted letters.

Levenshtein距离查看两个值并根据它们的相似性生成一个值。它正在寻找丢失或替换的字母。

Basically Soundex is better for finding that "Schmidt" and "Smith" might be the same surname.

基本上Soundex更适合发现“施密特”和“史密斯”可能是同一个姓氏。

Levenshtein distance is better for spotting that the user has mistyped "Levnshtein" ;-)

Levenshtein距离更好地发现用户输错了“Levnshtein”;-)

#2

I would suggest using Metaphone, not Soundex. As noted, Soundex was developed in the 19th century for American names. Metaphone will give you some results when checking the work of poor spellers who are "sounding it out", and spelling phonetically.

我建议使用Metaphone,而不是Soundex。如上所述,Soundex是在19世纪为美国名字开发的。 Metaphone会在检查“发出声音”的拼写错误的拼写者的工作时发音,并拼音拼写。

Edit distance is good at catching typos such as repeated letters, transposed letters, or hitting the wrong key.

编辑距离很好地捕捉错字,例如重复字母,转置字母或敲错键。

Consider the application to decide which will fit your users best—or use both together, with Metaphone complementing the suggestions produced by Levenshtein.

考虑应用程序来决定哪个最适合您的用户 - 或者同时使用它们,Metaphone补充了Levenshtein提出的建议。

With regard to the original question, I've used n-grams successfully in information retrieval applications.

关于原始问题,我在信息检索应用程序中成功使用了n-gram。

#3

I agree with you on Daitch-Mokotoff, Soundex is biased because the original US census takers wanted 'Americanized' names.

我同意你对Daitch-Mokotoff的看法,Soundex有偏见,因为美国原始人口普查员希望获得“美国化”的名字。

Maybe an example on the difference would help:

也许一个关于差异的例子会有所帮助:

Soundex puts addition value in the start of a word - in fact it only considers the first 4 phonetic sounds. So while "Schmidt" and "Smith" will match "Smith" and "Wmith" won't.

Soundex将加法值放在单词的开头 - 实际上它只考虑前4个语音。因此,虽然“施密特”和“史密斯”将匹配“史密斯”,而“史密斯”则不会。

Levenshtein's algorithm would be better for finding typos - one or two missing or replaced letters produces a high correlation, while the phonetic impact of those missing letters is less important.

Levenshtein的算法可以更好地找到拼写错误 - 一个或两个丢失或替换的字母产生高度相关性,而那些丢失字母的语音影响则不那么重要。

I don't think either is better, and I'd consider both a distance algorithm and a phonetic one for helping users correct typed input.

我认为两者都不是更好,我会考虑使用距离算法和语音来帮助用户纠正输入类型。

#4

@Keith:

As I posted on the other question, Daitch-Mokotoff is better for us Europeans (and I'd argue the US).

正如我在另一个问题上发表的那样,Daitch-Mokotoff对我们欧洲人来说更好(而且我认为是美国)。

I've also read the Wiki on Levenshtein. But I don't see why (in real life) it's better for the user than Soundex.

我还阅读了Levenshtein的Wiki。但我不明白为什么(在现实生活中)它对用户来说比Soundex更好。

#1