基于Levenshtein距离的方法Vs Soundex

时间:2021-05-10 19:23:25

As per this comment in a related thread, I'd like to know why Levenshtein distance based methods are better than Soundex.


4 个解决方案



Soundex is rather primitive - it was originally developed to be hand calculated. It results in a key that can be compared.

Soundex相当原始 - 它最初是为手工计算而开发的。它产生了一个可以比较的密钥。

Soundex works well with western names, as it was originally developed for US census data. It's intended for phonetic comparison.


Levenshtein distance looks at two values and produces a value based on their similarity. It's looking for missing or substituted letters.


Basically Soundex is better for finding that "Schmidt" and "Smith" might be the same surname.


Levenshtein distance is better for spotting that the user has mistyped "Levnshtein" ;-)




I would suggest using Metaphone, not Soundex. As noted, Soundex was developed in the 19th century for American names. Metaphone will give you some results when checking the work of poor spellers who are "sounding it out", and spelling phonetically.

我建议使用Metaphone,而不是Soundex。如上所述,Soundex是在19世纪为美国名字开发的。 Metaphone会在检查“发出声音”的拼写错误的拼写者的工作时发音,并拼音拼写。

Edit distance is good at catching typos such as repeated letters, transposed letters, or hitting the wrong key.


Consider the application to decide which will fit your users best—or use both together, with Metaphone complementing the suggestions produced by Levenshtein.

考虑应用程序来决定哪个最适合您的用户 - 或者同时使用它们,Metaphone补充了Levenshtein提出的建议。

With regard to the original question, I've used n-grams successfully in information retrieval applications.




I agree with you on Daitch-Mokotoff, Soundex is biased because the original US census takers wanted 'Americanized' names.


Maybe an example on the difference would help:


Soundex puts addition value in the start of a word - in fact it only considers the first 4 phonetic sounds. So while "Schmidt" and "Smith" will match "Smith" and "Wmith" won't.

Soundex将加法值放在单词的开头 - 实际上它只考虑前4个语音。因此,虽然“施密特”和“史密斯”将匹配“史密斯”,而“史密斯”则不会。

Levenshtein's algorithm would be better for finding typos - one or two missing or replaced letters produces a high correlation, while the phonetic impact of those missing letters is less important.

Levenshtein的算法可以更好地找到拼写错误 - 一个或两个丢失或替换的字母产生高度相关性,而那些丢失字母的语音影响则不那么重要。

I don't think either is better, and I'd consider both a distance algorithm and a phonetic one for helping users correct typed input.





As I posted on the other question, Daitch-Mokotoff is better for us Europeans (and I'd argue the US).


I've also read the Wiki on Levenshtein. But I don't see why (in real life) it's better for the user than Soundex.




Soundex is rather primitive - it was originally developed to be hand calculated. It results in a key that can be compared.

Soundex相当原始 - 它最初是为手工计算而开发的。它产生了一个可以比较的密钥。

Soundex works well with western names, as it was originally developed for US census data. It's intended for phonetic comparison.


Levenshtein distance looks at two values and produces a value based on their similarity. It's looking for missing or substituted letters.


Basically Soundex is better for finding that "Schmidt" and "Smith" might be the same surname.


Levenshtein distance is better for spotting that the user has mistyped "Levnshtein" ;-)




I would suggest using Metaphone, not Soundex. As noted, Soundex was developed in the 19th century for American names. Metaphone will give you some results when checking the work of poor spellers who are "sounding it out", and spelling phonetically.

我建议使用Metaphone,而不是Soundex。如上所述,Soundex是在19世纪为美国名字开发的。 Metaphone会在检查“发出声音”的拼写错误的拼写者的工作时发音,并拼音拼写。

Edit distance is good at catching typos such as repeated letters, transposed letters, or hitting the wrong key.


Consider the application to decide which will fit your users best—or use both together, with Metaphone complementing the suggestions produced by Levenshtein.

考虑应用程序来决定哪个最适合您的用户 - 或者同时使用它们,Metaphone补充了Levenshtein提出的建议。

With regard to the original question, I've used n-grams successfully in information retrieval applications.




I agree with you on Daitch-Mokotoff, Soundex is biased because the original US census takers wanted 'Americanized' names.


Maybe an example on the difference would help:


Soundex puts addition value in the start of a word - in fact it only considers the first 4 phonetic sounds. So while "Schmidt" and "Smith" will match "Smith" and "Wmith" won't.

Soundex将加法值放在单词的开头 - 实际上它只考虑前4个语音。因此,虽然“施密特”和“史密斯”将匹配“史密斯”,而“史密斯”则不会。

Levenshtein's algorithm would be better for finding typos - one or two missing or replaced letters produces a high correlation, while the phonetic impact of those missing letters is less important.

Levenshtein的算法可以更好地找到拼写错误 - 一个或两个丢失或替换的字母产生高度相关性,而那些丢失字母的语音影响则不那么重要。

I don't think either is better, and I'd consider both a distance algorithm and a phonetic one for helping users correct typed input.





As I posted on the other question, Daitch-Mokotoff is better for us Europeans (and I'd argue the US).


I've also read the Wiki on Levenshtein. But I don't see why (in real life) it's better for the user than Soundex.
