I want to filter out duplicate customer names from a database. A single customer may have more than one entry to the system with the same name but with little difference in spelling. So here is an example: A customer named * may have three entries to the system with this variations:
我想从数据库中筛选出重复的客户名称。单个客户可能有多个具有相同名称的系统条目,但拼写上的差异很小。所以这是一个例子:一个名为*的客户可能有三个条目进入系统:
- * Berta
- 布鲁克伯塔
- Bruck Berta
- 布鲁克伯塔
- Biruk Berta
- Biruk Berta
Let's assume we are putting this name in one database column. I would like to know the different mechanisms to identify such duplications form say a 100,000 records. We may use regular expressions in C# to iterate through all records or some other pattern matching technique or we may export these records to what ever best fits for such queries (SQL with Regular Expression capabilities)).
我们假设我们将此名称放在一个数据库列中。我想知道识别此类重复的不同机制,例如100,000条记录。我们可以在C#中使用正则表达式来迭代所有记录或其他一些模式匹配技术,或者我们可以将这些记录导出到最适合此类查询的内容(具有正则表达式功能的SQL))。
This is what I thought as a solution
这就是我认为的解决方案
- Write a C# code to iterate through each record
- 编写C#代码来遍历每条记录
- Get only the Consonant letters in order (in the above case: BrKBrt)
- 只按顺序获取辅音字母(在上述情况下:BrKBrt)
- Search for the same Consonant pattern from the other records considering similar sounding letters like (C,K) (C,S), (F, PH)
- 从其他记录中搜索相同的辅音模式,考虑类似的声音字母,如(C,K)(C,S),(F,PH)
So please forward any ideas.
所以请转发任何想法。
8 个解决方案
#1
8
The Double Metaphone algorithm, published in 2000, is a new and improved version of the Soundex algorithm that was patented in 1918.
Double Metaphone算法于2000年发布,是Soundex算法的一个新的改进版本,于1918年获得专利。
The article has links to Double Metaphone implementations in many languages.
本文链接到许多语言的Double Metaphone实现。
#2
2
Have a look at Soundex
看看Soundex
There is a Soundex function in Transact-SQL (see http://msdn.microsoft.com/en-us/library/ms187384.aspx):
Transact-SQL中有一个Soundex函数(参见http://msdn.microsoft.com/en-us/library/ms187384.aspx):
SELECT
SOUNDEX('* berta'),
SOUNDEX('Bruck Berta'),
SOUNDEX('Biruk Berta')
returns the same value B620 for each of the example values
为每个示例值返回相同的值B620
#3
2
The obvious, established (and well documented) algorithms for finding string similarity are:
用于查找字符串相似性的明显,已建立(并且记录良好)的算法是:
- Levenstein distance
- Levenstein距离
- Soundex
- 探测法
#4
1
I would consider writing something such as the "famous" python spell checker.
我会考虑写一些诸如“着名的”python拼写检查器之类的东西。
http://norvig.com/spell-correct.html
http://norvig.com/spell-correct.html
This will take a word and find all possible alternatives based on missing letters, adding letters, swapping letters, etc.
这将采用一个词,并根据缺失的字母,添加字母,交换字母等找到所有可能的替代方案。
#5
1
You might want to google for phonetic similarity algorithm
and you'll find plenty of information about this. Including this article on Codeproject about implementing a solution in C#.
你可能想谷歌语音相似度算法,你会发现很多关于此的信息。在Codeproject上包含关于在C#中实现解决方案的这篇文章。
#6
1
Look into soundex. It's a pretty standard library in most languages that does what you require, i.e. algorithmically identify phonetic similarity. http://en.wikipedia.org/wiki/Soundex
看看soundex。它是大多数语言中非常标准的库,可以满足您的需求,即通过算法识别语音相似性。 http://en.wikipedia.org/wiki/Soundex
#7
1
There is a very nice R (just search for "R" in Google) package for Record Linkage. The standard examples target exactly your problem: R RecordLinkage
Record Linkage有一个非常好的R(只在Google中搜索“R”)包。标准示例完全针对您的问题:R RecordLinkage
The C-Code for Soundex etc. is taken directly from PostgreSQL!
Soundex等的C代码直接来自PostgreSQL!
#8
0
I would recommend Soundex and derived algorithms over Lev distance for this solution. Levenstein distance more appropriate for spell checking solutions imho.
对于此解决方案,我建议Soundex和派生算法超过Lev距离。 Levenstein距离更适合拼写检查解决方案imho。
#1
8
The Double Metaphone algorithm, published in 2000, is a new and improved version of the Soundex algorithm that was patented in 1918.
Double Metaphone算法于2000年发布,是Soundex算法的一个新的改进版本,于1918年获得专利。
The article has links to Double Metaphone implementations in many languages.
本文链接到许多语言的Double Metaphone实现。
#2
2
Have a look at Soundex
看看Soundex
There is a Soundex function in Transact-SQL (see http://msdn.microsoft.com/en-us/library/ms187384.aspx):
Transact-SQL中有一个Soundex函数(参见http://msdn.microsoft.com/en-us/library/ms187384.aspx):
SELECT
SOUNDEX('* berta'),
SOUNDEX('Bruck Berta'),
SOUNDEX('Biruk Berta')
returns the same value B620 for each of the example values
为每个示例值返回相同的值B620
#3
2
The obvious, established (and well documented) algorithms for finding string similarity are:
用于查找字符串相似性的明显,已建立(并且记录良好)的算法是:
- Levenstein distance
- Levenstein距离
- Soundex
- 探测法
#4
1
I would consider writing something such as the "famous" python spell checker.
我会考虑写一些诸如“着名的”python拼写检查器之类的东西。
http://norvig.com/spell-correct.html
http://norvig.com/spell-correct.html
This will take a word and find all possible alternatives based on missing letters, adding letters, swapping letters, etc.
这将采用一个词,并根据缺失的字母,添加字母,交换字母等找到所有可能的替代方案。
#5
1
You might want to google for phonetic similarity algorithm
and you'll find plenty of information about this. Including this article on Codeproject about implementing a solution in C#.
你可能想谷歌语音相似度算法,你会发现很多关于此的信息。在Codeproject上包含关于在C#中实现解决方案的这篇文章。
#6
1
Look into soundex. It's a pretty standard library in most languages that does what you require, i.e. algorithmically identify phonetic similarity. http://en.wikipedia.org/wiki/Soundex
看看soundex。它是大多数语言中非常标准的库,可以满足您的需求,即通过算法识别语音相似性。 http://en.wikipedia.org/wiki/Soundex
#7
1
There is a very nice R (just search for "R" in Google) package for Record Linkage. The standard examples target exactly your problem: R RecordLinkage
Record Linkage有一个非常好的R(只在Google中搜索“R”)包。标准示例完全针对您的问题:R RecordLinkage
The C-Code for Soundex etc. is taken directly from PostgreSQL!
Soundex等的C代码直接来自PostgreSQL!
#8
0
I would recommend Soundex and derived algorithms over Lev distance for this solution. Levenstein distance more appropriate for spell checking solutions imho.
对于此解决方案,我建议Soundex和派生算法超过Lev距离。 Levenstein距离更适合拼写检查解决方案imho。