I am looking for a reference database that can be used to test for possible name typos in a contact database. This is for a batch process, so performance isn't a real issue. Ideally I'd like a comprehensive database, but even something like "top 5000" would go a long way.
我正在寻找一个参考数据库,可用于测试联系人数据库中可能的名称拼写错误。这适用于批处理,因此性能不是真正的问题。理想情况下,我想要一个全面的数据库,但即使像“前5000”这样的东西也会有很长的路要走。
Thanks!
谢谢!
6 个解决方案
#1
18
I don't know about a database, but populating one yourself from a resource such as this http://www.census.gov/genealogy/names/dist.all.last should work fine :)
我不知道一个数据库,但你自己从一个资源填充这个http://www.census.gov/genealogy/names/dist.all.last应该工作正常:)
#2
13
I don't understand how you can find typos in names. I mean, my first name is Philippe (French), but it can be Philip, Philips, Felipe, Fèlipe, or anything else. Likely, there is a traditional French name, Sandrine, but a trend is to write that Cendrine, even more as law is relaxed recently in France. And so on.
OK, perhaps a Jhon smell like a typo (common two letter inversion) but you can't tell for sure.
Typos in last names is even more impossible to detect... Unless you check against a limited, known list (employees of a company, for example).
我不明白你如何在名字中找到拼写错误。我的意思是,我的名字是菲利普(法国),但它可以是菲利普,飞利浦,费利佩,菲利普或其他任何东西。可能有一个传统的法国名字,桑德琳,但一个趋势是写出了鸡蛋泉,更多的是法律最近放松在法国。等等。好吧,也许Jhon闻起来像一个错字(常见的两个字母反转),但你无法确定。姓氏中的错别字甚至更难以发现......除非您检查有限的已知列表(例如公司的员工)。
#3
13
I know a first name database http://www.lexique.org/public/Prenoms100.zip which covers Phil, Phile, Philip, Philipp, Phillip, Felipe, Philippe
. (around 12000 first names)
我知道一个名字数据库http://www.lexique.org/public/Prenoms100.zip,其中包括Phil,Phile,Philip,Philipp,Phillip,Felipe,Philippe。 (大约12000名)
I think you won't find anything useful with second names, as they are far more numerous than first names. This is a known problem in computational linguistics.
我认为你找不到任何有用的名字,因为它们比名字要多得多。这是计算语言学中的已知问题。
#4
2
If there is no additional language information involved, this can be pretty useless. I would not spend effort on this as it probably works only on a small population procentage.
如果没有涉及其他语言信息,这可能是无用的。我不会在此花费精力,因为它可能只适用于一小部分人口程序。
PS: Don't forget the chinese, russian and indian names (millions)
PS:不要忘记中文,俄文和印度名字(数百万)
#5
2
I personally know people who have unique names (names their parents deliberately made up to be unique) and I also personally know people whose names appear to be misspelled but that is actually what their parents named them. I would not even attempt to do such a thing as attempt to fix name typos. What we do instead is import the names (and we require a unique identifier to come from our clients). Then the next time we import, we match on the unique identifier and if the name was changed (because we contacted the person and he or she told us what to change it to) then the name is not updated. Buut if the name was not changed and it is differnt inthe file (usually because of a marriage or divorce) then the name is updated. You'll need some kind of flag on the data record to tell that it was updated manually. We populate this through a trigger.
我个人认识那些有着独特名字的人(他们的父母故意编造的名字是独一无二的),而且我个人也知道那些名字似乎拼写错误的人,但这实际上是他们的父母给他们起的名字。我甚至不想尝试修改名称拼写错误这样的事情。我们所做的是导入名称(我们需要来自客户的唯一标识符)。然后在下次导入时,我们匹配唯一标识符,如果名称被更改(因为我们联系了该人,他或她告诉我们要将其更改为什么),则名称不会更新。如果名称没有改变并且文件中的内容不同(通常是因为结婚或离婚),那么名称就会更新。您需要在数据记录上使用某种标志来表明它已手动更新。我们通过触发器来填充它。
Far more important when importing name data is to avoid creating duplicates (hence our requirement for a unique identifier from our data sources) or avoiding incorrect matching of data (you can't just consider name when matching to see if the record already exists).
导入名称数据时更重要的是避免创建重复项(因此我们需要从数据源中获取唯一标识符)或避免不正确的数据匹配(在匹配时不能只考虑名称以查看记录是否已存在)。
#6
1
I found some databases that aren't used for the purpose of checking spelling, but here's one that lists common first names: Name Genders Database, and another that lists common last names: Name Ethnicities Database
我找到了一些不用于检查拼写的数据库,但是这里有一个列出常见名字的数据库:名称性别数据库,另一个列出常见的姓氏:名称种族数据库
Hope that helps!
希望有所帮助!
#1
18
I don't know about a database, but populating one yourself from a resource such as this http://www.census.gov/genealogy/names/dist.all.last should work fine :)
我不知道一个数据库,但你自己从一个资源填充这个http://www.census.gov/genealogy/names/dist.all.last应该工作正常:)
#2
13
I don't understand how you can find typos in names. I mean, my first name is Philippe (French), but it can be Philip, Philips, Felipe, Fèlipe, or anything else. Likely, there is a traditional French name, Sandrine, but a trend is to write that Cendrine, even more as law is relaxed recently in France. And so on.
OK, perhaps a Jhon smell like a typo (common two letter inversion) but you can't tell for sure.
Typos in last names is even more impossible to detect... Unless you check against a limited, known list (employees of a company, for example).
我不明白你如何在名字中找到拼写错误。我的意思是,我的名字是菲利普(法国),但它可以是菲利普,飞利浦,费利佩,菲利普或其他任何东西。可能有一个传统的法国名字,桑德琳,但一个趋势是写出了鸡蛋泉,更多的是法律最近放松在法国。等等。好吧,也许Jhon闻起来像一个错字(常见的两个字母反转),但你无法确定。姓氏中的错别字甚至更难以发现......除非您检查有限的已知列表(例如公司的员工)。
#3
13
I know a first name database http://www.lexique.org/public/Prenoms100.zip which covers Phil, Phile, Philip, Philipp, Phillip, Felipe, Philippe
. (around 12000 first names)
我知道一个名字数据库http://www.lexique.org/public/Prenoms100.zip,其中包括Phil,Phile,Philip,Philipp,Phillip,Felipe,Philippe。 (大约12000名)
I think you won't find anything useful with second names, as they are far more numerous than first names. This is a known problem in computational linguistics.
我认为你找不到任何有用的名字,因为它们比名字要多得多。这是计算语言学中的已知问题。
#4
2
If there is no additional language information involved, this can be pretty useless. I would not spend effort on this as it probably works only on a small population procentage.
如果没有涉及其他语言信息,这可能是无用的。我不会在此花费精力,因为它可能只适用于一小部分人口程序。
PS: Don't forget the chinese, russian and indian names (millions)
PS:不要忘记中文,俄文和印度名字(数百万)
#5
2
I personally know people who have unique names (names their parents deliberately made up to be unique) and I also personally know people whose names appear to be misspelled but that is actually what their parents named them. I would not even attempt to do such a thing as attempt to fix name typos. What we do instead is import the names (and we require a unique identifier to come from our clients). Then the next time we import, we match on the unique identifier and if the name was changed (because we contacted the person and he or she told us what to change it to) then the name is not updated. Buut if the name was not changed and it is differnt inthe file (usually because of a marriage or divorce) then the name is updated. You'll need some kind of flag on the data record to tell that it was updated manually. We populate this through a trigger.
我个人认识那些有着独特名字的人(他们的父母故意编造的名字是独一无二的),而且我个人也知道那些名字似乎拼写错误的人,但这实际上是他们的父母给他们起的名字。我甚至不想尝试修改名称拼写错误这样的事情。我们所做的是导入名称(我们需要来自客户的唯一标识符)。然后在下次导入时,我们匹配唯一标识符,如果名称被更改(因为我们联系了该人,他或她告诉我们要将其更改为什么),则名称不会更新。如果名称没有改变并且文件中的内容不同(通常是因为结婚或离婚),那么名称就会更新。您需要在数据记录上使用某种标志来表明它已手动更新。我们通过触发器来填充它。
Far more important when importing name data is to avoid creating duplicates (hence our requirement for a unique identifier from our data sources) or avoiding incorrect matching of data (you can't just consider name when matching to see if the record already exists).
导入名称数据时更重要的是避免创建重复项(因此我们需要从数据源中获取唯一标识符)或避免不正确的数据匹配(在匹配时不能只考虑名称以查看记录是否已存在)。
#6
1
I found some databases that aren't used for the purpose of checking spelling, but here's one that lists common first names: Name Genders Database, and another that lists common last names: Name Ethnicities Database
我找到了一些不用于检查拼写的数据库,但是这里有一个列出常见名字的数据库:名称性别数据库,另一个列出常见的姓氏:名称种族数据库
Hope that helps!
希望有所帮助!