数据库中的名字变化

I am trying to determine what the best way is to find variations of a first name in a database. For example, I search for Bill Smith. I would like it return "Bill Smith", obviously, but I would also like it to return "William Smith", or "Billy Smith", or even "Willy Smith". My initial thought was to build a first name hierarchy, but I do not know where I could obtain such data, if it even exists.

我正在尝试确定最好的方法是在数据库中查找名字的变体。例如，我搜索比尔·史密斯。很明显，我希望它能归还“比尔·史密斯”，但我也希望它能归还“威廉·史密斯”、“比利·史密斯”，甚至“威利·史密斯”。我最初的想法是建立一个名字层次结构，但是我不知道在哪里可以获得这样的数据，如果它存在的话。

Since users can search the directory, I thought this would be a key feature. For example, people I went to school with called me Joe, but I always go by Joseph now. So, I was looking at doing a phonetic search on the last name, either with NYSIIS or Double Metaphone and then searching on the first name using this name heirarchy. Is there a better way to do this - maybe some sort of graded relevance using a full text search on the full name instead of a two part search on the first and last name? Part of me thinks that if I stored a name as a single value instead of multiple values, it might facilitate more search options at the expense of being able to address a user by the first name.

由于用户可以搜索目录，我认为这将是一个关键特性。例如，和我一起上学的人叫我乔，但我现在总是从约瑟夫身边经过。所以，我想用语音搜索来搜索姓氏，用NYSIIS或Double Metaphone，然后用这个名字来搜索姓氏。是否有更好的方法来实现这一点——也许是某种程度上的相关性，使用全名的全文搜索，而不是用姓和名的两部分搜索?我的一部分想法是，如果我将一个名称存储为一个值而不是多个值，那么它可能会促进更多的搜索选项，代价是无法通过用户的名字来定位用户。

As far as platform, I am using SQL Server 2005 - however, I don't have a problem shifting some of the matching into the code; for example, pre-seeding the phonetic keys for a user, since they wouldn't change.

就平台而言，我使用的是SQL Server 2005——但是，我没有把一些匹配转换成代码的问题;例如，为用户预播语音键，因为它们不会改变。

Any thoughts or guidance would be appreciated. Countless searches have pretty much turned up empty. Thanks!

任何想法或指导都将受到感激。无数的搜索结果都是空的。谢谢!

Edit: It seems that there are two very distinct camps on the functionality and I am definitely sitting in the middle right now. I could see the argument of a full-text search - most likely done with a lack of data normalization, and a multi-part approach that uses different criteria for different parts of the name.

编辑:似乎在功能上有两个截然不同的阵营，我现在肯定是站在中间。我可以看到全文搜索的论点——很可能是在缺乏数据规范化的情况下完成的，以及对名称的不同部分使用不同标准的多部分方法。

The problem ultimately comes down to user intent. The Bill / William example is a good one, because it shows the mutation of a first name based upon the formality of the usage. I think that building a name hierarchy is the more accurate (and extensible) solution, but is going to be far more complex. The fuzzy search approach is easier to implement at the expense of accuracy. Is this a fair comparison?

问题最终归结为用户意图。Bill / William的例子是一个很好的例子，因为它根据用法的形式显示了名字的变化。我认为构建名称层次结构是更准确(和可扩展)的解决方案，但它将更加复杂。模糊搜索方法更容易实现，而牺牲了准确性。这是公平的比较吗?

Resolution: Upon doing some tests, I have determined to go with an approach where the initial registration will take a full name and I will split it out into multiple fields (forename, surname, middle, suffix, etc.). Since I am sure that it won't be perfect, I will allow the user to edit the "parts", including adding a maiden or alternate name. As far as searching goes, with either solution I am going to need to maintain what variations exists, either in a database table, or as a thesaurus. Neither have an advantage over the other in this case. I think it is going to come down to performance, and I will have to actually run some benchmarks to determine which is best. Thank you, everyone, for your input!

解决方案:在做了一些测试之后，我决定采用一种方法，即初始注册将使用全名，并将其划分为多个字段(名、姓、中、后缀等)。因为我确信它不会是完美的，我将允许用户编辑“部分”，包括添加一个少女或替代的名字。就搜索而言，对于任何一种解决方案，我都需要维护存在的变体，无论是数据库表还是同义词典。在这种情况下，双方都没有优势。我认为这将归结于性能，我需要运行一些基准来确定哪个是最好的。谢谢大家的参与!

9 个解决方案

#1

No, Full Text searches will not help to solve your problem.

不，全文搜索对解决你的问题没有帮助。

I think you might want to take a look at some of the following links: (Funny, no one mentioned SoundEx till now)

我想你可能想看看下面的一些链接:(有趣的是，直到现在还没有人提到SoundEx)

SoundEx - MSDN
SoundEx——MSDN
SoundEx - Google results
SoundEx——谷歌搜索结果
InformIT - Tolerant Search algorithms
信息容忍搜索算法

Basically SoundEx allows you to evaluate the level of similarity in similar sounding words. The function is also available on SQL 2005.

基本上SoundEx允许你评估相似发音的相似程度。该函数在SQL 2005上也可用。

As a side issue, instead of returning similar results, it might prove more intuitive to the user to use a AJAX based script to deliver similar sounding names before the user initiates his/her search. That way you can show the user "similar names" or "did you mean..." kind of data.

另一个问题是，在用户开始搜索之前，使用基于AJAX的脚本传递类似的发音名称，而不是返回类似的结果，这可能会更直观。这样你就可以向用户显示“相似的名字”或者“你是说……”之类的数据。

#2

In my opinion you should either do a feature right and make it complete, or you should leave it off to avoid building a half-assed intelligence into a computer program that still gets it wrong most of the time ("Looks like you're writing a letter", anyone?).

在我看来，你要么把一个功能做对，然后让它完成，要么你应该把它放在一边，以免在电脑程序中构建出一种半途而废的智能，而这个程序在大多数情况下仍然会出错(“看起来你在写一封信”，有人知道吗?)

In case of human names, a computer will get it wrong most of the time, doing it right and complete is impossible, IMHO. Maybe you can hack something that does the most common English names. But actually, the intelligence to look for both "Bill" and "William" is built into almost any English speaking person - I would leave it to them to connect the dots.

如果是人名，电脑在大多数情况下都会出错，正确完成是不可能的。也许你可以破解一些最常见的英文名字。但事实上，几乎所有说英语的人都具备寻找“比尔”和“威廉”的智力——我想让他们把这些点连起来。

#3

I think your basic approach is solid. I don't think fulltext is going to help you. For seeding, behindthename.com seems to have large amount of the data you want.

我认为你的基本方法是可靠的。我认为全文不能帮助你。对于种子，behindthename.com似乎有大量你想要的数据。

#4

Are you using SQl Server 2005 Express with Advanced Services as to me it sounds you would benefit from the Full Text indexing and more specifically Contains and Containstable which you can use with specific instructions here is a link for the uses of Containstable:

您是否正在使用SQl Server 2005 Express与高级服务，就我而言，您将从全文索引中获益，更具体地说，包含和容器，您可以使用特定的指令这里有一个使用Containstable的链接:

http://msdn.microsoft.com/en-us/library/ms189760.aspx

and here is the download link for SQL Server 2005 With Advanced Services:

下面是SQL Server 2005的下载链接，其中包含高级服务:

http://www.microsoft.com/downloads/details.aspx?familyid=4C6BA9FD-319A-4887-BC75-3B02B5E48A40&displaylang=en

http://www.microsoft.com/downloads/details.aspx?familyid=4c6ba9fd - 319 - 4887 - bc75 - 3 - b02b5e48a40&displaylang=en

Hope this helps,

希望这有助于

Andrew

安德鲁

#5

You can use the SQL Server Full Text Search and do an inflectional search.

您可以使用SQL Server全文搜索并进行屈折搜索。

Basically like:

基本上就像:

SELECT ProductId, ProductName FROM ProductModel WHERE CONTAINS(CatalogDescription, ' FORMSOF(THESAURUS, metal) ')

选择ProductModel中的ProductName，其中包含(编目描述，' FORMSOF(同义词典，metal))”)

Check out: http://en.wikipedia.org/wiki/SQL_Server_Full_Text_Search#Inflectional_Searches http://msdn.microsoft.com/en-us/library/ms345119.aspx http://www.mssqltips.com/tip.asp?tip=1491

查看:http://en.wikipedia.org/wiki/SQL_Server_Full_Text_Search# inflectional_search http://msdn.microsoft.com/en- us/library/ms3419.aspx http://www.mssqltips.com/tip.asp?

#6

Not sure what your application is, but if your users know at the time of sign up that people from their past might be searching the database for them, you could offer them the chance in the user profile to define other names they might be known as (including last names, women change these all the time and makes finding them much harder!) and that they want people to be able to search on. Store these in a separate related table. Then search on that. Just make the structure such that you can define one name as the main name (the one you use for everything except the search.)

不确定您的应用程序是什么,但如果你的用户知道在签约的时候,人们从他们过去可能搜索数据库,你可以为他们提供机会在用户配置文件来定义其他名字他们可能被称为(包括姓氏,女性改变这些,使得寻找他们更难!),他们希望人们能够搜索。将它们存储在单独的相关表中。然后搜索。只需将结构设置为这样，您就可以将一个名称定义为主体名称(除了搜索之外的所有名称都使用这个名称)。

#7

You'll find that you're dabbling in an area known as "Natural Language Processing" and you'll need to do several things, most of which can be found under the topic of stemming.

你会发现你正在涉足一个被称为“自然语言处理”的领域，你需要做一些事情，其中大部分可以在词干主题下找到。

Simplistic stemming simply breaks the word apart, but more advanced algorithms associate words that mean the same thing - for instance Google might use stemming to convert "cat" and "kitten" to "feline" and search for all three, weighing the actual word provided by the user as slightly heavier so exact matches return before stemmed matches.

简单而简单地打破了词分开,但更高级的算法关联词语意味着同样的事情——例如谷歌可能使用阻止“猫”和“小猫”转换为“猫”和搜索所有三个,重实际的用户提供的词作为略重,所以精确匹配返回之前的比赛。

It's a known problem, and there are open source stemmers available.

这是一个已知的问题，并且有可用的开源stemmers。

-Adam

亚当

#8

The term you are looking for is Hypocorism:

你要找的术语是假设主义:

http://en.wikipedia.org/wiki/Hypocorism

And Wikipedia lists many of them. You could bang out some Python or Perl to scrape that page and put it in a db.

*列出了很多。您可以使用一些Python或Perl来获取页面并将其放入db中。

I would go with a structure like this:

我会选择这样的结构:

create table given_names (
  id int primary key,
  name text not null unique
);

create table hypocorisms (
  id int references given_names(id),
  name text not null,

  primary key (id, name)
);

insert into given_names values (1, 'William');
insert into hypocorisms values (1, 'Bill');
insert into hypocorisms values (1, 'Billy');

Then you could write a function/sproc to normalize a name:

然后您可以编写一个函数/sproc来规范化名称:

normalize_given_name('Bill'); --returns William

One issue you will face is that different names can have the same hypocorism (Albert -> Al, Alan -> Al)

你将面临的一个问题是不同的名字可能具有相同的低中心(Albert -> Al, Alan -> Al)

#9

Here's an idea for automatically finding "name synonyms" like Bill/William. That problem has been studied in the broader context of synonyms in general: inducing them from statistics of which words commonly appear in the same contexts in a large text corpus like the Web. You could try combining that approach with a list of names like Moby Names; I don't know if it's been done before.

这里有一个自动查找“name synonyms”(如Bill/William)的方法。这个问题已经在同义词的更广泛的语境中得到了研究:从统计信息中归纳同义词，这些统计信息通常出现在像Web这样的大型文本语料库中的相同语境中。您可以尝试将该方法与诸如Moby名称的列表组合在一起;我不知道以前是否做过。

Here are some pointers.

这里有一些指针。

#1