如何智能地解析姓氏

时间:2022-10-31 07:32:14

Assuming western naming convention of FirstName MiddleName(s) LastName,

假设FirstName MiddleName(s)LastName的西方命名约定,

What would be the best way to correctly parse out the last name from a full name?

从全名正确解析姓氏的最佳方法是什么?

For example:

John Smith --> 'Smith'
John Maxwell Smith --> 'Smith'
John Smith Jr --> 'Smith Jr'
John van Damme --> 'van Damme'
John Smith, IV --> 'Smith, IV'
John Mark Del La Hoya --> 'Del La Hoya'

...and the countless other permutations from this.

......以及由此产生的无数其他排列。

3 个解决方案

#1


17  

Probably the best answer here is not to try. Names are individual and idosyncratic and, even limiting yourself to the Western tradition, you can never be sure that you'll have thought of all the edge cases. A friend of mine legally changed his name to be a single word, and he's had a hell of a time dealing with various institutions whose procedures can't deal with this. You're in a unique position of being the one creating the software that implements a procedure, and so you have an opportunity to design something that isn't going to annoy the crap out of people with unconventional names. Think about why you need to be parsing out the last name to begin with, and see if there's something else you could do.

可能这里最好的答案是不要尝试。名字是个人的和偶然的,甚至限制自己的西方传统,你永远不能确定你会想到所有的边缘情况。我的一个朋友合法地将他的名字改成了一个单词,而且他有一段时间处理各种机构,他们的程序无法解决这个问题。您处于创建实现程序的软件的独特位置,因此您有机会设计一些不会惹恼非常规名称的人的东西。想一想为什么你需要解析姓氏开头,看看你能做些什么。

That being said, as a purely techincal matter the best way would probably be to trim off specifically the strings " Jr", ", Jr", ", Jr.", "III", ", III", etc. from the end of the string containing the name, and then get everything from the last space in the string to the (new, after having removed Jr, etc.) end. This wouldn't get, say, "Del La Hoya" from your example, but you can't even really count on a human to get that - I'm making an educated guess that John Mark Del La Hoya's last name is "Del La Hoya" and not "Mark Del La Hoya" because I"m a native English speaker and I have some intuition about what Spanish last names look like - if the name were, say "Gauthip Yeidze Ka Illunyepsi" I would have absolutely no idea whether to count that Ka as part of the last name or not because I have no idea what language that's from.

话虽这么说,作为一个纯粹的技术问题,最好的方法可能是从末端特别修剪字符串“Jr”,“,Jr”,“,Jr。”,“III”,“,III”等。包含该名称的字符串,然后获取从字符串中的最后一个空格到(新的,删除Jr后等)结束的所有内容。从你的例子来看,这不会得到“Del La Hoya”,但你甚至不能真正指望一个人来做到这一点 - 我正在做出有根据的猜测John Mark Del La Hoya的姓氏是“Del” La Hoya“而不是”Mark Del La Hoya“因为我”母语为英语,而且我对西班牙姓氏的样子有一些直觉 - 如果名字是“Gauthip Yeidze Ka Illunyepsi”,我绝对不知道把Ka算作姓氏的一部分,因为我不知道是哪种语言。

#2


0  

I'm seconding Tnekutippa here, but you should check out named entity recognition. It might help automate some of the process. This is however, as noted, quite difficult. I'm not quite sure if the Stanford NER can extract first and last names out of the box, but a machine learning approach could prove very useful for this task. The Stanford NER could be a nice starting point, or you could try to make your own classifiers and training corpora.

我在这里支持Tnekutippa,但你应该查看命名实体识别。它可能有助于自动化某些过程。然而,如上所述,这非常困难。我不太确定Stanford NER能否开箱即用提取名字和姓氏,但机器学习方法对于这项任务非常有用。斯坦福NER可能是一个不错的起点,或者您可以尝试制作自己的分类器和训练语料库。

#3


0  

Came across a lib called "nameparser" at https://pypi.python.org/pypi/nameparser It handles four out of six cases above:

在https://pypi.python.org/pypi/nameparser上遇到名为“nameparser”的lib。它处理上述六种情况中的四种:

#!/usr/bin/env python
from nameparser import HumanName

def get_lname(somename):
    name = HumanName(somename)
    return name.last

people_names = [
    ('John Smith', 'Smith'),
    ('John Maxwell Smith', 'Smith'),
    # ('John Smith Jr', 'Smith Jr'),
    ('John van Damme', 'van Damme'),
    # ('John Smith, IV', 'Smith, IV'),
    ('John Mark Del La Hoya', 'Del La Hoya')
]

for name, target in people_names:
    print('{} --> {} <-- {}'.format(name, get_lname(name), target))
    assert get_lname(name) == target    

#1


17  

Probably the best answer here is not to try. Names are individual and idosyncratic and, even limiting yourself to the Western tradition, you can never be sure that you'll have thought of all the edge cases. A friend of mine legally changed his name to be a single word, and he's had a hell of a time dealing with various institutions whose procedures can't deal with this. You're in a unique position of being the one creating the software that implements a procedure, and so you have an opportunity to design something that isn't going to annoy the crap out of people with unconventional names. Think about why you need to be parsing out the last name to begin with, and see if there's something else you could do.

可能这里最好的答案是不要尝试。名字是个人的和偶然的,甚至限制自己的西方传统,你永远不能确定你会想到所有的边缘情况。我的一个朋友合法地将他的名字改成了一个单词,而且他有一段时间处理各种机构,他们的程序无法解决这个问题。您处于创建实现程序的软件的独特位置,因此您有机会设计一些不会惹恼非常规名称的人的东西。想一想为什么你需要解析姓氏开头,看看你能做些什么。

That being said, as a purely techincal matter the best way would probably be to trim off specifically the strings " Jr", ", Jr", ", Jr.", "III", ", III", etc. from the end of the string containing the name, and then get everything from the last space in the string to the (new, after having removed Jr, etc.) end. This wouldn't get, say, "Del La Hoya" from your example, but you can't even really count on a human to get that - I'm making an educated guess that John Mark Del La Hoya's last name is "Del La Hoya" and not "Mark Del La Hoya" because I"m a native English speaker and I have some intuition about what Spanish last names look like - if the name were, say "Gauthip Yeidze Ka Illunyepsi" I would have absolutely no idea whether to count that Ka as part of the last name or not because I have no idea what language that's from.

话虽这么说,作为一个纯粹的技术问题,最好的方法可能是从末端特别修剪字符串“Jr”,“,Jr”,“,Jr。”,“III”,“,III”等。包含该名称的字符串,然后获取从字符串中的最后一个空格到(新的,删除Jr后等)结束的所有内容。从你的例子来看,这不会得到“Del La Hoya”,但你甚至不能真正指望一个人来做到这一点 - 我正在做出有根据的猜测John Mark Del La Hoya的姓氏是“Del” La Hoya“而不是”Mark Del La Hoya“因为我”母语为英语,而且我对西班牙姓氏的样子有一些直觉 - 如果名字是“Gauthip Yeidze Ka Illunyepsi”,我绝对不知道把Ka算作姓氏的一部分,因为我不知道是哪种语言。

#2


0  

I'm seconding Tnekutippa here, but you should check out named entity recognition. It might help automate some of the process. This is however, as noted, quite difficult. I'm not quite sure if the Stanford NER can extract first and last names out of the box, but a machine learning approach could prove very useful for this task. The Stanford NER could be a nice starting point, or you could try to make your own classifiers and training corpora.

我在这里支持Tnekutippa,但你应该查看命名实体识别。它可能有助于自动化某些过程。然而,如上所述,这非常困难。我不太确定Stanford NER能否开箱即用提取名字和姓氏,但机器学习方法对于这项任务非常有用。斯坦福NER可能是一个不错的起点,或者您可以尝试制作自己的分类器和训练语料库。

#3


0  

Came across a lib called "nameparser" at https://pypi.python.org/pypi/nameparser It handles four out of six cases above:

在https://pypi.python.org/pypi/nameparser上遇到名为“nameparser”的lib。它处理上述六种情况中的四种:

#!/usr/bin/env python
from nameparser import HumanName

def get_lname(somename):
    name = HumanName(somename)
    return name.last

people_names = [
    ('John Smith', 'Smith'),
    ('John Maxwell Smith', 'Smith'),
    # ('John Smith Jr', 'Smith Jr'),
    ('John van Damme', 'van Damme'),
    # ('John Smith, IV', 'Smith, IV'),
    ('John Mark Del La Hoya', 'Del La Hoya')
]

for name, target in people_names:
    print('{} --> {} <-- {}'.format(name, get_lname(name), target))
    assert get_lname(name) == target