So one of my major pain points is name comprehension and piecing together household names & titles. I have a 80% solution with a pretty massive regex I put together this morning that I probably shouldn't be proud of (but am anyway in a kind of sick way) that matches the following examples correctly:
因此,我的主要痛点之一是对名字的理解,以及把家喻户晓的名字和头衔拼凑在一起。我有一个80%的解决方案,上面有一个我今天早上组装的非常大的regex,我可能不应该为它感到自豪(但我还是有点恶心),它与下面的例子匹配得很好:
John Jeffries
John Jeffries, M.D.
John Jeffries, MD
John Jeffries and Jim Smith
John and Jim Jeffries
John Jeffries & Jennifer Wilkes-Smith, DDS, MD
John Jeffries, CPA & Jennifer Wilkes-Smith, DDS, MD
John Jeffries, C.P.A & Jennifer Wilkes-Smith, DDS, MD
John Jeffries, C.P.A., MD & Jennifer Wilkes-Smith, DDS, MD
John Jeffries M.D. and Jennifer Holmes CPA
John Jeffries M.D. & Jennifer Holmes CPA
The regex matcher looks like this:
regex matcher如下所示:
(?P<first_name>\S*\s*)?(?!and\s|&\s)(?P<last_name>[\w-]*\s*)(?P<titles1>,?\s*(?!and\s|&\s)[\w\.]*,*\s*(?!and\s|&\s)[\w\.]*)?(?P<connector>\sand\s|\s*&*\s*)?(?!and\s|&\s)(?P<first_name2>\S*\s*)(?P<last_name2>[\w-]*\s*)?(?P<titles2>,?\s*[\w\.]*,*\s*[\w\.]*)?
(wtf right?)
(wtf ?)
For convenience: http://www.pyregex.com/
为了方便:http://www.pyregex.com/
So, for the example:
所以,例如:
'John Jeffries, C.P.A., MD & Jennifer Wilkes-Smith, DDS, MD'
the regex results in a group dict that looks like:
regex将生成一个看起来像:
connector: &
first_name: John
first_name2: Jennifer
last_name: Jeffries
last_name2: Wilkes-Smith
titles1: , C.P.A., MD
titles2: , DDS, MD
I need help with the final step that has been tripping me up, comprehending possible middle names.
我需要帮助完成最后一步,这一步让我很困惑,包括可能的中间名。
Examples include:
例子包括:
'John Jimmy Jeffries, C.P.A., MD & Jennifer Wilkes-Smith, DDS, MD'
'John Jeffries, C.P.A., MD & Jennifer Jenny Wilkes-Smith, DDS, MD'
Is this possible and is there a better way to do this without machine learning? Maybe I can use nameparser (discovered after I went down the regex rabbit hole) instead with some way to determine whether or not there are multiple names? The above matches 99.9% of my cases so I feel like it's worth finishing.
这是可能的吗?有没有更好的方法不需要机器学习?也许我可以使用nameparser(在我进入regex兔子洞后发现的)来确定是否有多个名称?上面的99。9%是我的情况,所以我觉得值得一试。
TLDR: I can't figure out if I can use some sort of lookahead or lookbehind to make sure that the possible middle name only matches if there is a last name after it.
TLDR:我不能确定我是否可以使用某种lookahead或lookbehind来确保只有在它后面有一个姓时才可以使用中间名。
Note: I don't need to parse titles like Mr. Mrs. Ms., etc., but I suppose that can be added in the same manner as middle names.
注意:我不需要解析Ms. Mr. Ms.等标题,但是我认为可以像中间名一样添加。
Solution Notes: First, follow Richard's advice and don't do this. Second, investigate NLTK or use/contribute to nameparser for a more robust solution if necessary.
解决方案:首先,听从理查德的建议,不要这样做。第二,研究NLTK或使用/贡献nameparser以获得更健壮的解决方案。
1 个解决方案
#1
8
Regular expressions like this are the work of the Dark One.
像这样的正则表达式是黑色的。
Who, looking at your code later, will be able to understand what is going on? Will you even?
谁在稍后查看您的代码时,能够理解发生了什么?甚至你会吗?
How will you test all of the possible edge cases?
如何测试所有可能的边缘情况?
Why have you chosen to use a regular expression at all? If the tool you are using is so difficult to work with, it suggests that maybe another tool would be better.
为什么选择使用正则表达式呢?如果您正在使用的工具很难使用,那么建议您使用另一个工具。
Try this:
试试这个:
import re
examples = [
"John Jeffries",
"John Jeffries, M.D.",
"John Jeffries, MD",
"John Jeffries and Jim Smith",
"John and Jim Jeffries",
"John Jeffries & Jennifer Wilkes-Smith, DDS, MD",
"John Jeffries, CPA & Jennifer Wilkes-Smith, DDS, MD",
"John Jeffries, C.P.A & Jennifer Wilkes-Smith, DDS, MD",
"John Jeffries, C.P.A., MD & Jennifer Wilkes-Smith, DDS, MD",
"John Jeffries M.D. and Jennifer Holmes CPA",
"John Jeffries M.D. & Jennifer Holmes CPA",
'John Jimmy Jeffries, C.P.A., MD & Jennifer Wilkes-Smith, DDS, MD',
'John Jeffries, C.P.A., MD & Jennifer Jenny Wilkes-Smith, DDS, MD'
]
def IsTitle(inp):
return re.match('^([A-Z]\.?)+$',inp.strip())
def ParseName(name):
#Titles are separated from each other and from names with ","
#We don't need these, so we remove them
name = name.replace(',',' ')
#Split name and titles on spaces, combining adjacent spaces
name = name.split()
#Build an output object
ret_name = {"first":None, "middle":None, "last":None, "titles":[]}
#First string is always a first name
ret_name['first'] = name[0]
if len(name)>2: #John Johnson Smith/PhD
if IsTitle(name[2]): #John Smith PhD
ret_name['last'] = name[1]
ret_name['titles'] = name[2:]
else: #John Johnson Smith, PhD, MD
ret_name['middle'] = name[1]
ret_name['last'] = name[2]
ret_name['titles'] = name[3:]
elif len(name) == 2: #John Johnson
ret_name['last'] = name[1]
return ret_name
def CombineNames(inp):
if not inp[0]['last']:
inp[0]['last'] = inp[1]['last']
def ParseString(inp):
inp = inp.replace("&","and") #Names are combined with "&" or "and"
inp = re.split("\s+and\s+",inp) #Split names apart
inp = map(ParseName,inp)
CombineNames(inp)
return inp
for e in examples:
print e
print ParseString(e)
Output:
输出:
John Jeffries
[{'middle': None, 'titles': [], 'last': 'Jeffries', 'first': 'John'}]
John Jeffries, M.D.
[{'middle': None, 'titles': ['M.D.'], 'last': 'Jeffries', 'first': 'John'}]
John Jeffries, MD
[{'middle': None, 'titles': ['MD'], 'last': 'Jeffries', 'first': 'John'}]
John Jeffries and Jim Smith
[{'middle': None, 'titles': [], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': [], 'last': 'Smith', 'first': 'Jim'}]
John and Jim Jeffries
[{'middle': None, 'titles': [], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': [], 'last': 'Jeffries', 'first': 'Jim'}]
John Jeffries & Jennifer Wilkes-Smith, DDS, MD
[{'middle': None, 'titles': [], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': ['DDS', 'MD'], 'last': 'Wilkes-Smith', 'first': 'Jennifer'}]
John Jeffries, CPA & Jennifer Wilkes-Smith, DDS, MD
[{'middle': None, 'titles': ['CPA'], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': ['DDS', 'MD'], 'last': 'Wilkes-Smith', 'first': 'Jennifer'}]
John Jeffries, C.P.A & Jennifer Wilkes-Smith, DDS, MD
[{'middle': None, 'titles': ['C.P.A'], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': ['DDS', 'MD'], 'last': 'Wilkes-Smith', 'first': 'Jennifer'}]
John Jeffries, C.P.A., MD & Jennifer Wilkes-Smith, DDS, MD
[{'middle': None, 'titles': ['C.P.A.', 'MD'], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': ['DDS', 'MD'], 'last': 'Wilkes-Smith', 'first': 'Jennifer'}]
John Jeffries M.D. and Jennifer Holmes CPA
[{'middle': None, 'titles': ['M.D.'], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': ['CPA'], 'last': 'Holmes', 'first': 'Jennifer'}]
John Jeffries M.D. & Jennifer Holmes CPA
[{'middle': None, 'titles': ['M.D.'], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': ['CPA'], 'last': 'Holmes', 'first': 'Jennifer'}]
John Jimmy Jeffries, C.P.A., MD & Jennifer Wilkes-Smith, DDS, MD
[{'middle': 'Jimmy', 'titles': ['C.P.A.', 'MD'], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': ['DDS', 'MD'], 'last': 'Wilkes-Smith', 'first': 'Jennifer'}]
John Jeffries, C.P.A., MD & Jennifer Jenny Wilkes-Smith, DDS, MD
[{'middle': None, 'titles': ['C.P.A.', 'MD'], 'last': 'Jeffries', 'first': 'John'}, {'middle': 'Jenny', 'titles': ['DDS', 'MD'], 'last': 'Wilkes-Smith', 'first': 'Jennifer'}]
This took less than fifteen minutes and, at each stage, the logic is clear and the program can be debugged in pieces. While one-liners are cute, clarity and testability should take precedence.
这需要不到15分钟的时间,并且在每个阶段,逻辑都是清晰的,程序可以被分割。虽然一行程序很可爱,但是清晰性和可测试性应该优先考虑。
#1
8
Regular expressions like this are the work of the Dark One.
像这样的正则表达式是黑色的。
Who, looking at your code later, will be able to understand what is going on? Will you even?
谁在稍后查看您的代码时,能够理解发生了什么?甚至你会吗?
How will you test all of the possible edge cases?
如何测试所有可能的边缘情况?
Why have you chosen to use a regular expression at all? If the tool you are using is so difficult to work with, it suggests that maybe another tool would be better.
为什么选择使用正则表达式呢?如果您正在使用的工具很难使用,那么建议您使用另一个工具。
Try this:
试试这个:
import re
examples = [
"John Jeffries",
"John Jeffries, M.D.",
"John Jeffries, MD",
"John Jeffries and Jim Smith",
"John and Jim Jeffries",
"John Jeffries & Jennifer Wilkes-Smith, DDS, MD",
"John Jeffries, CPA & Jennifer Wilkes-Smith, DDS, MD",
"John Jeffries, C.P.A & Jennifer Wilkes-Smith, DDS, MD",
"John Jeffries, C.P.A., MD & Jennifer Wilkes-Smith, DDS, MD",
"John Jeffries M.D. and Jennifer Holmes CPA",
"John Jeffries M.D. & Jennifer Holmes CPA",
'John Jimmy Jeffries, C.P.A., MD & Jennifer Wilkes-Smith, DDS, MD',
'John Jeffries, C.P.A., MD & Jennifer Jenny Wilkes-Smith, DDS, MD'
]
def IsTitle(inp):
return re.match('^([A-Z]\.?)+$',inp.strip())
def ParseName(name):
#Titles are separated from each other and from names with ","
#We don't need these, so we remove them
name = name.replace(',',' ')
#Split name and titles on spaces, combining adjacent spaces
name = name.split()
#Build an output object
ret_name = {"first":None, "middle":None, "last":None, "titles":[]}
#First string is always a first name
ret_name['first'] = name[0]
if len(name)>2: #John Johnson Smith/PhD
if IsTitle(name[2]): #John Smith PhD
ret_name['last'] = name[1]
ret_name['titles'] = name[2:]
else: #John Johnson Smith, PhD, MD
ret_name['middle'] = name[1]
ret_name['last'] = name[2]
ret_name['titles'] = name[3:]
elif len(name) == 2: #John Johnson
ret_name['last'] = name[1]
return ret_name
def CombineNames(inp):
if not inp[0]['last']:
inp[0]['last'] = inp[1]['last']
def ParseString(inp):
inp = inp.replace("&","and") #Names are combined with "&" or "and"
inp = re.split("\s+and\s+",inp) #Split names apart
inp = map(ParseName,inp)
CombineNames(inp)
return inp
for e in examples:
print e
print ParseString(e)
Output:
输出:
John Jeffries
[{'middle': None, 'titles': [], 'last': 'Jeffries', 'first': 'John'}]
John Jeffries, M.D.
[{'middle': None, 'titles': ['M.D.'], 'last': 'Jeffries', 'first': 'John'}]
John Jeffries, MD
[{'middle': None, 'titles': ['MD'], 'last': 'Jeffries', 'first': 'John'}]
John Jeffries and Jim Smith
[{'middle': None, 'titles': [], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': [], 'last': 'Smith', 'first': 'Jim'}]
John and Jim Jeffries
[{'middle': None, 'titles': [], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': [], 'last': 'Jeffries', 'first': 'Jim'}]
John Jeffries & Jennifer Wilkes-Smith, DDS, MD
[{'middle': None, 'titles': [], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': ['DDS', 'MD'], 'last': 'Wilkes-Smith', 'first': 'Jennifer'}]
John Jeffries, CPA & Jennifer Wilkes-Smith, DDS, MD
[{'middle': None, 'titles': ['CPA'], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': ['DDS', 'MD'], 'last': 'Wilkes-Smith', 'first': 'Jennifer'}]
John Jeffries, C.P.A & Jennifer Wilkes-Smith, DDS, MD
[{'middle': None, 'titles': ['C.P.A'], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': ['DDS', 'MD'], 'last': 'Wilkes-Smith', 'first': 'Jennifer'}]
John Jeffries, C.P.A., MD & Jennifer Wilkes-Smith, DDS, MD
[{'middle': None, 'titles': ['C.P.A.', 'MD'], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': ['DDS', 'MD'], 'last': 'Wilkes-Smith', 'first': 'Jennifer'}]
John Jeffries M.D. and Jennifer Holmes CPA
[{'middle': None, 'titles': ['M.D.'], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': ['CPA'], 'last': 'Holmes', 'first': 'Jennifer'}]
John Jeffries M.D. & Jennifer Holmes CPA
[{'middle': None, 'titles': ['M.D.'], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': ['CPA'], 'last': 'Holmes', 'first': 'Jennifer'}]
John Jimmy Jeffries, C.P.A., MD & Jennifer Wilkes-Smith, DDS, MD
[{'middle': 'Jimmy', 'titles': ['C.P.A.', 'MD'], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': ['DDS', 'MD'], 'last': 'Wilkes-Smith', 'first': 'Jennifer'}]
John Jeffries, C.P.A., MD & Jennifer Jenny Wilkes-Smith, DDS, MD
[{'middle': None, 'titles': ['C.P.A.', 'MD'], 'last': 'Jeffries', 'first': 'John'}, {'middle': 'Jenny', 'titles': ['DDS', 'MD'], 'last': 'Wilkes-Smith', 'first': 'Jennifer'}]
This took less than fifteen minutes and, at each stage, the logic is clear and the program can be debugged in pieces. While one-liners are cute, clarity and testability should take precedence.
这需要不到15分钟的时间,并且在每个阶段,逻辑都是清晰的,程序可以被分割。虽然一行程序很可爱,但是清晰性和可测试性应该优先考虑。