I want to split Brazilian names into parts. However there are names like below where "de"
, "da"
(and others) that are not separate parts and they always go with the following word. So normal split doesn't work.
我想把巴西人的名字分成几个部分。然而,下面有一些名称,如“de”、“da”(以及其他名称),它们不是分开的部分,它们总是与下面的词搭配。所以正常的分裂不起作用。
test1 = "Francisco da Sousa Rodrigues" #special split
test2 = "Emiliano Rodrigo Carrasco" #normal split
test3 = "Alberto de Francia" #special split
test4 = "Bruno Rezende" #normal split
My expected output would be:
我的预期产出是:
[Francisco, da Sousa, Rodrigues] #1
[Emiliano, Rodrigo, Carrasco] #2
[Alberto, de Francia] #3
[Bruno, Rezende] #4
For the special cases I tried this pattern:
对于特殊情况,我尝试了以下模式:
PATTERN = re.compile(r"\s(?=[da, de, do, dos, das])")
re.split(PATTERN, test1) (...)
but the output is not what I expected:
但是输出不是我所期望的:
['Francisco', 'da Sousa Rodrigues'] #1
['Alberto', 'de Francia'] #3
Any idea how to fix it? Is there a way to just use one pattern for both "normal" and "special" case?
你知道怎么修理吗?是否有一种方法可以只使用一个模式来实现“正常”和“特殊”的情况?
9 个解决方案
#1
9
Will the names always be written in the "canonical" way, i.e. with every part capitalised except for da, de, do, ...?
姓名是否总是以“规范”的方式书写,即除了da, de, do,…?
In that case, you can use that fact:
在这种情况下,你可以利用这个事实:
>>> import re
>>> for t in (test1, test2, test3, test4):
... print(re.findall(r"(?:[a-z]+ )?[A-Z]\w+", t, re.UNICODE))
['Francisco', 'da Sousa', 'Rodrigues']
['Emiliano', 'Rodrigo', 'Carrasco']
['Alberto', 'de Francia']
['Bruno', 'Rezende']
>>>
The "right" way to do what you want to do (apart from not doing it at all), would be a negative lookbehind: split when on a space that isn't preceeded by any of da, de, do, ... . Sadly, this is (AFAIK) impossible, because re
requires lookbehinds to be of equal width. If no names end in the syllables, which you really can't assume, you could do this:
做你想做的事情的“正确”方式(除了完全不做之外),是一种消极的观望:在没有da、de、do、…之前的空间中分裂。遗憾的是,这是不可能的,因为re要求后视镜的宽度要相等。如果没有以音节结尾的名字,这是你无法想象的,你可以这样做:
PATTERN = re.compile(r"(?<! da| de| do|dos|das)\s")
You may or may not occasionally stumble about cases that don't work: If the first letter is an accented character (or the article, hypothetically, contained one), it will match incorrectly. To fix this, you won't get around using an external library; regex
.
如果第一个字母是一个重音字符(或者假设包含一个字母),那么它就会不正确地匹配。要解决这个问题,您将无法使用外部库;正则表达式。
Your new findall will look like this then:
你的新findall会是这样的:
regex.findall(r"(?:\p{Ll}+ )?\p{Lu}\w+", "Luiz Ângelo de Urzêda")
The \p{Ll}
refers to any lowercase letter, and \p{Lu}
to any uppercase letter.
\p{Ll}表示小写字母,\p{Lu}表示大写字母。
#2
2
With regex.split()
function from python's regex
library which offers additional functionality:
使用python的regex.split()函数库提供附加功能:
installation:
安装:
pip install regex
usage:
用法:
import regex as re
test_names = ["Francisco da Sousa Rodrigues", "Emiliano Rodrigo Carrasco",
"Alberto de Francia", "Bruno Rezende"]
for n in test_names:
print(re.split(r'(?<!das?|de|dos?)\s+', n))
The output:
输出:
['Francisco', 'da Sousa', 'Rodrigues']
['Emiliano', 'Rodrigo', 'Carrasco']
['Alberto', 'de Francia']
['Bruno', 'Rezende']
-
(?<!das?|de|dos?)\s+
- lookbehind negative assertion(?<!...)
ensures that whitespace(s)\s+
is not preceded with one of the special casesda|das|de|do|dos
- (?
https://pypi.python.org/pypi/regex/
https://pypi.python.org/pypi/regex/
#3
2
You may use this regex in findall
with an optional group:
您可以在findall中使用这个regex,并使用一个可选组:
(?:(?:da|de|do|dos|das)\s+)?\S+
Here we make (?:da|de|do|dos|das)
and 1+ whitespace following this optional.
我们在这里(?:da|de|do|dos|das)和1+空格后面这个可选选项。
RegEx演示
代码演示
Code Example:
代码示例:
test1 = "Francisco da Sousa Rodrigues" #special split
test2 = "Emiliano Rodrigo Carrasco" #normal split
test3 = "Alberto de Francia" #special split
test4 = "Bruno Rezende" #normal split
PATTERN = re.compile(r'(?:(?:da|de|do|dos|das)\s+)?\S+')
>>> print re.findall(PATTERN, test1)
['Francisco', 'da Sousa', 'Rodrigues']
>>> print re.findall(PATTERN, test2)
['Emiliano', 'Rodrigo', 'Carrasco']
>>> print re.findall(PATTERN, test3)
['Alberto', 'de Francia']
>>> print re.findall(PATTERN, test4)
['Bruno', 'Rezende']
#4
1
One can achieve this stepwise after replacing da with da_ and de with de_:
用da_替换da,用de_替换de,可以实现这一步:
lst = ["Francisco da Sousa Rodrigues" ,
"Emiliano Rodrigo Carrasco" ,
"Alberto de Francia" ,
"Bruno Rezende" ]
# replace da with da_ and de with de_
lst = list(map(lambda x: x.replace(" da ", " da_"), lst) )
lst = list(map(lambda x: x.replace(" de ", " de_"), lst) )
# now split names and then convert back _ to space:
lst = [ [k.replace("_", " ")
for k in l.split()]
for l in lst ]
print(lst)
Output:
输出:
[['Francisco', 'da Sousa', 'Rodrigues'],
['Emiliano', 'Rodrigo', 'Carrasco'],
['Alberto', 'de Francia'],
['Bruno', 'Rezende']]
Edit: in response to the comment, if "Fernanda Rezende" type names are there then one can replace " da "
with " da_"
(code above changed to this from earlier "da "
to "da_"
)
编辑:针对评论,如果有“Fernanda Rezende”类型名,那么可以用“da_”替换“da”(上面的代码从之前的“da”改为“da_”)
One can also define a simple function for making changes in all strings of a list, and then use it:
还可以定义一个简单的函数,用于在列表的所有字符串中进行更改,然后使用它:
def strlist_replace(slist, oristr, newstr):
return [ s.replace(oristr, newstr)
for s in slist ]
lst = strlist_replace(lst, " da ", " da_")
lst = strlist_replace(lst, " de ", " de_")
#5
0
This happens because you split the string at your special pattern. This will indeed split the string in two parts.
之所以会出现这种情况,是因为您在特定的模式下拆分字符串。这确实会将字符串分成两部分。
You could try splitting the second part further, using " " as a delimiter once more. Note that this doesn't work in case there are multiple instances of special delimiters.
您可以尝试将第二部分进一步分割,再使用“”作为分隔符。注意,如果有多个特殊分隔符的实例,这将不起作用。
Another approach would be to keep splitting using " " as delimiter, and join each special delimiter with the following name. For example:
另一种方法是继续使用“”作为分隔符进行分割,并将每个特殊的分隔符连接到以下名称。例如:
[Francisco, da, Sousa, Rodrigues] # becomes...
[Francisco, da Sousa, Rodrigues]
#6
0
May be you can try something like this ?
你可以试试这样的东西吗?
b_o_g=['da', 'de', 'do', 'dos', 'das']
test1 = "Francisco da Sousa Rodrigues"
test3= "Alberto de Francia"
def _custom_split (bag_of_words,string_t):
s_o_s = string_t.split()
for _,__ in enumerate(s_o_s):
if __ in bag_of_words:
try:
s_o_s[_]="{} {}".format(s_o_s[_],s_o_s[_+1])
del s_o_s [ _ + 1]
except IndexError:
pass
return s_o_s
print(_custom_split(b_o_g,test1))
print(_custom_split(b_o_g,test3))
output:
输出:
['Francisco', 'da Sousa', 'Rodrigues']
['Alberto', 'de Francia']
#7
0
Maybe not the best or elegant way but this will work. I also added the test5 just to be sure.
也许不是最好的或优雅的方式,但这将会起作用。我还添加了test5,只是为了确定。
special_chars = ['da', 'de', 'do', 'dos', 'das']
test1 = "Francisco da Sousa Rodrigues" #special split
test2 = "Emiliano Rodrigo Carrasco" #normal split
test3 = "Alberto de Francia" #special split
test4 = "Bruno Rezende" #normal split
test5 = 'Francisco da Sousa de Rodrigues'
def cut(test):
t1 = test.split()
for i in range(len(t1)):
if t1[i] in special_chars:
t1[i+1] = t1[i] + ' ' + t1[i+1]
for i in t1:
if i in special_chars:
t1.remove(i)
print(t1)
cut(test1)
cut(test2)
cut(test3)
cut(test4)
cut(test5)
The results are:
结果是:
['Francisco', 'da Sousa', 'Rodrigues']
['Emiliano', 'Rodrigo', 'Carrasco']
['Alberto', 'de Francia']
['Bruno', 'Rezende']
['Francisco', 'da Sousa', 'de Rodrigues']
#8
0
It should be pointed out that we are talking about titles here, not names.
应该指出的是,我们讨论的是标题,而不是名字。
These pretty much all translate to something like "from" or "of" and the part after typically refers to a place and they originated as titles for nobility.
这些几乎都可以翻译成"from"或"of"后面的部分通常指一个地方,它们起源于贵族的头衔。
You are trying to fit a non-name into a name context, which makes everything difficult.
您正在尝试将非名称放入名称上下文中,这使一切都变得困难。
It's weird to try to just remove all this like it doesn't exist. Like if you take a name such as "Steve From New York" and to just try to drop the from and make New York the "last name".
试着把这些都删除,就像它不存在一样,这很奇怪。就像如果你取一个像“来自纽约的史蒂夫”这样的名字,然后试着放弃这个名字,让纽约成为“最后的名字”。
These were never intended to be last names or to act like what to most people would be a last name. Things just kinda drifted in that direction over time trying to make round pegs fit into square holes.
这些名字从来都不打算是姓,也不像对大多数人来说是姓。随着时间的推移,这些东西逐渐向这个方向漂移,试图使圆形的钉子与正方形的洞相吻合。
You might add a title field to your signup page or something and direct it to be used for people with titles as a more elegant solution.
您可以在注册页面或其他地方添加一个标题字段,并将其直接用于有标题的人,作为一种更优雅的解决方案。
#9
-2
Your regular expression should be changed into
应该将正则表达式改为
PATTERN = re.compile(r"\s(?=[da, de, do, dos, das])(\S+\s*\s\s*\S+)")
模式= re.compile(r " \ s(?[da, de, do, dos, das])(\S+\ \S \S \S+)
import re
test1 = "Francisco da Sousa Rodrigues" #special split
test3 = "Alberto de Francia" #special split
PATTERN = re.compile(r"\s(?=[da, de, do, dos, das])(\S+\s*\s\s*\S+)")
print re.split(PATTERN, test1)
print re.split(PATTERN, test3)
This works for me giving the following outputs,
这适用于我给出以下输出,
['Francisco', 'da Sousa', ' Rodrigues'] ['Alberto', 'de Francia', '']
['Francisco', 'da Sousa', ' Rodrigues'] ['Alberto', 'de Francia']
#1
9
Will the names always be written in the "canonical" way, i.e. with every part capitalised except for da, de, do, ...?
姓名是否总是以“规范”的方式书写,即除了da, de, do,…?
In that case, you can use that fact:
在这种情况下,你可以利用这个事实:
>>> import re
>>> for t in (test1, test2, test3, test4):
... print(re.findall(r"(?:[a-z]+ )?[A-Z]\w+", t, re.UNICODE))
['Francisco', 'da Sousa', 'Rodrigues']
['Emiliano', 'Rodrigo', 'Carrasco']
['Alberto', 'de Francia']
['Bruno', 'Rezende']
>>>
The "right" way to do what you want to do (apart from not doing it at all), would be a negative lookbehind: split when on a space that isn't preceeded by any of da, de, do, ... . Sadly, this is (AFAIK) impossible, because re
requires lookbehinds to be of equal width. If no names end in the syllables, which you really can't assume, you could do this:
做你想做的事情的“正确”方式(除了完全不做之外),是一种消极的观望:在没有da、de、do、…之前的空间中分裂。遗憾的是,这是不可能的,因为re要求后视镜的宽度要相等。如果没有以音节结尾的名字,这是你无法想象的,你可以这样做:
PATTERN = re.compile(r"(?<! da| de| do|dos|das)\s")
You may or may not occasionally stumble about cases that don't work: If the first letter is an accented character (or the article, hypothetically, contained one), it will match incorrectly. To fix this, you won't get around using an external library; regex
.
如果第一个字母是一个重音字符(或者假设包含一个字母),那么它就会不正确地匹配。要解决这个问题,您将无法使用外部库;正则表达式。
Your new findall will look like this then:
你的新findall会是这样的:
regex.findall(r"(?:\p{Ll}+ )?\p{Lu}\w+", "Luiz Ângelo de Urzêda")
The \p{Ll}
refers to any lowercase letter, and \p{Lu}
to any uppercase letter.
\p{Ll}表示小写字母,\p{Lu}表示大写字母。
#2
2
With regex.split()
function from python's regex
library which offers additional functionality:
使用python的regex.split()函数库提供附加功能:
installation:
安装:
pip install regex
usage:
用法:
import regex as re
test_names = ["Francisco da Sousa Rodrigues", "Emiliano Rodrigo Carrasco",
"Alberto de Francia", "Bruno Rezende"]
for n in test_names:
print(re.split(r'(?<!das?|de|dos?)\s+', n))
The output:
输出:
['Francisco', 'da Sousa', 'Rodrigues']
['Emiliano', 'Rodrigo', 'Carrasco']
['Alberto', 'de Francia']
['Bruno', 'Rezende']
-
(?<!das?|de|dos?)\s+
- lookbehind negative assertion(?<!...)
ensures that whitespace(s)\s+
is not preceded with one of the special casesda|das|de|do|dos
- (?
https://pypi.python.org/pypi/regex/
https://pypi.python.org/pypi/regex/
#3
2
You may use this regex in findall
with an optional group:
您可以在findall中使用这个regex,并使用一个可选组:
(?:(?:da|de|do|dos|das)\s+)?\S+
Here we make (?:da|de|do|dos|das)
and 1+ whitespace following this optional.
我们在这里(?:da|de|do|dos|das)和1+空格后面这个可选选项。
RegEx演示
代码演示
Code Example:
代码示例:
test1 = "Francisco da Sousa Rodrigues" #special split
test2 = "Emiliano Rodrigo Carrasco" #normal split
test3 = "Alberto de Francia" #special split
test4 = "Bruno Rezende" #normal split
PATTERN = re.compile(r'(?:(?:da|de|do|dos|das)\s+)?\S+')
>>> print re.findall(PATTERN, test1)
['Francisco', 'da Sousa', 'Rodrigues']
>>> print re.findall(PATTERN, test2)
['Emiliano', 'Rodrigo', 'Carrasco']
>>> print re.findall(PATTERN, test3)
['Alberto', 'de Francia']
>>> print re.findall(PATTERN, test4)
['Bruno', 'Rezende']
#4
1
One can achieve this stepwise after replacing da with da_ and de with de_:
用da_替换da,用de_替换de,可以实现这一步:
lst = ["Francisco da Sousa Rodrigues" ,
"Emiliano Rodrigo Carrasco" ,
"Alberto de Francia" ,
"Bruno Rezende" ]
# replace da with da_ and de with de_
lst = list(map(lambda x: x.replace(" da ", " da_"), lst) )
lst = list(map(lambda x: x.replace(" de ", " de_"), lst) )
# now split names and then convert back _ to space:
lst = [ [k.replace("_", " ")
for k in l.split()]
for l in lst ]
print(lst)
Output:
输出:
[['Francisco', 'da Sousa', 'Rodrigues'],
['Emiliano', 'Rodrigo', 'Carrasco'],
['Alberto', 'de Francia'],
['Bruno', 'Rezende']]
Edit: in response to the comment, if "Fernanda Rezende" type names are there then one can replace " da "
with " da_"
(code above changed to this from earlier "da "
to "da_"
)
编辑:针对评论,如果有“Fernanda Rezende”类型名,那么可以用“da_”替换“da”(上面的代码从之前的“da”改为“da_”)
One can also define a simple function for making changes in all strings of a list, and then use it:
还可以定义一个简单的函数,用于在列表的所有字符串中进行更改,然后使用它:
def strlist_replace(slist, oristr, newstr):
return [ s.replace(oristr, newstr)
for s in slist ]
lst = strlist_replace(lst, " da ", " da_")
lst = strlist_replace(lst, " de ", " de_")
#5
0
This happens because you split the string at your special pattern. This will indeed split the string in two parts.
之所以会出现这种情况,是因为您在特定的模式下拆分字符串。这确实会将字符串分成两部分。
You could try splitting the second part further, using " " as a delimiter once more. Note that this doesn't work in case there are multiple instances of special delimiters.
您可以尝试将第二部分进一步分割,再使用“”作为分隔符。注意,如果有多个特殊分隔符的实例,这将不起作用。
Another approach would be to keep splitting using " " as delimiter, and join each special delimiter with the following name. For example:
另一种方法是继续使用“”作为分隔符进行分割,并将每个特殊的分隔符连接到以下名称。例如:
[Francisco, da, Sousa, Rodrigues] # becomes...
[Francisco, da Sousa, Rodrigues]
#6
0
May be you can try something like this ?
你可以试试这样的东西吗?
b_o_g=['da', 'de', 'do', 'dos', 'das']
test1 = "Francisco da Sousa Rodrigues"
test3= "Alberto de Francia"
def _custom_split (bag_of_words,string_t):
s_o_s = string_t.split()
for _,__ in enumerate(s_o_s):
if __ in bag_of_words:
try:
s_o_s[_]="{} {}".format(s_o_s[_],s_o_s[_+1])
del s_o_s [ _ + 1]
except IndexError:
pass
return s_o_s
print(_custom_split(b_o_g,test1))
print(_custom_split(b_o_g,test3))
output:
输出:
['Francisco', 'da Sousa', 'Rodrigues']
['Alberto', 'de Francia']
#7
0
Maybe not the best or elegant way but this will work. I also added the test5 just to be sure.
也许不是最好的或优雅的方式,但这将会起作用。我还添加了test5,只是为了确定。
special_chars = ['da', 'de', 'do', 'dos', 'das']
test1 = "Francisco da Sousa Rodrigues" #special split
test2 = "Emiliano Rodrigo Carrasco" #normal split
test3 = "Alberto de Francia" #special split
test4 = "Bruno Rezende" #normal split
test5 = 'Francisco da Sousa de Rodrigues'
def cut(test):
t1 = test.split()
for i in range(len(t1)):
if t1[i] in special_chars:
t1[i+1] = t1[i] + ' ' + t1[i+1]
for i in t1:
if i in special_chars:
t1.remove(i)
print(t1)
cut(test1)
cut(test2)
cut(test3)
cut(test4)
cut(test5)
The results are:
结果是:
['Francisco', 'da Sousa', 'Rodrigues']
['Emiliano', 'Rodrigo', 'Carrasco']
['Alberto', 'de Francia']
['Bruno', 'Rezende']
['Francisco', 'da Sousa', 'de Rodrigues']
#8
0
It should be pointed out that we are talking about titles here, not names.
应该指出的是,我们讨论的是标题,而不是名字。
These pretty much all translate to something like "from" or "of" and the part after typically refers to a place and they originated as titles for nobility.
这些几乎都可以翻译成"from"或"of"后面的部分通常指一个地方,它们起源于贵族的头衔。
You are trying to fit a non-name into a name context, which makes everything difficult.
您正在尝试将非名称放入名称上下文中,这使一切都变得困难。
It's weird to try to just remove all this like it doesn't exist. Like if you take a name such as "Steve From New York" and to just try to drop the from and make New York the "last name".
试着把这些都删除,就像它不存在一样,这很奇怪。就像如果你取一个像“来自纽约的史蒂夫”这样的名字,然后试着放弃这个名字,让纽约成为“最后的名字”。
These were never intended to be last names or to act like what to most people would be a last name. Things just kinda drifted in that direction over time trying to make round pegs fit into square holes.
这些名字从来都不打算是姓,也不像对大多数人来说是姓。随着时间的推移,这些东西逐渐向这个方向漂移,试图使圆形的钉子与正方形的洞相吻合。
You might add a title field to your signup page or something and direct it to be used for people with titles as a more elegant solution.
您可以在注册页面或其他地方添加一个标题字段,并将其直接用于有标题的人,作为一种更优雅的解决方案。
#9
-2
Your regular expression should be changed into
应该将正则表达式改为
PATTERN = re.compile(r"\s(?=[da, de, do, dos, das])(\S+\s*\s\s*\S+)")
模式= re.compile(r " \ s(?[da, de, do, dos, das])(\S+\ \S \S \S+)
import re
test1 = "Francisco da Sousa Rodrigues" #special split
test3 = "Alberto de Francia" #special split
PATTERN = re.compile(r"\s(?=[da, de, do, dos, das])(\S+\s*\s\s*\S+)")
print re.split(PATTERN, test1)
print re.split(PATTERN, test3)
This works for me giving the following outputs,
这适用于我给出以下输出,
['Francisco', 'da Sousa', ' Rodrigues'] ['Alberto', 'de Francia', '']
['Francisco', 'da Sousa', ' Rodrigues'] ['Alberto', 'de Francia']