From this question which was closed, the op asked how to extract rank, first, middle, and last from the strings
从这个封闭的问题中,op询问如何从字符串中提取秩、第一、中间和最后
x <- c("Marshall Robert Forsyth", "Deputy Sheriff John A. Gooch",
"Constable Darius Quimby", "High Sheriff John Caldwell Cook")
# rank first middle last
# Marshall Robert Forsyth "Marshall" "Robert" "" "Forsyth"
# Deputy Sheriff John A. Gooch "Deputy Sheriff" "John" "A." "Gooch"
# Constable Darius Quimby "Constable" "Darius" "" "Quimby"
# High Sheriff John Caldwell. Cook "High Sheriff" "John" "Caldwell" "Cook"
I came up with this which only works if the middle name includes a period; otherwise, the pattern for rank captures as much as it can from the beginning of the line.
我想到了这个只有中间的名字包含一个句号才能成立;否则,rank的模式将从行首捕获尽可能多的数据。
pat <- '(?i)(?<rank>[a-z ]+)\\s(?<first>[a-z]+)\\s(?:(?<middle>[a-z.]+)\\s)?(?<last>[a-z]+)'
f <- function(x, pattern) {
m <- gregexpr(pattern, x, perl = TRUE)[[1]]
s <- attr(m, "capture.start")
l <- attr(m, "capture.length")
n <- attr(m, "capture.names")
setNames(mapply('substr', x, s, s + l - 1L), n)
}
do.call('rbind', Map(f, x, pat))
# rank first middle last
# Marshall Robert Forsyth "Marshall" "Robert" "" "Forsyth"
# Deputy Sheriff John A. Gooch "Deputy Sheriff" "John" "A." "Gooch"
# Constable Darius Quimby "Constable" "Darius" "" "Quimby"
# High Sheriff John Caldwell Cook "High Sheriff John" "Caldwell" "" "Cook"
So this would work if the middle name was either not given or included a period
所以,如果中间的名字没有被给出或者包含一个句号,这是可行的
x <- c("Marshall Robert Forsyth", "Deputy Sheriff John A. Gooch",
"Constable Darius Quimby", "High Sheriff John Caldwell. Cook")
do.call('rbind', Map(f, x, pat))
So my question is is there a way to prioritize matching from the end of the string such that this pattern matches last, middle, first, then leaving everything else for rank.
所以我的问题是,是否有一种方法可以从字符串的末尾对匹配进行优先排序,这样这个模式就能匹配最后,中间,首先,然后把其他的都留给rank。
Can I do this without reversing the string or something hacky like that? Also, maybe there is a better pattern since I am not great with regex.
我能做这个而不改变弦或者类似的东西吗?另外,可能还有更好的模式,因为我对regex不是很在行。
Related - [1] [2] - I don't think these will work since another pattern was suggested rather than answering the question. Also, in this example, the number of words in the rank is arbitrary, and the pattern matching the rank would also work for the first name.
相关-[1][2]-我不认为这些会起作用,因为另一个模式被建议而不是回答问题。同样,在本例中,排名中的单词数是任意的,匹配排名的模式也适用于第一个名字。
2 个解决方案
#1
2
We cannot start matching from the end, there are no any modifiers for that in any regex systems I know. But we can check how many words do we have until the end, and restrain our greediness :). The below regex is doing this.
我们不能从最后开始匹配,我知道在任何regex系统中都没有任何修改器。但是我们可以检查我们到底有多少字,并抑制我们的贪心。下面的regex将执行此操作。
This one will do what you want:
^(?<rank>(?:(?:[ \t]|^)[a-z]+)+?)(?!(?:[ \t][a-z.]+){4,}$)[ \t](?<first>[a-z]+)[ \t](?:(?<middle>[a-z.]+)[ \t])?(?<last>[a-z]+)$
实时预览在regex101.com
There's also one exception:
when you have First, Last and more than 1 word for the rank, the part of rank will become a First name.
当你有第一个、最后一个和超过一个词的排名,排名的部分将成为第一个名字。
To solve this you have to define a list of rank prefixes which mean that there's another word definitely goes after it and capture it in a greedy way.
要解决这个问题,你必须定义一个排名前缀的列表,这意味着有另一个词肯定会跟着它以贪婪的方式捕捉它。
E.g.: Deputy,High.
例如:副,高。
#2
0
My R is rusty, but placing a ?
after a quantifier makes it non-greedy instead of greedy in all regex engines that I am aware of. So to answer your main question:
我的R生锈了,但放了一个?在我所知道的所有regex引擎中,量词使其非贪婪而非贪婪之后。回答你的主要问题:
Is there a way to prioritize matching from the end of the string such that this pattern matches last, middle, first, then leaving everything else for rank?
是否有一种方法可以从字符串的末尾对匹配进行优先排序,从而使该模式匹配到最后、中间、首先,然后将其他所有内容保留为rank?
You should be able to do this by making the rank match section of the pattern non-greedy by adding a ?
after the +
.
您应该能够通过添加a使模式的rank匹配部分非贪婪来实现这一点?后+。
(?<rank>[a-z ]+?)
Full pattern:
完整的模式:
pat <- '(?i)(?<rank>[a-z ]+?)\\s(?<first>[a-z]+)\\s(?:(?<middle>[a-z.]+)\\s)?(?<last>[a-z]+)'
#1
2
We cannot start matching from the end, there are no any modifiers for that in any regex systems I know. But we can check how many words do we have until the end, and restrain our greediness :). The below regex is doing this.
我们不能从最后开始匹配,我知道在任何regex系统中都没有任何修改器。但是我们可以检查我们到底有多少字,并抑制我们的贪心。下面的regex将执行此操作。
This one will do what you want:
^(?<rank>(?:(?:[ \t]|^)[a-z]+)+?)(?!(?:[ \t][a-z.]+){4,}$)[ \t](?<first>[a-z]+)[ \t](?:(?<middle>[a-z.]+)[ \t])?(?<last>[a-z]+)$
实时预览在regex101.com
There's also one exception:
when you have First, Last and more than 1 word for the rank, the part of rank will become a First name.
当你有第一个、最后一个和超过一个词的排名,排名的部分将成为第一个名字。
To solve this you have to define a list of rank prefixes which mean that there's another word definitely goes after it and capture it in a greedy way.
要解决这个问题,你必须定义一个排名前缀的列表,这意味着有另一个词肯定会跟着它以贪婪的方式捕捉它。
E.g.: Deputy,High.
例如:副,高。
#2
0
My R is rusty, but placing a ?
after a quantifier makes it non-greedy instead of greedy in all regex engines that I am aware of. So to answer your main question:
我的R生锈了,但放了一个?在我所知道的所有regex引擎中,量词使其非贪婪而非贪婪之后。回答你的主要问题:
Is there a way to prioritize matching from the end of the string such that this pattern matches last, middle, first, then leaving everything else for rank?
是否有一种方法可以从字符串的末尾对匹配进行优先排序,从而使该模式匹配到最后、中间、首先,然后将其他所有内容保留为rank?
You should be able to do this by making the rank match section of the pattern non-greedy by adding a ?
after the +
.
您应该能够通过添加a使模式的rank匹配部分非贪婪来实现这一点?后+。
(?<rank>[a-z ]+?)
Full pattern:
完整的模式:
pat <- '(?i)(?<rank>[a-z ]+?)\\s(?<first>[a-z]+)\\s(?:(?<middle>[a-z.]+)\\s)?(?<last>[a-z]+)'