
时间:2021-02-21 08:56:22

I'm working on an assignment where I need to clear a lot of messy string data.
I've worked my way with most problems but got stuck with two problems:


  1. Ties when using multiple grepl statements
  2. 使用多个grepl语句时的联系

  3. Lot's of code, that I feel, could be simplified but I can't figure out how
  4. 我认为,很多代码可以简化,但我无法弄清楚如何

Let's consider this minimal example:


names is a character vector storing names of 3 distinct persons, written in various ways


names should be simplified (recoded) so that multiple occurrences of a person name are stored the same way


Let's assume Johnatan is First John,
Johnnie and johnnie are all Second John,
John, John D., John Doe are Third John.

让我们假设Johnatan是First John,Johnnie和johnnie都是Second John,John,John D.,John Doe是Third John。

With my limited R knowledge I came this solution:


names <- c("John", "Johnatan", "Johnnie", "John D.", "John Doe", "johnnie")

names[grepl("johna", names, ignore.case = TRUE)] <- "First John"
names[grepl("johnn", names, ignore.case = TRUE)] <- "Second John"
names[grepl("john d*", names, ignore.case = TRUE)] <- "Third John"

At this point there is john that I have no idea how to recode into Third John as


names[grepl("john", names, ignore.case = TRUE)]

will pick up all the john's in names.



How can I approach this kind of ties, hopefully in a way, more elegant then what I wrote so far?


Thank you for any hints and suggestions.


2 个解决方案



You can use a word boundary (\\b) for "john":

您可以为“john”使用单词边界(\\ b):

names <- c("John", "Johnatan", "Johnnie", "John D.", "John Doe", "johnnie")
names2 = names

names2[grepl("johna", names, ignore.case = TRUE)] <- "First John"
names2[grepl("johnn", names, ignore.case = TRUE)] <- "Second John"
names2[grepl("john(\\b|\\sd.*)", names, ignore.case = TRUE)] <- "Third John"

or with case_when from dplyr:


names = case_when(grepl("johna", names, ignore.case = TRUE) ~ "First Join",
                  grepl("johnn", names, ignore.case = TRUE) ~ "Second Join",
                  grepl("john(\\b|\\sd.*)", names, ignore.case = TRUE) ~ "Third Join")


  • \\b matches a word boundary, which could be either a space or punctuation. for example johnatan would not be matched since john follows another letter a, not a word boundary.

    \\ b匹配单词边界,可以是空格或标点符号。例如johnatan不会匹配,因为john跟随另一个字母a,而不是单词边界。

  • \\s matches a space.

    \\ s匹配一个空格。

  • d.* matches d followed by anything (.) zero of more times.


  • ( | ) is a capture group that matches either the left hand side or right hand side of |.


  • john(\\b|\\sd.*) matches john followed by either a word boundary or a space followed by a d and anything zero or more times. Hence matching "john", "john d.", and "john doe" (ignore.case = TRUE takes care of the cases).

    john(\\ b | \\ sd。*)匹配john后跟一个单词边界或一个空格后跟一个d和任何零次或多次。因此匹配“john”,“john d。”和“john doe”(ignore.case = TRUE处理案例)。


> names2
[1] "Third John"  "First John"  "Second John" "Third John"  "Third John"  "Second John"



temp = c(Johnatan = "First John", johnnie = "Second John", John = "Third John")
temp[apply(X = sapply(names(temp),
                 function(x) grepl(pattern = x,
                                   x = names,
                                   ignore.case = TRUE)),
      MARGIN = 1,
      FUN = function(x) head(which(x), 1))]
#         John      Johnatan       johnnie          John          John       johnnie 
# "Third John"  "First John" "Second John"  "Third John"  "Third John" "Second John" 



You can use a word boundary (\\b) for "john":

您可以为“john”使用单词边界(\\ b):

names <- c("John", "Johnatan", "Johnnie", "John D.", "John Doe", "johnnie")
names2 = names

names2[grepl("johna", names, ignore.case = TRUE)] <- "First John"
names2[grepl("johnn", names, ignore.case = TRUE)] <- "Second John"
names2[grepl("john(\\b|\\sd.*)", names, ignore.case = TRUE)] <- "Third John"

or with case_when from dplyr:


names = case_when(grepl("johna", names, ignore.case = TRUE) ~ "First Join",
                  grepl("johnn", names, ignore.case = TRUE) ~ "Second Join",
                  grepl("john(\\b|\\sd.*)", names, ignore.case = TRUE) ~ "Third Join")


  • \\b matches a word boundary, which could be either a space or punctuation. for example johnatan would not be matched since john follows another letter a, not a word boundary.

    \\ b匹配单词边界,可以是空格或标点符号。例如johnatan不会匹配,因为john跟随另一个字母a,而不是单词边界。

  • \\s matches a space.

    \\ s匹配一个空格。

  • d.* matches d followed by anything (.) zero of more times.


  • ( | ) is a capture group that matches either the left hand side or right hand side of |.


  • john(\\b|\\sd.*) matches john followed by either a word boundary or a space followed by a d and anything zero or more times. Hence matching "john", "john d.", and "john doe" (ignore.case = TRUE takes care of the cases).

    john(\\ b | \\ sd。*)匹配john后跟一个单词边界或一个空格后跟一个d和任何零次或多次。因此匹配“john”,“john d。”和“john doe”(ignore.case = TRUE处理案例)。


> names2
[1] "Third John"  "First John"  "Second John" "Third John"  "Third John"  "Second John"



temp = c(Johnatan = "First John", johnnie = "Second John", John = "Third John")
temp[apply(X = sapply(names(temp),
                 function(x) grepl(pattern = x,
                                   x = names,
                                   ignore.case = TRUE)),
      MARGIN = 1,
      FUN = function(x) head(which(x), 1))]
#         John      Johnatan       johnnie          John          John       johnnie 
# "Third John"  "First John" "Second John"  "Third John"  "Third John" "Second John"