I'm working on an assignment where I need to clear a lot of messy string data.
I've worked my way with most problems but got stuck with two problems:

我正在进行一项任务,我需要清除大量凌乱的字符串数据。我已经解决了大多数问题,但遇到了两个问题:

Ties when using multiple grepl statements

使用多个grepl语句时的联系

Lot's of code, that I feel, could be simplified but I can't figure out how

我认为,很多代码可以简化,但我无法弄清楚如何

Let's consider this minimal example:

让我们考虑这个最小的例子:

names is a character vector storing names of 3 distinct persons, written in various ways

names是一个存储3个不同人名的字符向量,以各种方式编写

names should be simplified (recoded) so that multiple occurrences of a person name are stored the same way

应简化(重新编码)名称,以便以相同的方式存储多次出现的人名

Let's assume Johnatan is First John,
Johnnie and johnnie are all Second John,
John, John D., John Doe are Third John.

让我们假设Johnatan是First John,Johnnie和johnnie都是Second John,John,John D.,John Doe是Third John。

With my limited R knowledge I came this solution:

凭借我有限的R知识我得到了这个解决方案:

names <- c("John", "Johnatan", "Johnnie", "John D.", "John Doe", "johnnie")

names[grepl("johna", names, ignore.case = TRUE)] <- "First John"
names[grepl("johnn", names, ignore.case = TRUE)] <- "Second John"
names[grepl("john d*", names, ignore.case = TRUE)] <- "Third John"

At this point there is john that I have no idea how to recode into Third John as

在这一点上有约翰,我不知道如何重新编入第三约翰

names[grepl("john", names, ignore.case = TRUE)]

will pick up all the john's in names.

将拿起所有约翰的名字。

Question:

How can I approach this kind of ties, hopefully in a way, more elegant then what I wrote so far?

我怎么能处理这种关系,希望在某种程度上,比我到目前为止写的更优雅?

Thank you for any hints and suggestions.

感谢您提供任何提示和建议。

2 个解决方案

#1

You can use a word boundary (\\b) for "john":

您可以为“john”使用单词边界(\\ b):

names <- c("John", "Johnatan", "Johnnie", "John D.", "John Doe", "johnnie")
names2 = names

names2[grepl("johna", names, ignore.case = TRUE)] <- "First John"
names2[grepl("johnn", names, ignore.case = TRUE)] <- "Second John"
names2[grepl("john(\\b|\\sd.*)", names, ignore.case = TRUE)] <- "Third John"

or with case_when from dplyr:

或者来自dplyr的case_when:

library(dplyr)
names = case_when(grepl("johna", names, ignore.case = TRUE) ~ "First Join",
                  grepl("johnn", names, ignore.case = TRUE) ~ "Second Join",
                  grepl("john(\\b|\\sd.*)", names, ignore.case = TRUE) ~ "Third Join")

Note:

\\b matches a word boundary, which could be either a space or punctuation. for example johnatan would not be matched since john follows another letter a, not a word boundary.

\\ b匹配单词边界,可以是空格或标点符号。例如johnatan不会匹配,因为john跟随另一个字母a,而不是单词边界。
\\s matches a space.

\\ s匹配一个空格。
d.* matches d followed by anything (.) zero of more times.

d。*匹配d后跟任何(。)零次多次。
( | ) is a capture group that matches either the left hand side or right hand side of |.

(|)是匹配|的左侧或右侧的捕获组。
john(\\b|\\sd.*) matches john followed by either a word boundary or a space followed by a d and anything zero or more times. Hence matching "john", "john d.", and "john doe" (ignore.case = TRUE takes care of the cases).

john(\\ b | \\ sd。*)匹配john后跟一个单词边界或一个空格后跟一个d和任何零次或多次。因此匹配“john”,“john d。”和“john doe”(ignore.case = TRUE处理案例)。

Result:

> names2
[1] "Third John"  "First John"  "Second John" "Third John"  "Third John"  "Second John"

#2

temp = c(Johnatan = "First John", johnnie = "Second John", John = "Third John")
temp[apply(X = sapply(names(temp),
                 function(x) grepl(pattern = x,
                                   x = names,
                                   ignore.case = TRUE)),
      MARGIN = 1,
      FUN = function(x) head(which(x), 1))]
#         John      Johnatan       johnnie          John          John       johnnie 
# "Third John"  "First John" "Second John"  "Third John"  "Third John" "Second John"

#1