将两个数据帧与字符串中的特定模式的列合并

(I have been stuck with this problem for past two days, So if it has an answer on SO please bear with me.)

(这个问题我已经处理了两天了，如果有答案，请见谅。)

I have two data frames A and B. I want to merge them on Name column. Suppose, A has two columns Name and Numbers. The Name column of A df has values ".tony.x.rds", ".tom.x.rds" and so on.

我有两个数据帧A和b，我想在Name列上合并它们。假设A有两个列名和数字。df的Name列有值“.tony.x”。rds”、“.tom.x。rds”等等。

Name     Numbers
.tony.x.rds 15.6
.tom.x.rds 14.5

The B df has two columns Name and ChaR. The Name column of B has values "tony.x","tom.x" and so on.

bdf有两个列名和ChaR。B的Name列有值“tony.x”、“tom”。x”等等。

Name  ChaR
tony.x   ENG
tom.x   US

The main element in column Name of both dfs is "tony', "tom" and so on.

两个dfs的列名中的主元素是“tony”、“tom”等。

So, ".tony.x.rds" is equal to "tony.x" and ".tom.x.rds" is equal to "tom.x".

所以,“.tony.x。rds等于tony。x”和“.tom.x。rds等于tom.x。

I have tried gsub with various option leaving me with 'tony", "tom", and so on in column Name of both A and B data frames. But when I use

我尝试过gsub的各种选项，给我留下了“tony”、“tom”等A和B数据帧的列名。但是当我使用

StoRe<-merge(A,B, all=T)

I ge all the rows of A and B rather than single rows. That is, there are two rows for each "a", "b" and so on for with their respective values in Numbers and ChaR column. For example:

我把A和B的所有行都加起来，而不是单行。也就是说，每个“a”、“b”等都有两行，分别以数字和ChaR列表示它们各自的值。例如:

Name Numbers ChaR
tony    15.6    NA
tony    NULL    ENG
tom    14.5    NA
tom    NULL    US

It has been giving me splitting headache. I request you to help.

它让我头疼欲裂。我请求你帮忙。

1 个解决方案

#1

One possible solution. I am not completely sure what you want to do with the 'x' in the strings, I have kept them in the linkage key, but by changing the \\1\\2 to \\1 you keep only the first letter.

一个可能的解决方案。我不完全确定你想用字符串中的“x”做什么，我将它们保存在链接键中，但是通过将\1\ 2改为\1，你只保留第一个字母。

a <- data.frame(
  Name = paste0(".", c("tony", "tom", "foo", "bar", "foobar"), ".x.rds"),
  Numbers = rnorm(5)
)

b <- data.frame(
  Name = paste0(c("tony", "tom", "bar", "foobar", "company"), ".x"),
  ChaR = LETTERS[11:15]
)

# String consists of 'point letter1 point letter2 point rds'; replace by
# 'letter1 letter2' 
a$Name_stand <- gsub("^\\.([a-z]+)\\.([a-z]+)\\.rds$", "\\1\\2", a$Name)

# String consists of 'letter1 point letter2'; replace by 'letter1 letter2' 
b$Name_stand <- gsub("^([a-z]+)\\.([a-z]+)$", "\\1\\2", b$Name)

result <- merge(a, b, all = TRUE, by = "Name_stand")

Output:

输出:

#> result
#  Name_stand        Name.x     Numbers    Name.y ChaR
#1       barx    .bar.x.rds  1.38072696     bar.x    M
#2   companyx          <NA>          NA company.x    O
#3    foobarx .foobar.x.rds -1.53076596  foobar.x    N
#4       foox    .foo.x.rds  1.40829287      <NA> <NA>
#5       tomx    .tom.x.rds -0.01204651     tom.x    L
#6      tonyx   .tony.x.rds  0.34159406    tony.x    K

Another, perhaps somewhat more robust (to variations of the strings such as 'tom.rds' and 'tom' which will still be linked; this can of course also be a disadvantage)/

另一种，可能更健壮一些(针对像“tom”这样的字符串变体)。rds'和'tom'仍将链接;这当然也可能是一个缺点)/

# Remove the rds from a$Name
a$Name_stand <- gsub("rds$" , "", a$Name)
# Remove all non alpha numeric characters from the strings
a$Name_stand <- gsub("[^[:alnum:]]", "", a$Name_stand)
b$Name_stand <- gsub("[^[:alnum:]]", "", b$Name)

result2 <- merge(a, b, all = TRUE, by = "Name_stand")

#1