如何合并具有相同列名的多个数据框?

时间:2022-08-15 22:59:26

What I have:

I have a "master" dataframe that has the following columns:

我有一个“主”数据框,其中包含以下列:

userid, condition

Since there are four experiment conditions, I also have four dataframes that carry answer information, with the following columns:

由于有四个实验条件,我还有四个带有答案信息的数据框,其中包含以下列:

userid, condition, answer1, answer2

Now, I'd like to join these, so all combinations of user IDs, conditions and their answers to these conditions are merged. Each condition should only have the correct answer in the appropriate column, per row.

现在,我想加入这些,因此合并了用户ID,条件及其对这些条件的答案的所有组合。每个条件在每行的相应列中应该只有正确的答案。


Short, self-contained example:

master = data.frame(userid=c("foo","foo","foo","foo","bar","bar","bar","bar"), condition=c("A","B","C","D","A","B","C","D"))
cond_a = data.frame(userid=c("foo","bar"), condition="A", answer1=c("1","1"), answer2=c("2","2"))
cond_b = data.frame(userid=c("foo","bar"), condition="B", answer1=c("3","3"), answer2=c("4","4"))
cond_c = data.frame(userid=c("foo","bar"), condition="C", answer1=c("5","5"), answer2=c("6","6"))
cond_d = data.frame(userid=c("foo","bar"), condition="D", answer1=c("7","7"), answer2=c("8","8"))

How do I merge all conditions into the master, so the master table looks like follows?

如何将所有条件合并到主服务器中,因此主表如下所示?

  userid condition answer1 answer2
1    bar         A       1       2
2    bar         B       3       4
3    bar         C       5       6
4    bar         D       7       8
5    foo         A       1       2
6    foo         B       3       4
7    foo         C       5       6
8    foo         D       7       8

I've tried the following:

我尝试过以下方法:

temp = merge(master, cond_a, all.x=TRUE)

Which gives me:

这给了我:

  userid condition answer1 answer2
1    bar         A       1       2
2    bar         B    <NA>    <NA>
3    bar         C    <NA>    <NA>
4    bar         D    <NA>    <NA>
5    foo         A       1       2
6    foo         B    <NA>    <NA>
7    foo         C    <NA>    <NA>
8    foo         D    <NA>    <NA>

But as soon as I do this…

但是一旦我这样做......

merge(temp, cond_b, all.x=TRUE)

There are no values for condition B. How come?

条件B没有值。为什么?

  userid condition answer1 answer2
1    bar         A       1       2
2    bar         B    <NA>    <NA>
3    bar         C    <NA>    <NA>
4    bar         D    <NA>    <NA>
5    foo         A       1       2
6    foo         B    <NA>    <NA>
7    foo         C    <NA>    <NA>
8    foo         D    <NA>    <NA>

3 个解决方案

#1


11  

You can use Reduce() and complete.cases() as follows:

您可以使用Reduce()和complete.cases(),如下所示:

merged <- Reduce(function(x, y) merge(x, y, all=TRUE), 
                 list(master, cond_a, cond_b, cond_c, cond_d))
merged[complete.cases(merged), ]
#    userid condition answer1 answer2
# 1     bar         A       1       2
# 2     bar         B       3       4
# 4     bar         C       5       6
# 6     bar         D       7       8
# 8     foo         A       1       2
# 9     foo         B       3       4
# 11    foo         C       5       6
# 13    foo         D       7       8

Reduce() might take some getting accustomed to. You define your function, and then provide a list of objects to repeatedly apply the function to. Thus, that statement is like doing:

Reduce()可能需要一些习惯。您定义您的函数,然后提供重复应用该函数的对象列表。因此,该陈述就像做:

temp1 <- merge(master, cond_a, all=TRUE)
temp2 <- merge(temp1, cond_b, all=TRUE)
temp3 <- merge(temp2, ....)

Or something like:

或类似的东西:

merge(merge(merge(master, cond_a, all=TRUE), cond_b, all=TRUE), cond_c, all=TRUE)

complete.cases() creates a logical vector of whether the specified columns are "complete" or not; this logical vector can be used to subset from the merged data.frame.

complete.cases()创建一个逻辑向量,表明指定的列是否“完整”;此逻辑向量可用于合并data.frame的子集。

#2


2  

As stated by the OP, given that no explicit relationship with the master data frame, an option is this:

正如OP所述,鉴于与主数据框没有明确的关系,一个选项是这样的:

temp <-rbind(cond_a,cond_b,cond_c,cond_d)
temp[order(temp["userid"]),]

Perhaps if any relationship was known, there could be a non-simplistic solution.

也许如果知道任何关系,可能会有一个非简单的解决方案。

#3


1  

You can express this join as a SQL statement, and then use the sqldf library to execute it.

您可以将此连接表示为SQL语句,然后使用sqldf库来执行它。

cond_all = rbind(cond_a, cond_b, cond_c, cond_d)

> sqldf('select p.userid as userid, p.condition as condition, answer1, answer2 from master as p join cond_all as q on p.userid=q.userid and p.condition=q.condition order by userid, condition')
  userid condition answer1 answer2
1    bar         A       1       2
2    bar         B       3       4
3    bar         C       5       6
4    bar         D       7       8
5    foo         A       1       2
6    foo         B       3       4
7    foo         C       5       6
8    foo         D       7       8

You mentioned in a comment that the master dataframe has extra columns that do not exist in the cond dataframes. You should be able to modify this SQL query to still work for this case.

您在评论中提到主数据帧具有cond数据帧中不存在的额外列。您应该能够修改此SQL查询以适用于此情况。

#1


11  

You can use Reduce() and complete.cases() as follows:

您可以使用Reduce()和complete.cases(),如下所示:

merged <- Reduce(function(x, y) merge(x, y, all=TRUE), 
                 list(master, cond_a, cond_b, cond_c, cond_d))
merged[complete.cases(merged), ]
#    userid condition answer1 answer2
# 1     bar         A       1       2
# 2     bar         B       3       4
# 4     bar         C       5       6
# 6     bar         D       7       8
# 8     foo         A       1       2
# 9     foo         B       3       4
# 11    foo         C       5       6
# 13    foo         D       7       8

Reduce() might take some getting accustomed to. You define your function, and then provide a list of objects to repeatedly apply the function to. Thus, that statement is like doing:

Reduce()可能需要一些习惯。您定义您的函数,然后提供重复应用该函数的对象列表。因此,该陈述就像做:

temp1 <- merge(master, cond_a, all=TRUE)
temp2 <- merge(temp1, cond_b, all=TRUE)
temp3 <- merge(temp2, ....)

Or something like:

或类似的东西:

merge(merge(merge(master, cond_a, all=TRUE), cond_b, all=TRUE), cond_c, all=TRUE)

complete.cases() creates a logical vector of whether the specified columns are "complete" or not; this logical vector can be used to subset from the merged data.frame.

complete.cases()创建一个逻辑向量,表明指定的列是否“完整”;此逻辑向量可用于合并data.frame的子集。

#2


2  

As stated by the OP, given that no explicit relationship with the master data frame, an option is this:

正如OP所述,鉴于与主数据框没有明确的关系,一个选项是这样的:

temp <-rbind(cond_a,cond_b,cond_c,cond_d)
temp[order(temp["userid"]),]

Perhaps if any relationship was known, there could be a non-simplistic solution.

也许如果知道任何关系,可能会有一个非简单的解决方案。

#3


1  

You can express this join as a SQL statement, and then use the sqldf library to execute it.

您可以将此连接表示为SQL语句,然后使用sqldf库来执行它。

cond_all = rbind(cond_a, cond_b, cond_c, cond_d)

> sqldf('select p.userid as userid, p.condition as condition, answer1, answer2 from master as p join cond_all as q on p.userid=q.userid and p.condition=q.condition order by userid, condition')
  userid condition answer1 answer2
1    bar         A       1       2
2    bar         B       3       4
3    bar         C       5       6
4    bar         D       7       8
5    foo         A       1       2
6    foo         B       3       4
7    foo         C       5       6
8    foo         D       7       8

You mentioned in a comment that the master dataframe has extra columns that do not exist in the cond dataframes. You should be able to modify this SQL query to still work for this case.

您在评论中提到主数据帧具有cond数据帧中不存在的额外列。您应该能够修改此SQL查询以适用于此情况。