合并data.frames,总结R中相同列的值

时间:2021-10-08 15:46:10

I have 3 data frames (rows: sites, columns:species name) of species abundances within sites. Row numbers are identical, but column numbers differ as not all species are in all three data frames. I would like to merge them into one data frame with abundances of identical species summed up. For example:

我有3个数据框(行:站点,列:物种名称)的站点内的物种丰富度。行号相同,但列号不同,因为并非所有物种都在所有三个数据帧中。我想将它们合并为一个数据框架,其中总结了大量相同的物种。例如:

data.frame1

data.frame1

       Sp1  Sp2  Sp3  Sp4
site1   1    2    3    1
site2   0    2    0    1
site3   1    1    1    1

data.frame2

data.frame2

       Sp1  Sp2  Sp4
 site1  0    1    2
 site2  1    2    0
 site3  1    1    1

data.frame3

data.frame3

       Sp1  Sp2  Sp5  Sp6
 site1  0    1    1    1     
 site2  1    1    1    5
 site3  2    0    0    0

What I want to have is something like:

我想拥有的是:

       Sp1  Sp2  Sp3  Sp4  Sp5  Sp6
 site1  1    4    3    3    1    1
 site2  2    5    0    1    1    5
 site3  4    2    1    2    0    0

I guess i'd have to work with merge, but so far my attempts have failed to get what I want.

我想我必须使用合并,但到目前为止,我的尝试未能得到我想要的。

Any help is appreciated.

任何帮助表示赞赏。

4 个解决方案

#1


19  

I'd use plyr's rbind.fill like this:

我会像这样使用plyr的rbind.fill:

pp <- cbind(names=c(rownames(df1), rownames(df2), rownames(df3)), 
                        rbind.fill(list(df1, df2, df3)))

#   names Sp1 Sp2 Sp3 Sp4 Sp5 Sp6
# 1 site1   1   2   3   1  NA  NA
# 2 site2   0   2   0   1  NA  NA
# 3 site3   1   1   1   1  NA  NA
# 4 site1   0   1  NA   2  NA  NA
# 5 site2   1   2  NA   0  NA  NA
# 6 site3   1   1  NA   1  NA  NA
# 7 site1   0   1  NA  NA   1   1
# 8 site2   1   1  NA  NA   1   5
# 9 site3   2   0  NA  NA   0   0

Then, aggregate with plyr's ddply as follows:

然后,与plyr的ddply汇总如下:

ddply(pp, .(names), function(x) colSums(x[,-1], na.rm = TRUE))
#   names Sp1 Sp2 Sp3 Sp4 Sp5 Sp6
# 1 site1   1   4   3   3   1   1
# 2 site2   2   5   0   1   1   5
# 3 site3   4   2   1   2   0   0

#2


7  

Another alternative is to use melt/cast from reshape2. Here is an unsophisticated example:

另一种选择是使用reshape2的熔体/铸造。这是一个简单的例子:

df1 <- read.table(header=T, text="
    Sp1  Sp2  Sp3  Sp4
    site1   1    2    3    1
    site2   0    2    0    1
    site3   1    1    1    1")

df2 <- read.table(header=T, text="
       Sp1  Sp2  Sp4
 site1  0    1    2
 site2  1    2    0
 site3  1    1    1")

df3 <- read.table(header=T, text="
       Sp1  Sp2  Sp5  Sp6
 site1  0    1    1    1     
 site2  1    1    1    5
 site3  2    0    0    0")

df1$site <- rownames(df1)
df2$site <- rownames(df2)
df3$site <- rownames(df3)

DF <- rbind(melt(df1,id="site"),melt(df2,id="site"),melt(df3,id="site"))
dcast(data=DF,formula=site ~ variable,fun.aggregate=sum)

   site Sp1 Sp2 Sp3 Sp4 Sp5 Sp6
1 site1   1   4   3   3   1   1
2 site2   2   5   0   1   1   5
3 site3   4   2   1   2   0   0

In short, we use site designation as an additional variable, and convert each dataframe to long format, subsequently joining them into a single dataframe. The latter contains all the values in the long format. With dcast we create the dataframe you require, sites being in rows (left side of the formula), variables being in columns (right side of the formula). The sum function is used on the variables for which multiple cells are produced.

简而言之,我们使用站点指定作为附加变量,并将每个数据帧转换为长格式,随后将它们连接成单个数据帧。后者包含长格式的所有值。使用dcast,我们创建您需要的数据帧,站点在行中(公式的左侧),变量在列中(公式的右侧)。 sum函数用于生成多个单元格的变量。

Of course, the code can be extended to more general case with loops or *apply functions.

当然,代码可以扩展到循环或* apply函数的更一般情况。

#3


5  

Adding to the options available, here are two more that stick with base R.

添加到可用的选项,这里还有两个坚持基础R.

First option: Wide aggregation (sort of)

第一种选择:广泛聚合(有点)

temp <- cbind(df1, df2, df3)
temp
#       Sp1 Sp2 Sp3 Sp4 Sp1 Sp2 Sp4 Sp1 Sp2 Sp5 Sp6
# site1   1   2   3   1   0   1   2   0   1   1   1
# site2   0   2   0   1   1   2   0   1   1   1   5
# site3   1   1   1   1   1   1   1   2   0   0   0
sapply(unique(colnames(temp)), 
       function(x) rowSums(temp[, colnames(temp) == x, drop = FALSE]))
#       Sp1 Sp2 Sp3 Sp4 Sp5 Sp6
# site1   1   4   3   3   1   1
# site2   2   5   0   1   1   5
# site3   4   2   1   2   0   0

Second option: semi-wide to long to wide

第二种选择:半宽到长到宽

Conceptually, this is similar to Maxim. K's answer: Get the data in a long form, and it makes it much easier to manipulate things:

从概念上讲,这与Maxim类似。 K的答案:以长形式获取数据,这使得操作更容易:

> temp1 <- t(cbind(df1, df2, df3))
> # You'll get a warning in the next step
> # Safe to ignore though...
> temp2 <- data.frame(var = rownames(temp), stack(data.frame(temp)))
Warning message:
In data.row.names(row.names, rowsi, i) :
  some row.names duplicated: 5,6,7,8,9 --> row.names NOT used
> xtabs(values ~ ind + var, temp2)
       var
ind     Sp1 Sp2 Sp3 Sp4 Sp5 Sp6
  site1   1   4   3   3   1   1
  site2   2   5   0   1   1   5
  site3   4   2   1   2   0   0

#4


2  

An alternative to Arun's answer: Create a 'template' array with all the columns you'll need

替代Arun的答案:创建一个包含您需要的所有列的“模板”数组

Rgames> bbar<-data.frame('one'=rep(0,3),'two'=rep(0,3),'three'=rep(0,3))
Rgames> bbar
  one two three
1  0    0    0
2   0    0    0
3   0    0    0

Then, given each of your data frames like

然后,给出你的每个数据帧

Rgames> bar1<-data.frame('one'=c(1,2,3),'two'=c(4,5,6))
Rgames> bar1
  one two
1   1   4
2   2   5
3   3   6

Create an expanded dataframe:

创建扩展的数据框:

Rgames> newbar1<-bbar
Rgames> for (jj in names(bar) )  newbar1[[jj]]<-bar[[jj]]
Rgames> newbar1
  one two three
1   1   4    0
2   2   5    0
3   3   6    0

Then sum all such expanded data frames. Clumsy but simple.

然后对所有这些扩展数据帧求和。笨拙但很简单。

#1


19  

I'd use plyr's rbind.fill like this:

我会像这样使用plyr的rbind.fill:

pp <- cbind(names=c(rownames(df1), rownames(df2), rownames(df3)), 
                        rbind.fill(list(df1, df2, df3)))

#   names Sp1 Sp2 Sp3 Sp4 Sp5 Sp6
# 1 site1   1   2   3   1  NA  NA
# 2 site2   0   2   0   1  NA  NA
# 3 site3   1   1   1   1  NA  NA
# 4 site1   0   1  NA   2  NA  NA
# 5 site2   1   2  NA   0  NA  NA
# 6 site3   1   1  NA   1  NA  NA
# 7 site1   0   1  NA  NA   1   1
# 8 site2   1   1  NA  NA   1   5
# 9 site3   2   0  NA  NA   0   0

Then, aggregate with plyr's ddply as follows:

然后,与plyr的ddply汇总如下:

ddply(pp, .(names), function(x) colSums(x[,-1], na.rm = TRUE))
#   names Sp1 Sp2 Sp3 Sp4 Sp5 Sp6
# 1 site1   1   4   3   3   1   1
# 2 site2   2   5   0   1   1   5
# 3 site3   4   2   1   2   0   0

#2


7  

Another alternative is to use melt/cast from reshape2. Here is an unsophisticated example:

另一种选择是使用reshape2的熔体/铸造。这是一个简单的例子:

df1 <- read.table(header=T, text="
    Sp1  Sp2  Sp3  Sp4
    site1   1    2    3    1
    site2   0    2    0    1
    site3   1    1    1    1")

df2 <- read.table(header=T, text="
       Sp1  Sp2  Sp4
 site1  0    1    2
 site2  1    2    0
 site3  1    1    1")

df3 <- read.table(header=T, text="
       Sp1  Sp2  Sp5  Sp6
 site1  0    1    1    1     
 site2  1    1    1    5
 site3  2    0    0    0")

df1$site <- rownames(df1)
df2$site <- rownames(df2)
df3$site <- rownames(df3)

DF <- rbind(melt(df1,id="site"),melt(df2,id="site"),melt(df3,id="site"))
dcast(data=DF,formula=site ~ variable,fun.aggregate=sum)

   site Sp1 Sp2 Sp3 Sp4 Sp5 Sp6
1 site1   1   4   3   3   1   1
2 site2   2   5   0   1   1   5
3 site3   4   2   1   2   0   0

In short, we use site designation as an additional variable, and convert each dataframe to long format, subsequently joining them into a single dataframe. The latter contains all the values in the long format. With dcast we create the dataframe you require, sites being in rows (left side of the formula), variables being in columns (right side of the formula). The sum function is used on the variables for which multiple cells are produced.

简而言之,我们使用站点指定作为附加变量,并将每个数据帧转换为长格式,随后将它们连接成单个数据帧。后者包含长格式的所有值。使用dcast,我们创建您需要的数据帧,站点在行中(公式的左侧),变量在列中(公式的右侧)。 sum函数用于生成多个单元格的变量。

Of course, the code can be extended to more general case with loops or *apply functions.

当然,代码可以扩展到循环或* apply函数的更一般情况。

#3


5  

Adding to the options available, here are two more that stick with base R.

添加到可用的选项,这里还有两个坚持基础R.

First option: Wide aggregation (sort of)

第一种选择:广泛聚合(有点)

temp <- cbind(df1, df2, df3)
temp
#       Sp1 Sp2 Sp3 Sp4 Sp1 Sp2 Sp4 Sp1 Sp2 Sp5 Sp6
# site1   1   2   3   1   0   1   2   0   1   1   1
# site2   0   2   0   1   1   2   0   1   1   1   5
# site3   1   1   1   1   1   1   1   2   0   0   0
sapply(unique(colnames(temp)), 
       function(x) rowSums(temp[, colnames(temp) == x, drop = FALSE]))
#       Sp1 Sp2 Sp3 Sp4 Sp5 Sp6
# site1   1   4   3   3   1   1
# site2   2   5   0   1   1   5
# site3   4   2   1   2   0   0

Second option: semi-wide to long to wide

第二种选择:半宽到长到宽

Conceptually, this is similar to Maxim. K's answer: Get the data in a long form, and it makes it much easier to manipulate things:

从概念上讲,这与Maxim类似。 K的答案:以长形式获取数据,这使得操作更容易:

> temp1 <- t(cbind(df1, df2, df3))
> # You'll get a warning in the next step
> # Safe to ignore though...
> temp2 <- data.frame(var = rownames(temp), stack(data.frame(temp)))
Warning message:
In data.row.names(row.names, rowsi, i) :
  some row.names duplicated: 5,6,7,8,9 --> row.names NOT used
> xtabs(values ~ ind + var, temp2)
       var
ind     Sp1 Sp2 Sp3 Sp4 Sp5 Sp6
  site1   1   4   3   3   1   1
  site2   2   5   0   1   1   5
  site3   4   2   1   2   0   0

#4


2  

An alternative to Arun's answer: Create a 'template' array with all the columns you'll need

替代Arun的答案:创建一个包含您需要的所有列的“模板”数组

Rgames> bbar<-data.frame('one'=rep(0,3),'two'=rep(0,3),'three'=rep(0,3))
Rgames> bbar
  one two three
1  0    0    0
2   0    0    0
3   0    0    0

Then, given each of your data frames like

然后,给出你的每个数据帧

Rgames> bar1<-data.frame('one'=c(1,2,3),'two'=c(4,5,6))
Rgames> bar1
  one two
1   1   4
2   2   5
3   3   6

Create an expanded dataframe:

创建扩展的数据框:

Rgames> newbar1<-bbar
Rgames> for (jj in names(bar) )  newbar1[[jj]]<-bar[[jj]]
Rgames> newbar1
  one two three
1   1   4    0
2   2   5    0
3   3   6    0

Then sum all such expanded data frames. Clumsy but simple.

然后对所有这些扩展数据帧求和。笨拙但很简单。