
时间:2021-10-08 15:46:10

I have 3 data frames (rows: sites, columns:species name) of species abundances within sites. Row numbers are identical, but column numbers differ as not all species are in all three data frames. I would like to merge them into one data frame with abundances of identical species summed up. For example:




       Sp1  Sp2  Sp3  Sp4
site1   1    2    3    1
site2   0    2    0    1
site3   1    1    1    1



       Sp1  Sp2  Sp4
 site1  0    1    2
 site2  1    2    0
 site3  1    1    1



       Sp1  Sp2  Sp5  Sp6
 site1  0    1    1    1     
 site2  1    1    1    5
 site3  2    0    0    0

What I want to have is something like:


       Sp1  Sp2  Sp3  Sp4  Sp5  Sp6
 site1  1    4    3    3    1    1
 site2  2    5    0    1    1    5
 site3  4    2    1    2    0    0

I guess i'd have to work with merge, but so far my attempts have failed to get what I want.


Any help is appreciated.


4 个解决方案



I'd use plyr's rbind.fill like this:


pp <- cbind(names=c(rownames(df1), rownames(df2), rownames(df3)), 
                        rbind.fill(list(df1, df2, df3)))

#   names Sp1 Sp2 Sp3 Sp4 Sp5 Sp6
# 1 site1   1   2   3   1  NA  NA
# 2 site2   0   2   0   1  NA  NA
# 3 site3   1   1   1   1  NA  NA
# 4 site1   0   1  NA   2  NA  NA
# 5 site2   1   2  NA   0  NA  NA
# 6 site3   1   1  NA   1  NA  NA
# 7 site1   0   1  NA  NA   1   1
# 8 site2   1   1  NA  NA   1   5
# 9 site3   2   0  NA  NA   0   0

Then, aggregate with plyr's ddply as follows:


ddply(pp, .(names), function(x) colSums(x[,-1], na.rm = TRUE))
#   names Sp1 Sp2 Sp3 Sp4 Sp5 Sp6
# 1 site1   1   4   3   3   1   1
# 2 site2   2   5   0   1   1   5
# 3 site3   4   2   1   2   0   0



Another alternative is to use melt/cast from reshape2. Here is an unsophisticated example:


df1 <- read.table(header=T, text="
    Sp1  Sp2  Sp3  Sp4
    site1   1    2    3    1
    site2   0    2    0    1
    site3   1    1    1    1")

df2 <- read.table(header=T, text="
       Sp1  Sp2  Sp4
 site1  0    1    2
 site2  1    2    0
 site3  1    1    1")

df3 <- read.table(header=T, text="
       Sp1  Sp2  Sp5  Sp6
 site1  0    1    1    1     
 site2  1    1    1    5
 site3  2    0    0    0")

df1$site <- rownames(df1)
df2$site <- rownames(df2)
df3$site <- rownames(df3)

DF <- rbind(melt(df1,id="site"),melt(df2,id="site"),melt(df3,id="site"))
dcast(data=DF,formula=site ~ variable,fun.aggregate=sum)

   site Sp1 Sp2 Sp3 Sp4 Sp5 Sp6
1 site1   1   4   3   3   1   1
2 site2   2   5   0   1   1   5
3 site3   4   2   1   2   0   0

In short, we use site designation as an additional variable, and convert each dataframe to long format, subsequently joining them into a single dataframe. The latter contains all the values in the long format. With dcast we create the dataframe you require, sites being in rows (left side of the formula), variables being in columns (right side of the formula). The sum function is used on the variables for which multiple cells are produced.

简而言之,我们使用站点指定作为附加变量,并将每个数据帧转换为长格式,随后将它们连接成单个数据帧。后者包含长格式的所有值。使用dcast,我们创建您需要的数据帧,站点在行中(公式的左侧),变量在列中(公式的右侧)。 sum函数用于生成多个单元格的变量。

Of course, the code can be extended to more general case with loops or *apply functions.

当然,代码可以扩展到循环或* apply函数的更一般情况。



Adding to the options available, here are two more that stick with base R.


First option: Wide aggregation (sort of)


temp <- cbind(df1, df2, df3)
#       Sp1 Sp2 Sp3 Sp4 Sp1 Sp2 Sp4 Sp1 Sp2 Sp5 Sp6
# site1   1   2   3   1   0   1   2   0   1   1   1
# site2   0   2   0   1   1   2   0   1   1   1   5
# site3   1   1   1   1   1   1   1   2   0   0   0
       function(x) rowSums(temp[, colnames(temp) == x, drop = FALSE]))
#       Sp1 Sp2 Sp3 Sp4 Sp5 Sp6
# site1   1   4   3   3   1   1
# site2   2   5   0   1   1   5
# site3   4   2   1   2   0   0

Second option: semi-wide to long to wide


Conceptually, this is similar to Maxim. K's answer: Get the data in a long form, and it makes it much easier to manipulate things:

从概念上讲,这与Maxim类似。 K的答案:以长形式获取数据,这使得操作更容易:

> temp1 <- t(cbind(df1, df2, df3))
> # You'll get a warning in the next step
> # Safe to ignore though...
> temp2 <- data.frame(var = rownames(temp), stack(data.frame(temp)))
Warning message:
In data.row.names(row.names, rowsi, i) :
  some row.names duplicated: 5,6,7,8,9 --> row.names NOT used
> xtabs(values ~ ind + var, temp2)
ind     Sp1 Sp2 Sp3 Sp4 Sp5 Sp6
  site1   1   4   3   3   1   1
  site2   2   5   0   1   1   5
  site3   4   2   1   2   0   0



An alternative to Arun's answer: Create a 'template' array with all the columns you'll need


Rgames> bbar<-data.frame('one'=rep(0,3),'two'=rep(0,3),'three'=rep(0,3))
Rgames> bbar
  one two three
1  0    0    0
2   0    0    0
3   0    0    0

Then, given each of your data frames like


Rgames> bar1<-data.frame('one'=c(1,2,3),'two'=c(4,5,6))
Rgames> bar1
  one two
1   1   4
2   2   5
3   3   6

Create an expanded dataframe:


Rgames> newbar1<-bbar
Rgames> for (jj in names(bar) )  newbar1[[jj]]<-bar[[jj]]
Rgames> newbar1
  one two three
1   1   4    0
2   2   5    0
3   3   6    0

Then sum all such expanded data frames. Clumsy but simple.




I'd use plyr's rbind.fill like this:


pp <- cbind(names=c(rownames(df1), rownames(df2), rownames(df3)), 
                        rbind.fill(list(df1, df2, df3)))

#   names Sp1 Sp2 Sp3 Sp4 Sp5 Sp6
# 1 site1   1   2   3   1  NA  NA
# 2 site2   0   2   0   1  NA  NA
# 3 site3   1   1   1   1  NA  NA
# 4 site1   0   1  NA   2  NA  NA
# 5 site2   1   2  NA   0  NA  NA
# 6 site3   1   1  NA   1  NA  NA
# 7 site1   0   1  NA  NA   1   1
# 8 site2   1   1  NA  NA   1   5
# 9 site3   2   0  NA  NA   0   0

Then, aggregate with plyr's ddply as follows:


ddply(pp, .(names), function(x) colSums(x[,-1], na.rm = TRUE))
#   names Sp1 Sp2 Sp3 Sp4 Sp5 Sp6
# 1 site1   1   4   3   3   1   1
# 2 site2   2   5   0   1   1   5
# 3 site3   4   2   1   2   0   0



Another alternative is to use melt/cast from reshape2. Here is an unsophisticated example:


df1 <- read.table(header=T, text="
    Sp1  Sp2  Sp3  Sp4
    site1   1    2    3    1
    site2   0    2    0    1
    site3   1    1    1    1")

df2 <- read.table(header=T, text="
       Sp1  Sp2  Sp4
 site1  0    1    2
 site2  1    2    0
 site3  1    1    1")

df3 <- read.table(header=T, text="
       Sp1  Sp2  Sp5  Sp6
 site1  0    1    1    1     
 site2  1    1    1    5
 site3  2    0    0    0")

df1$site <- rownames(df1)
df2$site <- rownames(df2)
df3$site <- rownames(df3)

DF <- rbind(melt(df1,id="site"),melt(df2,id="site"),melt(df3,id="site"))
dcast(data=DF,formula=site ~ variable,fun.aggregate=sum)

   site Sp1 Sp2 Sp3 Sp4 Sp5 Sp6
1 site1   1   4   3   3   1   1
2 site2   2   5   0   1   1   5
3 site3   4   2   1   2   0   0

In short, we use site designation as an additional variable, and convert each dataframe to long format, subsequently joining them into a single dataframe. The latter contains all the values in the long format. With dcast we create the dataframe you require, sites being in rows (left side of the formula), variables being in columns (right side of the formula). The sum function is used on the variables for which multiple cells are produced.

简而言之,我们使用站点指定作为附加变量,并将每个数据帧转换为长格式,随后将它们连接成单个数据帧。后者包含长格式的所有值。使用dcast,我们创建您需要的数据帧,站点在行中(公式的左侧),变量在列中(公式的右侧)。 sum函数用于生成多个单元格的变量。

Of course, the code can be extended to more general case with loops or *apply functions.

当然,代码可以扩展到循环或* apply函数的更一般情况。



Adding to the options available, here are two more that stick with base R.


First option: Wide aggregation (sort of)


temp <- cbind(df1, df2, df3)
#       Sp1 Sp2 Sp3 Sp4 Sp1 Sp2 Sp4 Sp1 Sp2 Sp5 Sp6
# site1   1   2   3   1   0   1   2   0   1   1   1
# site2   0   2   0   1   1   2   0   1   1   1   5
# site3   1   1   1   1   1   1   1   2   0   0   0
       function(x) rowSums(temp[, colnames(temp) == x, drop = FALSE]))
#       Sp1 Sp2 Sp3 Sp4 Sp5 Sp6
# site1   1   4   3   3   1   1
# site2   2   5   0   1   1   5
# site3   4   2   1   2   0   0

Second option: semi-wide to long to wide


Conceptually, this is similar to Maxim. K's answer: Get the data in a long form, and it makes it much easier to manipulate things:

从概念上讲,这与Maxim类似。 K的答案:以长形式获取数据,这使得操作更容易:

> temp1 <- t(cbind(df1, df2, df3))
> # You'll get a warning in the next step
> # Safe to ignore though...
> temp2 <- data.frame(var = rownames(temp), stack(data.frame(temp)))
Warning message:
In data.row.names(row.names, rowsi, i) :
  some row.names duplicated: 5,6,7,8,9 --> row.names NOT used
> xtabs(values ~ ind + var, temp2)
ind     Sp1 Sp2 Sp3 Sp4 Sp5 Sp6
  site1   1   4   3   3   1   1
  site2   2   5   0   1   1   5
  site3   4   2   1   2   0   0



An alternative to Arun's answer: Create a 'template' array with all the columns you'll need


Rgames> bbar<-data.frame('one'=rep(0,3),'two'=rep(0,3),'three'=rep(0,3))
Rgames> bbar
  one two three
1  0    0    0
2   0    0    0
3   0    0    0

Then, given each of your data frames like


Rgames> bar1<-data.frame('one'=c(1,2,3),'two'=c(4,5,6))
Rgames> bar1
  one two
1   1   4
2   2   5
3   3   6

Create an expanded dataframe:


Rgames> newbar1<-bbar
Rgames> for (jj in names(bar) )  newbar1[[jj]]<-bar[[jj]]
Rgames> newbar1
  one two three
1   1   4    0
2   2   5    0
3   3   6    0

Then sum all such expanded data frames. Clumsy but simple.
