R - 根据列名称将列添加到一起

I have a dataframe (df1) that includes abundances of different species in each sample:

我有一个数据帧(df1),其中包含每个样本中不同物种的丰度:

> SampleID   Sp1   Sp2   Sp3   Sp4   ... Spn
> asb-001      3     0     0    23         9
> asb-002      4    15    10    56        98
> asb-003      8    45     8   453         0
> asb-004      0     5     0     3         6
> asb-005    120    56     0     0         0
...

Each column represents a different species.

每列代表不同的物种。

I have another dataframe (df2)

我有另一个数据帧(df2)

Sp     Fam
Sp1   Fam1
Sp2   Fam2
Sp3   Fam1
Sp4   Fam3
Sp5   Fam2
Sp6   Fam1
...

There are fewer Family names than there are Species names.

系列名称少于种类名称。

I would like to add columns together depending on the species name to get a total for the family (e.g. Sp1 + Sp3 + Sp6 = Fam1). I don't want to keep the original column with the species name.

我想根据物种名称一起添加列,以获得该家族的总数(例如Sp1 + Sp3 + Sp6 = Fam1)。我不想保留原始列与物种名称。

If all goes according to plan, my new dataframe (df3) will look something like this:

如果一切按计划进行,我的新数据帧(df3)将如下所示:

> SampleID  Fam1  Fam2  Fam3 
> asb-001     12     0     9 
> asb-002     14    18   112 
> asb-003     28    58    18
> asb-004     10    12    10
> asb-005    142    65     0
...

I could manually go through and add individual species together, but this seems tedious for a large dataset, and likely to produce errors. I have the feeling that I should melt df1, but I'm not confident in the details. Any advice would be appreciated!

我可以手动完成并将各个物种组合在一起,但这对于大型数据集来说似乎很乏味,并且可能会产生错误。我觉得我应该融化df1,但我对细节没有信心。任何意见,将不胜感激!

2 个解决方案

#1

Change your original data to be in the long format and join against the species-family mapping. You can then compute summation for each family using group_by. Finally, spread the data back out to be in wide format.

将原始数据更改为长格式并加入物种家族映射。然后,您可以使用group_by计算每个系列的总和。最后,将数据传播回宽格式。

library( tidyverse )

df1 %>% gather( Sp, Value, -SampleID ) %>%     # Convert to long format
  inner_join( df2 ) %>%                        # Combine with family mapping
  group_by( SampleID, Fam ) %>%                # Work on each sample/family pair
  summarize( ValSum = sum(Value) ) %>%         # Compute the sum across species
  ungroup %>% spread( Fam, ValSum, fill=0 )    # Convert back to wide format

Depending on whether each sample is represented in each family, you may get NA after converting back to wide format. The optional parameter fill=0 takes care of converting these NA to 0.

根据每个样本中是否显示每个样本,转换回宽格式后可能会得到NA。可选参数fill = 0负责将这些NA转换为0。

#2

Another way, using library(data.table):

另一种方法,使用library(data.table):

setDT(df1); setDT(df2) # Convert data.frames into data.tables

x = df2[melt(df1, variable.name = 'Sp'), on = 'Sp'] # Join melted df1 to df2 on species
df3 = dcast(x, SampleID~Fam, fun.aggregate = sum) # cast to wide format by summing total values per family

#1