I have a dataframe (df1) that includes abundances of different species in each sample:
我有一个数据帧(df1),其中包含每个样本中不同物种的丰度:
> SampleID Sp1 Sp2 Sp3 Sp4 ... Spn
> asb-001 3 0 0 23 9
> asb-002 4 15 10 56 98
> asb-003 8 45 8 453 0
> asb-004 0 5 0 3 6
> asb-005 120 56 0 0 0
...
Each column represents a different species.
每列代表不同的物种。
I have another dataframe (df2)
我有另一个数据帧(df2)
Sp Fam
Sp1 Fam1
Sp2 Fam2
Sp3 Fam1
Sp4 Fam3
Sp5 Fam2
Sp6 Fam1
...
There are fewer Family names than there are Species names.
系列名称少于种类名称。
I would like to add columns together depending on the species name to get a total for the family (e.g. Sp1 + Sp3 + Sp6 = Fam1). I don't want to keep the original column with the species name.
我想根据物种名称一起添加列,以获得该家族的总数(例如Sp1 + Sp3 + Sp6 = Fam1)。我不想保留原始列与物种名称。
If all goes according to plan, my new dataframe (df3) will look something like this:
如果一切按计划进行,我的新数据帧(df3)将如下所示:
> SampleID Fam1 Fam2 Fam3
> asb-001 12 0 9
> asb-002 14 18 112
> asb-003 28 58 18
> asb-004 10 12 10
> asb-005 142 65 0
...
I could manually go through and add individual species together, but this seems tedious for a large dataset, and likely to produce errors. I have the feeling that I should melt df1, but I'm not confident in the details. Any advice would be appreciated!
我可以手动完成并将各个物种组合在一起,但这对于大型数据集来说似乎很乏味,并且可能会产生错误。我觉得我应该融化df1,但我对细节没有信心。任何意见,将不胜感激!
2 个解决方案
#1
4
Change your original data to be in the long format and join against the species-family mapping. You can then compute summation for each family using group_by
. Finally, spread the data back out to be in wide format.
将原始数据更改为长格式并加入物种家族映射。然后,您可以使用group_by计算每个系列的总和。最后,将数据传播回宽格式。
library( tidyverse )
df1 %>% gather( Sp, Value, -SampleID ) %>% # Convert to long format
inner_join( df2 ) %>% # Combine with family mapping
group_by( SampleID, Fam ) %>% # Work on each sample/family pair
summarize( ValSum = sum(Value) ) %>% # Compute the sum across species
ungroup %>% spread( Fam, ValSum, fill=0 ) # Convert back to wide format
Depending on whether each sample is represented in each family, you may get NA
after converting back to wide format. The optional parameter fill=0
takes care of converting these NA
to 0.
根据每个样本中是否显示每个样本,转换回宽格式后可能会得到NA。可选参数fill = 0负责将这些NA转换为0。
#2
0
Another way, using library(data.table)
:
另一种方法,使用library(data.table):
setDT(df1); setDT(df2) # Convert data.frames into data.tables
x = df2[melt(df1, variable.name = 'Sp'), on = 'Sp'] # Join melted df1 to df2 on species
df3 = dcast(x, SampleID~Fam, fun.aggregate = sum) # cast to wide format by summing total values per family
#1
4
Change your original data to be in the long format and join against the species-family mapping. You can then compute summation for each family using group_by
. Finally, spread the data back out to be in wide format.
将原始数据更改为长格式并加入物种家族映射。然后,您可以使用group_by计算每个系列的总和。最后,将数据传播回宽格式。
library( tidyverse )
df1 %>% gather( Sp, Value, -SampleID ) %>% # Convert to long format
inner_join( df2 ) %>% # Combine with family mapping
group_by( SampleID, Fam ) %>% # Work on each sample/family pair
summarize( ValSum = sum(Value) ) %>% # Compute the sum across species
ungroup %>% spread( Fam, ValSum, fill=0 ) # Convert back to wide format
Depending on whether each sample is represented in each family, you may get NA
after converting back to wide format. The optional parameter fill=0
takes care of converting these NA
to 0.
根据每个样本中是否显示每个样本,转换回宽格式后可能会得到NA。可选参数fill = 0负责将这些NA转换为0。
#2
0
Another way, using library(data.table)
:
另一种方法,使用library(data.table):
setDT(df1); setDT(df2) # Convert data.frames into data.tables
x = df2[melt(df1, variable.name = 'Sp'), on = 'Sp'] # Join melted df1 to df2 on species
df3 = dcast(x, SampleID~Fam, fun.aggregate = sum) # cast to wide format by summing total values per family