使用bash在每列中组合具有相同名称的行

时间:2021-02-21 01:06:34

I have a file like the following (but with 52 columns and 4,000 rows):

我有一个像下面这样的文件(但有52列和4,000行):

                   1NA2  1NB2  2RA2  2RB2
Vibrionaceae       0.22  0.25  0.36  1.02
Bacillaceae        2.0   1.76  0.55  0.23
Enterobacteriaceae 0.55  0.52  2.40  1.23
Vibrionaceae       0.22  0.25  0.36  1.02
Bacillaceae        2.0   1.76  0.55  0.23
Enterobacteriaceae 0.55  0.52  2.40  1.23

And I want it to look like this:

我希望它看起来像这样:

                   1NA2  1NB2  2RA2  2RB2
Vibrionaceae       0.44  0.50  0.72  2.04
Bacillaceae        4.0   3.52  1.10  0.46
Enterobacteriaceae 1.10  1.04  4.80  2.46

edit: I´m sorry, I don't want to delete the remaining rows and columns. Every row name is repeated several times, so I want it to appear only 1 time with the the total in every column. I have tried the following:

编辑:对不起,我不想删除剩余的行和列。每个行名称重复几次,所以我希望它只出现一次,每列都有一个总数。我尝试过以下方法:

awk '{a[$1]+=$2}END{for(i in a) print i,a[i]}' file

but it only does it for the first column, and I want it to work for all 52 columns.

但它只针对第一列,我希望它适用于所有52列。

1 个解决方案

#1


4  

With GNU awk and a 2D array:

使用GNU awk和2D数组:

awk 'NR==1
     NR>1{
       for(i=2; i<=NF; i++){
         a[$1][i]+=$i
       }
     }
     END{
       for(i in a){
         printf("%-19s", i)
         for(j=2; j<=NF; j++){
           printf("%.2f  ", a[i][j])
         }
         print ""
       }
     }' file

or as one-liner:

或作为一个班轮:

awk 'NR==1; NR>1{for(i=2; i<=NF; i++){a[$1][i]+=$i}} END{for(i in a){printf("%-19s", i); for(j in a[i]){printf("%.2f  ", a[i][j])} print ""}}' file

Output:

输出:

                   1NA2  1NB2  2RA2  2RB2
Bacillaceae        4.00  3.52  1.10  0.46  
Vibrionaceae       0.44  0.50  0.72  2.04  
Enterobacteriaceae 1.10  1.04  4.80  2.46

NR is the line number

NR是行号

NF is the number of fields in a row

NF是一行中的字段数

#1


4  

With GNU awk and a 2D array:

使用GNU awk和2D数组:

awk 'NR==1
     NR>1{
       for(i=2; i<=NF; i++){
         a[$1][i]+=$i
       }
     }
     END{
       for(i in a){
         printf("%-19s", i)
         for(j=2; j<=NF; j++){
           printf("%.2f  ", a[i][j])
         }
         print ""
       }
     }' file

or as one-liner:

或作为一个班轮:

awk 'NR==1; NR>1{for(i=2; i<=NF; i++){a[$1][i]+=$i}} END{for(i in a){printf("%-19s", i); for(j in a[i]){printf("%.2f  ", a[i][j])} print ""}}' file

Output:

输出:

                   1NA2  1NB2  2RA2  2RB2
Bacillaceae        4.00  3.52  1.10  0.46  
Vibrionaceae       0.44  0.50  0.72  2.04  
Enterobacteriaceae 1.10  1.04  4.80  2.46

NR is the line number

NR是行号

NF is the number of fields in a row

NF是一行中的字段数