汇总多个文件的非唯一行

I would like to combine (sum) the values for all lines that are not unique in each file: I have 96 of those files. I was trying:

我想组合（求和）每个文件中不唯一的所有行的值：我有96个这样的文件。我在努力：

for f in file*
do
awk '{a[$1]+=$2}END{for(i in a){print i, a[i]}}' "$f" > "out${f#merge}"
done

file1:

文件1：

rsRNA-8458-n    3
rsRNA-849-n 0
rsRNA-8617-n    0
rsRNA-946-n 0
rsRNA-9538-n    1
rsRNA-9811-n    1
rsRNA-9811-n    3
rsRNA-9815-n    0

file2

文件2

rsRNA-552-n 25
rsRNA-552-n 29
rsRNA-5722-n    0
rsRNA-6330-n    2
rsRNA-6330-n    0
rsRNA-6382-n    2
rsRNA-6382-n    8
rsRNA-6382-n    0
rsRNA-6382-n    0
rsRNA-6382-n    5
rsRNA-6430-n    0

2 个解决方案

#1

Your script will currently write the unique sums to each file, outputting each to an file like outfile1. Because you're asking a question about it, I'm going to assume you want to sum across all files. Here's a GNU awk script that will either sum the unique entries per file(default) or across all the files, and sort the output in either case based on the index strings used in array a:

您的脚本当前将为每个文件写入唯一的总和，将每个文件输出到outfile1之类的文件。因为你问的是关于它的问题，我假设你想要对所有文件求和。这是一个GNU awk脚本，它将对每个文件的唯一条目（默认）或所有文件进行求和，并根据数组a中使用的索引字符串对输出进行排序：

#!/usr/bin/gawk -f

BEGIN { PROCINFO["sorted_in"] = "@ind_str_asc" }

lf != FILENAME {
  if( !merge ) {
    output()
    delete( a )
  }
  lf = FILENAME
}

{ a[$1]+=$2 }

END { output() }

function output() {
  fname = "out" (!merge ? lf : "")
  for(k in a) {
    print k, a[k] > fname
  }
}

If you put that into a file called merge.awk and make it executable you can run it like:

如果你将它放入一个名为merge.awk的文件并使其可执行，你可以运行它：

./merge.awk file*

which will create the same kind of outfile1, outfile2 files you get now(though sorted). If instead, you initialize merge with a truthy value using the -v flag like:

这将创建您现在获得的相同类型的outfile1，outfile2文件（虽然已排序）。相反，如果使用-v标志初始化与truthy值的合并，如：

./merge.awk -v merge=true file

all the output will go into a file simply named out after reading all the input files in to the same array a.

在将所有输入文件读入同一个数组a之后，所有输出都将进入一个简单命名的文件。

Here's an annotated version:

这是一个带注释的版本：

#!/usr/bin/gawk -f

BEGIN { PROCINFO["sorted_in"] = "@ind_str_asc" } # GNU array sorting

lf != FILENAME {          # when the FILENAME changes
  if( !merge ) {          # output array a when merge variable is unset
    output()              # (which is the default for awk variables)
    delete( a )           # delete the array after output() to reset
  }
  lf = FILENAME           # track the last filename in lf
}

{ a[$1]+=$2 }             # sum values of the same key in array a

END { output() }          # output the contents of a

function output() {                  # define function output()
  fname = "out" (!merge ? lf : "")   # adjust the fname when merging
  for(k in a) {                      # sorted in gawk via PROCINFO
    print k, a[k] > fname            # write the contents of array a
  }
}

If you only every want all files merged, you could more simply do:

如果您只想要合并所有文件，您可以更简单地执行：

 awk '{a[$1]+=$2}END{for(i in a){print i, a[i]}}' file* > out

and append | sort to sort them.

并追加|排序他们。

#2

It's not clear at all what all lines that are not unique in each file means but assuming your awk script does what you want for one file - Again, you do not need shell loops, just let awk process all the files at once.

根本不清楚每个文件中不唯一的所有行是什么意思，但假设您的awk脚本为一个文件执行您想要的操作 - 再次，您不需要shell循环，只需让awk一次处理所有文件。

Using GNU awk for ENDFILE:

使用GNU awk进行ENDFILE：

awk '{a[$1]+=$2} ENDFILE{for(i in a) print i, a[i] > (FILENAME".out"); delete a}' *

If that's not what you wanted the edit your question to clarify and provide the expected output given the 2 input files you have posted.

如果那不是您想要的，请编辑您的问题以澄清并提供您已发布的2个输入文件的预期输出。

#1

#!/usr/bin/gawk -f

BEGIN { PROCINFO["sorted_in"] = "@ind_str_asc" }

lf != FILENAME {
  if( !merge ) {
    output()
    delete( a )
  }
  lf = FILENAME
}

{ a[$1]+=$2 }

END { output() }

function output() {
  fname = "out" (!merge ? lf : "")
  for(k in a) {
    print k, a[k] > fname
  }
}