I have a file that looks like:
chr1 mireap precursor 6405246 6405544 . - . ID=xxx-m0444;Count=3;mfe=-61.00
chr1 mireap mature-5p 6405511 6405534 . - . ID=xxx-m0444-5p;Parent=xxx-m044
chr1 mireap precursor 6482110 6482198 . + . ID=xxx-m0417;Count=105;mfe=-45.
chr1 mireap mature-5p 6482123 6482143 . + . ID=xxx-m0417-5p;Parent=xxx-m041
chr1 mireap mature-3p 6482168 6482188 . + . ID=xxx-m0417-3p;Parent=xxx-m041
chr1 mireap mature-3p 6482168 6482188 . + . Name=vvi-miR395g;ID=xxx-m0417-3
When fields 1, 4, and 5 are duplicated on a second line, I want to keep the duplicate line containing "Name" information at the beginning of field 9. Field 9 always begins with either "ID" or "Name". I want to remove the duplicate line where field 9 begins with "ID".
For example, the desired output would look like this:
chr1 mireap precursor 6405246 6405544 . - . ID=xxx-m0444;Count=3;mfe=-61.00
chr1 mireap mature-5p 6405511 6405534 . - . ID=xxx-m0444-5p;Parent=xxx-m044
chr1 mireap precursor 6482110 6482198 . + . ID=xxx-m0417;Count=105;mfe=-45.
chr1 mireap mature-5p 6482123 6482143 . + . ID=xxx-m0417-5p;Parent=xxx-m041
chr1 mireap mature-3p 6482168 6482188 . + . Name=vvi-miR395g;ID=xxx-m0417-3
According to 'man sort', -u outputs only the first line of "an equal run". I interpreted that as... well, if I simply sort in reverse than use -u, the "Name" containing line will be kept.
根据'man sort',-u仅输出“同等运行”的第一行。我把它解释为......好吧,如果我只是反向排序而不是使用-u,那么将保留包含“Name”的行。
sort -k1,1 -k4,4n -rk5,5n file # Correctly sorts the file and the name line appears first relative to its duplicate.
sort -u -k1,1 -k4,4n -k5,5n -rk9,9 file # Runs, but still eliminates the "Name"-containing line anyway.
I've also thought of doing something like this:
sort -k1,1 -k4,4n -rk5,5n file | awk '!x[$1,$4,%5]++' FS="\t" # but haven't gotten it to work quite yet and this still wouldn't retain the desired duplicate line...
3 个解决方案
$ cat tst.awk
{ key = $1 FS $4 FS $5; isNameLine = ($9~/^Name=/ ? 1 : 0) }
NR==FNR { if (isNameLine) hasNameLine[key]; next }
isNameLine || !(key in hasNameLine)
$ awk -f tst.awk file file
chr1 mireap precursor 6405246 6405544 . - . ID=xxx-m0444;Count=3;mfe=-61.00
chr1 mireap mature-5p 6405511 6405534 . - . ID=xxx-m0444-5p;Parent=xxx-m044
chr1 mireap precursor 6482110 6482198 . + . ID=xxx-m0417;Count=105;mfe=-45.
chr1 mireap mature-5p 6482123 6482143 . + . ID=xxx-m0417-5p;Parent=xxx-m041
chr1 mireap mature-3p 6482168 6482188 . + . Name=vvi-miR395g;ID=xxx-m0417-3
Your requirements are not entirely clear to me, but here is short script which will hopefully suggest a suitable implementation. It has been written with clarity rather than succinctness in mind.
First let's define "family" to mean a set of lines having the same [$1,$4,$5] value. Assuming you always want to keep at least one of the "Name=" lines in a family, a global sort does make sense, as otherwise the memory requirements could be prohibitive.
首先让我们将“family”定义为一组具有相同[$ 1,$ 4,$ 5]值的行。假设你总是希望在一个系列中保留至少一个“Name =”行,那么全局排序确实有意义,否则内存要求可能会过高。
So let's start with the sort you proposed, followed by an awk program, which you might want to tweak further depending on the details of your requirements and additional details about the conventions followed in the construction of the input file:
sort -k1,1 -k4,4n -k5,5n -rk9,9 |\
awk '{ seen[$1,$4,$5]++ }
$9 ~ /^Name=/ {print; next}
seen[$1,$4,$5] > 1 { next; }
{ print }'
using sort
and pick first by awk
idiom and depending on the lexical ordering of "Name" > "ID".
$ sort -k1,1 -k4,5 -k9,9r file | awk '!a[$1 FS $4 FS $5]++'
chr1 mireap precursor 6405246 6405544 . - . ID=xxx-m0444;Count=3;mfe=-61.00
chr1 mireap mature-5p 6405511 6405534 . - . ID=xxx-m0444-5p;Parent=xxx-m044
chr1 mireap precursor 6482110 6482198 . + . ID=xxx-m0417;Count=105;mfe=-45.
chr1 mireap mature-5p 6482123 6482143 . + . ID=xxx-m0417-5p;Parent=xxx-m041
chr1 mireap mature-3p 6482168 6482188 . + . Name=vvi-miR395g;ID=xxx-m0417-3
UPDATE: based on comments it looks like ID part of $9 should be in key too. Since there is no test data please verify
更新:基于评论看起来像$ 9的ID部分也应该是关键。由于没有测试数据,请验证
$ sort -k1,1 -k4,5 -k9,9r file
| awk '{match($9,/(ID=[^;]+;)/,m)}
!a[$1 FS $4 FS $5 FS m[1]]++'
$ cat tst.awk
{ key = $1 FS $4 FS $5; isNameLine = ($9~/^Name=/ ? 1 : 0) }
NR==FNR { if (isNameLine) hasNameLine[key]; next }
isNameLine || !(key in hasNameLine)
$ awk -f tst.awk file file
chr1 mireap precursor 6405246 6405544 . - . ID=xxx-m0444;Count=3;mfe=-61.00
chr1 mireap mature-5p 6405511 6405534 . - . ID=xxx-m0444-5p;Parent=xxx-m044
chr1 mireap precursor 6482110 6482198 . + . ID=xxx-m0417;Count=105;mfe=-45.
chr1 mireap mature-5p 6482123 6482143 . + . ID=xxx-m0417-5p;Parent=xxx-m041
chr1 mireap mature-3p 6482168 6482188 . + . Name=vvi-miR395g;ID=xxx-m0417-3
Your requirements are not entirely clear to me, but here is short script which will hopefully suggest a suitable implementation. It has been written with clarity rather than succinctness in mind.
First let's define "family" to mean a set of lines having the same [$1,$4,$5] value. Assuming you always want to keep at least one of the "Name=" lines in a family, a global sort does make sense, as otherwise the memory requirements could be prohibitive.
首先让我们将“family”定义为一组具有相同[$ 1,$ 4,$ 5]值的行。假设你总是希望在一个系列中保留至少一个“Name =”行,那么全局排序确实有意义,否则内存要求可能会过高。
So let's start with the sort you proposed, followed by an awk program, which you might want to tweak further depending on the details of your requirements and additional details about the conventions followed in the construction of the input file:
sort -k1,1 -k4,4n -k5,5n -rk9,9 |\
awk '{ seen[$1,$4,$5]++ }
$9 ~ /^Name=/ {print; next}
seen[$1,$4,$5] > 1 { next; }
{ print }'
using sort
and pick first by awk
idiom and depending on the lexical ordering of "Name" > "ID".
$ sort -k1,1 -k4,5 -k9,9r file | awk '!a[$1 FS $4 FS $5]++'
chr1 mireap precursor 6405246 6405544 . - . ID=xxx-m0444;Count=3;mfe=-61.00
chr1 mireap mature-5p 6405511 6405534 . - . ID=xxx-m0444-5p;Parent=xxx-m044
chr1 mireap precursor 6482110 6482198 . + . ID=xxx-m0417;Count=105;mfe=-45.
chr1 mireap mature-5p 6482123 6482143 . + . ID=xxx-m0417-5p;Parent=xxx-m041
chr1 mireap mature-3p 6482168 6482188 . + . Name=vvi-miR395g;ID=xxx-m0417-3
UPDATE: based on comments it looks like ID part of $9 should be in key too. Since there is no test data please verify
更新:基于评论看起来像$ 9的ID部分也应该是关键。由于没有测试数据,请验证
$ sort -k1,1 -k4,5 -k9,9r file
| awk '{match($9,/(ID=[^;]+;)/,m)}
!a[$1 FS $4 FS $5 FS m[1]]++'