通过指定多个字段来删除重复的行，保留第二个排序的行

I have a file that looks like:

我有一个看起来像这样的文件：

chr1          mireap  precursor  6405246   6405544   .  -  .  ID=xxx-m0444;Count=3;mfe=-61.00
chr1          mireap  mature-5p  6405511   6405534   .  -  .  ID=xxx-m0444-5p;Parent=xxx-m044
chr1          mireap  precursor  6482110   6482198   .  +  .  ID=xxx-m0417;Count=105;mfe=-45.
chr1          mireap  mature-5p  6482123   6482143   .  +  .  ID=xxx-m0417-5p;Parent=xxx-m041
chr1          mireap  mature-3p  6482168   6482188   .  +  .  ID=xxx-m0417-3p;Parent=xxx-m041
chr1          mireap  mature-3p  6482168   6482188   .  +  .  Name=vvi-miR395g;ID=xxx-m0417-3

HEAVILY EDITED FOR CLARIFICATION

重点编辑为澄清

When fields 1, 4, and 5 are duplicated on a second line, I want to keep the duplicate line containing "Name" information at the beginning of field 9. Field 9 always begins with either "ID" or "Name". I want to remove the duplicate line where field 9 begins with "ID".

当字段1,4和5在第二行上重复时，我想在字段9的开头保留包含“名称”信息的重复行。字段9始终以“ID”或“名称”开头。我想删除字段9以“ID”开头的重复行。

For example, the desired output would look like this:

例如，所需的输出如下所示：

chr1          mireap  precursor  6405246   6405544   .  -  .  ID=xxx-m0444;Count=3;mfe=-61.00
chr1          mireap  mature-5p  6405511   6405534   .  -  .  ID=xxx-m0444-5p;Parent=xxx-m044
chr1          mireap  precursor  6482110   6482198   .  +  .  ID=xxx-m0417;Count=105;mfe=-45.
chr1          mireap  mature-5p  6482123   6482143   .  +  .  ID=xxx-m0417-5p;Parent=xxx-m041
chr1          mireap  mature-3p  6482168   6482188   .  +  .  Name=vvi-miR395g;ID=xxx-m0417-3

According to 'man sort', -u outputs only the first line of "an equal run". I interpreted that as... well, if I simply sort in reverse than use -u, the "Name" containing line will be kept.

根据'man sort'，-u仅输出“同等运行”的第一行。我把它解释为......好吧，如果我只是反向排序而不是使用-u，那么将保留包含“Name”的行。

sort -k1,1 -k4,4n -rk5,5n file # Correctly sorts the file and the name line appears first relative to its duplicate.

sort -u -k1,1 -k4,4n -k5,5n -rk9,9 file # Runs, but still eliminates the "Name"-containing line anyway.

I've also thought of doing something like this:

我也想过做这样的事情：

sort -k1,1 -k4,4n -rk5,5n file | awk '!x[$1,$4,%5]++' FS="\t" # but haven't gotten it to work quite yet and this still wouldn't retain the desired duplicate line...

Ideas?

想法？

3 个解决方案

#1

$ cat tst.awk
{ key = $1 FS $4 FS $5; isNameLine = ($9~/^Name=/ ? 1 : 0) }
NR==FNR { if (isNameLine) hasNameLine[key]; next }
isNameLine || !(key in hasNameLine)

$ awk -f tst.awk file file
chr1          mireap  precursor  6405246   6405544   .  -  .  ID=xxx-m0444;Count=3;mfe=-61.00
chr1          mireap  mature-5p  6405511   6405534   .  -  .  ID=xxx-m0444-5p;Parent=xxx-m044
chr1          mireap  precursor  6482110   6482198   .  +  .  ID=xxx-m0417;Count=105;mfe=-45.
chr1          mireap  mature-5p  6482123   6482143   .  +  .  ID=xxx-m0417-5p;Parent=xxx-m041
chr1          mireap  mature-3p  6482168   6482188   .  +  .  Name=vvi-miR395g;ID=xxx-m0417-3

#2

Your requirements are not entirely clear to me, but here is short script which will hopefully suggest a suitable implementation. It has been written with clarity rather than succinctness in mind.

您的要求对我来说并不完全清楚，但这里有一个简短的脚本，希望能够提出合适的实施方案。它的编写清晰而不是简洁。

First let's define "family" to mean a set of lines having the same [$1,$4,$5] value. Assuming you always want to keep at least one of the "Name=" lines in a family, a global sort does make sense, as otherwise the memory requirements could be prohibitive.

首先让我们将“family”定义为一组具有相同[$ 1，$ 4，$ 5]值的行。假设你总是希望在一个系列中保留至少一个“Name =”行，那么全局排序确实有意义，否则内存要求可能会过高。

So let's start with the sort you proposed, followed by an awk program, which you might want to tweak further depending on the details of your requirements and additional details about the conventions followed in the construction of the input file:

因此，让我们从您提出的排序开始，然后是awk程序，您可能希望根据需求的详细信息进一步调整，以及有关构造输入文件时遵循的约定的其他详细信息：

sort -k1,1 -k4,4n -k5,5n -rk9,9 |\
  awk '{ seen[$1,$4,$5]++ }
       $9 ~ /^Name=/ {print; next}
       seen[$1,$4,$5] > 1 { next; }
       { print }'

#3

using sort and pick first by awk idiom and depending on the lexical ordering of "Name" > "ID".

首先使用awk惯用法进行排序和选择，并根据“名称”>“ID”的词汇顺序。

$ sort -k1,1 -k4,5 -k9,9r file | awk '!a[$1 FS $4 FS $5]++'

chr1          mireap  precursor  6405246   6405544   .  -  .  ID=xxx-m0444;Count=3;mfe=-61.00
chr1          mireap  mature-5p  6405511   6405534   .  -  .  ID=xxx-m0444-5p;Parent=xxx-m044
chr1          mireap  precursor  6482110   6482198   .  +  .  ID=xxx-m0417;Count=105;mfe=-45.
chr1          mireap  mature-5p  6482123   6482143   .  +  .  ID=xxx-m0417-5p;Parent=xxx-m041
chr1          mireap  mature-3p  6482168   6482188   .  +  .  Name=vvi-miR395g;ID=xxx-m0417-3

UPDATE: based on comments it looks like ID part of $9 should be in key too. Since there is no test data please verify

更新：基于评论看起来像$ 9的ID部分也应该是关键。由于没有测试数据，请验证

$ sort -k1,1 -k4,5 -k9,9r file 
     | awk '{match($9,/(ID=[^;]+;)/,m)} 
            !a[$1 FS $4 FS $5 FS m[1]]++'

#1

$ cat tst.awk
{ key = $1 FS $4 FS $5; isNameLine = ($9~/^Name=/ ? 1 : 0) }
NR==FNR { if (isNameLine) hasNameLine[key]; next }
isNameLine || !(key in hasNameLine)

$ awk -f tst.awk file file
chr1          mireap  precursor  6405246   6405544   .  -  .  ID=xxx-m0444;Count=3;mfe=-61.00
chr1          mireap  mature-5p  6405511   6405534   .  -  .  ID=xxx-m0444-5p;Parent=xxx-m044
chr1          mireap  precursor  6482110   6482198   .  +  .  ID=xxx-m0417;Count=105;mfe=-45.
chr1          mireap  mature-5p  6482123   6482143   .  +  .  ID=xxx-m0417-5p;Parent=xxx-m041
chr1          mireap  mature-3p  6482168   6482188   .  +  .  Name=vvi-miR395g;ID=xxx-m0417-3

#2