从文件中grep多个模式,并为每个模式输出前5个匹配项

时间:2022-03-31 03:15:35

I have a question on pattern matching. I have a file with multiple patterns in it, say pattern.txt

我有关于模式匹配的问题。我有一个包含多个模式的文件,比如pattern.txt

Locus3039v1rpkm6.85
Locus3041v1rpkm6.84
Locus3042v1rpkm6.84

地址3039v1rpkm6.85 Locus3041v1rpkm6.84 Locus3042v1rpkm6.84

And the test file to search is file.txt -

要搜索的测试文件是file.txt -

Locus3039v1rpkm6.85 gi|350401309|ref|XM_003486067.1|    0   10  85  328 253 8e-12   78.8
Locus3039v1rpkm6.85 gi|350401301|ref|XM_003486066.1|    0   10  85  566 491 8e-12   78.8
Locus3039v1rpkm6.85 gi|350401298|ref|XM_003486065.1|    0   10  85  500 425 8e-12   78.8
Locus3039v1rpkm6.85 gi|340723355|ref|XM_003400008.1|    0   10  106 566 470 3e-11   77.0
Locus3039v1rpkm6.85 gi|340723353|ref|XM_003400007.1|    0   10  106 496 400 3e-11   77.0
Locus3039v1rpkm6.85 gi|359323056|ref|XM_003639939.1|    0   27  104 322 245 9e-05   55.4
Locus3039v1rpkm6.85 gi|359323055|ref|XM_543849.4|   0   27  104 241 164 9e-05   55.4
Locus3039v1rpkm6.85 gi|354503991|ref|XM_003514015.1|    0   27  103 335 259 0.004   50.0
Locus3039v1rpkm6.85 gi|341599927|emb|AM412059.2|    1   63  100 1645525 1645489 6.8 39.2
Locus3039v1rpkm6.85 gi|340003223|emb|HE572590.1|    1   63  100 1671652 1671616 6.8 39.2
Locus3041v1rpkm6.84 gi|337757426|emb|FQ859181.1|    1   61  114 2772617 2772667 0.60    42.8
Locus3041v1rpkm6.84 gi|159889572|gb|CP000875.1|     0   5   40  1185295 1185330 0.60    42.8
Locus3041v1rpkm6.84 gi|158107272|gb|CP000820.1|     0   2   34  5594193 5594161 0.60    42.8
Locus3041v1rpkm6.84 gi|156844486|ref|XM_001645256.1|    83  140 793 850 0.60    42.8
Locus3041v1rpkm6.84 gi|339305108|gb|CP001503.2|     0   58  94  3006529 3006565 2.1 41.0
Locus3041v1rpkm6.84 gi|247533203|gb|CP001607.1|     0   1   40  1268073 1268034 2.1 41.0
Locus3041v1rpkm6.84 gi|367050653|ref|XM_003655658.1|    0   75  103 843 871 7.3 39.2
Locus3041v1rpkm6.84 gi|347002178|gb|CP003012.1|     0   75  103 2986236 2986208 7.3 39.2
Locus3043v1rpkm6.84 gi|332015867|gb|HQ658110.1|     0   9   31  4151    4129    0.49    42.8
Locus3043v1rpkm6.84 gi|254946573|gb|CP001619.1|     1   9   43  4243052 4243019 0.49    42.8
Locus3043v1rpkm6.84 gi|329755665|gb|JF715057.1|     0   11  42  110968  110937  1.7 41.0
Locus3043v1rpkm6.84 gi|9937515|gb|AF294752.1|   0   48  79  2081    2050    1.7 41.0

I want to match each pattern for the first 5 hits and move to the next pattern for the 1st five and so on.

我希望匹配前5个命中的每个模式,然后移动到前5个的下一个模式,依此类推。

I tried

 grep -i -m 5 -f pattern.txt file.txt > out.txt
 grep -i -f pattern.txt -m 5 file.txt > out.txt

But I am getting only the top 5 for the first pattern and ending. Where am I going wrong? Is there a parameter to perform this required function?

但是我只获得了第一个模式的前五名并且结束了。我哪里错了?是否有参数来执行此必需功能?

3 个解决方案

#1


4  

Try this:

for pat in $(cat pattern.txt); do grep -i -m 5 $pat file.txt; done > out.txt

Which means

  1. For each pattern in pattern.txt, grep the first 5 matched record.
  2. 对于pattern.txt中的每个模式,grep前5个匹配的记录。

  3. Append the result to out.txt
  4. 将结果附加到out.txt

EDIT

As @dogbane mentioned in his comment, this is a UUOC. Here is my improved answer:

正如@dogbane在评论中提到的,这是一个UUOC。这是我改进的答案:

for pat in $(< pattern.txt); do grep -i -m 5 $pat file.txt; done > out.txt

Also look at this answer.

还看看这个答案。

#2


2  

Each time you use > you are overwriting the file with the new output, using >> instead you'll be able to append to the file:

每次使用>你都用新输出覆盖文件,使用>>而不是你能够追加到文件:

$ yourcommand >> file

#3


1  

Here's one way using awk. It should be quite quick too, because file.txt is read only once:

这是使用awk的一种方式。它应该也很快,因为file.txt只读一次:

awk 'BEGIN { IGNORECASE=1 } FNR==NR { a[$0]++; next } { for (i in a) if ($0 ~ i && a[i] <= 5) { print; a[i]++ } }' patterns.txt file.txt

Results:

Locus3039v1rpkm6.85 gi|350401309|ref|XM_003486067.1|    0   10  85  328 253 8e-12   78.8
Locus3039v1rpkm6.85 gi|350401301|ref|XM_003486066.1|    0   10  85  566 491 8e-12   78.8
Locus3039v1rpkm6.85 gi|350401298|ref|XM_003486065.1|    0   10  85  500 425 8e-12   78.8
Locus3039v1rpkm6.85 gi|340723355|ref|XM_003400008.1|    0   10  106 566 470 3e-11   77.0
Locus3039v1rpkm6.85 gi|340723353|ref|XM_003400007.1|    0   10  106 496 400 3e-11   77.0
Locus3041v1rpkm6.84 gi|337757426|emb|FQ859181.1|    1   61  114 2772617 2772667 0.60    42.8
Locus3041v1rpkm6.84 gi|159889572|gb|CP000875.1|     0   5   40  1185295 1185330 0.60    42.8
Locus3041v1rpkm6.84 gi|158107272|gb|CP000820.1|     0   2   34  5594193 5594161 0.60    42.8
Locus3041v1rpkm6.84 gi|156844486|ref|XM_001645256.1|    83  140 793 850 0.60    42.8
Locus3041v1rpkm6.84 gi|339305108|gb|CP001503.2|     0   58  94  3006529 3006565 2.1 41.0

#1


4  

Try this:

for pat in $(cat pattern.txt); do grep -i -m 5 $pat file.txt; done > out.txt

Which means

  1. For each pattern in pattern.txt, grep the first 5 matched record.
  2. 对于pattern.txt中的每个模式,grep前5个匹配的记录。

  3. Append the result to out.txt
  4. 将结果附加到out.txt

EDIT

As @dogbane mentioned in his comment, this is a UUOC. Here is my improved answer:

正如@dogbane在评论中提到的,这是一个UUOC。这是我改进的答案:

for pat in $(< pattern.txt); do grep -i -m 5 $pat file.txt; done > out.txt

Also look at this answer.

还看看这个答案。

#2


2  

Each time you use > you are overwriting the file with the new output, using >> instead you'll be able to append to the file:

每次使用>你都用新输出覆盖文件,使用>>而不是你能够追加到文件:

$ yourcommand >> file

#3


1  

Here's one way using awk. It should be quite quick too, because file.txt is read only once:

这是使用awk的一种方式。它应该也很快,因为file.txt只读一次:

awk 'BEGIN { IGNORECASE=1 } FNR==NR { a[$0]++; next } { for (i in a) if ($0 ~ i && a[i] <= 5) { print; a[i]++ } }' patterns.txt file.txt

Results:

Locus3039v1rpkm6.85 gi|350401309|ref|XM_003486067.1|    0   10  85  328 253 8e-12   78.8
Locus3039v1rpkm6.85 gi|350401301|ref|XM_003486066.1|    0   10  85  566 491 8e-12   78.8
Locus3039v1rpkm6.85 gi|350401298|ref|XM_003486065.1|    0   10  85  500 425 8e-12   78.8
Locus3039v1rpkm6.85 gi|340723355|ref|XM_003400008.1|    0   10  106 566 470 3e-11   77.0
Locus3039v1rpkm6.85 gi|340723353|ref|XM_003400007.1|    0   10  106 496 400 3e-11   77.0
Locus3041v1rpkm6.84 gi|337757426|emb|FQ859181.1|    1   61  114 2772617 2772667 0.60    42.8
Locus3041v1rpkm6.84 gi|159889572|gb|CP000875.1|     0   5   40  1185295 1185330 0.60    42.8
Locus3041v1rpkm6.84 gi|158107272|gb|CP000820.1|     0   2   34  5594193 5594161 0.60    42.8
Locus3041v1rpkm6.84 gi|156844486|ref|XM_001645256.1|    83  140 793 850 0.60    42.8
Locus3041v1rpkm6.84 gi|339305108|gb|CP001503.2|     0   58  94  3006529 3006565 2.1 41.0