Grep特定列,计数和写入输出文件

时间:2021-12-25 18:26:34

I'm trying to summarize my data and count for specific items

我正在尝试总结我的数据并计算具体项目

These are human sequencing data and hence very large.

这些是人类测序数据,因此非常大。

#CHROM  POS   ID    REF  ALT    QUAL    FILTER      INFO          FORMAT                            NORMAL                                          PRIMARY
  1    12867  .     C    A       5  q40;bldp;blq    SS=1;VT=SNP;  GT:DP:AD:BQ:MQ:SB:FA:SS:SSC:MQA   1/0:8:7,1:36,39:0:0.0,0.0:0.125:0:5:14.9,16.0   1/0:2:2,0:33,0:0:0.0,0:0.0:1:5:16.0,0

To simplify, the data looks something like this

为简化起见,数据看起来像这样

column1 column2 column3 column4 column5 column6 column7  column8   column9 column10                                         column11
   x      x      x        x       x        x      x       SS=1       x     1/0:8:7,1:36,39:0:0.0,0.0:0.125:0:5:14.9,16.0    1/0:2:2,0:33,0:0:0.0,0:0.0:1:5:16.0,0
   x      x      x        x       x        x      x       SS=2       x     1/0:8:7,1:36,39:0:0.0,0.0:0.125:0:5:14.9,16.0    1/0:2:2,0:33,0:0:0.0,0:0.0:1:5:16.0,0

First , I need to count the number how many different SS in column8. There are 5 different types of SS i.e. SS=1 ..... SS=5. This could be done by grep command and I tried

首先,我需要计算第8列中不同SS的数量。有5种不同类型的SS,即SS = 1 ...... SS = 5。这可以通过grep命令完成,我试过了

grep SS=1 file1.vcf | wc -l
grep SS=2 file1.vcf | wc -l

Then I want to count how many "0", "1" , "2" in column 10 and 11 at the position after the 7th colon (:)

然后我想计算在第10列和第11列之后的第7列冒号(:)中有多少“0”,“1”,“2”

This is the part that I'm not sure how to do. I was thinking about using awk but i'm not sure how to specify to look for at specific position (after the 7th colon (:)

这是我不确定该怎么做的部分。我在考虑使用awk,但我不确定如何指定在特定位置寻找(在第7个冒号之后)

awk -F ':' '$11==1' #this does command only specifies column but not at specific position.

I have 246 files that I want to do exactly the same. How can I apply to all my files and write the count in txt file? I only know how to do it one by one and probably I can cat the count files at the end.

我有246个文件,我想要完全相同。如何应用我的所有文件并在txt文件中写入计数?我只知道如何一个接一个地做到这一点,也许我可以在最后捕获计数文件。

for f in *.vcf; do grep SS=1 "$f" | wc -l > ${f}SS1.txt; done

1 个解决方案

#1


2  

To count how many different values you have in column 8 you can use the typical approach:

要计算第8列中有多少个不同的值,您可以使用典型方法:

$ awk -F"\t" 'NR>1{a[$8]++} END{for (i in a) print i,a[i]}' file
SS=1 1
SS=2 1

To count how many different values you have in the 8th position of a :-separated string from the 10th and 11th fields, you can use split() to slice the string in blocks. And then, use the same approach as above.

要计算在第10个字段的第8个位置中有多少个不同的值: - 第10个和第11个字段中的分隔字符串,可以使用split()来分割块中的字符串。然后,使用与上面相同的方法。

$ awk -F"\t" 'NR>1{split($10,a,":"); split($11,b,":"); count10[a[8]]++; count11[b[8]]++} END {for (i in count10) print i, count10[i]; for (i in count11) print i, count11[i]}' a
0 2
1 2

You can put all together to get something like:

你可以把所有东西放在一起得到类似的东西:

$ awk -F"\t" 'NR>1{count8[$8]++; split($10,a,":"); split($11,b,":"); count10[a[8]]++; count11[b[8]]++} END {for (i in count8) print i, count8[i]; for (i in count10) print i, count10[i]; for (i in count11) print i, count11[i]}' file
SS=1 1
SS=2 1
0 2
1 2

If you want to do this for many files, you can either use the loop or -better- work with FILENAME and ENDFILE to flush the stored information. Try it out and let us know if you face any problem there.

如果要对许多文件执行此操作,可以使用循环或使用FILENAME和ENDFILE来清除存储的信息。试试看,如果你遇到任何问题,请告诉我们。

#1


2  

To count how many different values you have in column 8 you can use the typical approach:

要计算第8列中有多少个不同的值,您可以使用典型方法:

$ awk -F"\t" 'NR>1{a[$8]++} END{for (i in a) print i,a[i]}' file
SS=1 1
SS=2 1

To count how many different values you have in the 8th position of a :-separated string from the 10th and 11th fields, you can use split() to slice the string in blocks. And then, use the same approach as above.

要计算在第10个字段的第8个位置中有多少个不同的值: - 第10个和第11个字段中的分隔字符串,可以使用split()来分割块中的字符串。然后,使用与上面相同的方法。

$ awk -F"\t" 'NR>1{split($10,a,":"); split($11,b,":"); count10[a[8]]++; count11[b[8]]++} END {for (i in count10) print i, count10[i]; for (i in count11) print i, count11[i]}' a
0 2
1 2

You can put all together to get something like:

你可以把所有东西放在一起得到类似的东西:

$ awk -F"\t" 'NR>1{count8[$8]++; split($10,a,":"); split($11,b,":"); count10[a[8]]++; count11[b[8]]++} END {for (i in count8) print i, count8[i]; for (i in count10) print i, count10[i]; for (i in count11) print i, count11[i]}' file
SS=1 1
SS=2 1
0 2
1 2

If you want to do this for many files, you can either use the loop or -better- work with FILENAME and ENDFILE to flush the stored information. Try it out and let us know if you face any problem there.

如果要对许多文件执行此操作,可以使用循环或使用FILENAME和ENDFILE来清除存储的信息。试试看,如果你遇到任何问题,请告诉我们。