Please help me parse a VCF file. I am pasting a real example.
请帮助我解析VCF文件。我在粘贴一个真实的例子。
Input:
输入:
1 1014143 rs786201005 C T . . RS=786201005;RSPOS=1014143;dbSNPBuildID=144;SSR=0;SAO=1;VP=0x050068000605000002110100;GENEINFO=ISG15:9636;WGT=1;VC=SNV;PM;PMC;NSN;REF;ASP;LSD;OM;CLNALLE=1;CLNHGVS=NC_000001.11:g.1014143C>T;CLNSRC=OMIM_Allelic_Variant;CLNORIGIN=1;CLNSRCID=147571.0003;CLNSIG=5;CLNDSDB=MedGen:OMIM:Orphanet;CLNDSDBID=C4015293:616126:ORPHA319563;CLNDBN=Immunodeficiency_38_with_basal_ganglia_calcification;CLNREVSTAT=no_criteria;CLNACC=RCV000162196.3
1 1014228 rs1921 G A,C . . RS=1921;RSPOS=1014228;dbSNPBuildID=36;SSR=0;SAO=0;VP=0x050328000a0517053f000100;GENEINFO=ISG15:9636;WGT=1;VC=SNV;PM;PMC;S3D;SLO;NSM;REF;ASP;VLD;G5A;G5;HD;GNO;KGPhase1;KGPhase3;CLNALLE=1;CLNHGVS=NC_000001.11:g.1014228G>A;CLNSRC=.;CLNORIGIN=1;CLNSRCID=.;CLNSIG=2;CLNDSDB=MedGen;CLNDSDBID=CN169374;CLNDBN=not_specified;CLNREVSTAT=single;CLNACC=RCV000455759.1;CAF=0.6611,0.3389,.;COMMON=1
1 1014316 rs672601345 C CG . . RS=672601345;RSPOS=1014319;dbSNPBuildID=142;SSR=0;SAO=1;VP=0x050068001205000002110200;GENEINFO=ISG15:9636;WGT=1;VC=DIV;PM;PMC;NSF;REF;ASP;LSD;OM;CLNALLE=1;CLNHGVS=NC_000001.11:g.1014319dupG;CLNSRC=OMIM_Allelic_Variant;CLNORIGIN=1;CLNSRCID=147571.0002;CLNSIG=5;CLNDSDB=MedGen:OMIM:Orphanet;CLNDSDBID=C4015293:616126:ORPHA319563;CLNDBN=Immunodeficiency_38_with_basal_ganglia_calcification;CLNREVSTAT=no_criteria;CLNACC=RCV000148989.5
1 1014359 rs672601312 G T . . RS=672601312;RSPOS=1014359;dbSNPBuildID=142;SSR=0;SAO=1;VP=0x050068000605000002110100;GENEINFO=ISG15:9636;WGT=1;VC=SNV;PM;PMC;NSN;REF;ASP;LSD;OM;CLNALLE=1;CLNHGVS=NC_000001.11:g.1014359G>T;CLNSRC=OMIM_Allelic_Variant;CLNORIGIN=1;CLNSRCID=147571.0001;CLNSIG=5;CLNDSDB=MedGen:OMIM:Orphanet;CLNDSDBID=C4015293:616126:ORPHA319563;CLNDBN=Immunodeficiency_38_with_basal_ganglia_calcification;CLNREVSTAT=no_criteria;CLNACC=RCV000148988.5
1 1020183 rs539283387 G C . . RS=539283387;RSPOS=1020183;dbSNPBuildID=142;SSR=0;SAO=0;VP=0x050000000a05040026000100;GENEINFO=AGRN:375790;WGT=1;VC=SNV;NSM;REF;ASP;VLD;KGPhase3;CLNALLE=1;CLNHGVS=NC_000001.11:g.1020183G>C;CLNSRC=.;CLNORIGIN=1;CLNSRCID=.;CLNSIG=3;CLNDSDB=MedGen;CLNDSDBID=CN169374;CLNDBN=not_specified;CLNREVSTAT=single;CLNACC=RCV000424799.1;CAF=0.9904,0.009585;COMMON=1
1 1020216 rs764659938 C G . . RS=764659938;RSPOS=1020216;dbSNPBuildID=144;SSR=0;SAO=0;VP=0x050000000a05040002000100;GENEINFO=AGRN:375790;WGT=1;VC=SNV;NSM;REF;ASP;VLD;CLNALLE=1;CLNHGVS=NC_000001.11:g.1020216C>G;CLNSRC=.;CLNORIGIN=1;CLNSRCID=.;CLNSIG=0;CLNDSDB=MedGen;CLNDSDBID=CN221809;CLNDBN=cancer;CLNREVSTAT=single;CLNACC=RCV000422793.1
And I need an output:
我需要一个输出:
1014143 rs786201005 C T CLNSIG=5 CLNDBN=Immunodeficiency_38_with_basal_ganglia_calcification
1014228 rs1921 G A,C CLNSIG=2 CLNDBN=not_specified
1014316 rs672601345 C CG CLNSIG=5 CLNDBN=Immunodeficiency_38_with_basal_ganglia_calcification
1014359 rs672601312 G T CLNSIG=5 CLNDBN=Immunodeficiency_38_with_basal_ganglia_calcification
1020183 rs539283387 G C CLNSIG=3 CLNDBN=not_specified
1020216 rs764659938 C G CLNSIG=0 CLNDBN=not_provided
That means print column 2,3,4,5 and then parse last column and print just CLNSIG and CLNDBN. Problem is, that those values are not always in the same position.
这意味着打印列2、3、4、5,然后解析最后一列并打印出CLNSIG和CLNDBN。问题是,这些值并不总是处于相同的位置。
My try was:
我的尝试:
awk -v OFS="\t"'{print $2,$3,$4,$5,$8}' input
...and then I have no clue how to get CLNSIG and CLNDBN.
…我不知道怎么得到CLNSIG和CLNDBN。
Thank you for any ideas.
谢谢你的建议。
3 个解决方案
#1
2
With perl
用perl
$ perl -lane 'print join "\t",(@F[1..4], /(?:CLNSIG|CLNDBN)=[^;]+/g)' ip.txt
1014143 rs786201005 C T CLNSIG=5 CLNDBN=Immunodeficiency_38_with_basal_ganglia_calcification
1014228 rs1921 G A,C CLNSIG=2 CLNDBN=not_specified
1014316 rs672601345 C CG CLNSIG=5 CLNDBN=Immunodeficiency_38_with_basal_ganglia_calcification
1014359 rs672601312 G T CLNSIG=5 CLNDBN=Immunodeficiency_38_with_basal_ganglia_calcification
1020183 rs539283387 G C CLNSIG=3 CLNDBN=not_specified
1020216 rs764659938 C G CLNSIG=0 CLNDBN=cancer
-
-a
option to split input on white-space, saved in@F
array - - - -在白空间上分割输入的选项,保存在@F数组中。
-
/(?:CLNSIG|CLNDBN)=[^;]+/g
will return theCLNSIG
andCLNDBN
fields - /(?:CLNSIG | CLNDBN)=[^;]+ / g将返回CLNSIG和CLNDBN字段
-
@F[1..4]
gives fields 2nd to 5th (index starts from0
) - @F[1 . .4]给予字段2到5(索引从0开始)
- See http://perldoc.perl.org/perlrun.html#Command-Switches for details on
-lane
options - 有关-lane选项的详细信息,请参见http://perldoc.perl.org/perlrun.html#命令开关。
#2
3
-
Pure
bash
, works by usingbash
to parse the remaining variables in$h
, with parameter tranformation output:纯bash,使用bash来解析$h中剩余的变量,并使用参数转换输出:
while read a b c d e f g h ; do declare ${h//;/ } printf "%s\t%-10s\t%s\t%s\t%s\t%s\n" $b $c $d $e ${CLNSIG@A} ${CLNDBN@A} done < input
Output:
输出:
1014143 rs786201005 C T CLNSIG='5' CLNDBN='Immunodeficiency_38_with_basal_ganglia_calcification' 1014228 rs1921 G A,C CLNSIG='2' CLNDBN='not_specified' 1014316 rs672601345 C CG CLNSIG='5' CLNDBN='Immunodeficiency_38_with_basal_ganglia_calcification' 1014359 rs672601312 G T CLNSIG='5' CLNDBN='Immunodeficiency_38_with_basal_ganglia_calcification' 1020183 rs539283387 G C CLNSIG='3' CLNDBN='not_specified' 1020216 rs764659938 C G CLNSIG='0' CLNDBN='cancer'
-
POSIX shell,
grep
andprintf
method:POSIX shell、grep和printf方法:
while read a b c d e f g h ; do printf "%s\t%-10s\t%s\t%s\t%s\t%s\n" $b $c $d $e \ $( echo "$h" | grep -o 'CLN\(SIG\|DBN\)=[^;]*' ) ; done < input
Output:
输出:
1014143 rs786201005 C T CLNSIG=5 CLNDBN=Immunodeficiency_38_with_basal_ganglia_calcification 1014228 rs1921 G A,C CLNSIG=2 CLNDBN=not_specified 1014316 rs672601345 C CG CLNSIG=5 CLNDBN=Immunodeficiency_38_with_basal_ganglia_calcification 1014359 rs672601312 G T CLNSIG=5 CLNDBN=Immunodeficiency_38_with_basal_ganglia_calcification 1020183 rs539283387 G C CLNSIG=3 CLNDBN=not_specified 1020216 rs764659938 C G CLNSIG=0 CLNDBN=cancer
#3
2
It can be done using awk:
可以使用awk:
script.awk
script.awk
BEGIN { OFS="\t" }
{ clnsig = clndbn = ""
if( match( $8, /CLNSIG=[^;]+/ ) ) {
clnsig = substr( $8, RSTART, RLENGTH )
}
if( match( $8, /CLNDBN=[^;]+/ ) ) {
clndbn = substr( $8, RSTART, RLENGTH )
}
print $2, $3, $4, $5, clnsig, clndbn
}
Or more compact, in case that CLNDBN
is always after CLNSIG
:
或更紧凑,以防clnbn总是在CLNSIG之后:
script.awk
script.awk
BEGIN { OFS="\t" }
{ match($8,/(CLNSIG=[^;]+).*(CLNDBN=[^;]+)/, tmp)
print $2,$3,$4,$5, tmp[1], tmp[2]
}
The function match
matches a regular expression. The first form sets the variables RSTART
and RLENGTH
so that you can extract the text with substring
.
函数match与正则表达式匹配。第一个表单设置变量RSTART和RLENGTH,以便您可以用子字符串提取文本。
The second form puts the first subexpression (first parentheses) in the array tmp
at pos 1, the second subexpression at pos 2 and so on.
第二种形式将第一个子表达式(第一个括号)放在pos 1的数组tmp中,第2个子表达式在pos 2上,以此类推。
The regular expression CLNSIG=[^;]+
matches a literal CLNSIG=
followed by a substring up to (but not including) the ;
.
正则表达式CLNSIG=[]+匹配一个文字CLNSIG=后面跟着一个子字符串(但不包括);
#1
2
With perl
用perl
$ perl -lane 'print join "\t",(@F[1..4], /(?:CLNSIG|CLNDBN)=[^;]+/g)' ip.txt
1014143 rs786201005 C T CLNSIG=5 CLNDBN=Immunodeficiency_38_with_basal_ganglia_calcification
1014228 rs1921 G A,C CLNSIG=2 CLNDBN=not_specified
1014316 rs672601345 C CG CLNSIG=5 CLNDBN=Immunodeficiency_38_with_basal_ganglia_calcification
1014359 rs672601312 G T CLNSIG=5 CLNDBN=Immunodeficiency_38_with_basal_ganglia_calcification
1020183 rs539283387 G C CLNSIG=3 CLNDBN=not_specified
1020216 rs764659938 C G CLNSIG=0 CLNDBN=cancer
-
-a
option to split input on white-space, saved in@F
array - - - -在白空间上分割输入的选项,保存在@F数组中。
-
/(?:CLNSIG|CLNDBN)=[^;]+/g
will return theCLNSIG
andCLNDBN
fields - /(?:CLNSIG | CLNDBN)=[^;]+ / g将返回CLNSIG和CLNDBN字段
-
@F[1..4]
gives fields 2nd to 5th (index starts from0
) - @F[1 . .4]给予字段2到5(索引从0开始)
- See http://perldoc.perl.org/perlrun.html#Command-Switches for details on
-lane
options - 有关-lane选项的详细信息,请参见http://perldoc.perl.org/perlrun.html#命令开关。
#2
3
-
Pure
bash
, works by usingbash
to parse the remaining variables in$h
, with parameter tranformation output:纯bash,使用bash来解析$h中剩余的变量,并使用参数转换输出:
while read a b c d e f g h ; do declare ${h//;/ } printf "%s\t%-10s\t%s\t%s\t%s\t%s\n" $b $c $d $e ${CLNSIG@A} ${CLNDBN@A} done < input
Output:
输出:
1014143 rs786201005 C T CLNSIG='5' CLNDBN='Immunodeficiency_38_with_basal_ganglia_calcification' 1014228 rs1921 G A,C CLNSIG='2' CLNDBN='not_specified' 1014316 rs672601345 C CG CLNSIG='5' CLNDBN='Immunodeficiency_38_with_basal_ganglia_calcification' 1014359 rs672601312 G T CLNSIG='5' CLNDBN='Immunodeficiency_38_with_basal_ganglia_calcification' 1020183 rs539283387 G C CLNSIG='3' CLNDBN='not_specified' 1020216 rs764659938 C G CLNSIG='0' CLNDBN='cancer'
-
POSIX shell,
grep
andprintf
method:POSIX shell、grep和printf方法:
while read a b c d e f g h ; do printf "%s\t%-10s\t%s\t%s\t%s\t%s\n" $b $c $d $e \ $( echo "$h" | grep -o 'CLN\(SIG\|DBN\)=[^;]*' ) ; done < input
Output:
输出:
1014143 rs786201005 C T CLNSIG=5 CLNDBN=Immunodeficiency_38_with_basal_ganglia_calcification 1014228 rs1921 G A,C CLNSIG=2 CLNDBN=not_specified 1014316 rs672601345 C CG CLNSIG=5 CLNDBN=Immunodeficiency_38_with_basal_ganglia_calcification 1014359 rs672601312 G T CLNSIG=5 CLNDBN=Immunodeficiency_38_with_basal_ganglia_calcification 1020183 rs539283387 G C CLNSIG=3 CLNDBN=not_specified 1020216 rs764659938 C G CLNSIG=0 CLNDBN=cancer
#3
2
It can be done using awk:
可以使用awk:
script.awk
script.awk
BEGIN { OFS="\t" }
{ clnsig = clndbn = ""
if( match( $8, /CLNSIG=[^;]+/ ) ) {
clnsig = substr( $8, RSTART, RLENGTH )
}
if( match( $8, /CLNDBN=[^;]+/ ) ) {
clndbn = substr( $8, RSTART, RLENGTH )
}
print $2, $3, $4, $5, clnsig, clndbn
}
Or more compact, in case that CLNDBN
is always after CLNSIG
:
或更紧凑,以防clnbn总是在CLNSIG之后:
script.awk
script.awk
BEGIN { OFS="\t" }
{ match($8,/(CLNSIG=[^;]+).*(CLNDBN=[^;]+)/, tmp)
print $2,$3,$4,$5, tmp[1], tmp[2]
}
The function match
matches a regular expression. The first form sets the variables RSTART
and RLENGTH
so that you can extract the text with substring
.
函数match与正则表达式匹配。第一个表单设置变量RSTART和RLENGTH,以便您可以用子字符串提取文本。
The second form puts the first subexpression (first parentheses) in the array tmp
at pos 1, the second subexpression at pos 2 and so on.
第二种形式将第一个子表达式(第一个括号)放在pos 1的数组tmp中,第2个子表达式在pos 2上,以此类推。
The regular expression CLNSIG=[^;]+
matches a literal CLNSIG=
followed by a substring up to (but not including) the ;
.
正则表达式CLNSIG=[]+匹配一个文字CLNSIG=后面跟着一个子字符串(但不包括);