GFF format

后记：

************************************************************************

在使用cufflinks和cuffmerge中我使用的都是gff3, 海宝说最好用gtf, 无论怎么样gtf一定是可以用的。

由gff3转化为gtf用gffread:

命令：

gffread Osativa_204_gene.gff3 -T -o Osativa_204_gene.gtf

转化后gff3文件中的信息都会被保留。虽然feature中没有UTR等，但是可以通过exon-CDS推算出来，

所以才会有用gff3 和 gtf 跑snpEff 结果会一样。

************************************************************************

ASE终于做到要考虑到使用gtf文件的时候，不得不学习一下，在这里对gtf文件格式做个简单介绍。

GFF ： general feature format

GTF : gene transfer format

这种文件contain gene annotations or other transcript data.

GFF have many versions, but the two most popular are GTF2 and GFF3

the proposed GFF3 format adresses the most common extensions to GFF, while preserving backward compatibility with previous formats.

GTF2，他属于gff格式。在GTF2文件中，attributes 有 transcript_id, gene_id, gene_name

Chr1 phytozome9_0      exon 10274 10430 .     +     .     transcript_id "PAC:24118181"; gene_id "LOC_Os01g01010"; gene_name "LOC_Os01
Chr1 phytozome9_0      exon 10504 10817 .     +     .     transcript_id "PAC:24118181"; gene_id "LOC_Os01g01010"; gene_name "LOC_Os01
Chr1 phytozome9_0      CDS   3449 3616 .     +     0     transcript_id "PAC:24118181"; gene_id "LOC_Os01g01010"; gene_name "LOC_Os01
Chr1 phytozome9_0      CDS   4357 4455 .     +     0     transcript_id "PAC:24118181"; gene_id "LOC_Os01g01010"; gene_name "LOC_Os01
Chr1 phytozome9_0      CDS   5457 5560 .     +     0     transcript_id "PAC:24118181"; gene_id "LOC_Os01g01010"; gene_name "LOC_Os01

GFF文件格式：
每行有9列，不同列tab delimited.

第一列： seqname, 来自于哪个序列。

第二列：source，这个注释的来源是哪里。上表是来自于pytozome.

第三例：feature, 有exon, CDS, *UTR。等同于bed 格式中的name列。

在gff3文件中。feature的关系是：gene最大，然后是mRNA（transcript）,然后是exon, 然后是CDS和*UTR。如果没有exon信息，可以利用CDS和*UTR算出exon. 一个gene 可以有多个mRNA.

第四列： start 从哪个碱基开始。比如上表第一行，参考序列的第一个碱基是1，即1-based, 第一个exon从10274个碱基开始。到10430个碱基处结束。相对于正链来说的。

第五列： end 到哪结束，注意end坐标是include的。

在这里解释一下坐标的问题，对于RNAseq,如果某个read比到forward strand, 说明正链表达了, 坐标是从小到大，比如start condon, CDS1, CDS2, CDS3, end condon。但是如果比到了反链，说明对于这个基因来说，反链转录了。方向相对于正链正好是反的，正链是5’到3‘，这个就是3’到5‘。但是坐标还是按着正链的坐标来说，坐标从小到大就应该是这样：end condon, CDS3, CDS2, CDS1, start condon。想明白这个问题你要懂得基因转录，对于一条双联DNA，并不是固定的某个链发转录，不是正链发生转录了，反链就没转录。而是对于某个基因来说的，可能在这个双链DNA中含有很多个基因，对于gene1，发生转录的是正链，对于gene2，发生的是反链。所以你才会在gtf, gff文件中看到+ -，倘若只有一个链发生转录，哪来的同时有+ - 一说？

这里感谢G博士的耐心讲解，thank you very much, 要不都研究生了还没弄懂，真的好丢人。。。

第六列：得分~ 貌似没啥用

第七列：strand, 即是正链还是反链。不管是正链还是反链，坐标从小到大的，所以对于正链，第一个CDS所在的坐标范围是小于第二个的。而对于反链，第一个CDS坐标范围是大于第二个。

dot 估计代表不知道哪个strand。。。

第八列： frame, codon是从start的首个碱基开始的，就是0. 从第二个碱基开始的，就是1，从第三个碱基开始的，就是2.

第九列：attributes。 textual attribtes 要用double quotes，不同的attribute用semiclon分开. attribute 和 textual attribute之间是一个space，不是一个tab.

gene_id ,a globally unique indentifier for the genomic source of the transcript. 一个gene_id 可以对应多个transcript_id. 因为选择性剪切。

transcript_id , a globally unique indentifer for the predicted transcript.

these attributes are designed for handling multiple transcripts from the same genomic region.

其他的attributes 必须放在这两个attribute之后。

by freemao

FAFU

free_mao@qq.com

相关文章