GRCh38基因组和注释文件探究

时间:2023-01-24 21:56:59

ensembl/release91:

cat Homo_sapiens.GRCh38.91.gtf | grep -v "#" | cut -f9 | cut -f1,3,6,8 -d\; | grep gene_biotype | sed -e 's/\"//g' | sed -e 's/\;//g' | cut -f2,6,8 -d" " | sort | uniq > GRCh38.feature.info

 

ENSG00000000003 TSPAN6 protein_coding
ENSG00000000005 TNMD protein_coding
ENSG00000000419 DPM1 protein_coding
ENSG00000000457 SCYL3 protein_coding
ENSG00000000460 C1orf112 protein_coding
ENSG00000000938 FGR protein_coding
ENSG00000000971 CFH protein_coding
ENSG00000001036 FUCA2 protein_coding
ENSG00000001084 GCLC protein_coding
ENSG00000001167 NFYA protein_coding
ENSG00000001460 STPG1 protein_coding
ENSG00000001461 NIPAL3 protein_coding
ENSG00000001497 LAS1L protein_coding
ENSG00000001561 ENPP4 protein_coding
ENSG00000001617 SEMA3F protein_coding
ENSG00000001626 CFTR protein_coding
ENSG00000001629 ANKIB1 protein_coding
ENSG00000001630 CYP51A1 protein_coding
ENSG00000001631 KRIT1 protein_coding
ENSG00000002016 RAD52 protein_coding
ENSG00000002079 MYH16 transcribed_unitary_pseudogene
ENSG00000002330 BAD protein_coding
ENSG00000002549 LAP3 protein_coding
ENSG00000002586 CD99 protein_coding
ENSG00000002587 HS3ST1 protein_coding
ENSG00000002726 AOC1 protein_coding
ENSG00000002745 WNT16 protein_coding
ENSG00000002746 HECW1 protein_coding
ENSG00000002822 MAD1L1 protein_coding
ENSG00000002834 LASP1 protein_coding
ENSG00000002919 SNX11 protein_coding
ENSG00000002933 TMEM176A protein_coding
ENSG00000003056 M6PR protein_coding
ENSG00000003096 KLHL13 protein_coding
ENSG00000003137 CYP26B1 protein_coding

  

58302个ENSG id

56655个gene name(为什么有将近两千个是重复)

46种类型:

3prime_overlapping_ncRNA
antisense_RNA
bidirectional_promoter_lncRNA
IG_C_gene
IG_C_pseudogene
IG_D_gene
IG_J_gene
IG_J_pseudogene
IG_pseudogene
IG_V_gene
IG_V_pseudogene
lincRNA
macro_lncRNA
miRNA
misc_RNA
Mt_rRNA
Mt_tRNA
non_coding
polymorphic_pseudogene
processed_pseudogene
processed_transcript
protein_coding
pseudogene
ribozyme
rRNA
scaRNA
scRNA
sense_intronic
sense_overlapping
snoRNA
snRNA
sRNA
TEC
transcribed_processed_pseudogene
transcribed_unitary_pseudogene
transcribed_unprocessed_pseudogene
translated_processed_pseudogene
TR_C_gene
TR_D_gene
TR_J_gene
TR_J_pseudogene
TR_V_gene
TR_V_pseudogene
unitary_pseudogene
unprocessed_pseudogene
vaultRNA

  

  

问题:

1. 为什么用gencode的注释文件做表达定量会出问题?

 

2. 不同的release之间有什么区别?

 

3. 不同来源的注释区别在哪里?

 

4.