文献编号:19Mar - 11
2019年04月23日三读,会其精髓;
相信这种方法的话,那么它的精髓是什么,如何整合出这个core gene set。
首先要考虑样本的选择,样本里是否存在明显的分层?
2019年04月01日再读;精读;
已经发现我的data没法在PCA里有明显的规律;应该可以直接从bulk RNA-seq里获取有价值的信息,那么single cell到底有什么优势呢?回答:单细胞的数据是必须的,它可以把core genes锚定到case-control pseudotime,从而可以得到早期的致病基因。
比较:
- 他们研究的是PD,这个疾病更重要,更流行,在老年人中发病率高,生理表征是失去了多巴胺神经元,仅有10%的遗传性;我们研究的是HSCR,没有那么流行,大概万分之几,基本是出生缺陷病,表征类似,没有正常function的神经元,遗传性高;
- GBA基因发现与PD有很强的相关性,5%-10%的PD携带GBA杂合突变;类似我们HSCR里的RET基因的突变;
- GBA和LRRK2的研究已经发现蛋白的稳态缺陷,通过内质网、自噬和溶酶体通路;我们发现了什么?需要看一下RET的paper;
- PD是一个神经退性疾病,所以他们要找上游的信号;而我们已经测到ENCC,我们鉴定出来的就直接是上游的信号了,我们只需要看上游的影响了哪些pathway;
short-segment Hirschsprung disease (S-HSCR: 80% of the patients), longsegment Hirschsprung (15% of the patients), and total colonic aganglionosis (TCA: 5% of the patients).1
highlight1:因为iPSC-derived的样本包含了细胞的异质性,会混淆基因表达的分析,所以本文做了单细胞(然而即使FACS后还是有异质性存在)。
highlight2:样本分层,发现某个样本中某些pathway被高度激活,确认后,这个样本就从下游分析中移除了;我们的更复杂,两个对照细胞系,更多的疾病种类,如何处理?
highlight3:60 genes,capture an axis of variation between control and case;如何确认的,我们的数据如何处理?
highlight4:HDAC4地位的确定,早期的许多差异基因被HDAC4调控,所以顺理成章,后面围绕着HSAC4故事讲得很完整,但他们用的软件貌似是收费的;我们如何发现这个明星分子?
核心1:60 genes是如何选出来的?
- DEG bulk RNA-seq
- DEG PC2 switch
- markers SC3
最终得到了60个genes,其中52个下调,8个上调;
核心2:如何验证这60个genes的?
- phenotypic linkage network;确定这些基因确实是和表型相关的;
- A Bayesian approach that learns transcriptomic trajectories,就是把每一个基因锚定到pseudotime上,确定它们是在哪个点产生影响的;
- ingenuity pathway analysis (IPA) (QIAGEN),收费的,伤不起啊;
核心3:HDAC4是如何筛选出来的。
Although total levels of HDAC4 protein were unchanged between controls and PD GBA-N370S patients (Figure S5B), the downregulation of four of the HDAC4-regulated genes (TSPAN7, ATP1A3, RTN1, and PRKCB) in PD GBA-N370S patient-derived neurons was experimentally confirmed. 可以看到,HDAC4的表达水平是不变的,但是有四个早期的基因都是被HDAC4调控的(IPA显示的),所以才有了后面的mislocalization的推断和验证。
其实整个过程是可以控制的,他们的搞法不是特别地严谨,也就只能在这套数据上使用吧。我们是不是可以取齐精髓,改进一下这个方法;50-100个core gene set比较好讲故事,200个以上就不好讲了,他们的阈值卡的不错;
这篇文章的方法写得比较容易理解,比正文写得明确些,值得多看看。
方法篇:
RNA-seq read alignment and expression quantification
Single-cell and bulk RNA-seq data was processed identically. FASTQ files were trimmed using TrimGalore v0.4.1 on default settings. Transcript expression levels were quantified using Kallisto v0.42.5 (Bray et al., 2016) against GRCh38 reference human transcriptome. Additional quality control (QC) metrics were compiled by Picard 2.0.1 (https://broadinstitute.github.io/picard/) on BAM files aligned to the human genome (GRCh38) using Hisat2 (Kim et al., 2015). Transcript level abundances were then summarized to gene level estimates using tximport 1.4.0 (https://bioconductor.org/packages/release/bioc/html/tximport.html) (Soneson et al., 2015).
Quality control of single-cell RNA-seq
Quality-control, visualization, and handling of single-cell data was performed using Scater (McCarthy et al., 2017). Cells belonging to plates 3-6 were removed due to distinct clustering on reduced-dimensionality representations and low expression of otherwise constitutively expressed genes. Further outliers were removed using Cellity (Ilicic et al., 2016) and subsequently any cell expression GAPDH at a level below the maximum GAPDH expression in blank wells was further removed, leaving a total of 146 cells for analysis.
Differential expression analysis
Differential expression analysis on bulk RNA-seq was performed using DESeq2 (Love et al., 2014), including a covariate to account for technical replicates. GO enrichment was performed using goseq (Young et al., 2010) and over-represented p values were multiple test corrected using the Benjamini-Hochberg procedure.
Single-cell pseudotime analysis
A PCA representation of the cells was computed using the prcomp function in the stats package in R using the 500 most variable genes (in log expression space), the default in Scater. Single-cell differential expression analysis along PC2 was performed using the R package switchde (Campbell and Yau, 2017). A further refined trajectory using the combined gene list along was computed using Ouija.
Identification of pathway activation in GBA 3
Over-dispersion analysis was performed in the method identical to Brennecke et al. (2013) using the R package scran (Lun et al., 2016) using ERCC spike-ins. A gene was designated as over-dispersed if the reported q-value < 0.05. GO analysis was performed using the R package GOSeq (Young et al., 2010). Genes from the SRP-dependent co-translational protein targeting to membrane pathway were selected for validation by performing a Wilcoxon rank-sum test for log2(TPM+1) expression in GBA3 cells compared to controls and prioritized based on p value.
Phenotypic linkage network construction
To assess functional similarity and convergence of the core gene set we constructed a phenotypic linkage network (Honti et al., 2014). We wished to assess the functional similarity of genes within the core set compared to a randomly sampled background distribution. The genes selected for the background distribution should match the overall expression pattern of the core set in these iPSCs in order to account for the increased likelihood of functional similarities between genes randomly selected from the same cell type. We first noted that the core set of genes exhibited high mean expression than average among the whole transcriptome. We then fitted a gamma distribution to the mean log2(TPM+1) values of both the core gene set and all other genes in the transcriptome (the core distribution and background distribution). Then for each gene not in the core gene set we calculated the probability of observing the mean expression level under both models, and formed an unnormalized inclusion probability of the ratio of the density of the observed expression given the core gene set distribution to the density of the observed expression given the background distribution. This can loosely be thought of as ‘‘how many times more likely is it that a gene fits the core set distribution compared to the background distribution.’’ To choose the background set of genes we then sampled 1000 genes from the transcriptome excluding the core gene set, where the probability of a gene being selected was proportional to the un-normalized inclusion probability. The resulting empirical distribution fitted well with the fitted core gene set distribution (Figure S4D). We subsequently constructed a phenotypic linkage network as per Honti et al. (2014) using 1) the core set of genes, 2) the 1000 sampled background genes, and 3) the genes SNCA, PARK2, PARK7, LRRK2, UCHL1, GBA, PINK1, ATP13A2, HTRA2, PLA2G6, VPS35, and EIF4G1. Links were compared between different classes of genes using a one-sided Wilcoxon rank-sum test.
这篇文章的核心就是构建了disease pseudotime,case-control axis ,这跟普通的pseudotime完全不同;
从core gene set中选出了关键的基因HDAC4,并做了非常完善的功能验证;药物验证;
目前唯一一篇把单细胞测序应用到疾病发育上的高分paper,发表在了CSC上。
参考阅读:
首先明确一下概念:
帕金森疾病有一定的遗传因素,本文就是研究可遗传的这部分,GBA基因里的N370S突变会导致帕金森。The majority of PD cases are idiopathic, with only about 10% attributed to inherited PD cases
早期发现:increased ER stress, autophagic and lysosomal perturbations, and elevated a-synuclein release;内质网应激增加,自噬和溶酶体微扰增加,a-synuclein释放增加
基本的思路:
找出了一个核心的DEG:HDAC4
HDAC4靶向药物可以纠正帕金森症状
把核心DEG按照作用时间进行了分类。
根据临床信息把不同的疾病样本分层了stratify
HDAC4错误定位了 mislocalized
问题:
1. 为什么找出的是这个基因?如何找出HDAC4的?
因为这个基因是早期基因的一个抑制因子,它更有可能是disorder的起源。
Analysis of the core set of 60 DE genes using ingenuity pathway analysis (IPA) (QIAGEN) identified histone deacetylase 4 (HDAC4) as a repressor of a set of genes downregulated early in the pseudo temporal profile
这并不是一个转录因子
Pseudotime analysis of genes differentially expressed (DE) along this axis identified the transcriptional repressor histone deacetylase 4 (HDAC4) as an upstream regulator of disease progression.
HDAC4 was mislocalized to the nucleus in PD iPSC-derived dopamine neurons and repressed genes early in the disease axis, leading to late deficits in protein homeostasis.
HDAC4错误地进入了细胞核,抑制了那些早起基因,导致了蛋白质稳态缺陷。
2. 怎么找到靶向药物的?
3. 如何构建pseudotime的?
we identified a progressive axis of gene expression variation leading to endoplasmic reticulum stress.
Combining bulk and single-cell expression profiles, we identified a robust set of 60 genes whose expression captured an axis of variation between cells from controls and the remaining two PD GBA-N370S patients.
4. core gene set是怎么找出来的?
We next sought to identify a core set of genes consistently perturbed across both bulk RNA-seq and the single-cell transcriptomic signature. This core set was identified as the intersection of two gene sets from the analysis: (1) those DE in bulk RNAseq using DESeq2 after the removal of GBA3 (at 5% FDR) and (2) those DE across PC2 using switchde (at 5% FDR; Campbell and Yau, 2017; Figure S4B). We further refined this set to include only those additionally identified as discriminating marker genes after clustering the single-cell RNA-seq data using SC3 (Kiselev et al., 2017). By combining genes found through both bulk and single-cell DE as well as single-cell clustering (STAR Methods), we identified a core set of 60 genes, 52 of which were consistently downregulated and 8 of which were consistently upregulated in PD GBA-N370S iPSC-derived dopamine neurons (Figure S4B).
去文中一一找答案。。。
图1:
A:提纯purify iPSC-derived dopamine neurons,然后做bulk RNA-seq 和单细胞
B:increased expression of dopamine neuron marker genes多巴胺神经元的marker表达提升。感觉图表很业余,看得费力,大致就是想展现一些marker的表达情况,证明细胞是对的;
C:常规DEG的火山图,对比的bulk RNA-seq,我们不能用,因为样本不够纯;
D:常规的GO 注释图,down and up分类
图二:
A:单细胞常规的聚类分析,只不过用得是PC2和PC3,应该用了scater包,发现了分层现象,GBA3的表达变异被PC3捕获了;
B:常规的Over dispersion分析,鉴定出具有生物学差异的基因;才143个gene,有没有搞错;
C:an enrichment in the endoplasmic reticulum (ER) signal recognition particle (SRP) pathway in GBA3;ER通路在GBA3中高度富集;
D:挑了几个基因出来证明;
E:qRTPCR验证,靶向膜途径的SRP依赖性共翻译蛋白的升高对GBA3具有特异性,提出了本研究中使用的患者的分子分层,分子分层的依据;
The stratification of PSP from PD among these GBAN370S carriers by single-cell profiling of their iPSC-derived dopamine neurons is therefore consistent with the clinical stratification and reveals a potential disease-relevant pathway for PSP.
我们的这个项目适合分层吗?肯定可以
图三:
A:首先想办法构建了一个pseudo diease axis,然后判断core gene set在哪个位置发挥作用
B:统计学验证了core gene set的合理性;
C:挑了几个基因做了验证,DIV 22就是分化的天数,可以直观的看出基因到底是在哪一个区段开始发挥作用