10X空间转录组数据分析之空间注释(解卷积,STdeconvolve)

时间:2024-12-14 07:08:14

hello,大家好,今天我们来分享一个新的空间注释的方法,当然主要思想还是解卷积,关于空间转录组的注释方法,我已经分享了很多了,在这里列举出来,供大家参考。

MIA用于单细胞和空间的联合分析
10X单细胞和空间联合分析的方法---cell2location
10X空间转录组和10X单细胞数据联合分析方法汇总
10X单细胞空间联合分析之四----DSTG
10X单细胞空间联合分析之三----Spotlight
10X单细胞空间联合分析之五----spatialDWLS
10X单细胞空间联合分析之六(依据每个spot的细胞数量进行单细胞空间联合分析)----Tangram
10X单细胞-10X空间转录组联合分析之七----CellDART
当然还有依据marker注释空间转录组的方法,10X空间转录组数据分析之思路总结(针对肿瘤样本)
10X单细胞-10X空间转录组联合分析之八----STRIDE(三维重构)

当然了,方法很多,不见得都是适合自己的样本分析,每种方法都有其优势和不足,我们要根据自己的实际情况进行选择。好了,开始我们今天的分享,文章在Reference-free cell-type deconvolution of pixel-resolution spatially resolved transcriptomics data,用到的方法是STdeconvolve,下面放几张分析效果图。

当然了,空间转录组可分析的内容是在太多了,如果大家有兴趣,可以在文章的下面评论一下,如果人员比较多的话,我就给大家讲一节公开课,大家一起交流交流,好了,开始我们今天的内容分享。

Abstract

最近的技术进步使空间解析转录组分析成为可能,但在多细胞像素分辨率下,从而阻碍了细胞类型空间共定位模式的识别。 作者开发了 STdeconvolve 作为一种无监督方法来对包含这种多细胞像素分辨率空间分辨转录组学数据集的底层细胞类型进行去卷积。 实际运用表明 STdeconvolve 有效地恢复了细胞类型的假定转录组学特征及其在空间分辨像素内的比例表示,而无需依赖外部单细胞转录组学参考(如果真的不以来单细胞数据,那真的是很完美,我们往下看看)。

Main

描绘组织内转录不同细胞类型的空间组织对于理解组织功能的细胞基础至关重要。最近的技术已经能够以多细胞像素分辨率在组织内进行空间分辨转录组 (ST) 分析。因此,这些 ST 测量代表可能包含多种细胞类型的细胞混合物。这种缺乏单细胞分辨率阻碍了细胞类型特定空间组织的表征。为了应对这一挑战,最近开发了 SPOTlight 和 RCTD 等监督解卷积方法来预测 ST 像素内细胞类型的比例。然而,这些有监督的反卷积方法依赖于合适的单细胞参考的可用性,如果由于预算、技术或生物学限制而没有这样的参考,则可能会出现限制。在这里,作者开发了 STdeconvolve作为一种用于解卷积多细胞像素分辨率 ST 数据的无监督、无参考的分析方法(下图)。

给定 ST 数据的计数矩阵,STdeconvolve 推断不同细胞类型的推定转录组特征及其在每个多细胞空间分辨 ST 像素内的比例表示。 简而言之,STdeconvolve 的第一个特征选择可能提供转录不同细胞类型信息的基因。 STdeconvolve 然后基于Latent Dirichlet Allocation (LDA) 来估计转录不同细胞类型 K 的数量,并在 ST 像素中对这些 K 细胞类型进行解卷积。 STdeconvolve 利用 ST 数据的几个独特功能,使 LDA 的此应用程序特别适合。
首先评估了 STdeconvolve 在使用模拟 ST 数据恢复细胞类型及其转录组学特征的比例表示方面的性能。 这是通过将小鼠内侧视前区 (MPOA) 的单细胞分辨率多重错误鲁棒荧光原位杂交 (MERFISH) 数据聚合到 100 μm2 像素中来完成的。

  • 注:Ground truth single-cell resolution MERFISH data of one section of the MPOA partitioned into 100 μm2 pixels (black dashed squares). Each dot is a single cell colored by its ground truth cell-type label.
应用 STdeconvolve,确定了 K=9 转录不同的细胞类型,并在每个模拟像素中对它们的转录组谱和比例表示进行了反卷积(下图)。

  • 注:B. Proportions of deconvolved cell-types from STdeconvolve, represented as pie charts for each simulated pixel。
为了推断去卷积细胞类型的身份以进行基准测试,将它们的去卷积转录谱与真实细胞类型的转录谱进行了匹配(下图)。

  • 注:(B)Pearson’s correlation between the transcriptional profiles of the 9 ground truth cell-types in the MERFISH MPOA data and the 9 deconvolved cell-types by STdeconvolve.(C). Distributions of Pearson’s correlations between the transcriptional profiles or pixel proportions of matched deconvolved and ground truth cell-types. Matched cell-types are indicated as highlighted boxes in B.
观察到每个去卷积细胞类型的转录组谱与跨基因匹配的真实细胞类型之间的强相关性,同样地,在模拟像素中每个去卷积细胞类型和匹配的真实细胞类型的比例之间存在强相关性。

  • 注:(C)The ranking of each gene based on its expression level in the transcriptional profiles of the deconvolved cell-types, compared to its gene rank in the transcriptional profile of the matched ground truth cell-type. (D). Heatmap of Pearson’s correlations between the proportions of the deconvolved cell-types and ground truth cell-types across simulated pixels. Ground truth cell-types are ordered by their frequencies in the ground truth dataset. Matched deconvolved and ground truth cell-types are boxed.
一些去卷积的细胞类型,如细胞类型 X2 和 X8 都与兴奋性神经元匹配,而细胞类型 X4 和 X7 都与抑制性神经元匹配。 根据先前的注释,将真实的兴奋性和抑制性细胞类型进一步划分为其他亚型(共 76 种),发现这些去卷积的细胞类型与神经元亚型的特定组合相关

当进一步将去卷积细胞类型的数量扩大到 K=76 时,能够识别在转录谱和像素比例与更精细的神经元亚型以及稀有细胞类型方面高度相关的去卷积细胞类型such as pericytes and microglia。此外,由于当前的 ST 技术允许以不同的分辨率进行空间转录组分析,因此进一步模拟了另一个 20 μm2 分辨率的 ST 数据集,并观察到解卷积的细胞类型转录组谱与 STdeconvolve 的基本事实的比例之间类似的强相关性(下图).

  • 注:Comparison of STdeconvolve cell-types to ground truth cell-types of a 20 µm2 197 pixel MERFISH MPOA ST dataset。
使用模拟的 MPOA 的 100 μm2 分辨率 ST 数据,还将 STdeconvolve 与现有的监督去卷积方法 SPOTlight 和 RCTD 进行了比较。 对于理想的单细胞转录组学参考,使用了原始单细胞 MERFISH 数据。 使用去卷积的细胞类型比例的均方根误差 (RMSE) 与模拟像素上的真实情况(方法)来评估每种方法的性能。 总的来说,发现 STdeconvolve 的性能与 SPOTlight和 RCTD 相当.此类现有监督解卷积方法的一个潜在限制是它们依赖于合适的单细胞参考。 因此,试图通过从 MERFISH 单细胞参考中去除神经元细胞类型来评估它们在没有合适的单细胞参考时的性能。 重新评估性能导致 SPOTlight和 RCTD 的 RMSE 增加.
接下来,通过分析小鼠主嗅球(MOB)的 100 μm2 分辨率 ST 数据来评估 STdeconvolve 的性能, 由于topographically organized的感官输入,MOB 由多个双边对称和转录不同的细胞层组成。 虽然之前对 MOB ST 数据的聚类分析揭示了粗细胞层的粗略空间组织,但无法轻易观察到更精细的结构,例如喙迁移流 (RMS) .We applied STdeconvolve to identify K=12 transcriptionally distinct cell-types that either overlapped with or further split coarse cell layers previously identified from clustering analysis。

  • 注:Deconvolved cell-type proportions for ST data of the MOB from STdeconvolve, represented as pie charts for each ST pixel. Pixels are outlined with colors based on the pixel transcriptional cluster assignment corresponding to MOB coarse cell layers.
特别是,虽然去卷积的细胞类型 X7 与先前通过聚类分析确定的颗粒细胞层重叠,但它在空间上位于预期 RMS 的位置。

已知在其去卷积转录谱中上调的基因,包括 Nrep、Sox11 和 Dcx,与神经元分化相关或在 RMS 内的神经元前体细胞中上调,并根据 ISH 染色标记预期的空间组织。

这表明去卷积的细胞类型 X7 对应于未从聚类分析中识别的 RMS 内的神经元前体细胞类型。 同样,使用适当的 MOB scRNA-seq 参考将 STdeconvolve 与 SPOTlight和 RCTD 进行了比较,并发现所有评估方法之间具有高度的对应性。
通过从 MOB scRNA-seq 参考中去除嗅鞘细胞 (OEC),再次比较了在缺乏合适的单细胞参考时这种监督解卷积方法的性能。 通过所有评估方法,最初预测 OECs 在嗅觉神经层中富集

然而,鉴于这个没有 OEC 的新参考, SPOTlight和 RCTD 错误地预测 N2 细胞在嗅觉神经层中富集并且高度丰富,尽管最初所有方法都预测 N2 细胞是罕见的。 此外,在来自小鼠皮层的 scRNA-seq 参考上训练了 SPOTlight和 RCTD,导致皮层参考的血管软脑膜 (VLMC) 细胞簇被错误地预测为在嗅觉神经层中高度富集 . 因此,监督解卷积方法可能对使用的单细胞转录组学参考敏感。

最后,为了证明无监督、无参考方法的潜力,将 STdeconvolve 应用于 4 个乳腺癌切片的 100 μm2 分辨率 ST 数据。 在这里,匹配的 scRNA-seq 参考不可用,并且由于潜在的肿瘤间异质性,使用来自另一个乳腺癌样本的 scRNA-seq 参考可能是不合适的。 ST 像素的转录聚类先前确定了 3 个转录不同的cluster,对应于 3 组织的组织学区域:原位导管癌、浸润性癌和非恶性.

然而,肿瘤微环境是许多其他细胞类型的复杂环境。 应用 STdeconvolve 来识别额外的细胞类型并询问它们的空间组织,导致 K=15 识别的细胞类型.

  • 注:(A)Deconvolved cell-type pixel proportions for ST data of a breast cancer tissue section, represented as pie charts. Pixels are outlined with colors based on the pixel transcriptional cluster assignment corresponding to 3 pathological annotations. (B). Highlight of deconvolved cell-type X15. Pixel proportion of deconvolved cell-type X15 are indicated as black slices in pie charts. Pixels are outlined with colors as in A). An H&E-stained image of the breast cancer tissue is shown in the background.
其中,去卷积的细胞类型 X3、X13 和 X15 在空间上对应,并且去卷积的转录谱分别与非恶性、导管原位癌和浸润性癌注释一致。然而,去卷积细胞型X15在组织的癌性和非恶性区域的界面处空间富集。去卷积细胞型X15的高表达基因包括免疫基因如CD74和CXCL10,基因集富集分析表明显着富集 在免疫过程中表明去卷积的细胞类型 15 可能对应于免疫细胞。 这种沿肿瘤边界的免疫细胞空间组织先前已被认为在乳腺癌预后中发挥作用。 因此 STdeconvolve 可能有助于解卷积异质组织中转录不同的细胞类型,以发现潜在的临床相关空间组织。
总之,作者开发了 STdeconvolve 作为分析 ST 数据的工具,以在不依赖单细胞转录组学参考的情况下恢复细胞类型比例和转录谱。 分析表明,当合适的单细胞参考可用时,STdeconvolve 可以概括预期的生物学并提供与现有监督方法相比具有竞争力的性能,以及当合适的单细胞参考不可用时,具有潜在的优越性能。 一般来说,预计 STdeconvolve 将有助于研究复杂异质组织中转录不同细胞类型之间的空间关系。

Method(令人头疼的算法又要来了)

LDA Modeling(这个模型,是我们应该关注的重点)

Selection of genes for LDA model

Selection of LDA model with optimal number of cell-types

方法间的比较

最后,我们来看看示例代码

加载
  1. require(remotes)
  2. remotes::install_github('JEFworks-Lab/STdeconvolve')
  3. library(STdeconvolve)
  4. data(mOB)
  5. pos <- mOB$pos
  6. cd <- mOB$counts
  7. annot <- mOB$annot
STdeconvolve 第一个特征通过寻找跨 ST 像素高度过度分散的基因,选择最有可能与区分细胞类型相关的基因。 基因太少或读取太少的基因的像素也可以被去除。
  1. ## remove pixels with too few genes
  2. counts <- cleanCounts(counts = cd,
  3. .size = 100,
  4. = 1,
  5. = 1)

  1. ## feature select for genes
  2. corpus <- restrictCorpus(counts,
  3. removeAbove=1.0,
  4. removeBelow = 0.05,
  5. alpha = 0.05,
  6. plot = TRUE,
  7. verbose = TRUE)

STdeconvolve 然后应用 Latent Dirichlet allocation (LDA)(一种常用于自然语言处理的生成统计模型)来发现 K 种潜在细胞类型。 STdeconvolve 适合一系列 LDA 模型,以告知最佳K 的选择。
  1. ## Note: the input corpus needs to be an integer count matrix of pixels x genes
  2. ldas <- fitLDA(t(as.matrix(corpus)), Ks = seq(2, 9, by = 1), plot=TRUE, verbose=TRUE)

In this example, we will use the model with the lowest model perplexity.
The shaded region indicates where a fitted model for a given K had an alpha > 1. alpha is an LDA parameter that is solved for during model fitting and corresponds to the shape parameter of a symmetric Dirichlet distribution. In the model, this Dirichlet distribution describes the cell-type proportions in the pixels. A symmetric Dirichlet with alpha > 1 would lead to more uniform cell-type distributions in the pixels and difficulty identifying distinct cell-types. Instead, we want models with alphas < 1, resulting in sparse distributions where only a few cell-types are represented in a given pixel.
The resulting theta matrix can be interpreted as the proportion of each deconvolved cell-type across each spatially resolved pixel. The resulting beta matrix can be interpreted as the putative gene expression profile for each deconvolved cell-type normalized to a library size of 1. This beta matrix can be scaled by a depth factor (ex. 1000) for interpretability.
  1. ## select model with minimum perplexity
  2. optLDA <- optimalModel(models = ldas, opt = "min")
  3. ## extract pixel cell-type proportions (theta) and cell-type gene expression profiles (beta) for the given dataset
  4. results <- getBetaTheta(optLDA, corpus = t(as.matrix(corpus)))
  5. deconProp <- results$theta
  6. deconGexp <- results$beta*1000
  7. vizAllTopics(deconProp, pos,
  8. groups = annot,
  9. group_cols = rainbow(length(levels(annot))),
  10. r=0.4)

还可以可视化每种去卷积细胞类型的*标记基因。 我们将在这里使用解卷积的细胞类型 5 和 1 作为示例。 我们将这里的*标记基因定义为在去卷积细胞类型(计数 > 5)中高度表达的基因,当将去卷积细胞类型的表达谱与所有其他细胞类型的平均值进行比较时,这些基因也具有前 4 个最高 log2(倍数变化) 去卷积细胞类型的表达谱。
  1. celltype <- 5
  2. ## highly expressed in cell-type of interest
  3. highgexp <- names(which(deconGexp[celltype,] > 5))
  4. ## high log2(fold-change) compared to other deconvolved cell-types
  5. log2fc <- sort(log2(deconGexp[celltype,highgexp]/colMeans(deconGexp[-celltype,highgexp])), decreasing=TRUE)
  6. markers <- names(log2fc)[1:4]
  7. ## visualize
  8. df <- merge(as.data.frame(pos),
  9. as.data.frame(t(as.matrix(counts[markers,]))),
  10. by = 0)
  11. ps <- lapply(markers, function(marker) {
  12. vizGeneCounts(df = df,
  13. gene = marker,
  14. size = 2, stroke = 0.1,
  15. plotTitle = marker,
  16. winsorize = 0.05,
  17. showLegend = FALSE)
  18. })
  19. gridExtra::(
  20. grobs = ps,
  21. layout_matrix = rbind(c(1, 2),
  22. c(3, 4))
  23. )

  1. celltype <- 1
  2. ## highly expressed in cell-type of interest
  3. highgexp <- names(which(deconGexp[celltype,] > 5))
  4. ## high log2(fold-change) compared to other deconvolved cell-types
  5. log2fc <- sort(log2(deconGexp[celltype,highgexp]/colMeans(deconGexp[-celltype,highgexp])), decreasing=TRUE)
  6. markers <- names(log2fc)[1:4]
  7. ## visualize
  8. df <- merge(as.data.frame(pos),
  9. as.data.frame(t(as.matrix(counts[markers,]))),
  10. by = 0)
  11. ps <- lapply(markers, function(marker) {
  12. vizGeneCounts(df = df,
  13. gene = marker,
  14. size = 2, stroke = 0.1,
  15. plotTitle = marker,
  16. winsorize = 0.05,
  17. showLegend = FALSE)
  18. })
  19. gridExtra::(
  20. grobs = ps,
  21. layout_matrix = rbind(c(1, 2),
  22. c(3, 4))
  23. )

当然了,软件还有其他的一些功能

In this tutorial, we will describe additional functionalities and advanced features of STdeconvolve.

Preprocessing

For LDA model fitting, STdeconvolve requires a “corpus of documents”, which is represented as a pixel (rows) x feature (columns) matrix of non-negative integer gene counts. To effectively deconvolve latent cell-types, the features in the corpus should be limited to genes that are variable across cell-types. Without prior knowledge of cell-types and their marker genes, one could look for over dispersed genes across the pixels, assuming that the underlying cell-types are present at variable proportions.
Additionally, it is useful to reduce the number of features in the corpus to those which are the most informative, to not only improve deconvolution but also increase speed of model fitting. To this end, one may wish to remove genes that are present in most or all pixels, or occur in only a small fraction of the pixels. This is because LDA identifies latent topics, or cell-types, by looking for sets of genes that occur frequently together in pixels. Therefore, removing genes that are present in most or all pixels will help restrict to gene that are cell-type specific, assuming that cell-types are not present across all pixels. Conversely, genes that are present in only a few pixels could represent noise and may not actually represent robust cell-type specific signatures.
As previously mentioned, a set of marker genes may be known a priori and a user may want to include these in the final corpus.
We include 2 different functions with STdeconvolve.
The first is restrictCorpus(), which is highlighted in the getting_started tutorial(就是上面的内容). This function first feature selects for over dispersed genes in the corpus and then allows a user to restrict the over dispersed genes to those present in less than or more than specified fractions of pixels in the dataset. This function does not filter for poor pixels or genes itself.
The second is preprocess(), which is a larger wrapper function and allows for a much greater and specified range of filtered and feature
  1. library(STdeconvolve)
  2. data(mOB)
  3. pos <- mOB$pos
  4. cd <- mOB$counts
  5. annot <- mOB$annot
In general, preprocess() includes a step to remove poor pixels and genes, allows one to select for specific genes to include or remove, allows an option to select for over dispersed genes, and options to remove top expressed genes, or genes present in less than or more than specified fractions of pixels in the dataset. Further, it returns an organized list that contains the filtered corpus, and the positions of the pixels retained in the filtered corpus if the information is present in the pixel names originally (for example, if the name of a pixel is in the format “XxY”). Otherwise this can be appended after.
Lastly, preprocess() can take as input a pixel (row) x gene (columns) matrix or a path to the file.
Order of filtering options in which they occur: 1. Selection to use specific genes only 2. cleanCounts to remove poor pixels and genes 3. Remove top expressed genes in matrix 4. Remove specific genes based on grepl pattern matching 5. Remove genes that appear in more/less than a percentage of pixels 6. Use the over dispersed genes computed from the remaining genes after filtering steps 1-5 (if selected) 7. Choice to use the top over dispersed genes based on -log10()
  1. mobCorpus1 <- preprocess(t(cd),
  2. alignFile = NA, # if there is a file to adjust pixel coordinates this can be included.
  3. extractPos = FALSE, # optional argument
  4. = NA, #
  5. nTopGenes = 3, # remove the top 3 expressed genes (genes with most counts) in dataset
  6. genes.to.remove = c("^Trmt"), # ex: remove tRNA methyltransferase genes (gene names that begin with "Trmt")
  7. removeAbove = 0.95, # remove genes present in 95% or more of pixels
  8. removeBelow = 0.05, # remove genes present in 5% or less of pixels
  9. = 10, # minimum number of reads a gene must have across pixels
  10. .size = 100, # minimum number of reads a pixel must have to keep (before gene filtering)
  11. = 1, # minimum number of pixels a gene needs to have been detected in
  12. ODgenes = TRUE, # feature select for over dispersed genes
  13. nTopOD = 100, # number of top over dispersed genes to use, otherwise use all that pass filters if `NA`
  14. = 0.05, # alpha param for over dispersed genes
  15. = 5, # gam param for over dispersed genes
  16. verbose = TRUE)

  1. mobCorpus1$pos <- pos[rownames(mobCorpus1$corpus), ] # because positions were not available in the counts matrix itself, append after.
  2. mobCorpus1$slm
  3. ## A 260x100 simple triplet matrix.
  4. print(mobCorpus1$corpus[1:10,1:10])
  5. ## Bpifb9a Bpifb9b Col1a1 Dcn Cyp2a5 Sox11 Omp Ogn Prokr2 Ptn
  6. ## ACAACTATGGGTTGGCGG 0 0 1 0 0 1 0 0 0 2
  7. ## ACACAGATCCTGTTCTGA 1 1 0 0 0 0 6 0 0 22
  8. ## ACATCACCTGCGCGCTCT 0 0 1 2 0 6 0 0 0 0
  9. ## ACATTTAAGGCGCATGAT 0 0 0 1 0 0 1 0 1 1
  10. ## ACCACTGTAATCTCCCAT 0 0 0 1 1 0 1 0 0 1
  11. ## ACCAGAGCCGTTGAGCAA 0 0 0 0 0 1 3 0 0 2
  12. ## ACCCGGCGTAACTAGATA 0 1 0 1 0 3 0 0 1 2
  13. ## ACCGGAGTAAATTAGCGG 0 0 0 0 0 2 2 0 0 2
  14. ## ACCTGACAGCGGAAACTT 0 0 1 1 0 0 8 0 0 7
  15. ## ACGGAAATCAGTGGTATT 0 1 0 1 0 4 3 0 1 0
  16. print(mobCorpus1$pos[1:10,])
  17. ## x y
  18. ## ACAACTATGGGTTGGCGG 16.001 16.036
  19. ## ACACAGATCCTGTTCTGA 26.073 15.042
  20. ## ACATCACCTGCGCGCTCT 13.048 21.079
  21. ## ACATTTAAGGCGCATGAT 13.963 18.117
  22. ## ACCACTGTAATCTCCCAT 24.104 13.074
  23. ## ACCAGAGCCGTTGAGCAA 9.040 12.046
  24. ## ACCCGGCGTAACTAGATA 15.941 12.112
  25. ## ACCGGAGTAAATTAGCGG 7.949 16.058
  26. ## ACCTGACAGCGGAAACTT 9.039 13.047
  27. ## ACGGAAATCAGTGGTATT 20.959 15.073
preprocess can also be used to build a corpus using a specific list of genes:
  1. ## get list of genes of interest, for an example.
  2. counts <- cleanCounts(counts = cd,
  3. .size = 100,
  4. = 1,
  5. = 1)

  1. odGenes <- getOverdispersedGenes(as.matrix(counts),
  2. =5,
  3. alpha=0.05,
  4. plot=FALSE,
  5. use.=FALSE,
  6. =TRUE,
  7. =1e3,
  8. =1e-3,
  9. verbose=FALSE, details=TRUE)
  10. genes <- odGenes$ods
  11. length(genes)
  12. ## build corpus using just the selected genes
  13. mobCorpus2 <- preprocess(t(cd),
  14. = genes,
  15. # can then proceed to filter this list, if desired
  16. # = 1,
  17. .size = 1, # can still filter pixels
  18. = 1, # can still filter to make sure the selected genes are present in at least 1 pixel
  19. ODgenes = FALSE, # don't select the over dispersed genes
  20. verbose = TRUE)

Selecting Optimal K

One limitation of LDA is that one must select the number of predicted cell-types (K) to be returned, a priori. Thus, one must either have knowledge of the number of cell-types present in the dataset of interest, or a way to select the model with the “optimal K”, or the model that best describes the dataset and captures the latent cell-types.
To do this, STdeconvolve fits a number of different LDA models with different K’s to the dataset and computes several different metrics to help guide users in the choice of K.
First, the perplexity of each fitted model is computed with respect to it’s K. This can be done using the entire real dataset the model was fitted to, or, users can chose a certain fraction of randomly selected pixels to be held out and used as a test set to compute model perplexity. The optimal K can either be the model with the K that returns the lowest perplexity (“min”), or we compute a “knee” metric (similar to choosing the number of principle components in PCA), which is the maximum second derivative, a reasonable choice for the inflection point (the elbow or knee).
Second, as K increases, the additional cell-types that are predicted tend to be represented present at small proportions in the pixels and thus contribute little to the predicted pixel cell-type profiles. To help put an upper limit on K, we also measure the number of predicted cell-types with mean pixel proportion less than 5%. After a certain K, the number of “rare” predicted cell-types, or those with low proportions across the pixels, steadily increases, suggesting that increasing K is no longer returning informative topics, or cell-types.
  1. ## fit LDA models to the corpus
  2. ks <- seq(from = 2, to = 18, by = 1) # range of K's to fit LDA models with given the input corpus
  3. ldas <- fitLDA((mobCorpus2$corpus),
  4. Ks = ks,
  5. ncores = parallel::detectCores(logical = TRUE) - 1, # number of cores to fit LDA models in parallel
  6. testSize = 0.2, # fraction of pixels to set aside for test corpus when computing perplexity
  7. plot=TRUE, verbose=FALSE)

While technically the lowest perplexity computed is when K=8, perplexity appears to be relatively stable between K=7 and K=18. Additionally, we expect there to be more than 4 cell-types and thus K should be greater than 4 (based on “knee” metric).
However, the number of cell-types with mean proportion <5% doesn’t start steadily increasing until around K=15, suggesting that the number of predicted cell-types are likely informative until this chosen K.
Once the LDA models are fitted, beta and theta matrices can be extracted for a given model. The simplest way to do this is with optimalModel() to get the specific model of interest:
  1. ## `optimalModel()` can take as arguments:
  2. optimalModel(models = ldas, opt = "min") # "min" = K that resulted in minimum perplexity
  3. ## A LDA_VEM topic model with 8 topics.
  4. optimalModel(models = ldas, opt = "kneed") # "kneed" = K that resulted in inflection perplexity
  5. ## A LDA_VEM topic model with 4 topics.
  6. optimalModel(models = ldas, opt = 15) # or extract the model for any K that was used
  7. ## A LDA_VEM topic model with 15 topics.
Then, getBetaTheta() can be used to get the beta (cell-type gene expression profiles) and theta (pixel cell-type proportions) matrices.
  1. results <- getBetaTheta(lda = optimalModel(models = ldas, opt = "15"),
  2. corpus = mobCorpus2$corpus)
  3. print(names(results))
  4. ## [1] "beta" "theta"
或者, buildLDAobject() 是 getBetaTheta()、clusterTopics() 和 combineTopics() 的包装器。 具有相似基因表达谱的细胞类型被聚类,并提供了一组替代的“细胞类型cluster”,其中细胞类型cluster是给定cluster内细胞类型的聚合β和θ矩阵。
  1. results <- buildLDAobject(LDAmodel = optimalModel(models = ldas, opt = "15"),
  2. corpus = mobCorpus2$corpus,
  3. deepSplit = 4,
  4. colorScheme = "rainbow")

此处,结果是一个列表,其中包含单个预测细胞类型的 beta 和 theta 矩阵、细胞类型cluster的组合 beta 和 theta 矩阵。 “cluster”是指示每个预测细胞类型的指定cluster的因素,“树状”是聚集的细胞类型相对于它们预测的基因表达谱的树状图。 “cols”和“clustCols”是指示细胞类型或cluster的颜色标签的因素,可选择性地用于可视化目的。

Clustering Cell-types

如上所述,预测的细胞类型可以基于它们预测的基因表达谱(β矩阵)被聚类为“细胞类型cluster”。 一组预测的细胞类型可能代表组织中较大细胞层的组成部分,因此将细胞类型聚集在一起以将该层可视化和评估为单个特征可能很有用。
如前面所示,最简单的方法是使用 buildLDAobject(),它不仅返回单个细胞类型的 beta 和 theta 矩阵,还返回细胞类型cluster及其关联的 beta 和 theta 矩阵。
然而,有了细胞类型的 beta 矩阵,可以通过直接调用 clusterTopics() 来对细胞类型进行聚类,这允许指定执行的聚类类型。 使用 R 包 dynamicTreeCut 使用动态树拆分完成将细胞类型分配给特定cluster。
  1. results <- getBetaTheta(lda = optimalModel(models = ldas, opt = "15"),
  2. corpus = mobCorpus2$corpus)
  3. clust <- clusterTopics(beta = results$beta,
  4. clustering = "", # type of clustering
  5. dynamic = "hybrid", # method to assign cell-types to clusters (see `dynamicTreeCut` options)
  6. deepSplit = 4, # dynamic tree cutting sensitivity parameter
  7. plot = TRUE)

To get the beta and theta matrices of the cell-type clusters, combineTopics() is used to aggregate the beta or theta matrices of the cell-types within assigned clusters:
  1. betaCombn <- combineTopics(results$beta, clusters = clust$clusters, type = "b")
  2. ## [1] "cell-types combined."
  3. thetaCombn <- combineTopics(results$theta, clusters = clust$clusters, type = "t")
  4. ## [1] "cell-types combined."
  5. clust$clusters
  6. ## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
  7. ## 2 1 5 1 3 4 3 5 2 3 4 2 1 1 1
  8. ## Levels: 1 2 3 4 5
  9. dim(betaCombn)
  10. ## [1] 5 345
  11. dim(thetaCombn)
  12. ## [1] 260 5

可视化

  1. results <- buildLDAobject(LDAmodel = optimalModel(models = ldas, opt = "15"),
  2. corpus = mobCorpus2$corpus,
  3. deepSplit = 4,
  4. colorScheme = "rainbow",
  5. plot = FALSE)
  6. m <- results$theta
  7. p <- pos
  8. plt <- vizAllTopics(theta = m,
  9. pos = p,
  10. topicOrder=seq(ncol(m)),
  11. topicCols=rainbow(ncol(m)),
  12. groups = NA,
  13. group_cols = NA,
  14. r = 0.4, # size of scatterpies; adjust depending on the coordinates of the pixels
  15. lwd = 0.1,
  16. showLegend = TRUE,
  17. plotTitle = "K=15")
  18. plt <- plt + ggplot2::guides(fill=ggplot2::guide_legend(ncol=2))
  19. plt

Scatterpie borders can be colored, either custom, or based on group membership
  1. m <- results$theta
  2. p <- pos
  3. plt <- vizAllTopics(theta = m,
  4. pos = p,
  5. topicOrder=seq(ncol(m)),
  6. topicCols=rainbow(ncol(m)),
  7. groups = rep("0", dim(m)[1]),
  8. group_cols = c("0" = "black"),
  9. r = 0.4,
  10. lwd = 0.4, # adjust thickness of the scatterpie borders
  11. showLegend = TRUE,
  12. plotTitle = "K=15")
  13. plt <- plt + ggplot2::guides(fill=ggplot2::guide_legend(ncol=2))
  14. plt

Based on group membership (let’s use the coarse cell layers of the MOB)
  1. m <- results$theta
  2. p <- pos
  3. plt <- vizAllTopics(theta = m,
  4. pos = p,
  5. topicOrder=seq(ncol(m)),
  6. topicCols=rainbow(ncol(m)),
  7. groups = annot,
  8. group_cols = rainbow(length(levels(annot))),
  9. r = 0.4,
  10. lwd = 0.4, # adjust thickness of the scatterpie borders
  11. showLegend = TRUE,
  12. plotTitle = "K=15")
  13. plt <- plt + ggplot2::guides(fill=ggplot2::guide_legend(ncol=2))
  14. plt

Now let’s visualize the cell-type clusters:
  1. m <- results$thetaCombn
  2. p <- pos
  3. plt <- vizAllTopics(theta = m,
  4. pos = p,
  5. topicOrder=seq(ncol(m)),
  6. topicCols=rainbow(ncol(m)),
  7. groups = rep("0", dim(m)[1]),
  8. group_cols = c("0" = "black"),
  9. r = 0.4,
  10. lwd = 0.2,
  11. showLegend = TRUE,
  12. plotTitle = "K=15 cell-type clusters")
  13. # plt <- plt + ggplot2::guides(fill=guide_legend(ncol=2))
  14. plt

Individual Cell-types and Cell-type clusters
  1. m <- results$theta
  2. p <- pos[rownames(results$theta),]
  3. vizTopicClusters(theta = m,
  4. pos = p,
  5. clusters = results$cols,
  6. sharedCol = TRUE, # cell-types in a cluster share the same color
  7. groups = rep("0", dim(m)[1]),
  8. group_cols = c("0" = "black"),
  9. r = 0.4,
  10. lwd = 0.3,
  11. showLegend = TRUE)

  1. m <- results$theta
  2. p <- pos[rownames(results$theta),]
  3. vizTopicClusters(theta = m,
  4. pos = p,
  5. clusters = results$cols,
  6. sharedCol = FALSE, # cell-types in a cluster on a color gradient
  7. groups = rep("0", dim(m)[1]),
  8. group_cols = c("0" = "black"),
  9. r = 0.4,
  10. lwd = 0.3,
  11. showLegend = TRUE)

  1. m <- results$theta[,c("14", "12")]
  2. p <- pos
  3. other <- 1 - rowSums(m)
  4. m <- cbind(m, other)
  5. colnames(m) <- c("14", "12", "Other")
  6. vizAllTopics(theta = m,
  7. pos = p,
  8. topicOrder=seq(ncol(m)),
  9. topicCols=c(transparentCol("red", percent = 50), # BONUS: can make colors transparent if overlaying on top of an image
  10. "black",
  11. "white"),
  12. groups = rep("0", dim(m)[1]),
  13. group_cols = c("0" = "white"), # make scatterpie borders white to focus directly on the cell-type.
  14. r = 0.4,
  15. lwd = 0.1,
  16. showLegend = TRUE,
  17. overlay = NA) # BONUS: plot the scatterpies on top of a raster image of the H&E tissue, if set equal to the rgb matrix

非常好的方法,很值得大家借鉴

生活很好,有你更好