单细胞层次给基因集打分除了和GSVA等分析一样可以做富集以外,还可以通过cell marker gene set做亚群注释(需要注意细胞比例的问题,数量少的群体不行,可作辅助)。前面写过的scMetabolism包是给代谢通路打分,思路一样。
AUCell 是ranking-based,所以具体哪种标准化方法不重要,只要不影响每个细胞内基因表达量的排序就好。总的来讲,分为三步:
1,build the rankings
2,calculate the AUC
3,set the assignment threshold
安装R包
-
if (!requireNamespace("BiocManager", quietly=TRUE))
-
("BiocManager")
-
-
Biocmanager::install("AUCell")
-
-
# To support paralell execution:
-
BiocManager::install(c("doMC", "doRNG","doSNOW"))
-
# For the main example:
-
BiocManager::install(c("mixtools", "GEOquery", "SummarizedExperiment"))
-
# For the examples in the follow-up section of the tutorial:
-
BiocManager::install(c("DT", "plotly", "NMF", "d3heatmap", "shiny", "rbokeh",
-
"dynamicTreeCut","R2HTML","Rtsne", "zoo"))
- load the scRNA expression matrix and gene set
-
# . Reading from a text file
-
#exprMatrix <- ("")
-
#exprMatrix <- (exprMatrix)
-
-
# or using an expression set
-
#exprMatrix <- exprs(myExpressionSet)
-
-
-
-
-
# (This may take a few minutes)
-
library(GEOquery)
-
attemptsLeft <- 20
-
while(attemptsLeft>0)
-
{
-
geoFile <- tryCatch(getGEOSuppFiles("GSE60361", makeDirectory=FALSE), error=identity)
-
if(methods::is(geoFile,"error"))
-
{
-
attemptsLeft <- attemptsLeft-1
-
(5)
-
}
-
else
-
attemptsLeft <- 0
-
}
-
-
gzFile <- grep(".", basename(rownames(geoFile)), fixed=TRUE, value=TRUE)
-
message(gzFile)
-
txtFile <- gsub(".gz", "", gzFile, fixed=TRUE)
-
message(txtFile)
-
gunzip(filename=gzFile, destname=txtFile, remove=TRUE)
-
-
library()
-
geoData <- fread(txtFile, sep="\t")
-
geneNames <- unname(unlist(geoData[,1, with=FALSE]))
-
exprMatrix <- (geoData[,-1, with=FALSE])
-
rm(geoData)
-
dim(exprMatrix)
-
rownames(exprMatrix) <- geneNames
-
exprMatrix[1:5,1:4]
-
-
# Remove file
-
(txtFile)
-
-
# Save for future use
-
mouseBrainExprMatrix <- exprMatrix
-
save(mouseBrainExprMatrix, file="exprMatrix_AUCellVignette_MouseBrain.RData")
pick random 5000 genes to release computational burden
-
# load("exprMatrix_AUCellVignette_MouseBrain.RData")
-
(333)
-
exprMatrix <- mouseBrainExprMatrix[sample(rownames(mouseBrainExprMatrix), 5000),]
-
-
dim(exprMatrix)
the next thing we need to prepare is the gene set. use ?AUCell_calcAUC to check accepted forms.
-
library(GSEABase)
-
-
#gene list version
-
genes <- c("gene1", "gene2", "gene3")
-
geneSets <- GeneSet(genes, setName="geneSet1")
-
geneSets
-
-
#gmt version
-
library(AUCell)
-
library(GSEABase)
-
gmtFile <- paste((('examples', package='AUCell')), "", sep="/")
-
geneSets <- getGmt(gmtFile)
-
-
#check how many of these genes are in the expression matrix
-
geneSets <- subsetGeneSets(geneSets, rownames(exprMatrix))
-
cbind(nGenes(geneSets))
-
-
#add the gene set size to its name
-
geneSets <- setGeneSetNames(geneSets, newNames=paste(names(geneSets), " (", nGenes(geneSets) ,"g)", sep=""))
-
-
#let’s also add a few sets of random genes and 100 genes expressed in many cells (. housekeeping-like):
-
-
# Random
-
(321)
-
extraGeneSets <- c(
-
GeneSet(sample(rownames(exprMatrix), 50), setName="Random (50g)"),
-
GeneSet(sample(rownames(exprMatrix), 500), setName="Random (500g)"))
-
-
countsPerGene <- apply(exprMatrix, 1, function(x) sum(x>0))
-
# Housekeeping-like
-
extraGeneSets <- c(extraGeneSets,
-
GeneSet(sample(names(countsPerGene)[which(countsPerGene>quantile(countsPerGene, probs=.95))], 100), setName="HK-like (100g)"))
-
-
geneSets <- GeneSetCollection(c(geneSets,extraGeneSets))
-
names(geneSets)
-
-
- build gene expression rankings for each cell
-
cells_rankings <- AUCell_buildRankings(exprMatrix, nCores=1, plotStats=TRUE)
-
-
cells_rankings
-
#The “rankings” can be seen as a new representation of the original dataset. Once #they are calculated, they can be saved for future analyses.
-
-
save(cells_rankings, file="cells_rankings.RData")
- calculate enrichment for the signature (AUC)
To determine whether the gene set is enriched at the top of the gene-ranking for each cell, AUCell uses the “Area Under the Curve” (AUC) of the recovery curve.
-
cells_AUC <- AUCell_calcAUC(geneSets, cells_rankings)
-
save(cells_AUC, file="cells_AUC.RData")
col is the cell id, row is gene set, AUC in each cell.
- determine the cells with given gene signatures or active gene sets
The AUC estimates the proportion of genes in the gene-set that are highly expressed in each cell.
The ideal situation will be a bi-modal distribution, in which most cells in the dataset have a low “AUC” compared to a population of cells with a clearly higher value
注意,如果gene set是细胞marker,就必须考虑到细胞亚群比例的问题。如果亚群比例很高,那么可能双峰中的低峰部分被掩盖,表现出来更像house keeping genes(单峰甚至常量表达);同理,如果比例很小,高峰也可能微不足道,看不出来。
另外,基因集基因数不能过小,不然有可能出现gene set里的基因一个都match不到matrix前5%的情况,此时AUC=0。所以,gene set长度在100-2k比较合适。
To ease the exploration of the distributions, the function AUCell_exploreThresholds()
automatically plots all the histograms and calculates several thresholds that could be used to consider a gene-set ‘active’ (returned in $aucThr
). The distributions are plotted as dotted lines over the histogram and the corresponding thresholds as vertical bars in the matching color. The thicker vertical line indicates the threshold selected by default ($aucThr$selected
): the highest value to reduce the false positives.
-
(123)
-
par(mfrow=c(3,3))
-
cells_assignment <- AUCell_exploreThresholds(cells_AUC, plotHist=TRUE, assign=TRUE)
-
#看一下储存在结果里的提醒
-
warningMsg <- sapply(cells_assignment, function(x) x$aucThr$comment)
-
warningMsg[which(warningMsg!="")]
-
-
-
-
## Microglia_lavin (165g)
-
## "The AUC might follow a normal distribution (random gene-set?). "
-
## Random (500g)
-
## "The AUC might follow a normal distribution (random gene-set?). The global distribution overlaps the partial distributions. "
The thresholds calcuated for each gene set are stored in the $aucThr
slot. For example, the thresholds suggested for the oligodendrocyte gene-set:
cells_assignment$Oligodendrocyte_Cahoy$aucThr$thresholds
-
## threshold nCells
-
## Global_k1 0.2419241 739
-
## L_k2 0.2437017 733
-
## R_k3 0.1766730 1008
-
## minimumDens 0.2417192 740
To obtain the threshold selected automatically for a given gene set(. Oligodendrocytes):
cells_assignment$Oligodendrocyte_Cahoy$aucThr$selected
Cells assigned at this threshold (AUC over the threshold)
-
oligodencrocytesAssigned <- cells_assignment$Oligodendrocyte_Cahoy$assignment
-
length(oligodencrocytesAssigned)
-
head(oligodencrocytesAssigned)
Plotting the AUC histogram of a specific gene set, and setting a new threshold:
-
geneSetName <- rownames(cells_AUC)[grep("Oligodendrocyte_Cahoy", rownames(cells_AUC))]
-
AUCell_plotHist(cells_AUC[geneSetName,], aucThr=0.25)
-
abline(v=0.25)
Assigning cells to this new threshold:
-
newSelectedCells <- names(which(getAUC(cells_AUC)[geneSetName,]>0.08))
-
length(newSelectedCells)
head(newSelectedCells)
现在,有了gene set在每个细胞中的得分,探究常见的可视化思路,参见官方文档。
AUCell: Identifying cells with active gene sets
(1)热图(on-off二分类或者连续值)
(2)feature plot