【AUCell】单细胞基因集打分

单细胞层次给基因集打分除了和GSVA等分析一样可以做富集以外，还可以通过cell marker gene set做亚群注释（需要注意细胞比例的问题，数量少的群体不行，可作辅助）。前面写过的scMetabolism包是给代谢通路打分，思路一样。

AUCell 是ranking-based，所以具体哪种标准化方法不重要，只要不影响每个细胞内基因表达量的排序就好。总的来讲，分为三步：

1，build the rankings

2，calculate the AUC

3，set the assignment threshold

安装R包





if (!requireNamespace("BiocManager", quietly=TRUE))




    ("BiocManager")




 



Biocmanager::install("AUCell")




 



# To support paralell execution:



BiocManager::install(c("doMC", "doRNG","doSNOW"))




# For the main example:



BiocManager::install(c("mixtools", "GEOquery", "SummarizedExperiment"))




# For the examples in the follow-up section of the tutorial:



BiocManager::install(c("DT", "plotly", "NMF", "d3heatmap", "shiny", "rbokeh",




                       "dynamicTreeCut","R2HTML","Rtsne", "zoo"))

load the scRNA expression matrix and gene set




# . Reading from a text file



#exprMatrix <- ("")



#exprMatrix <- (exprMatrix)



 



# or using an expression set



#exprMatrix <- exprs(myExpressionSet)



 



 



 



 



# (This may take a few minutes)



library(GEOquery)




attemptsLeft <- 20




while(attemptsLeft>0)




{



  geoFile <- tryCatch(getGEOSuppFiles("GSE60361", makeDirectory=FALSE), error=identity) 



  if(methods::is(geoFile,"error")) 



  {




    attemptsLeft <- attemptsLeft-1




    (5)




  }




  else




    attemptsLeft <- 0



}



 



gzFile <- grep(".", basename(rownames(geoFile)), fixed=TRUE, value=TRUE)




message(gzFile)




txtFile <- gsub(".gz", "", gzFile, fixed=TRUE)




message(txtFile)




gunzip(filename=gzFile, destname=txtFile, remove=TRUE)




 



library()




geoData <- fread(txtFile, sep="\t")




geneNames <- unname(unlist(geoData[,1, with=FALSE]))




exprMatrix <- (geoData[,-1, with=FALSE])




rm(geoData)





dim(exprMatrix)




rownames(exprMatrix) <- geneNames



exprMatrix[1:5,1:4]




 



# Remove file



(txtFile)




 



# Save for future use



mouseBrainExprMatrix <- exprMatrix



save(mouseBrainExprMatrix, file="exprMatrix_AUCellVignette_MouseBrain.RData")

pick random 5000 genes to release computational burden




# load("exprMatrix_AUCellVignette_MouseBrain.RData")



(333)




exprMatrix <- mouseBrainExprMatrix[sample(rownames(mouseBrainExprMatrix), 5000),]




 




dim(exprMatrix)

the next thing we need to prepare is the gene set. use ?AUCell_calcAUC to check accepted forms.




library(GSEABase)




 



#gene list version



genes <- c("gene1", "gene2", "gene3")




geneSets <- GeneSet(genes, setName="geneSet1")




geneSets



 



#gmt version



library(AUCell)




library(GSEABase)




gmtFile <- paste((('examples', package='AUCell')), "", sep="/")




geneSets <- getGmt(gmtFile)




 



#check how many of these genes are in the expression matrix



geneSets <- subsetGeneSets(geneSets, rownames(exprMatrix)) 



cbind(nGenes(geneSets))




 



#add the gene set size to its name



geneSets <- setGeneSetNames(geneSets, newNames=paste(names(geneSets), " (", nGenes(geneSets) ,"g)", sep=""))




 



#let’s also add a few sets of random genes and 100 genes expressed in many cells (. housekeeping-like):



 



# Random



(321)




extraGeneSets <- c(




  GeneSet(sample(rownames(exprMatrix), 50), setName="Random (50g)"),




  GeneSet(sample(rownames(exprMatrix), 500), setName="Random (500g)"))




 



countsPerGene <- apply(exprMatrix, 1, function(x) sum(x>0))




# Housekeeping-like



extraGeneSets <- c(extraGeneSets,




                   GeneSet(sample(names(countsPerGene)[which(countsPerGene>quantile(countsPerGene, probs=.95))], 100), setName="HK-like (100g)"))




 



geneSets <- GeneSetCollection(c(geneSets,extraGeneSets))





names(geneSets)

build gene expression rankings for each cell




cells_rankings <- AUCell_buildRankings(exprMatrix, nCores=1, plotStats=TRUE)




 



cells_rankings




#The “rankings” can be seen as a new representation of the original dataset. Once #they are calculated, they can be saved for future analyses.



 



save(cells_rankings, file="cells_rankings.RData")

calculate enrichment for the signature (AUC)

To determine whether the gene set is enriched at the top of the gene-ranking for each cell, AUCell uses the “Area Under the Curve” (AUC) of the recovery curve.




cells_AUC <- AUCell_calcAUC(geneSets, cells_rankings)




save(cells_AUC, file="cells_AUC.RData")

col is the cell id, row is gene set, AUC in each cell.

determine the cells with given gene signatures or active gene sets

The AUC estimates the proportion of genes in the gene-set that are highly expressed in each cell.

The ideal situation will be a bi-modal distribution, in which most cells in the dataset have a low “AUC” compared to a population of cells with a clearly higher value

注意，如果gene set是细胞marker,就必须考虑到细胞亚群比例的问题。如果亚群比例很高，那么可能双峰中的低峰部分被掩盖，表现出来更像house keeping genes(单峰甚至常量表达)；同理，如果比例很小，高峰也可能微不足道，看不出来。

另外，基因集基因数不能过小，不然有可能出现gene set里的基因一个都match不到matrix前5%的情况，此时AUC=0。所以，gene set长度在100-2k比较合适。

To ease the exploration of the distributions, the function AUCell_exploreThresholds() automatically plots all the histograms and calculates several thresholds that could be used to consider a gene-set ‘active’ (returned in $aucThr). The distributions are plotted as dotted lines over the histogram and the corresponding thresholds as vertical bars in the matching color. The thicker vertical line indicates the threshold selected by default ($aucThr$selected): the highest value to reduce the false positives.




(123)




par(mfrow=c(3,3)) 



cells_assignment <- AUCell_exploreThresholds(cells_AUC, plotHist=TRUE, assign=TRUE)




#看一下储存在结果里的提醒



warningMsg <- sapply(cells_assignment, function(x) x$aucThr$comment)




warningMsg[which(warningMsg!="")]




 



 



 



##                                                                                                         Microglia_lavin (165g) 



##                                                              "The AUC might follow a normal distribution (random gene-set?). " 



##                                                                                                                  Random (500g) 



## "The AUC might follow a normal distribution (random gene-set?).  The global distribution overlaps the partial distributions. "

The thresholds calcuated for each gene set are stored in the $aucThr slot. For example, the thresholds suggested for the oligodendrocyte gene-set:

cells_assignment$Oligodendrocyte_Cahoy$aucThr$thresholds




##             threshold nCells



## Global_k1   0.2419241    739




## L_k2        0.2437017    733




## R_k3        0.1766730   1008




## minimumDens 0.2417192    740

To obtain the threshold selected automatically for a given gene set(. Oligodendrocytes):

cells_assignment$Oligodendrocyte_Cahoy$aucThr$selected

Cells assigned at this threshold （AUC over the threshold）




oligodencrocytesAssigned <- cells_assignment$Oligodendrocyte_Cahoy$assignment




length(oligodencrocytesAssigned)




head(oligodencrocytesAssigned)

Plotting the AUC histogram of a specific gene set, and setting a new threshold:




geneSetName <- rownames(cells_AUC)[grep("Oligodendrocyte_Cahoy", rownames(cells_AUC))]





AUCell_plotHist(cells_AUC[geneSetName,], aucThr=0.25)




abline(v=0.25)

Assigning cells to this new threshold:




newSelectedCells <- names(which(getAUC(cells_AUC)[geneSetName,]>0.08))




length(newSelectedCells)

head(newSelectedCells)

现在，有了gene set在每个细胞中的得分，探究常见的可视化思路，参见官方文档。

AUCell: Identifying cells with active gene sets

（1）热图(on-off二分类或者连续值)

（2）feature plot

秒客网

【AUCell】单细胞基因集打分

相关文章