【聚类】ConsensusClusterPlus包

ConsensusClusterPlus包是R语言中实现consensus clustering的一种方法。

主要有三个步骤：1，准备输入数据；2，跑流程；3，generating consensus

1-输入数据

输入数据要求无特别，列为样本行为基因、标准化的表达矩阵。

值得注意的是，此包默认选择以median absolute deviation (MAD)衡量的top5000高变基因用于分析，以更好的聚类分群（这和单细胞很像）。选择多少基因和选择方法都是可以自己选择的，因为这步骤用的classical R statistics而非包中的集成化命令。




library(ALL)




data(ALL)




d=exprs(ALL)




d[1:5,1:5]




 



mads=apply(d,1,mad)




d=d[rev(order(mads))[1:5000],]




d = sweep(d,1, apply(d,1,median,=T))

2-聚类

几个重要参数：

pItem: percent of items (column) resampling

pFeature: percent of features (rows) resampling

maxK: maxium cluster counts

reps: resampling times

clusterAlg: agglomerative hierarchical clustering algorithm

distance: 1- Pearson correlation distances

注：实际中K和reps可以设置的高一点，例如20，1000




library(ConsensusClusterPlus)




title=tempdir()




results = ConsensusClusterPlus(d,maxK=6,reps=50,pItem=0.8,pFeature=1,





+ title=title,clusterAlg="hc",distance="pearson",seed=1262118388.71279,plot="png")

结果是一个list，list里的各元素对应着k不同取值时的结果




###查看重要结果



 



#consensusMatrix - the consensus matrix.



#For .example, the top five rows and columns of results for k=2:



results[[2]][["consensusMatrix"]][1:5,1:5]




 



#consensusTree - hclust object



results[[2]][["consensusTree"]]




 



#consensusClass - the sample classifications



results[[2]][["consensusClass"]][1:5]




 



#ml - consensus matrix result



#clrs - colors for cluster

3-计算cluster consensus与item consensus

这两个概念类似于cluster内异质性和WGCNA里的membership概念。




icl = calcICL(results,title=title,plot="png")




 



icl[["clusterConsensus"]]




#k cluster clusterConsensus



#[1,] 2 1 0.7681668



#[2,] 2 2 0.9788274



#[3,] 3 1 0.6176820



#[4,] 3 2 0.9190744



#[5,] 3 3 1.0000000



#[6,] 4 1 0.8446083



 



icl[["itemConsensus"]][1:5,]




#k cluster item itemConsensus



#1 2 1 28031 0.6173782



#2 2 1 28023 0.5797202



#3 2 1 43012 0.5961974



#4 2 1 28042 0.5644619



#5 2 1 28047 0.6259350

4-图形化展示

详见Bioconductor - ConsensusClusterPlus

R语言|ConsensusClusterPlus 包进行一致性聚类

在实际应用中，CC聚类常与一些特定的生物学过程相联系。例如血管生成、缺氧。先找到相关gene set，对表达矩阵取子集，再聚类分亚型。有了不同的亚型，后面的分析就很多样了，可以深挖分子机制不同、共表达网络，也可以探究他的诊断或预后价值。

或者，先用cibersort等算法计算免疫浸润matrix，用免疫浸润结果做浸润聚类。

这种思路与ssGSEA或GSVA利用基因集打分中位数划分高低组其实差别不大，只不过用GSVA更加直观，预后或者诊断区分也更明显，但基于距离的聚类可能在共表达网络方面更有优势。

秒客网

【聚类】ConsensusClusterPlus包

相关文章