R中的聚类分析:确定簇的最佳数量。

时间:2021-05-28 16:56:12

Being a newbie in R, I'm not very sure how to choose the best number of clusters to do a k-means analysis. After plotting a subset of below data, how many clusters will be appropriate? How can I perform cluster dendro analysis?

作为R的新手,我不太确定如何选择最好的集群数量来进行k-means分析。在绘制了以下数据的子集之后,有多少集群是合适的?如何执行集群dendro分析?

n = 1000
kk = 10    
x1 = runif(kk)
y1 = runif(kk)
z1 = runif(kk)    
x4 = sample(x1,length(x1))
y4 = sample(y1,length(y1)) 
randObs <- function()
{
  ix = sample( 1:length(x4), 1 )
  iy = sample( 1:length(y4), 1 )
  rx = rnorm( 1, x4[ix], runif(1)/8 )
  ry = rnorm( 1, y4[ix], runif(1)/8 )
  return( c(rx,ry) )
}  
x = c()
y = c()
for ( k in 1:n )
{
  rPair  =  randObs()
  x  =  c( x, rPair[1] )
  y  =  c( y, rPair[2] )
}
z <- rnorm(n)
d <- data.frame( x, y, z )

6 个解决方案

#1


905  

If your question is how can I determine how many clusters are appropriate for a kmeans analysis of my data?, then here are some options. The wikipedia article on determining numbers of clusters has a good review of some of these methods.

如果您的问题是我如何确定有多少集群适合于kmeans分析我的数据?这里有一些选择。wikipedia关于确定集群数量的文章对其中一些方法进行了很好的回顾。

First, some reproducible data (the data in the Q are... unclear to me):

首先,一些可重复的数据(Q中的数据是……)我不清楚):

n = 100
g = 6 
set.seed(g)
d <- data.frame(x = unlist(lapply(1:g, function(i) rnorm(n/g, runif(1)*i^2))), 
                y = unlist(lapply(1:g, function(i) rnorm(n/g, runif(1)*i^2))))
plot(d)

R中的聚类分析:确定簇的最佳数量。

One. Look for a bend or elbow in the sum of squared error (SSE) scree plot. See http://www.statmethods.net/advstats/cluster.html & http://www.mattpeeples.net/kmeans.html for more. The location of the elbow in the resulting plot suggests a suitable number of clusters for the kmeans:

一个。在平方误差(SSE) scree图中寻找弯曲或弯头。参见http://www.statmethods.net/advstats/cluster.html和http://www.mattpeeples.net/kmeans.html获取更多信息。在结果图中肘部的位置显示了kmeans的合适数量的集群:

mydata <- d
wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
  for (i in 2:15) wss[i] <- sum(kmeans(mydata,
                                       centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters",
     ylab="Within groups sum of squares")

We might conclude that 4 clusters would be indicated by this method: R中的聚类分析:确定簇的最佳数量。

我们可以得出这样的结论:4个集群将通过这个方法来表示:

Two. You can do partitioning around medoids to estimate the number of clusters using the pamk function in the fpc package.

两个。您可以在medoids周围进行分区,以估计在fpc包中使用pamk函数的集群数量。

library(fpc)
pamk.best <- pamk(d)
cat("number of clusters estimated by optimum average silhouette width:", pamk.best$nc, "\n")
plot(pam(d, pamk.best$nc))

R中的聚类分析:确定簇的最佳数量。R中的聚类分析:确定簇的最佳数量。

# we could also do:
library(fpc)
asw <- numeric(20)
for (k in 2:20)
  asw[[k]] <- pam(d, k) $ silinfo $ avg.width
k.best <- which.max(asw)
cat("silhouette-optimal number of clusters:", k.best, "\n")
# still 4

Three. Calinsky criterion: Another approach to diagnosing how many clusters suit the data. In this case we try 1 to 10 groups.

三。Calinsky标准:另一种诊断有多少集群适合数据的方法。在本例中,我们尝试了1到10组。

require(vegan)
fit <- cascadeKM(scale(d, center = TRUE,  scale = TRUE), 1, 10, iter = 1000)
plot(fit, sortg = TRUE, grpmts.plot = TRUE)
calinski.best <- as.numeric(which.max(fit$results[2,]))
cat("Calinski criterion optimal number of clusters:", calinski.best, "\n")
# 5 clusters!

R中的聚类分析:确定簇的最佳数量。

Four. Determine the optimal model and number of clusters according to the Bayesian Information Criterion for expectation-maximization, initialized by hierarchical clustering for parameterized Gaussian mixture models

四。通过参数化高斯混合模型的层次聚类方法,根据贝叶斯信息准则确定最优模型和群数。

# See http://www.jstatsoft.org/v18/i06/paper
# http://www.stat.washington.edu/research/reports/2006/tr504.pdf
#
library(mclust)
# Run the function to see how many clusters
# it finds to be optimal, set it to search for
# at least 1 model and up 20.
d_clust <- Mclust(as.matrix(d), G=1:20)
m.best <- dim(d_clust$z)[2]
cat("model-based optimal number of clusters:", m.best, "\n")
# 4 clusters
plot(d_clust)

R中的聚类分析:确定簇的最佳数量。R中的聚类分析:确定簇的最佳数量。R中的聚类分析:确定簇的最佳数量。

Five. Affinity propagation (AP) clustering, see http://dx.doi.org/10.1126/science.1136800

五。亲和传播(AP)聚类,见http://dx.doi.org/10.1126/science.1136800。

library(apcluster)
d.apclus <- apcluster(negDistMat(r=2), d)
cat("affinity propogation optimal number of clusters:", length(d.apclus@clusters), "\n")
# 4
heatmap(d.apclus)
plot(d.apclus, d)

R中的聚类分析:确定簇的最佳数量。R中的聚类分析:确定簇的最佳数量。

Six. Gap Statistic for Estimating the Number of Clusters. See also some code for a nice graphical output. Trying 2-10 clusters here:

六。估算集群数量的间隙统计量。还可以看到一些用于图形输出的代码。在2 - 10集群:

library(cluster)
clusGap(d, kmeans, 10, B = 100, verbose = interactive())

Clustering k = 1,2,..., K.max (= 10): .. done
Bootstrapping, b = 1,2,..., B (= 100)  [one "." per sample]:
.................................................. 50 
.................................................. 100 
Clustering Gap statistic ["clusGap"].
B=100 simulated reference sets, k = 1..10
 --> Number of clusters (method 'firstSEmax', SE.factor=1): 4
          logW   E.logW        gap     SE.sim
 [1,] 5.991701 5.970454 -0.0212471 0.04388506
 [2,] 5.152666 5.367256  0.2145907 0.04057451
 [3,] 4.557779 5.069601  0.5118225 0.03215540
 [4,] 3.928959 4.880453  0.9514943 0.04630399
 [5,] 3.789319 4.766903  0.9775842 0.04826191
 [6,] 3.747539 4.670100  0.9225607 0.03898850
 [7,] 3.582373 4.590136  1.0077628 0.04892236
 [8,] 3.528791 4.509247  0.9804556 0.04701930
 [9,] 3.442481 4.433200  0.9907197 0.04935647
[10,] 3.445291 4.369232  0.9239414 0.05055486

Here's the output from Edwin Chen's implementation of the gap statistic: R中的聚类分析:确定簇的最佳数量。

以下是来自埃德温·陈(Edwin Chen)的gap统计数据的执行情况:

Seven. You may also find it useful to explore your data with clustergrams to visualize cluster assignment, see http://www.r-statistics.com/2010/06/clustergram-visualization-and-diagnostics-for-cluster-analysis-r-code/ for more details.

七。你也可以用clustergrams来查看你的数据,以可视化集群的分配,你也可以看到http://www.r统计信息。

Eight. The NbClust package provides 30 indices to determine the number of clusters in a dataset.

八。NbClust包提供30个索引来确定数据集中的集群数量。

library(NbClust)
nb <- NbClust(d, diss="NULL", distance = "euclidean", 
        min.nc=2, max.nc=15, method = "kmeans", 
        index = "alllong", alphaBeale = 0.1)
hist(nb$Best.nc[1,], breaks = max(na.omit(nb$Best.nc[1,])))
# Looks like 3 is the most frequently determined number of clusters
# and curiously, four clusters is not in the output at all!

R中的聚类分析:确定簇的最佳数量。

If your question is how can I produce a dendrogram to visualize the results of my cluster analysis, then you should start with these: http://www.statmethods.net/advstats/cluster.html http://www.r-tutor.com/gpu-computing/clustering/hierarchical-cluster-analysis http://gastonsanchez.wordpress.com/2012/10/03/7-ways-to-plot-dendrograms-in-r/ And see here for more exotic methods: http://cran.r-project.org/web/views/Cluster.html

如果你的问题是我如何生成一个dendrogram来可视化我的聚类分析结果,那么你应该从以下几个方面开始:http://www.statmethods.net/advstats/cluster.html http://www.r- tutor.com/gpu-computing/clustering/clustering/hierarchiccluster- analysis http://gastonsanchez.wordpress.com/2012/10/03/7-ways-toplot -dendrograms-in-r/,在这里可以看到更多的方法:http://cran.r project.org/web/views/cluster.html。

Here are a few examples:

这里有几个例子:

d_dist <- dist(as.matrix(d))   # find distance matrix 
plot(hclust(d_dist))           # apply hirarchical clustering and plot

R中的聚类分析:确定簇的最佳数量。

# a Bayesian clustering method, good for high-dimension data, more details:
# http://vahid.probstat.ca/paper/2012-bclust.pdf
install.packages("bclust")
library(bclust)
x <- as.matrix(d)
d.bclus <- bclust(x, transformed.par = c(0, -50, log(16), 0, 0, 0))
viplot(imp(d.bclus)$var); plot(d.bclus); ditplot(d.bclus)
dptplot(d.bclus, scale = 20, horizbar.plot = TRUE,varimp = imp(d.bclus)$var, horizbar.distance = 0, dendrogram.lwd = 2)
# I just include the dendrogram here

R中的聚类分析:确定簇的最佳数量。

Also for high-dimension data is the pvclust library which calculates p-values for hierarchical clustering via multiscale bootstrap resampling. Here's the example from the documentation (wont work on such low dimensional data as in my example):

对于高维数据,pvclust库通过多尺度引导重采样来计算分层集群的p值。下面是来自文档的示例(在我的示例中不会处理如此低的维度数据):

library(pvclust)
library(MASS)
data(Boston)
boston.pv <- pvclust(Boston)
plot(boston.pv)

R中的聚类分析:确定簇的最佳数量。

Does any of that help?

有任何帮助吗?

#2


15  

It's hard to add something too such an elaborate answer. Though I feel we should mention identify here, particularly because @Ben shows a lot of dendrogram examples.

很难再加上一个如此详尽的答案。虽然我觉得我们应该在这里提一下,特别是因为@Ben展示了很多dendrogram的例子。

d_dist <- dist(as.matrix(d))   # find distance matrix 
plot(hclust(d_dist)) 
clusters <- identify(hclust(d_dist))

identify lets you interactively choose clusters from an dendrogram and stores your choices to a list. Hit Esc to leave interactive mode and return to R console. Note, that the list contains the indices, not the rownames (as opposed to cutree).

标识允许您交互式地从一个dendrogram选择集群,并将您的选择存储到一个列表中。点击Esc离开交互模式,返回到R控制台。注意,列表包含索引,而不是行名(与cutree相反)。

#3


6  

In order to determine optimal k-cluster in clustering methods. I usually using Elbow method accompany by Parallel processing to avoid time-comsuming. This code can sample like this:

为了确定聚类方法中最优的k-簇。我通常使用肘部方法并行处理,以避免时间的消耗。这段代码可以这样取样:

Elbow method

弯头的方法

elbow.k <- function(mydata){
dist.obj <- dist(mydata)
hclust.obj <- hclust(dist.obj)
css.obj <- css.hclust(dist.obj,hclust.obj)
elbow.obj <- elbow.batch(css.obj)
k <- elbow.obj$k
return(k)
}

Running Elbow parallel

运行肘部平行

no_cores <- detectCores()
    cl<-makeCluster(no_cores)
    clusterEvalQ(cl, library(GMD))
    clusterExport(cl, list("data.clustering", "data.convert", "elbow.k", "clustering.kmeans"))
 start.time <- Sys.time()
 elbow.k.handle(data.clustering))
 k.clusters <- parSapply(cl, 1, function(x) elbow.k(data.clustering))
    end.time <- Sys.time()
    cat('Time to find k using Elbow method is',(end.time - start.time),'seconds with k value:', k.clusters)

It works well.

它的工作原理。

#4


4  

Splendid answer from Ben. However I'm surprised that the Affinity Propagation (AP) method has been here suggested just to find the number of cluster for the k-means method, where in general AP do a better job clustering the data. Please see the scientific paper supporting this method in Science here:

从本精彩的回答。然而,我很惊讶的是,这里的关联传播(AP)方法仅仅是为了找到k-means方法的集群数量,在一般的AP中,该方法能够更好地聚集数据。请参阅科学论文中支持这种方法的科学:

Frey, Brendan J., and Delbert Dueck. "Clustering by passing messages between data points." science 315.5814 (2007): 972-976.

Frey布兰登·J。和德尔伯特Dueck。“通过在数据点之间传递信息进行聚类。”科学315.5814(2007):972-976。

So if you are not biased toward k-means I suggest to use AP directly, which will cluster the data without requiring knowing the number of clusters:

因此,如果你对k不偏,我建议直接使用AP,它会在不需要知道集群数量的情况下对数据进行集群:

library(apcluster)
apclus = apcluster(negDistMat(r=2), data)
show(apclus)

If negative euclidean distances are not appropriate, then you can use another similarity measures provided in the same package. For example, for similarities based on Spearman correlations, this is what you need:

如果负值的euclidean距离不合适,那么您可以使用在相同的包中提供的另一个相似度量。例如,基于Spearman相关的相似性,这是您需要的:

sim = corSimMat(data, method="spearman")
apclus = apcluster(s=sim)

Please note that those functions for similarities in the AP package are just provided for simplicity. In fact, apcluster() function in R will accept any matrix of correlations. The same before with corSimMat() can be done with this:

请注意,为了简单起见,在AP包中有类似的功能。事实上,在R中的apcluster()函数将接受任何关联矩阵。在使用corSimMat()之前,可以这样做:

sim = cor(data, method="spearman")

or

sim = cor(t(data), method="spearman")

depending on what you want to cluster on your matrix (rows or cols).

取决于您想要在矩阵(行或cols)上集群的内容。

#5


1  

These methods are great but when trying to find k for much larger data sets, these can be crazy slow in R.

这些方法都很好,但是当我们试图为更大的数据集寻找k时,这些方法在R中可能会非常慢。

A good solution I have found is the "RWeka" package, which has an efficient implementation of the X-Means algorithm - an extended version of K-Means that scales better and will determine the optimum number of clusters for you.

我找到的一个很好的解决方案是“RWeka”包,它有效地实现了X-Means算法——一个扩展版本的K-Means,它可以更好地扩展,并为您确定最佳的集群数量。

First you'll want to make sure that Weka is installed on your system and have XMeans installed through Weka's package manager tool.

首先,您需要确保Weka安装在您的系统上,并通过Weka的包管理器工具安装XMeans。

library(RWeka)

# Print a list of available options for the X-Means algorithm
WOW("XMeans")

# Create a Weka_control object which will specify our parameters
weka_ctrl <- Weka_control(
    I = 1000,                          # max no. of overall iterations
    M = 1000,                          # max no. of iterations in the kMeans loop
    L = 20,                            # min no. of clusters
    H = 150,                           # max no. of clusters
    D = "weka.core.EuclideanDistance", # distance metric Euclidean
    C = 0.4,                           # cutoff factor ???
    S = 12                             # random number seed (for reproducibility)
)

# Run the algorithm on your data, d
x_means <- XMeans(d, control = weka_ctrl)

# Assign cluster IDs to original data set
d$xmeans.cluster <- x_means$class_ids

#6


0  

The answers are great. If you want to give a chance to another clustering method you can use hierarchical clustering and see how data is splitting.

答案是伟大的。如果你想给另一种聚类方法提供一个机会,你可以使用层次聚类,看看数据是如何分裂的。

> set.seed(2)
> x=matrix(rnorm(50*2), ncol=2)
> hc.complete = hclust(dist(x), method="complete")
> plot(hc.complete)

R中的聚类分析:确定簇的最佳数量。

Depending on how many classes you need you can cut your dendrogram as;

取决于你需要多少课程,你可以把你的dendrogram;

> cutree(hc.complete,k = 2)
 [1] 1 1 1 2 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 2 1 1 1
[26] 2 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 1 1 1 1 1 1 1 2

If you type ?cutree you will see the definitions. If your data set has three classes it will be simply cutree(hc.complete, k = 3). The equivalent for cutree(hc.complete,k = 2) is cutree(hc.complete,h = 4.9).

如果你键入?cutree,你会看到定义。如果你的数据集有三个类,那么它就是简单的cutree(hc.complete,k = 3). cutree(hc.complete,k = 2)是cutree(hc.complete,h = 4.9)。

#1


905  

If your question is how can I determine how many clusters are appropriate for a kmeans analysis of my data?, then here are some options. The wikipedia article on determining numbers of clusters has a good review of some of these methods.

如果您的问题是我如何确定有多少集群适合于kmeans分析我的数据?这里有一些选择。wikipedia关于确定集群数量的文章对其中一些方法进行了很好的回顾。

First, some reproducible data (the data in the Q are... unclear to me):

首先,一些可重复的数据(Q中的数据是……)我不清楚):

n = 100
g = 6 
set.seed(g)
d <- data.frame(x = unlist(lapply(1:g, function(i) rnorm(n/g, runif(1)*i^2))), 
                y = unlist(lapply(1:g, function(i) rnorm(n/g, runif(1)*i^2))))
plot(d)

R中的聚类分析:确定簇的最佳数量。

One. Look for a bend or elbow in the sum of squared error (SSE) scree plot. See http://www.statmethods.net/advstats/cluster.html & http://www.mattpeeples.net/kmeans.html for more. The location of the elbow in the resulting plot suggests a suitable number of clusters for the kmeans:

一个。在平方误差(SSE) scree图中寻找弯曲或弯头。参见http://www.statmethods.net/advstats/cluster.html和http://www.mattpeeples.net/kmeans.html获取更多信息。在结果图中肘部的位置显示了kmeans的合适数量的集群:

mydata <- d
wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
  for (i in 2:15) wss[i] <- sum(kmeans(mydata,
                                       centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters",
     ylab="Within groups sum of squares")

We might conclude that 4 clusters would be indicated by this method: R中的聚类分析:确定簇的最佳数量。

我们可以得出这样的结论:4个集群将通过这个方法来表示:

Two. You can do partitioning around medoids to estimate the number of clusters using the pamk function in the fpc package.

两个。您可以在medoids周围进行分区,以估计在fpc包中使用pamk函数的集群数量。

library(fpc)
pamk.best <- pamk(d)
cat("number of clusters estimated by optimum average silhouette width:", pamk.best$nc, "\n")
plot(pam(d, pamk.best$nc))

R中的聚类分析:确定簇的最佳数量。R中的聚类分析:确定簇的最佳数量。

# we could also do:
library(fpc)
asw <- numeric(20)
for (k in 2:20)
  asw[[k]] <- pam(d, k) $ silinfo $ avg.width
k.best <- which.max(asw)
cat("silhouette-optimal number of clusters:", k.best, "\n")
# still 4

Three. Calinsky criterion: Another approach to diagnosing how many clusters suit the data. In this case we try 1 to 10 groups.

三。Calinsky标准:另一种诊断有多少集群适合数据的方法。在本例中,我们尝试了1到10组。

require(vegan)
fit <- cascadeKM(scale(d, center = TRUE,  scale = TRUE), 1, 10, iter = 1000)
plot(fit, sortg = TRUE, grpmts.plot = TRUE)
calinski.best <- as.numeric(which.max(fit$results[2,]))
cat("Calinski criterion optimal number of clusters:", calinski.best, "\n")
# 5 clusters!

R中的聚类分析:确定簇的最佳数量。

Four. Determine the optimal model and number of clusters according to the Bayesian Information Criterion for expectation-maximization, initialized by hierarchical clustering for parameterized Gaussian mixture models

四。通过参数化高斯混合模型的层次聚类方法,根据贝叶斯信息准则确定最优模型和群数。

# See http://www.jstatsoft.org/v18/i06/paper
# http://www.stat.washington.edu/research/reports/2006/tr504.pdf
#
library(mclust)
# Run the function to see how many clusters
# it finds to be optimal, set it to search for
# at least 1 model and up 20.
d_clust <- Mclust(as.matrix(d), G=1:20)
m.best <- dim(d_clust$z)[2]
cat("model-based optimal number of clusters:", m.best, "\n")
# 4 clusters
plot(d_clust)

R中的聚类分析:确定簇的最佳数量。R中的聚类分析:确定簇的最佳数量。R中的聚类分析:确定簇的最佳数量。

Five. Affinity propagation (AP) clustering, see http://dx.doi.org/10.1126/science.1136800

五。亲和传播(AP)聚类,见http://dx.doi.org/10.1126/science.1136800。

library(apcluster)
d.apclus <- apcluster(negDistMat(r=2), d)
cat("affinity propogation optimal number of clusters:", length(d.apclus@clusters), "\n")
# 4
heatmap(d.apclus)
plot(d.apclus, d)

R中的聚类分析:确定簇的最佳数量。R中的聚类分析:确定簇的最佳数量。

Six. Gap Statistic for Estimating the Number of Clusters. See also some code for a nice graphical output. Trying 2-10 clusters here:

六。估算集群数量的间隙统计量。还可以看到一些用于图形输出的代码。在2 - 10集群:

library(cluster)
clusGap(d, kmeans, 10, B = 100, verbose = interactive())

Clustering k = 1,2,..., K.max (= 10): .. done
Bootstrapping, b = 1,2,..., B (= 100)  [one "." per sample]:
.................................................. 50 
.................................................. 100 
Clustering Gap statistic ["clusGap"].
B=100 simulated reference sets, k = 1..10
 --> Number of clusters (method 'firstSEmax', SE.factor=1): 4
          logW   E.logW        gap     SE.sim
 [1,] 5.991701 5.970454 -0.0212471 0.04388506
 [2,] 5.152666 5.367256  0.2145907 0.04057451
 [3,] 4.557779 5.069601  0.5118225 0.03215540
 [4,] 3.928959 4.880453  0.9514943 0.04630399
 [5,] 3.789319 4.766903  0.9775842 0.04826191
 [6,] 3.747539 4.670100  0.9225607 0.03898850
 [7,] 3.582373 4.590136  1.0077628 0.04892236
 [8,] 3.528791 4.509247  0.9804556 0.04701930
 [9,] 3.442481 4.433200  0.9907197 0.04935647
[10,] 3.445291 4.369232  0.9239414 0.05055486

Here's the output from Edwin Chen's implementation of the gap statistic: R中的聚类分析:确定簇的最佳数量。

以下是来自埃德温·陈(Edwin Chen)的gap统计数据的执行情况:

Seven. You may also find it useful to explore your data with clustergrams to visualize cluster assignment, see http://www.r-statistics.com/2010/06/clustergram-visualization-and-diagnostics-for-cluster-analysis-r-code/ for more details.

七。你也可以用clustergrams来查看你的数据,以可视化集群的分配,你也可以看到http://www.r统计信息。

Eight. The NbClust package provides 30 indices to determine the number of clusters in a dataset.

八。NbClust包提供30个索引来确定数据集中的集群数量。

library(NbClust)
nb <- NbClust(d, diss="NULL", distance = "euclidean", 
        min.nc=2, max.nc=15, method = "kmeans", 
        index = "alllong", alphaBeale = 0.1)
hist(nb$Best.nc[1,], breaks = max(na.omit(nb$Best.nc[1,])))
# Looks like 3 is the most frequently determined number of clusters
# and curiously, four clusters is not in the output at all!

R中的聚类分析:确定簇的最佳数量。

If your question is how can I produce a dendrogram to visualize the results of my cluster analysis, then you should start with these: http://www.statmethods.net/advstats/cluster.html http://www.r-tutor.com/gpu-computing/clustering/hierarchical-cluster-analysis http://gastonsanchez.wordpress.com/2012/10/03/7-ways-to-plot-dendrograms-in-r/ And see here for more exotic methods: http://cran.r-project.org/web/views/Cluster.html

如果你的问题是我如何生成一个dendrogram来可视化我的聚类分析结果,那么你应该从以下几个方面开始:http://www.statmethods.net/advstats/cluster.html http://www.r- tutor.com/gpu-computing/clustering/clustering/hierarchiccluster- analysis http://gastonsanchez.wordpress.com/2012/10/03/7-ways-toplot -dendrograms-in-r/,在这里可以看到更多的方法:http://cran.r project.org/web/views/cluster.html。

Here are a few examples:

这里有几个例子:

d_dist <- dist(as.matrix(d))   # find distance matrix 
plot(hclust(d_dist))           # apply hirarchical clustering and plot

R中的聚类分析:确定簇的最佳数量。

# a Bayesian clustering method, good for high-dimension data, more details:
# http://vahid.probstat.ca/paper/2012-bclust.pdf
install.packages("bclust")
library(bclust)
x <- as.matrix(d)
d.bclus <- bclust(x, transformed.par = c(0, -50, log(16), 0, 0, 0))
viplot(imp(d.bclus)$var); plot(d.bclus); ditplot(d.bclus)
dptplot(d.bclus, scale = 20, horizbar.plot = TRUE,varimp = imp(d.bclus)$var, horizbar.distance = 0, dendrogram.lwd = 2)
# I just include the dendrogram here

R中的聚类分析:确定簇的最佳数量。

Also for high-dimension data is the pvclust library which calculates p-values for hierarchical clustering via multiscale bootstrap resampling. Here's the example from the documentation (wont work on such low dimensional data as in my example):

对于高维数据,pvclust库通过多尺度引导重采样来计算分层集群的p值。下面是来自文档的示例(在我的示例中不会处理如此低的维度数据):

library(pvclust)
library(MASS)
data(Boston)
boston.pv <- pvclust(Boston)
plot(boston.pv)

R中的聚类分析:确定簇的最佳数量。

Does any of that help?

有任何帮助吗?

#2


15  

It's hard to add something too such an elaborate answer. Though I feel we should mention identify here, particularly because @Ben shows a lot of dendrogram examples.

很难再加上一个如此详尽的答案。虽然我觉得我们应该在这里提一下,特别是因为@Ben展示了很多dendrogram的例子。

d_dist <- dist(as.matrix(d))   # find distance matrix 
plot(hclust(d_dist)) 
clusters <- identify(hclust(d_dist))

identify lets you interactively choose clusters from an dendrogram and stores your choices to a list. Hit Esc to leave interactive mode and return to R console. Note, that the list contains the indices, not the rownames (as opposed to cutree).

标识允许您交互式地从一个dendrogram选择集群,并将您的选择存储到一个列表中。点击Esc离开交互模式,返回到R控制台。注意,列表包含索引,而不是行名(与cutree相反)。

#3


6  

In order to determine optimal k-cluster in clustering methods. I usually using Elbow method accompany by Parallel processing to avoid time-comsuming. This code can sample like this:

为了确定聚类方法中最优的k-簇。我通常使用肘部方法并行处理,以避免时间的消耗。这段代码可以这样取样:

Elbow method

弯头的方法

elbow.k <- function(mydata){
dist.obj <- dist(mydata)
hclust.obj <- hclust(dist.obj)
css.obj <- css.hclust(dist.obj,hclust.obj)
elbow.obj <- elbow.batch(css.obj)
k <- elbow.obj$k
return(k)
}

Running Elbow parallel

运行肘部平行

no_cores <- detectCores()
    cl<-makeCluster(no_cores)
    clusterEvalQ(cl, library(GMD))
    clusterExport(cl, list("data.clustering", "data.convert", "elbow.k", "clustering.kmeans"))
 start.time <- Sys.time()
 elbow.k.handle(data.clustering))
 k.clusters <- parSapply(cl, 1, function(x) elbow.k(data.clustering))
    end.time <- Sys.time()
    cat('Time to find k using Elbow method is',(end.time - start.time),'seconds with k value:', k.clusters)

It works well.

它的工作原理。

#4


4  

Splendid answer from Ben. However I'm surprised that the Affinity Propagation (AP) method has been here suggested just to find the number of cluster for the k-means method, where in general AP do a better job clustering the data. Please see the scientific paper supporting this method in Science here:

从本精彩的回答。然而,我很惊讶的是,这里的关联传播(AP)方法仅仅是为了找到k-means方法的集群数量,在一般的AP中,该方法能够更好地聚集数据。请参阅科学论文中支持这种方法的科学:

Frey, Brendan J., and Delbert Dueck. "Clustering by passing messages between data points." science 315.5814 (2007): 972-976.

Frey布兰登·J。和德尔伯特Dueck。“通过在数据点之间传递信息进行聚类。”科学315.5814(2007):972-976。

So if you are not biased toward k-means I suggest to use AP directly, which will cluster the data without requiring knowing the number of clusters:

因此,如果你对k不偏,我建议直接使用AP,它会在不需要知道集群数量的情况下对数据进行集群:

library(apcluster)
apclus = apcluster(negDistMat(r=2), data)
show(apclus)

If negative euclidean distances are not appropriate, then you can use another similarity measures provided in the same package. For example, for similarities based on Spearman correlations, this is what you need:

如果负值的euclidean距离不合适,那么您可以使用在相同的包中提供的另一个相似度量。例如,基于Spearman相关的相似性,这是您需要的:

sim = corSimMat(data, method="spearman")
apclus = apcluster(s=sim)

Please note that those functions for similarities in the AP package are just provided for simplicity. In fact, apcluster() function in R will accept any matrix of correlations. The same before with corSimMat() can be done with this:

请注意,为了简单起见,在AP包中有类似的功能。事实上,在R中的apcluster()函数将接受任何关联矩阵。在使用corSimMat()之前,可以这样做:

sim = cor(data, method="spearman")

or

sim = cor(t(data), method="spearman")

depending on what you want to cluster on your matrix (rows or cols).

取决于您想要在矩阵(行或cols)上集群的内容。

#5


1  

These methods are great but when trying to find k for much larger data sets, these can be crazy slow in R.

这些方法都很好,但是当我们试图为更大的数据集寻找k时,这些方法在R中可能会非常慢。

A good solution I have found is the "RWeka" package, which has an efficient implementation of the X-Means algorithm - an extended version of K-Means that scales better and will determine the optimum number of clusters for you.

我找到的一个很好的解决方案是“RWeka”包,它有效地实现了X-Means算法——一个扩展版本的K-Means,它可以更好地扩展,并为您确定最佳的集群数量。

First you'll want to make sure that Weka is installed on your system and have XMeans installed through Weka's package manager tool.

首先,您需要确保Weka安装在您的系统上,并通过Weka的包管理器工具安装XMeans。

library(RWeka)

# Print a list of available options for the X-Means algorithm
WOW("XMeans")

# Create a Weka_control object which will specify our parameters
weka_ctrl <- Weka_control(
    I = 1000,                          # max no. of overall iterations
    M = 1000,                          # max no. of iterations in the kMeans loop
    L = 20,                            # min no. of clusters
    H = 150,                           # max no. of clusters
    D = "weka.core.EuclideanDistance", # distance metric Euclidean
    C = 0.4,                           # cutoff factor ???
    S = 12                             # random number seed (for reproducibility)
)

# Run the algorithm on your data, d
x_means <- XMeans(d, control = weka_ctrl)

# Assign cluster IDs to original data set
d$xmeans.cluster <- x_means$class_ids

#6


0  

The answers are great. If you want to give a chance to another clustering method you can use hierarchical clustering and see how data is splitting.

答案是伟大的。如果你想给另一种聚类方法提供一个机会,你可以使用层次聚类,看看数据是如何分裂的。

> set.seed(2)
> x=matrix(rnorm(50*2), ncol=2)
> hc.complete = hclust(dist(x), method="complete")
> plot(hc.complete)

R中的聚类分析:确定簇的最佳数量。

Depending on how many classes you need you can cut your dendrogram as;

取决于你需要多少课程,你可以把你的dendrogram;

> cutree(hc.complete,k = 2)
 [1] 1 1 1 2 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 2 1 1 1
[26] 2 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 1 1 1 1 1 1 1 2

If you type ?cutree you will see the definitions. If your data set has three classes it will be simply cutree(hc.complete, k = 3). The equivalent for cutree(hc.complete,k = 2) is cutree(hc.complete,h = 4.9).

如果你键入?cutree,你会看到定义。如果你的数据集有三个类,那么它就是简单的cutree(hc.complete,k = 3). cutree(hc.complete,k = 2)是cutree(hc.complete,h = 4.9)。