After finishing cluster analysis,when I input some new data,how Do I know which cluster do the data belongs to?
在完成聚类分析之后,当我输入一些新的数据时,如何知道数据属于哪个簇?
data(freeny)
library(RSNNS)
options(digits=2)
year<-as.integer(rownames(freeny))
freeny<-cbind(freeny,year)
freeny = freeny[sample(1:nrow(freeny),length(1:nrow(freeny))),1:ncol(freeny)]
freenyValues= freeny[,1:5]
freenyTargets=decodeClassLabels(freeny[,6])
freeny = splitForTrainingAndTest(freenyValues,freenyTargets,ratio=0.15)
km<-kmeans(freeny$inputsTrain,10,iter.max = 100)
kclust=km$cluster
1 个解决方案
#1
3
kmeans returns an object containing the coordinates of the cluster centers in $centers
. You want to find the cluster to which the new object is closest (in terms of the sum of squares of distances):
kmeans返回一个对象,该对象包含$centers中集群中心的坐标。您希望找到新对象最接近的簇(以距离平方和计算):
v <- freeny$inputsTrain[1,] # just an example
which.min( sapply( 1:10, function( x ) sum( ( v - km$centers[x,])^2 ) ) )
The above returns 8
- same as the cluster to which the first row of freeny$inputsTrain
was assigned.
上面的返回8——与分配freeny$inputsTrain的第一行相同。
In an alternative approach, you can first create a clustering, and then use a supervised machine learning to train a model which you will then use as a prediction. However, the quality of the model will depend on how good the clustering really represents the data structure and how much data you have. I have inspected your data with PCA (my favorite tool):
在另一种方法中,您可以首先创建集群,然后使用监控机器学习来训练模型,然后将其用作预测。但是,模型的质量取决于集群是否真正代表数据结构以及您拥有多少数据。我用PCA(我最喜欢的工具)检查了你的数据:
pca <- prcomp( freeny$inputsTrain, scale.= TRUE )
library( pca3d )
pca3d( pca )
My impression is that you have at most 6-7 clear classes to work with:
我的印象是,你最多有6-7个清晰的课程可以使用:
However, one should run more kmeans
diagnostic (elbow plots etc) to determine the optimal number of clusters:
但是,应该运行更多的kmeans诊断(弯头图等)来确定最优的集群数量:
wss <- sapply( 1:10, function( x ) { km <- kmeans(freeny$inputsTrain,x,iter.max = 100 ) ; km$tot.withinss } )
plot( 1:10, wss )
This plot suggests 3-4 classes as the optimum. For a more complex and informative approach, consult the clusterograms: http://www.r-statistics.com/2010/06/clustergram-visualization-and-diagnostics-for-cluster-analysis-r-code/
这个图建议3-4个类是最优的。要获得更复杂和信息丰富的方法,请参阅clustergram: http://www.r- statistics.com/2010/06/clustergram-visualizationand diagnosis - For clusterster -analysis-r-code/
#1
3
kmeans returns an object containing the coordinates of the cluster centers in $centers
. You want to find the cluster to which the new object is closest (in terms of the sum of squares of distances):
kmeans返回一个对象,该对象包含$centers中集群中心的坐标。您希望找到新对象最接近的簇(以距离平方和计算):
v <- freeny$inputsTrain[1,] # just an example
which.min( sapply( 1:10, function( x ) sum( ( v - km$centers[x,])^2 ) ) )
The above returns 8
- same as the cluster to which the first row of freeny$inputsTrain
was assigned.
上面的返回8——与分配freeny$inputsTrain的第一行相同。
In an alternative approach, you can first create a clustering, and then use a supervised machine learning to train a model which you will then use as a prediction. However, the quality of the model will depend on how good the clustering really represents the data structure and how much data you have. I have inspected your data with PCA (my favorite tool):
在另一种方法中,您可以首先创建集群,然后使用监控机器学习来训练模型,然后将其用作预测。但是,模型的质量取决于集群是否真正代表数据结构以及您拥有多少数据。我用PCA(我最喜欢的工具)检查了你的数据:
pca <- prcomp( freeny$inputsTrain, scale.= TRUE )
library( pca3d )
pca3d( pca )
My impression is that you have at most 6-7 clear classes to work with:
我的印象是,你最多有6-7个清晰的课程可以使用:
However, one should run more kmeans
diagnostic (elbow plots etc) to determine the optimal number of clusters:
但是,应该运行更多的kmeans诊断(弯头图等)来确定最优的集群数量:
wss <- sapply( 1:10, function( x ) { km <- kmeans(freeny$inputsTrain,x,iter.max = 100 ) ; km$tot.withinss } )
plot( 1:10, wss )
This plot suggests 3-4 classes as the optimum. For a more complex and informative approach, consult the clusterograms: http://www.r-statistics.com/2010/06/clustergram-visualization-and-diagnostics-for-cluster-analysis-r-code/
这个图建议3-4个类是最优的。要获得更复杂和信息丰富的方法,请参阅clustergram: http://www.r- statistics.com/2010/06/clustergram-visualizationand diagnosis - For clusterster -analysis-r-code/