《机器学习实战》之二分K-均值聚类算法的python实现
上面博文介绍了K-均值聚类算法及其用python实现,上篇博文中的两张截图,我们可以看到,由于K-均值聚类算法中由于初始质心的选取,会造成聚类的局部最优,并不是全局最优,因此,会造成聚类的效果并不理想,为克服K-均值算法收敛于局部最小值的问题,就有了二分K-均值算法。
二分K-均值聚类算法
二分K均值算法是基本K均值算法的直接扩充,其基本思想是:为了得到K个簇,首先将所有点的集合分裂成两个簇,然后从这些簇中选取一个继续分裂,迭代直到产生K个簇;二分K均值的关键是如何选择待分裂簇,选择策略主要有两种,一是选择包含矢量个数最多的簇进行分裂,二是选择具有最大SSE(误差的平方和)的簇分裂。
二分k均值算法的伪代码如下:
将所有数据点看成一个簇
当簇数目小于k时
对每一个簇
计算总误差
在给定的簇上面进行k-均值聚类(k=2)
计算将该簇一分为二后的总误差
选择使得误差最小的那个簇进行划分操作
另一种做法是选择SSE最大的簇进行划分,直到簇数目达到用户指定的数目为止。
python实现如下
环境:python 3.4 使用的库:numpy、matplotlib
biKmeans.py:此文件中的kmeans算法与上篇博文中的代码一致,在上一篇博文中注释写的比较详细,这里就没有在详细注释,若需要看注释,可以看上一篇博文
# 二分k均值算法的伪代码如下:
#***************************************************************
#将所有数据点看成一个簇
#
#当簇数目小于k时
#
# 对每一个簇
#
# 计算总误差
#
# 在给定的簇上面进行k-均值聚类(k=2)
#
# 计算将该簇一分为二后的总误差
#
# 选择使得误差最小的那个簇进行划分操作
#
#***************************************************************
from numpy import *
import time
import matplotlib.pyplot as plt
# calculate Euclidean distance
def euclDistance(vector1, vector2):
return sqrt(sum(power(vector2 - vector1, 2)))
# init centroids with random samples
def initCentroids(dataSet, k):
numSamples, dim = dataSet.shape
centroids = zeros((k, dim))
for i in range(k):
index = int(random.uniform(0, numSamples))
print(index)
centroids[i, :] = dataSet[index, :]
return centroids
# k-means cluster
def kmeans(dataSet, k):
numSamples = dataSet.shape[0]
# first column stores which cluster this sample belongs to,
# second column stores the error between this sample and its centroid
clusterAssment = mat(zeros((numSamples, 2)))
clusterChanged = True
## step 1: init centroids
centroids = initCentroids(dataSet, k)
while clusterChanged:
clusterChanged = False
## for each sample
for i in range(numSamples):
minDist = 100000.0
minIndex = 0
## for each centroid
## step 2: find the centroid who is closest
for j in range(k):
distance = euclDistance(centroids[j, :], dataSet[i, :])
if distance < minDist:
minDist = distance
minIndex = j
## step 3: update its cluster
if clusterAssment[i, 0] != minIndex:
clusterChanged = True
clusterAssment[i, :] = minIndex, minDist**2
## step 4: update centroids
for j in range(k):
pointsInCluster = dataSet[nonzero(clusterAssment[:, 0].A == j)[0]]
centroids[j, :] = mean(pointsInCluster, axis = 0)
print ('Congratulations, cluster using k-means complete!' )
return centroids, clusterAssment
# bisecting k-means cluster
def biKmeans(dataSet, k):
numSamples = dataSet.shape[0]
# first column stores which cluster this sample belongs to,
# second column stores the error between this sample and its centroid
clusterAssment = mat(zeros((numSamples, 2)))
# step 1: the init cluster is the whole data set
#mean(dataSet,axis=0)返回的是:matrix([[XXX,XXX,XXX]])
#mean(dataSet, axis = 0).tolist()返回的是:[[XXX,XXX,XXX]],
#mean(dataSet, axis = 0).tolist()[0],返回的是:[XXX,XXX,XXX]
centroid0 = mean(dataSet, axis = 0).tolist()[0] #列表中的第一个列表元素:即全部数据每个属性对应的均值
centList = [centroid0] #centList是[[xxx,xxx,xxx]]
for j in range(numSamples):
clusterAssment[j, 1] = (euclDistance(mat(centroid0), dataSet[j, :]))**2 #计算每个样本点与质心之间的距离
while (len(centList) < k):
# min sum of square error
minSSE = inf
#numCurrCluster = len(centList) #当前的簇的个数
# for each cluster
#找出numCurrCluster个簇中哪个簇分解得到的误差平方和最小的两个簇
for i in range(len(centList)):
# step 2: get samples in cluster i
#选取第i个簇的所有数据,然后将其分成两个两个簇
pointsInCurrCluster = dataSet[nonzero(clusterAssment[:, 0].A == i)[0], :]
# step 3: cluster it to 2 sub-clusters using k-means
#centroids的元素为每个簇的质心
#splitClusterAssment第一列为样本所属的类别号,第二列为样本到其所属簇的质心的距离的平方
#每个2均值聚类时循环N次,去除kmeans的初始值随机选取不当
N=20
minSSE_kmeans=100000.0;
for j in range(N):
centroids_kmeans, splitClusterAssment_kmeans = kmeans(pointsInCurrCluster, 2)
if (sum(splitClusterAssment_kmeans[:, 1]))<minSSE_kmeans:
minSSE_kmeans=sum(splitClusterAssment_kmeans[:, 1])
splitClusterAssment=splitClusterAssment_kmeans
centroids=centroids_kmeans
# step 4: calculate the sum of square error after split this cluster
#下面的代码是求误差平方和
#splitSSE=sum(power(splitClusterAssment[:,1],2))
splitSSE = sum(splitClusterAssment[:, 1])
#不是标号为第i个簇的误差平方和
notSplitSSE = sum(clusterAssment[nonzero(clusterAssment[:, 0].A!=i)[0], 1])
currSplitSSE = splitSSE + notSplitSSE #当前所有簇的平方和
print('num=%d,%s,%s,%s' %(i,splitSSE,notSplitSSE,currSplitSSE))
# step 5: find the best split cluster which has the min sum of square error
if currSplitSSE < minSSE: #
bestCentroidToSplit = i
bestNewCentroids = centroids
bestClusterAssment = splitClusterAssment.copy()
minSSE = currSplitSSE
print('minSSE=%s' %minSSE)
# step 6: modify the cluster index for adding new cluster
#将新分出来的两个簇的标号一个沿用它父亲的标号,一个用簇的总数来标号。
bestClusterAssment[nonzero(bestClusterAssment[:, 0].A == 1)[0], 0] = len(centList)
bestClusterAssment[nonzero(bestClusterAssment[:, 0].A == 0)[0], 0] = bestCentroidToSplit
# step 7: update and append the centroids of the new 2 sub-cluster
centList[bestCentroidToSplit] = bestNewCentroids[0, :] #将第一个子簇的质心放在父亲质心的原位置
centList.append(bestNewCentroids[1, :]) #将第二个子簇的质心添加在末尾
# step 8: update the index and error of the samples whose cluster have been changed
#由第i个簇分解为j、k两个簇所得到的数据将分解之前的数据替换掉
clusterAssment[nonzero(clusterAssment[:, 0].A == bestCentroidToSplit)[0], :] = bestClusterAssment
print ('Congratulations, cluster using bi-kmeans complete!' )
return mat(centList), clusterAssment
# show your cluster only available with 2-D data
def showCluster(dataSet, k, centroids, clusterAssment):
numSamples, dim = dataSet.shape
if dim != 2:
print ("Sorry! I can not draw because the dimension of your data is not 2!" )
return 1
mark = ['or', 'ob', 'og', 'ok', '^r', '+r', 'sr', 'dr', '<r', 'pr']
if k > len(mark):
print ("Sorry! Your k is too large!")
return 1
# draw all samples
for i in range(numSamples):
markIndex = int(clusterAssment[i, 0])
plt.plot(dataSet[i, 0], dataSet[i, 1], mark[markIndex])
mark = ['Dr', 'Db', 'Dg', 'Dk', '^b', '+b', 'sb', 'db', '<b', 'pb']
# draw the centroids
for i in range(k):
plt.plot(centroids[i, 0], centroids[i, 1], mark[i], markersize = 12)
plt.show()
测试文件test.py:与K均值聚类这篇博文中的代码相同,这里也就没有再注释
#################################################
# kmeans: k-means cluster
# Author :wojiushimogui
# Date :2015年7月28日11:12:44
# HomePage :http://write.blog.csdn.net/postlist
# Email :
#################################################
from numpy import *
import time
import matplotlib.pyplot as plt
import biKMeans
from biKMeans import *
## step 1: load data
print ("step 1: load data..." )
dataSet = []
fileIn = open('D:/xuepython/BiKmeans/testSet.txt')
for line in fileIn.readlines():
lineArr = line.strip().split('\t')
dataSet.append([float(lineArr[0]), float(lineArr[1])])
## step 2: clustering...
print ("step 2: clustering...")
dataSet = mat(dataSet)
k = 4
centroids, clusterAssment = biKmeans(dataSet, k)
## step 3: show the result
print ("step 3: show the result..." )
showCluster(dataSet, k, centroids, clusterAssment)
运行结果截图如下:
是乃遗憾,运行的结果也不是我们想象的那样,聚类的结果也是随机的,不是那么的理想,但是,按照《机器学习实战》这本书上面提供的二分K-均值聚类算法的思想:找到分解使得总的SSE最小(即找到那个簇使得分解后能够最大的降低SSE))来看,应该是不会出现上面的问题的,但是今天无论是我参考《机器学习实战》这本书上面的源代码,还是这篇博客中的源码,都不能解决这个问题。
由于在参考《数据挖掘导论》这本书上面的二分K均值聚类时看到如下的伪代码。
初始化簇表,使之包含由所有点组成的簇
repeat
从簇表中取出一个簇
{对选定的簇进行多次二分“实验”}
for i= 1 to 实验次数 do
使用基本K均值,二分选定的簇。
end for
从二分试验中选择具有最小总SSE的两个簇
将这两个簇添加到簇表中。
until 簇表中包含K个簇
因此,按照上面的思路,我在《机器学习实战》上面的源码中添加了如下代码。
#每个2均值聚类时循环N次,去除kmeans的初始值随机选取不当 N=20 minSSE_kmeans=100000.0; for j in range(N): centroids_kmeans, splitClusterAssment_kmeans = kmeans(pointsInCurrCluster, 2) if (sum(splitClusterAssment_kmeans[:, 1]))<minSSE_kmeans: minSSE_kmeans=sum(splitClusterAssment_kmeans[:, 1]) splitClusterAssment=splitClusterAssment_kmeans centroids=centroids_kmeans
发现,依然还是没有办法解决这个问题。
马上就要去跑步了,今天就思考到这里,明天我将会换一种方法来解决这个问题:通过找到最大的SSE的那个簇,然后将其分解;按照这样一个思路我相信应该可以解决这个问题。
源码和测试数据在这里可以获取
身体是自己的,科研是老板的,哈哈
跑步去了