二分k-均值算法:
算法思想:
首先将所有点作为一个簇,然后将该簇一分为二。之后选择能最大程度降低聚类代价函数(也就是误差平方和)的簇划分为两个簇。以此进行下去,直到簇的数目等于用户给定的数目k为止。
算法伪代码:
*************************************************************将所有数据点看成一个簇
当簇数目小于k时
对每一个簇
在给定的簇上面进行k-均值聚类(k=2)
计算总误差
选择使得误差最大的那个簇进行划分操作*************************************************************
Python代码实现:
from numpy import *
import pdb
import matplotlib.pyplot as plt
def createCenter(dataSet,k):
n = shape(dataSet)[0]
d = shape(dataSet)[1]
centroids = zeros((k,d))
for i in range(k):
c = int(random.uniform(0,n-1)) #float
centroids[i,:] = dataSet[c,:]
return centroids
def getDist(vec1,vec2):
return sqrt(sum(power(vec1 - vec2,2)))
def kmeans(dataSet,k):
n = shape(dataSet)[0]
clusterAssment = mat(zeros((n,2)))
centroids = createCenter(dataSet,k)
clusterChnaged = True
while clusterChnaged:
clusterChnaged = False
for i in range(n):
minDist = inf
minIndex = -1
for j in range(k):
distJI = getDist(dataSet[i,:],centroids[j,:])
if distJI < minDist:
minDist = distJI
minIndex = j
if clusterAssment[i,0] != minIndex: #Convergence condition: distributions no longer change
clusterChnaged = True
clusterAssment[i,:] = minIndex,minDist**2
#update centroids
for i in range(k):
ptsdataSet = dataSet[nonzero(clusterAssment[:,0].A == i)[0]]
centroids[i,:] = mean(ptsdataSet,axis = 0)
return centroids,clusterAssment
def print_result(dataSet,k,centroids,clusterAssment):
n,d = dataSet.shape
if d !=2:
print "Cannot draw!"
return 1
mark = ['or', 'ob', 'og', 'ok', '^r', '+r', 'sr', 'dr', '<r', 'pr']
if k > len(mark):
print "Sorry your k is too large"
return 1
for i in range(n):
markIndex = int(clusterAssment[i,0])
plt.plot(dataSet[i, 0],dataSet[i, 1],mark[markIndex])
mark = ['Dr', 'Db', 'Dg', 'Dk', '^b', '+b', 'sb', 'db', '<b', 'pb']
# draw the centroids
for i in range(k):
plt.plot(centroids[i, 0], centroids[i, 1], mark[i], markersize = 12)
plt.show()
def biKmeans(dataSet, k):
numSamples = dataSet.shape[0]
# first column stores which cluster this sample belongs to,
# second column stores the error between this sample and its centroid
clusterAssment = mat(zeros((numSamples, 2)))
# step 1: the init cluster is the whole data set
centroid = mean(dataSet, axis = 0).tolist()[0]
centList = [centroid]
for i in xrange(numSamples):
clusterAssment[i, 1] = getDist(mat(centroid), dataSet[i, :])**2
while (len(centList) < k):
# min sum of square error
minSSE = inf
numCurrCluster = len(centList)
# for each cluster
for i in range(numCurrCluster):
# step 2: get samples in cluster i
pointsInCurrCluster = dataSet[nonzero(clusterAssment[:, 0].A == i)[0], :]
# step 3: cluster it to 2 sub-clusters using k-means
centroids, splitClusterAssment = kmeans(pointsInCurrCluster, 2)
# step 4: calculate the sum of square error after split this cluster
splitSSE = sum(splitClusterAssment[:, 1])
notSplitSSE = sum(clusterAssment[nonzero(clusterAssment[:, 0].A != i)[0], 1])
currSplitSSE = splitSSE + notSplitSSE
# step 5: find the best split cluster which has the min sum of square error
if currSplitSSE < minSSE:
minSSE = currSplitSSE
bestCentroidToSplit = i
bestNewCentroids = centroids.copy()
bestClusterAssment = splitClusterAssment.copy()
# step 6: modify the cluster index for adding new cluster
bestClusterAssment[nonzero(bestClusterAssment[:, 0].A == 1)[0], 0] = numCurrCluster
bestClusterAssment[nonzero(bestClusterAssment[:, 0].A == 0)[0], 0] = bestCentroidToSplit
# step 7: update and append the centroids of the new 2 sub-cluster
centList[bestCentroidToSplit] = bestNewCentroids[0, :]
centList.append(bestNewCentroids[1, :])
# step 8: update the index and error of the samples whose cluster have been changed
clusterAssment[nonzero(clusterAssment[:, 0].A == bestCentroidToSplit), :] = bestClusterAssment
plt.figure()
print_result(dataSet,len(centList),mat(centList),clusterAssment)
print 'Congratulations, cluster using bi-kmeans complete!'
return mat(centList), clusterAssment
其中,biKmeans(dataSet,k)为二分算法的主体,过程大体如下:
1.初始化质心,并建立所需要的数据存储结构
2.对每一个簇进行二分,选出最好的
3.更新各个簇的元素个数
划分结果:
二分的优点:
- 二分K均值算法可以加速K-means算法的执行速度,因为相似度计算少了
- 不受初始化问题的影响,因为随机点选取少了,且每一步保证误差最小
k均值的结果: