二分k均值 Python实现

二分k-均值算法：

算法思想：

首先将所有点作为一个簇，然后将该簇一分为二。之后选择能最大程度降低聚类代价函数（也就是误差平方和）的簇划分为两个簇。以此进行下去，直到簇的数目等于用户给定的数目k为止。

算法伪代码：

*************************************************************

将所有数据点看成一个簇

当簇数目小于k时

对每一个簇

在给定的簇上面进行k-均值聚类（k=2）

计算总误差

选择使得误差最大的那个簇进行划分操作

*************************************************************

Python代码实现：

from numpy import *
import pdb
import matplotlib.pyplot as plt

def createCenter(dataSet,k):
    n = shape(dataSet)[0]
    d = shape(dataSet)[1]
    centroids = zeros((k,d))
    for i in range(k):
        c = int(random.uniform(0,n-1))  #float
        centroids[i,:] = dataSet[c,:]
    return centroids

def getDist(vec1,vec2):
    return sqrt(sum(power(vec1 - vec2,2)))

def kmeans(dataSet,k):
    n = shape(dataSet)[0]
    clusterAssment = mat(zeros((n,2)))
    centroids = createCenter(dataSet,k)

    clusterChnaged = True
    while clusterChnaged:
        clusterChnaged = False

        for i in range(n):
            minDist = inf
            minIndex = -1
            for j in range(k):
                distJI = getDist(dataSet[i,:],centroids[j,:])
                if distJI < minDist:
                    minDist = distJI
                    minIndex = j
            if clusterAssment[i,0] != minIndex:  #Convergence condition: distributions no longer change
                clusterChnaged = True
                clusterAssment[i,:] = minIndex,minDist**2

        #update centroids
        for  i in range(k):
            ptsdataSet = dataSet[nonzero(clusterAssment[:,0].A == i)[0]]
            centroids[i,:] = mean(ptsdataSet,axis = 0)     
    return centroids,clusterAssment 

def print_result(dataSet,k,centroids,clusterAssment):
    n,d = dataSet.shape
    if d !=2:
        print "Cannot draw!"
        return 1
    mark = ['or', 'ob', 'og', 'ok', '^r', '+r', 'sr', 'dr', '<r', 'pr']
    if k > len(mark):
        print "Sorry your k is too large"
        return 1

    for i in range(n):
        markIndex = int(clusterAssment[i,0])
        plt.plot(dataSet[i, 0],dataSet[i, 1],mark[markIndex])
    mark = ['Dr', 'Db', 'Dg', 'Dk', '^b', '+b', 'sb', 'db', '<b', 'pb']  
    # draw the centroids  
    for i in range(k):  
        plt.plot(centroids[i, 0], centroids[i, 1], mark[i], markersize = 12)  
    plt.show()     

def biKmeans(dataSet, k):  
    numSamples = dataSet.shape[0]  
    # first column stores which cluster this sample belongs to,  
    # second column stores the error between this sample and its centroid  
    clusterAssment = mat(zeros((numSamples, 2)))  

    # step 1: the init cluster is the whole data set  
    centroid = mean(dataSet, axis = 0).tolist()[0]  
    centList = [centroid]  
    for i in xrange(numSamples):  
        clusterAssment[i, 1] = getDist(mat(centroid), dataSet[i, :])**2  

    while (len(centList) < k):  
        # min sum of square error  
        minSSE = inf  
        numCurrCluster = len(centList)  
        # for each cluster  
        for i in range(numCurrCluster):  
            # step 2: get samples in cluster i  
            pointsInCurrCluster = dataSet[nonzero(clusterAssment[:, 0].A == i)[0], :]  

            # step 3: cluster it to 2 sub-clusters using k-means  
            centroids, splitClusterAssment = kmeans(pointsInCurrCluster, 2)  

            # step 4: calculate the sum of square error after split this cluster  
            splitSSE = sum(splitClusterAssment[:, 1])  
            notSplitSSE = sum(clusterAssment[nonzero(clusterAssment[:, 0].A != i)[0], 1])  
            currSplitSSE = splitSSE + notSplitSSE  

            # step 5: find the best split cluster which has the min sum of square error  
            if currSplitSSE < minSSE:  
                minSSE = currSplitSSE  
                bestCentroidToSplit = i  
                bestNewCentroids = centroids.copy()  
                bestClusterAssment = splitClusterAssment.copy()  

        # step 6: modify the cluster index for adding new cluster  
        bestClusterAssment[nonzero(bestClusterAssment[:, 0].A == 1)[0], 0] = numCurrCluster  
        bestClusterAssment[nonzero(bestClusterAssment[:, 0].A == 0)[0], 0] = bestCentroidToSplit  

        # step 7: update and append the centroids of the new 2 sub-cluster  
        centList[bestCentroidToSplit] = bestNewCentroids[0, :]  
        centList.append(bestNewCentroids[1, :])  

        # step 8: update the index and error of the samples whose cluster have been changed  
        clusterAssment[nonzero(clusterAssment[:, 0].A == bestCentroidToSplit), :] = bestClusterAssment
        plt.figure()
        print_result(dataSet,len(centList),mat(centList),clusterAssment)        

    print 'Congratulations, cluster using bi-kmeans complete!'  
    return mat(centList), clusterAssment

其中，biKmeans(dataSet,k)为二分算法的主体，过程大体如下：

1.初始化质心，并建立所需要的数据存储结构

2.对每一个簇进行二分，选出最好的

3.更新各个簇的元素个数

划分结果：

二分k均值 Python实现

二分的优点：

二分K均值算法可以加速K-means算法的执行速度，因为相似度计算少了
不受初始化问题的影响，因为随机点选取少了，且每一步保证误差最小

k均值的结果：

二分k均值 Python实现

秒客网

二分k均值 Python实现

相关文章