1、聚类定义
聚类是一种无监督学习,它将相似的对象归为一类,簇内的对象越相似,聚类的效果越好。k-均值首先发现k个不同的簇,且每个簇的中心采用簇中所含值的均值计算而成。
2、开发机器学习应用程序的步骤
(1)收集数据:收集各种样本数据,为了节省时间,可以使用公开的可用数据源
(2)准备输入数据:确保数据格式符合要求,本书采用的格式是Python语言的List。
(3)数据分析:人工分析以前得到的数据,确保数据集中没有垃圾数据。
(4)训练算法,这一步才是机器学习真正的开始,对于监督学习,我们把之前的格式化数据输入到算法中,得到分类模型,方便后续处理。对于无监督学习,不存在目标变量,因此也不需要训练算法!!!
(5)测试算法:这一步使用第4步机器学习的模型。这一步主要测试算法的工作效率。
(6)使用算法:将机器学习算法转换为应用程序,执行实际任务。
3、K-均值聚类
首先,随机确定k个初始点作为质心。然后为每个点找距离其最近的质心,将其分配到该质心所对应的簇中。完成这一步之后,更新每个簇的质心为该簇所有点的平均值。迭代以上过程,直到簇分配结果不再改变,完成K-均值聚类。
4、k-均值伪代码
创建k个点作为起始质心(随机选择)
当任意一个点的簇分配结果发生改变时
对数据集中的每个数据点
对每个质心
计算质心与数据点之间的距离
将数据点分配到距其最近的簇
对每一个簇,计算簇中所有点的均值并将均值作为质心
5、算法实现
k-均值聚类支持函数
#encoding:utf-8
from numpy import *
from math import *
#加载数据集
def loadDataSet(fileName): #将一个文本文件导入到列表中
dataMat = [] #创建一个dataMat空列表
fr = open(fileName)
for line in fr.readlines(): #一行一行的读取文本文件
curLine = line.strip().split('\t')
fltLine = map(float,curLine) #将文本文件转化为字符型
dataMat.append(fltLine)
return dataMat
#计算两个向量的欧式距离
def distEclud(vecA, vecB):
return sqrt(sum(power(vecA - vecB, 2)))
#随机构建初始质心
def randCent(dataSet, k): #构建一个包含k个随机质心的集合
#数据列数
n = shape(dataSet)[1]
#初始化质心
centroids = mat(zeros((k,n)))
for j in range(n):
#数据集中每一维的最小和最大值,保证随机选取的质心在边界之内
minJ = min(dataSet[:,j])
rangeJ = float(max(dataSet[:,j]) - minJ)
centroids[:,j] = mat(minJ + rangeJ * random.rand(k,1))
return centroids
#数据导入
data = loadDataSet('testSet.txt')
print "data:"
print data
print "\n"
#将数据按照矩阵方式输出
dataMat = mat(data)
print "dataMat:"
print dataMat
print "\n"
#矩阵中的最大值和最小值
print "矩阵中的最大值和最小值:"
min1 = min(dataMat[:,0])
print "min(dataMat[:,0]):"
print min1
min2 = min(dataMat[:,1])
print "min(dataMat[:,1]):"
print min2
max1 = max(dataMat[:,0])
print "max(dataMat[:,0]):"
print max1
max2 = max(dataMat[:,1])
print "max(dataMat[:,1]):"
print max2
print "\n"
#k个随机质心的集合,随机质心必须在整个数据集的边界之内,也就是在最小值和最大值之间
centroids = randCent(dataMat, 2)
print "质心:"
print centroids
print "\n"
#计算距离
distance = distEclud(dataMat[0],dataMat[1])
print "distance:"
print distance
Output:
data(列表形式):
[[1.658985, 4.285136], [-3.453687, 3.424321], [4.838138, -1.151539], [-5.379713, -3.362104], [0.972564, 2.924086], [-3.567919, 1.531611], [0.450614, -3.302219], [-3.487105, -1.724432], [2.668759, 1.594842], [-3.156485, 3.191137], [3.165506, -3.999838], [-2.786837, -3.099354], [4.208187, 2.984927], [-2.123337, 2.943366], [0.704199, -0.479481], [-0.39237, -3.963704], [2.831667, 1.574018], [-0.790153, 3.343144], [2.943496, -3.357075], [-3.195883, -2.283926], [2.336445, 2.875106], [-1.786345, 2.554248], [2.190101, -1.90602], [-3.403367, -2.778288], [1.778124, 3.880832], [-1.688346, 2.230267], [2.592976, -2.054368], [-4.007257, -3.207066], [2.257734, 3.387564], [-2.679011, 0.785119], [0.939512, -4.023563], [-3.674424, -2.261084], [2.046259, 2.735279], [-3.18947, 1.780269], [4.372646, -0.822248], [-2.579316, -3.497576], [1.889034, 5.1904], [-0.798747, 2.185588], [2.83652, -2.658556], [-3.837877, -3.253815], [2.096701, 3.886007], [-2.709034, 2.923887], [3.367037, -3.184789], [-2.121479, -4.232586], [2.329546, 3.179764], [-3.284816, 3.273099], [3.091414, -3.815232], [-3.762093, -2.432191], [3.542056, 2.778832], [-1.736822, 4.241041], [2.127073, -2.98368], [-4.323818, -3.938116], [3.792121, 5.135768], [-4.786473, 3.358547], [2.624081, -3.260715], [-4.009299, -2.978115], [2.493525, 1.96371], [-2.513661, 2.642162], [1.864375, -3.176309], [-3.171184, -3.572452], [2.89422, 2.489128], [-2.562539, 2.884438], [3.491078, -3.947487], [-2.565729, -2.012114], [3.332948, 3.983102], [-1.616805, 3.573188], [2.280615, -2.559444], [-2.651229, -3.103198], [2.321395, 3.154987], [-1.685703, 2.939697], [3.031012, -3.620252], [-4.599622, -2.185829], [4.196223, 1.126677], [-2.133863, 3.093686], [4.668892, -2.562705], [-2.793241, -2.149706], [2.884105, 3.043438], [-2.967647, 2.848696], [4.479332, -1.764772], [-4.905566, -2.91107]]
dataMat(矩阵形式):
[[ 1.658985 4.285136]
[-3.453687 3.424321]
[ 4.838138 -1.151539]
[-5.379713 -3.362104]
[ 0.972564 2.924086]
[-3.567919 1.531611]
[ 0.450614 -3.302219]
[-3.487105 -1.724432]
[ 2.668759 1.594842]
[-3.156485 3.191137]
[ 3.165506 -3.999838]
[-2.786837 -3.099354]
[ 4.208187 2.984927]
[-2.123337 2.943366]
[ 0.704199 -0.479481]
[-0.39237 -3.963704]
[ 2.831667 1.574018]
[-0.790153 3.343144]
[ 2.943496 -3.357075]
[-3.195883 -2.283926]
[ 2.336445 2.875106]
[-1.786345 2.554248]
[ 2.190101 -1.90602 ]
[-3.403367 -2.778288]
[ 1.778124 3.880832]
[-1.688346 2.230267]
[ 2.592976 -2.054368]
[-4.007257 -3.207066]
[ 2.257734 3.387564]
[-2.679011 0.785119]
[ 0.939512 -4.023563]
[-3.674424 -2.261084]
[ 2.046259 2.735279]
[-3.18947 1.780269]
[ 4.372646 -0.822248]
[-2.579316 -3.497576]
[ 1.889034 5.1904 ]
[-0.798747 2.185588]
[ 2.83652 -2.658556]
[-3.837877 -3.253815]
[ 2.096701 3.886007]
[-2.709034 2.923887]
[ 3.367037 -3.184789]
[-2.121479 -4.232586]
[ 2.329546 3.179764]
[-3.284816 3.273099]
[ 3.091414 -3.815232]
[-3.762093 -2.432191]
[ 3.542056 2.778832]
[-1.736822 4.241041]
[ 2.127073 -2.98368 ]
[-4.323818 -3.938116]
[ 3.792121 5.135768]
[-4.786473 3.358547]
[ 2.624081 -3.260715]
[-4.009299 -2.978115]
[ 2.493525 1.96371 ]
[-2.513661 2.642162]
[ 1.864375 -3.176309]
[-3.171184 -3.572452]
[ 2.89422 2.489128]
[-2.562539 2.884438]
[ 3.491078 -3.947487]
[-2.565729 -2.012114]
[ 3.332948 3.983102]
[-1.616805 3.573188]
[ 2.280615 -2.559444]
[-2.651229 -3.103198]
[ 2.321395 3.154987]
[-1.685703 2.939697]
[ 3.031012 -3.620252]
[-4.599622 -2.185829]
[ 4.196223 1.126677]
[-2.133863 3.093686]
[ 4.668892 -2.562705]
[-2.793241 -2.149706]
[ 2.884105 3.043438]
[-2.967647 2.848696]
[ 4.479332 -1.764772]
[-4.905566 -2.91107 ]]
矩阵中的最大值和最小值:
min(dataMat[:,0]):
[[-5.379713]]
min(dataMat[:,1]):
[[-4.232586]]
max(dataMat[:,0]):
[[ 4.838138]]
max(dataMat[:,1]):
[[ 5.1904]]
质心:
[[-0.63033973 3.67867827]
[ 4.73957277 -1.20849738]]
distance:
5.18463281668
所有支持函数正常运行之后,就可以准备实现完整的k-均值聚类算法了。
k-均值聚类算法:
#encoding:utf-8
from numpy import *
from math import *
#加载数据集
def loadDataSet(fileName): #将一个文本文件导入到列表中
dataMat = [] #创建一个dataMat空列表
fr = open(fileName)
for line in fr.readlines(): #一行一行的读取文本文件
curLine = line.strip().split('\t')
fltLine = map(float,curLine) #将文本文件转化为字符型
dataMat.append(fltLine)
return dataMat
#计算两个向量的欧式距离
def distEclud(vecA, vecB):
return sqrt(sum(power(vecA - vecB, 2)))
#随机构建初始质心
def randCent(dataSet, k): #构建一个包含k个随机质心的集合
#数据列数
n = shape(dataSet)[1]
#初始化质心
centroids = mat(zeros((k,n)))
for j in range(n):
#数据集中每一维的最小和最大值,保证随机选取的质心在边界之内
minJ = min(dataSet[:,j])
rangeJ = float(max(dataSet[:,j]) - minJ)
centroids[:,j] = mat(minJ + rangeJ * random.rand(k,1))
return centroids
#k-均值算法:四个参数(数据集、簇的数目、距离、创建初始质心)
def kMeans(dataSet, k, distMeas=distEclud, createCent=randCent):
#数据点的行数
m = shape(dataSet)[0]
#用于记录数据点到质心的距离平方
clusterAssment = mat(zeros((m,2)))
#中心点
centroids = createCent(dataSet, k)
#聚类结束标志
clusterChanged = True
while clusterChanged:
clusterChanged = False
#遍历每条数据
for i in range(m):
#设置两个变量,分别存放数据点到质心的距离,及数据点属于哪个质心
minDist = inf; minIndex = -1
#遍历每个质心
for j in range(k):
distJI = distMeas(centroids[j,:],dataSet[i,:])
if distJI < minDist:
#将数据归为最近的质心
minDist = distJI; minIndex = j
#簇分配结果发生改变,更新标志
if clusterAssment[i,0] != minIndex: clusterChanged = True
clusterAssment[i,:] = minIndex,minDist**2
print centroids
#更新质心
for cent in range(k):#recalculate centroids
ptsInClust = dataSet[nonzero(clusterAssment[:,0].A==cent)[0]]#get all the point in this cluster
centroids[cent,:] = mean(ptsInClust, axis=0) #assign centroid to mean
return centroids, clusterAssment
#数据导入
data = loadDataSet('testSet.txt')
print "data:"
print data
print "\n"
#将数据按照矩阵方式输出
dataMat = mat(data)
print "dataMat:"
print dataMat
print "\n"
centroids, clusterAssment = kMeans(dataMat,4)
print centroids,clusterAssment
Output:
data:
[[1.658985, 4.285136], [-3.453687, 3.424321], [4.838138, -1.151539], [-5.379713, -3.362104], [0.972564, 2.924086], [-3.567919, 1.531611], [0.450614, -3.302219], [-3.487105, -1.724432], [2.668759, 1.594842], [-3.156485, 3.191137], [3.165506, -3.999838], [-2.786837, -3.099354], [4.208187, 2.984927], [-2.123337, 2.943366], [0.704199, -0.479481], [-0.39237, -3.963704], [2.831667, 1.574018], [-0.790153, 3.343144], [2.943496, -3.357075], [-3.195883, -2.283926], [2.336445, 2.875106], [-1.786345, 2.554248], [2.190101, -1.90602], [-3.403367, -2.778288], [1.778124, 3.880832], [-1.688346, 2.230267], [2.592976, -2.054368], [-4.007257, -3.207066], [2.257734, 3.387564], [-2.679011, 0.785119], [0.939512, -4.023563], [-3.674424, -2.261084], [2.046259, 2.735279], [-3.18947, 1.780269], [4.372646, -0.822248], [-2.579316, -3.497576], [1.889034, 5.1904], [-0.798747, 2.185588], [2.83652, -2.658556], [-3.837877, -3.253815], [2.096701, 3.886007], [-2.709034, 2.923887], [3.367037, -3.184789], [-2.121479, -4.232586], [2.329546, 3.179764], [-3.284816, 3.273099], [3.091414, -3.815232], [-3.762093, -2.432191], [3.542056, 2.778832], [-1.736822, 4.241041], [2.127073, -2.98368], [-4.323818, -3.938116], [3.792121, 5.135768], [-4.786473, 3.358547], [2.624081, -3.260715], [-4.009299, -2.978115], [2.493525, 1.96371], [-2.513661, 2.642162], [1.864375, -3.176309], [-3.171184, -3.572452], [2.89422, 2.489128], [-2.562539, 2.884438], [3.491078, -3.947487], [-2.565729, -2.012114], [3.332948, 3.983102], [-1.616805, 3.573188], [2.280615, -2.559444], [-2.651229, -3.103198], [2.321395, 3.154987], [-1.685703, 2.939697], [3.031012, -3.620252], [-4.599622, -2.185829], [4.196223, 1.126677], [-2.133863, 3.093686], [4.668892, -2.562705], [-2.793241, -2.149706], [2.884105, 3.043438], [-2.967647, 2.848696], [4.479332, -1.764772], [-4.905566, -2.91107]]
dataMat:
[[ 1.658985 4.285136]
[-3.453687 3.424321]
[ 4.838138 -1.151539]
[-5.379713 -3.362104]
[ 0.972564 2.924086]
[-3.567919 1.531611]
[ 0.450614 -3.302219]
[-3.487105 -1.724432]
[ 2.668759 1.594842]
[-3.156485 3.191137]
[ 3.165506 -3.999838]
[-2.786837 -3.099354]
[ 4.208187 2.984927]
[-2.123337 2.943366]
[ 0.704199 -0.479481]
[-0.39237 -3.963704]
[ 2.831667 1.574018]
[-0.790153 3.343144]
[ 2.943496 -3.357075]
[-3.195883 -2.283926]
[ 2.336445 2.875106]
[-1.786345 2.554248]
[ 2.190101 -1.90602 ]
[-3.403367 -2.778288]
[ 1.778124 3.880832]
[-1.688346 2.230267]
[ 2.592976 -2.054368]
[-4.007257 -3.207066]
[ 2.257734 3.387564]
[-2.679011 0.785119]
[ 0.939512 -4.023563]
[-3.674424 -2.261084]
[ 2.046259 2.735279]
[-3.18947 1.780269]
[ 4.372646 -0.822248]
[-2.579316 -3.497576]
[ 1.889034 5.1904 ]
[-0.798747 2.185588]
[ 2.83652 -2.658556]
[-3.837877 -3.253815]
[ 2.096701 3.886007]
[-2.709034 2.923887]
[ 3.367037 -3.184789]
[-2.121479 -4.232586]
[ 2.329546 3.179764]
[-3.284816 3.273099]
[ 3.091414 -3.815232]
[-3.762093 -2.432191]
[ 3.542056 2.778832]
[-1.736822 4.241041]
[ 2.127073 -2.98368 ]
[-4.323818 -3.938116]
[ 3.792121 5.135768]
[-4.786473 3.358547]
[ 2.624081 -3.260715]
[-4.009299 -2.978115]
[ 2.493525 1.96371 ]
[-2.513661 2.642162]
[ 1.864375 -3.176309]
[-3.171184 -3.572452]
[ 2.89422 2.489128]
[-2.562539 2.884438]
[ 3.491078 -3.947487]
[-2.565729 -2.012114]
[ 3.332948 3.983102]
[-1.616805 3.573188]
[ 2.280615 -2.559444]
[-2.651229 -3.103198]
[ 2.321395 3.154987]
[-1.685703 2.939697]
[ 3.031012 -3.620252]
[-4.599622 -2.185829]
[ 4.196223 1.126677]
[-2.133863 3.093686]
[ 4.668892 -2.562705]
[-2.793241 -2.149706]
[ 2.884105 3.043438]
[-2.967647 2.848696]
[ 4.479332 -1.764772]
[-4.905566 -2.91107 ]]
[[ 3.23128394 0.95068522]
[ 1.44392461 -0.39902729]
[-1.62230418 -2.52007168]
[-3.51187771 3.79667864]]
[[ 2.95373358 2.32801413]
[ 2.5935345 -2.92880329]
[-3.01169468 -3.01238673]
[-2.46154315 2.78737555]]
[[ 2.6265299 3.10868015]
[ 2.80293085 -2.7315146 ]
[-3.38237045 -2.9473363 ]
[-2.46154315 2.78737555]]
[[ 2.6265299 3.10868015]
[ 2.80293085 -2.7315146 ]
[-3.38237045 -2.9473363 ]
[-2.46154315 2.78737555]] [[ 0.00000000e+00 2.32019150e+00]
[ 3.00000000e+00 1.39004893e+00]
[ 1.00000000e+00 6.63839104e+00]
[ 2.00000000e+00 4.16140951e+00]
[ 0.00000000e+00 2.76967820e+00]
[ 3.00000000e+00 2.80101213e+00]
[ 1.00000000e+00 5.85909807e+00]
[ 2.00000000e+00 1.50646425e+00]
[ 0.00000000e+00 2.29348924e+00]
[ 3.00000000e+00 6.45967483e-01]
[ 1.00000000e+00 1.74010499e+00]
[ 2.00000000e+00 3.77769471e-01]
[ 0.00000000e+00 2.51695402e+00]
[ 3.00000000e+00 1.38716420e-01]
[ 1.00000000e+00 9.47633071e+00]
[ 2.00000000e+00 9.97310599e+00]
[ 0.00000000e+00 2.39726914e+00]
[ 3.00000000e+00 3.10242360e+00]
[ 1.00000000e+00 4.11084375e-01]
[ 2.00000000e+00 4.74890795e-01]
[ 0.00000000e+00 1.38706133e-01]
[ 3.00000000e+00 5.10240996e-01]
[ 1.00000000e+00 1.05700176e+00]
[ 2.00000000e+00 2.90181828e-02]
[ 0.00000000e+00 1.31601105e+00]
[ 3.00000000e+00 9.08203769e-01]
[ 1.00000000e+00 5.02608557e-01]
[ 2.00000000e+00 4.57942717e-01]
[ 0.00000000e+00 2.13786618e-01]
[ 3.00000000e+00 4.05632356e+00]
[ 1.00000000e+00 5.14171888e+00]
[ 2.00000000e+00 5.56237495e-01]
[ 0.00000000e+00 4.76142736e-01]
[ 3.00000000e+00 1.54414110e+00]
[ 1.00000000e+00 6.10930460e+00]
[ 2.00000000e+00 9.47660177e-01]
[ 0.00000000e+00 4.87745774e+00]
[ 3.00000000e+00 3.12703929e+00]
[ 1.00000000e+00 6.45118831e-03]
[ 2.00000000e+00 3.01415411e-01]
[ 0.00000000e+00 8.84955695e-01]
[ 3.00000000e+00 7.98870968e-02]
[ 1.00000000e+00 5.23673430e-01]
[ 2.00000000e+00 3.24171404e+00]
[ 0.00000000e+00 9.32523506e-02]
[ 3.00000000e+00 9.13705455e-01]
[ 1.00000000e+00 1.25766593e+00]
[ 2.00000000e+00 4.09563895e-01]
[ 0.00000000e+00 9.46987842e-01]
[ 3.00000000e+00 2.63836399e+00]
[ 1.00000000e+00 5.20371222e-01]
[ 2.00000000e+00 1.86796790e+00]
[ 0.00000000e+00 5.46768776e+00]
[ 3.00000000e+00 5.73153563e+00]
[ 1.00000000e+00 3.12040332e-01]
[ 2.00000000e+00 3.93986735e-01]
[ 0.00000000e+00 1.32864695e+00]
[ 3.00000000e+00 2.38032454e-02]
[ 1.00000000e+00 1.07872914e+00]
[ 2.00000000e+00 4.35369355e-01]
[ 0.00000000e+00 4.55502856e-01]
[ 3.00000000e+00 1.96212809e-02]
[ 1.00000000e+00 1.95213538e+00]
[ 2.00000000e+00 1.54154401e+00]
[ 0.00000000e+00 1.26364010e+00]
[ 3.00000000e+00 1.33108375e+00]
[ 1.00000000e+00 3.02422139e-01]
[ 2.00000000e+00 5.58860689e-01]
[ 0.00000000e+00 9.52516316e-02]
[ 3.00000000e+00 6.25129762e-01]
[ 1.00000000e+00 8.41875177e-01]
[ 2.00000000e+00 2.06159470e+00]
[ 0.00000000e+00 6.39227291e+00]
[ 3.00000000e+00 2.01200372e-01]
[ 1.00000000e+00 3.51030769e+00]
[ 2.00000000e+00 9.83287604e-01]
[ 0.00000000e+00 7.06014703e-02]
[ 3.00000000e+00 2.59901305e-01]
[ 1.00000000e+00 3.74491207e+00]
[ 2.00000000e+00 2.32143993e+00]]
6、二分k-均值聚类
二分-K均值是为了解决k-均值的用户自定义输入簇值k所延伸出来的自己判断k数目,其基本思路是:
为了得到k个簇,将所有点的集合分裂成两个簇,从这些簇中选取一个继续分裂,如此下去,直到产生k个簇。
#encoding :utf-8
from numpy import *
from math import *
#加载数据集
def loadDataSet(fileName): #将一个文本文件导入到列表中
dataMat = [] #创建一个dataMat空列表
fr = open(fileName)
for line in fr.readlines(): #一行一行的读取文本文件
curLine = line.strip().split('\t')
fltLine = map(float,curLine) #将文本文件转化为字符型
dataMat.append(fltLine)
return dataMat
#计算两个向量的欧式距离
def distEclud(vecA, vecB):
return sqrt(sum(power(vecA - vecB, 2)))
#随机构建初始质心
def randCent(dataSet, k): #构建一个包含k个随机质心的集合
#数据列数
n = shape(dataSet)[1]
#初始化质心
centroids = mat(zeros((k,n)))
for j in range(n):
#数据集中每一维的最小和最大值,保证随机选取的质心在边界之内
minJ = min(dataSet[:,j])
rangeJ = float(max(dataSet[:,j]) - minJ)
centroids[:,j] = mat(minJ + rangeJ * random.rand(k,1))
return centroids
#k-均值算法:四个参数(数据集、簇的数目、距离、创建初始质心)
def kMeans(dataSet, k, distMeas=distEclud, createCent=randCent):
#数据点的行数
m = shape(dataSet)[0]
#用于记录数据点到质心的距离平方
clusterAssment = mat(zeros((m,2)))
#中心点
centroids = createCent(dataSet, k)
#聚类结束标志
clusterChanged = True
while clusterChanged:
clusterChanged = False
#遍历每条数据
for i in range(m):
#设置两个变量,分别存放数据点到质心的距离,及数据点属于哪个质心
minDist = inf; minIndex = -1
#遍历每个质心
for j in range(k):
distJI = distMeas(centroids[j,:],dataSet[i,:])
if distJI < minDist:
#将数据归为最近的质心
minDist = distJI; minIndex = j
#簇分配结果发生改变,更新标志
if clusterAssment[i,0] != minIndex: clusterChanged = True
clusterAssment[i,:] = minIndex,minDist**2
print centroids
#更新质心
for cent in range(k):#recalculate centroids
ptsInClust = dataSet[nonzero(clusterAssment[:,0].A==cent)[0]]#get all the point in this cluster
centroids[cent,:] = mean(ptsInClust, axis=0) #assign centroid to mean
return centroids, clusterAssment
def biKmeans(dataSet, k, distMeas=distEclud):
m = shape(dataSet)[0]
clusterAssment = mat(zeros((m,2)))
#创建一个初始簇
centroid0 = mean(dataSet, axis=0).tolist()[0]
centList =[centroid0] #create a list with one centroid
#计算初始误差
for j in range(m):#calc initial Error
clusterAssment[j,1] = distMeas(mat(centroid0), dataSet[j,:])**2
while (len(centList) < k):
lowestSSE = inf
#遍历每个簇,尝试划分每个簇
for i in range(len(centList)):
ptsInCurrCluster = dataSet[nonzero(clusterAssment[:,0].A==i)[0],:]#get the data points currently in cluster i
centroidMat, splitClustAss = kMeans(ptsInCurrCluster, 2, distMeas)
#划分后的误差平方和
sseSplit = sum(splitClustAss[:,1])#compare the SSE to the currrent minimum
#剩余的误差之和
sseNotSplit = sum(clusterAssment[nonzero(clusterAssment[:,0].A!=i)[0],1])
print "sseSplit, and notSplit: ",sseSplit,sseNotSplit
if (sseSplit + sseNotSplit) < lowestSSE:
bestCentToSplit = i
bestNewCents = centroidMat
bestClustAss = splitClustAss.copy()
lowestSSE = sseSplit + sseNotSplit
#更新分配结果
bestClustAss[nonzero(bestClustAss[:,0].A == 1)[0],0] = len(centList) #change 1 to 3,4, or whatever
bestClustAss[nonzero(bestClustAss[:,0].A == 0)[0],0] = bestCentToSplit
print 'the bestCentToSplit is: ',bestCentToSplit
print 'the len of bestClustAss is: ', len(bestClustAss)
centList[bestCentToSplit] = bestNewCents[0,:].tolist()[0]#replace a centroid with two best centroids
centList.append(bestNewCents[1,:].tolist()[0])
clusterAssment[nonzero(clusterAssment[:,0].A == bestCentToSplit)[0],:]= bestClustAss#reassign new clusters, and SSE
return mat(centList), clusterAssment
#数据导入
data = loadDataSet('testSet.txt')
print "data:"
print data
print "\n"
#将数据按照矩阵方式输出
dataMat = mat(data)
print "dataMat:"
print dataMat
print "\n"
mat, clusterAssment = biKmeans(dataMat, 4)
print mat, clusterAssment
Output:
data:
[[1.658985, 4.285136], [-3.453687, 3.424321], [4.838138, -1.151539], [-5.379713, -3.362104], [0.972564, 2.924086], [-3.567919, 1.531611], [0.450614, -3.302219], [-3.487105, -1.724432], [2.668759, 1.594842], [-3.156485, 3.191137], [3.165506, -3.999838], [-2.786837, -3.099354], [4.208187, 2.984927], [-2.123337, 2.943366], [0.704199, -0.479481], [-0.39237, -3.963704], [2.831667, 1.574018], [-0.790153, 3.343144], [2.943496, -3.357075], [-3.195883, -2.283926], [2.336445, 2.875106], [-1.786345, 2.554248], [2.190101, -1.90602], [-3.403367, -2.778288], [1.778124, 3.880832], [-1.688346, 2.230267], [2.592976, -2.054368], [-4.007257, -3.207066], [2.257734, 3.387564], [-2.679011, 0.785119], [0.939512, -4.023563], [-3.674424, -2.261084], [2.046259, 2.735279], [-3.18947, 1.780269], [4.372646, -0.822248], [-2.579316, -3.497576], [1.889034, 5.1904], [-0.798747, 2.185588], [2.83652, -2.658556], [-3.837877, -3.253815], [2.096701, 3.886007], [-2.709034, 2.923887], [3.367037, -3.184789], [-2.121479, -4.232586], [2.329546, 3.179764], [-3.284816, 3.273099], [3.091414, -3.815232], [-3.762093, -2.432191], [3.542056, 2.778832], [-1.736822, 4.241041], [2.127073, -2.98368], [-4.323818, -3.938116], [3.792121, 5.135768], [-4.786473, 3.358547], [2.624081, -3.260715], [-4.009299, -2.978115], [2.493525, 1.96371], [-2.513661, 2.642162], [1.864375, -3.176309], [-3.171184, -3.572452], [2.89422, 2.489128], [-2.562539, 2.884438], [3.491078, -3.947487], [-2.565729, -2.012114], [3.332948, 3.983102], [-1.616805, 3.573188], [2.280615, -2.559444], [-2.651229, -3.103198], [2.321395, 3.154987], [-1.685703, 2.939697], [3.031012, -3.620252], [-4.599622, -2.185829], [4.196223, 1.126677], [-2.133863, 3.093686], [4.668892, -2.562705], [-2.793241, -2.149706], [2.884105, 3.043438], [-2.967647, 2.848696], [4.479332, -1.764772], [-4.905566, -2.91107]]
dataMat:
[[ 1.658985 4.285136]
[-3.453687 3.424321]
[ 4.838138 -1.151539]
[-5.379713 -3.362104]
[ 0.972564 2.924086]
[-3.567919 1.531611]
[ 0.450614 -3.302219]
[-3.487105 -1.724432]
[ 2.668759 1.594842]
[-3.156485 3.191137]
[ 3.165506 -3.999838]
[-2.786837 -3.099354]
[ 4.208187 2.984927]
[-2.123337 2.943366]
[ 0.704199 -0.479481]
[-0.39237 -3.963704]
[ 2.831667 1.574018]
[-0.790153 3.343144]
[ 2.943496 -3.357075]
[-3.195883 -2.283926]
[ 2.336445 2.875106]
[-1.786345 2.554248]
[ 2.190101 -1.90602 ]
[-3.403367 -2.778288]
[ 1.778124 3.880832]
[-1.688346 2.230267]
[ 2.592976 -2.054368]
[-4.007257 -3.207066]
[ 2.257734 3.387564]
[-2.679011 0.785119]
[ 0.939512 -4.023563]
[-3.674424 -2.261084]
[ 2.046259 2.735279]
[-3.18947 1.780269]
[ 4.372646 -0.822248]
[-2.579316 -3.497576]
[ 1.889034 5.1904 ]
[-0.798747 2.185588]
[ 2.83652 -2.658556]
[-3.837877 -3.253815]
[ 2.096701 3.886007]
[-2.709034 2.923887]
[ 3.367037 -3.184789]
[-2.121479 -4.232586]
[ 2.329546 3.179764]
[-3.284816 3.273099]
[ 3.091414 -3.815232]
[-3.762093 -2.432191]
[ 3.542056 2.778832]
[-1.736822 4.241041]
[ 2.127073 -2.98368 ]
[-4.323818 -3.938116]
[ 3.792121 5.135768]
[-4.786473 3.358547]
[ 2.624081 -3.260715]
[-4.009299 -2.978115]
[ 2.493525 1.96371 ]
[-2.513661 2.642162]
[ 1.864375 -3.176309]
[-3.171184 -3.572452]
[ 2.89422 2.489128]
[-2.562539 2.884438]
[ 3.491078 -3.947487]
[-2.565729 -2.012114]
[ 3.332948 3.983102]
[-1.616805 3.573188]
[ 2.280615 -2.559444]
[-2.651229 -3.103198]
[ 2.321395 3.154987]
[-1.685703 2.939697]
[ 3.031012 -3.620252]
[-4.599622 -2.185829]
[ 4.196223 1.126677]
[-2.133863 3.093686]
[ 4.668892 -2.562705]
[-2.793241 -2.149706]
[ 2.884105 3.043438]
[-2.967647 2.848696]
[ 4.479332 -1.764772]
[-4.905566 -2.91107 ]]
[[-0.61154259 -0.99840509]
[ 0.21655089 0.84736297]]
[[-0.77465184 -2.93442862]
[ 0.47379212 2.62599895]]
[[-0.54735726 -2.93692713]
[ 0.2978695 2.76065064]]
[[-0.2897198 -2.83942545]
[ 0.08249337 2.94802785]]
sseSplit, and notSplit: 792.916856537 0.0
the bestCentToSplit is: 0
the len of bestClustAss is: 80
[[-1.675842 -3.5709562 ]
[-0.44613055 -1.66674358]]
[[-3.17656652 -2.99858519]
[ 2.90100553 -2.66351205]]
[[-3.38237045 -2.9473363 ]
[ 2.80293085 -2.7315146 ]]
sseSplit, and notSplit: 83.5874695564 326.284075201
[[ 0.17931125 3.21151684]
[-2.49413705 4.77613273]]
[[ 1.6577805 2.93121792]
[-2.84303986 2.97924629]]
[[ 2.6265299 3.10868015]
[-2.46154315 2.78737555]]
sseSplit, and notSplit: 66.36683512 466.632781336
the bestCentToSplit is: 0
the len of bestClustAss is: 40
[[-1.61809661 -4.16618929]
[-2.19748766 -3.31337473]]
[[-1.2569245 -4.098145 ]
[-3.61853111 -2.81946867]]
sseSplit, and notSplit: 19.619284753 377.270301893
[[-4.25497296 3.22644812]
[-3.07524324 4.4511578 ]]
[[-3.28879656 2.53721789]
[ 1.06125497 3.06729526]]
[[-2.64677572 2.78993217]
[ 2.31553173 3.07737886]]
[[-2.46154315 2.78737555]
[ 2.6265299 3.10868015]]
sseSplit, and notSplit: 66.36683512 83.5874695564
[[ 1.60206889 -3.5436413 ]
[ 3.03830692 -1.88925585]]
[[ 2.3728161 -3.548637 ]
[ 3.2330456 -1.9143922]]
[[ 2.44798442 -3.43588358]
[ 3.3353505 -1.67496113]]
[[ 2.47787177 -3.37608915]
[ 3.406612 -1.53444757]]
sseSplit, and notSplit: 31.6296069802 358.885318066
the bestCentToSplit is: 1
the len of bestClustAss is: 40
[[-3.38237045 -2.9473363 ]
[-2.46154315 2.78737555]
[ 2.80293085 -2.7315146 ]
[ 2.6265299 3.10868015]] [[ 3.00000000e+00 2.32019150e+00]
[ 1.00000000e+00 1.39004893e+00]
[ 2.00000000e+00 6.63839104e+00]
[ 0.00000000e+00 4.16140951e+00]
[ 3.00000000e+00 2.76967820e+00]
[ 1.00000000e+00 2.80101213e+00]
[ 2.00000000e+00 5.85909807e+00]
[ 0.00000000e+00 1.50646425e+00]
[ 3.00000000e+00 2.29348924e+00]
[ 1.00000000e+00 6.45967483e-01]
[ 2.00000000e+00 1.74010499e+00]
[ 0.00000000e+00 3.77769471e-01]
[ 3.00000000e+00 2.51695402e+00]
[ 1.00000000e+00 1.38716420e-01]
[ 2.00000000e+00 9.47633071e+00]
[ 0.00000000e+00 9.97310599e+00]
[ 3.00000000e+00 2.39726914e+00]
[ 1.00000000e+00 3.10242360e+00]
[ 2.00000000e+00 4.11084375e-01]
[ 0.00000000e+00 4.74890795e-01]
[ 3.00000000e+00 1.38706133e-01]
[ 1.00000000e+00 5.10240996e-01]
[ 2.00000000e+00 1.05700176e+00]
[ 0.00000000e+00 2.90181828e-02]
[ 3.00000000e+00 1.31601105e+00]
[ 1.00000000e+00 9.08203769e-01]
[ 2.00000000e+00 5.02608557e-01]
[ 0.00000000e+00 4.57942717e-01]
[ 3.00000000e+00 2.13786618e-01]
[ 1.00000000e+00 4.05632356e+00]
[ 2.00000000e+00 5.14171888e+00]
[ 0.00000000e+00 5.56237495e-01]
[ 3.00000000e+00 4.76142736e-01]
[ 1.00000000e+00 1.54414110e+00]
[ 2.00000000e+00 6.10930460e+00]
[ 0.00000000e+00 9.47660177e-01]
[ 3.00000000e+00 4.87745774e+00]
[ 1.00000000e+00 3.12703929e+00]
[ 2.00000000e+00 6.45118831e-03]
[ 0.00000000e+00 3.01415411e-01]
[ 3.00000000e+00 8.84955695e-01]
[ 1.00000000e+00 7.98870968e-02]
[ 2.00000000e+00 5.23673430e-01]
[ 0.00000000e+00 3.24171404e+00]
[ 3.00000000e+00 9.32523506e-02]
[ 1.00000000e+00 9.13705455e-01]
[ 2.00000000e+00 1.25766593e+00]
[ 0.00000000e+00 4.09563895e-01]
[ 3.00000000e+00 9.46987842e-01]
[ 1.00000000e+00 2.63836399e+00]
[ 2.00000000e+00 5.20371222e-01]
[ 0.00000000e+00 1.86796790e+00]
[ 3.00000000e+00 5.46768776e+00]
[ 1.00000000e+00 5.73153563e+00]
[ 2.00000000e+00 3.12040332e-01]
[ 0.00000000e+00 3.93986735e-01]
[ 3.00000000e+00 1.32864695e+00]
[ 1.00000000e+00 2.38032454e-02]
[ 2.00000000e+00 1.07872914e+00]
[ 0.00000000e+00 4.35369355e-01]
[ 3.00000000e+00 4.55502856e-01]
[ 1.00000000e+00 1.96212809e-02]
[ 2.00000000e+00 1.95213538e+00]
[ 0.00000000e+00 1.54154401e+00]
[ 3.00000000e+00 1.26364010e+00]
[ 1.00000000e+00 1.33108375e+00]
[ 2.00000000e+00 3.02422139e-01]
[ 0.00000000e+00 5.58860689e-01]
[ 3.00000000e+00 9.52516316e-02]
[ 1.00000000e+00 6.25129762e-01]
[ 2.00000000e+00 8.41875177e-01]
[ 0.00000000e+00 2.06159470e+00]
[ 3.00000000e+00 6.39227291e+00]
[ 1.00000000e+00 2.01200372e-01]
[ 2.00000000e+00 3.51030769e+00]
[ 0.00000000e+00 9.83287604e-01]
[ 3.00000000e+00 7.06014703e-02]
[ 1.00000000e+00 2.59901305e-01]
[ 2.00000000e+00 3.74491207e+00]
[ 0.00000000e+00 2.32143993e+00]]