聚类分析层次聚类及k-means算法

时间:2021-03-15 16:47:48
聚类分析层次聚类及k-means算法

参考文献:

[1]Jure Leskovec,Anand Rajaraman,Jeffrey David Ullman.大数据互联网大规模数据挖掘与分布式处理(第二版) [M]北京:人民邮电出版社,2015.,190-199;

[2]蒋盛益,李霞,郑琪.数据挖掘原理与实践 [M]北京:电子工业出版社,2015.1,107-114,121;

目录:

1、测试案例:

2、原理分析:

3、源代码示例:

4、运行结果:


1、测试案例:

给定国际通用UCI数据库中FISHERIRIS数据集,其meas集包含150个样本数据,每个数据含有莺尾属植物的4个属性,即萼片长度、萼片宽度、花瓣长度,单位为cm。上述数据分属于species集的三种setosa、versicolor和virginica花朵类别。
要求在该数据集上执行:
(1)层次聚类算法
(2)k-means聚类算法
得到的聚类结果与species集的Label结果比较,统计这两类算法聚类的正确率和运行时间。

聚类分析层次聚类及k-means算法

图1.1 Excel测试案例部分内容截图1

聚类分析层次聚类及k-means算法

图1.2 Excel测试案例部分内容截图2

2、原理分析:

(1)聚类定义:

将数据集划分为由若干相似对象组成的多个组(group)或簇(cluster)的过程,使得同一组中对象间的相似度最大化,不同组中对象间的相似度最小化。

聚类是一种无监督的机器学习方法,即事先对数据集的分布没有任何了解,是将物理或抽象对象的集合组成为由类似的对象组成的多个组的过程。

(2)聚类分析任务步骤:

①模式表示(包括特征提取和选择)

②适合于数据领域的模式相似性定义

③聚类或划分算法

④数据摘要

⑤输出结果的评估

(3)k-means算法:

首先,随机选择k个对象,每个对象代表一个簇的初始均值或中心;对剩余的每个对象,根据其与各簇中心的距离,将它指派到最近或最相似的簇,然后计算每个簇的新均值,得到更新后的簇中心;不断重复,直到准则函数收敛。

(4)自下而上聚合层次聚类方法(凝聚层次聚类):

最初将每个对象作为一个簇,然后将这些簇进行聚合以构造越来越大的簇,直到所有对象均聚合为一个簇,或满足一定终止条件为止。

(5)k-means算法的缺点:

①簇个数k需要预先给定。

②算法对初始值的选取依赖性极大以及算法常陷入局部最优解。

③该算法需要不断地进行样本分类调整,不断地计算调整后的簇中心,因此当数据量非常大时,算法的时间开销是非常大的。

④由于将簇的质心(即均值)作为簇中心进行新一轮的聚类计算,远离数据密集区的离群点和噪声点会导致聚类中心偏离真正的数据密集区,所以k-means算法对噪声点和离群点很敏感。

⑤k-means算法不能用于发现非凸形状的簇,或具有各种不同大小或密度的簇,即很难检测“自然的”簇。

⑥只能用于处理数值属性的数据集,不能处理包含分类属性的数据集。


3、源代码示例:

(1)工程目录:

聚类分析层次聚类及k-means算法

图3.1工程目录截图

(2)KMeans.java

package com.remoa.experiment4.service;

import java.util.ArrayList;
import java.util.List;
import java.util.UUID;

import com.remoa.experiment4.domain.ClusterVO;
import com.remoa.experiment4.domain.DataVO;
import com.remoa.experiment4.domain.PointVO;

import jxl.Cell;

/**
* K-Means算法
* @author Remoa
*
*/
public class KMeans {
//定义最大欧式距离为5000
public static final double MAXLENGTH = 5000.0;

/**
* 计算新的簇中心
* @param dataVO DataVO实体类
* @return 更新后的DataVO实体
*/
public static DataVO countClusterCenter(DataVO dataVO){
List<ClusterVO> clusterList = dataVO.getClusterList();
List<ClusterVO> newClusterList = new ArrayList<ClusterVO>();
int i, j, p;
for(i = 0; i < clusterList.size(); i++){
ClusterVO cluster = clusterList.get(i);
List<PointVO> pointList = cluster.getPointList();
Double[] countArray = new Double[clusterList.get(0).getPointList().get(0).getPoint().length];
for(j = 0; j < countArray.length; j++){
countArray[j] = 0.0;
}
for(j = 0; j < pointList.size(); j++){
PointVO point = pointList.get(j);
Double[] pointValue = point.getPoint();
for(p = 0; p < pointValue.length; p++){
countArray[p] = pointValue[p] + countArray[p];
}
}
for(j = 0; j < countArray.length; j++){
countArray[j] /= pointList.size();
}
cluster.setClusterCenter(countArray);
newClusterList.add(cluster);
}
dataVO.setClusterList(newClusterList);
return dataVO;
}

/**
* 将对象指派到与其距离最近的簇
* @param dataVO dataVO实体
* @param point 数据点
* @return 修改后的dataVO实体
*/
public static DataVO distributeIntoCluster(DataVO dataVO, PointVO point){
double sum = 0.0, max = MAXLENGTH;
//loca存放在原先簇中的位置,locaRecord存放是在哪个簇
int locaRecord = 0, loca = 0;
int i, j, count, n, m;
List<ClusterVO> clusterList = dataVO.getClusterList();
List<PointVO> pointList = dataVO.getPointList();
List<PointVO> clusterPointList = null;
Double[] distanceArray = new Double[clusterList.size()];
//获取数据点内容
Double[] pointValueArray = point.getPoint();
Double[] tempArray = new Double[pointValueArray.length];
//遍历每一个簇
for(i = 0; i < clusterList.size(); i++){
sum = 0.0;
//得到该簇的中心点
Double[] clusterCenter = clusterList.get(i).getClusterCenter();
//将平方值保存在一个temp数组
for(j = 0; j < pointValueArray.length; j++){
tempArray[j] = Math.pow(clusterCenter[j] - pointValueArray[j], 2);
}
//求欧式距离
for(j = 0; j < tempArray.length; j++){
sum += tempArray[j];
}
//将结果保存在距离数组中
distanceArray[i] = Math.sqrt(sum);
}
//遍历距离数组,找到要插入的簇
for(i = 0; i < distanceArray.length; i++){
if(distanceArray[i] < max){
max = distanceArray[i];
locaRecord = i;
}
}
//获得该簇
ClusterVO cluster = clusterList.get(locaRecord);
//找到簇中的该元素
for(i = 0; i < pointList.size(); i++){
if(pointList.get(i).equals(point)){
loca = i;
break;
}
}
//在同一个簇,不做任何处理
if(cluster.getClusterid().equals(point.getClusterid())){
return dataVO;
}
//这个数据不在任何一个簇,加进来
else if(point.getClusterid() == null){
clusterPointList = cluster.getPointList();
}
//在不同的簇中
else{
clusterPointList = cluster.getPointList();
//遍历每个簇,找到该元素
for(i = 0; i < clusterList.size(); i++){
boolean flag = false;
//遍历每个簇中元素
for(m = 0; m < clusterList.get(i).getPointList().size(); m++){
PointVO everypoint = clusterList.get(i).getPointList().get(m);
Double[] everypointValue = everypoint.getPoint();
count = 0;
for(n = 0; n < everypointValue.length; n++){
if(pointValueArray[n].doubleValue() == everypointValue[n].doubleValue()){
count++;
}
}
if(count == everypointValue.length){
clusterList.get(i).getPointList().remove(m);
flag = true;
break;
}
}
if(flag){
break;
}
}
}
//设置数据点的所在簇位置
point.setClusterid(cluster.getClusterid());
//更新dataVO中的数据点信息
pointList.set(loca, point);
//将数据点加入到簇的数据点集中
clusterPointList.add(point);
//将数据点集加入到簇中
cluster.setPointList(clusterPointList);
//更新dataVO中的簇信息
clusterList.set(locaRecord, cluster);
//将簇信息放入dataVO中
dataVO.setClusterList(clusterList);
//将数据点集信息放入到dataVO中
dataVO.setPointList(pointList);
return dataVO;
}

/**
* 初始化DataVO
* @param cellList 封装了Excel表中一行行数据的list
* @param k k-means算法中的k
* @return 修改后的DataVO实体
*/
public static DataVO initDataVO(List<Cell[]> cellList, int k){
int i, j;
DataVO dataVO = new DataVO();
List<PointVO> pointList = new ArrayList<PointVO>();
List<ClusterVO> clusterList = new ArrayList<ClusterVO>();
List<ClusterVO> newClusterList = new ArrayList<ClusterVO>();
Cell[] cell = new Cell[cellList.get(0).length];
//将所有元素加入到DataVO中管理以及加入PointVO中
for(i = 0; i < cellList.size(); i++){
cell = cellList.get(i);
Double[] point = new Double[cellList.get(0).length];
for(j = 0; j < cell.length; j++){
point[j] = Double.valueOf(cell[j].getContents());
}
PointVO pointVO = new PointVO();
pointVO.setPoint(point);
pointVO.setPointName(null);
if(i < k){
String clusterid = UUID.randomUUID().toString();
pointVO.setClusterid(clusterid);
ClusterVO cluster = new ClusterVO();
cluster.setClusterid(clusterid);
clusterList.add(cluster);
}else{
pointVO.setClusterid(null);
}
pointList.add(pointVO);
}
dataVO.setPointList(pointList);
//将前k个点作为k个簇
for(i = 0; i < k; i++){
cell = cellList.get(i);
Double[] point = new Double[cellList.get(0).length];
for(j = 0; j < cell.length; j++){
point[j] = Double.valueOf(cell[j].getContents());
}
ClusterVO cluster = clusterList.get(i);
cluster.setClusterCenter(point);
List<PointVO> clusterPointList = new ArrayList<PointVO>();
PointVO pointVO = new PointVO();
pointVO.setPoint(point);
clusterPointList.add(pointVO);
cluster.setPointList(clusterPointList);
newClusterList.add(cluster);
}
dataVO.setClusterList(newClusterList);
return dataVO;
}

}
(3)HierarchicalAlgorithm.java

package com.remoa.experiment4.service;

import java.util.ArrayList;
import java.util.List;
import java.util.UUID;

import com.remoa.experiment4.domain.ClusterVO;
import com.remoa.experiment4.domain.DataVO;
import com.remoa.experiment4.domain.PointVO;

import jxl.Cell;

/**
* 层次聚类算法
* @author Remoa
*
*/
public class HierarchicalAlgorithm {
//定义最大欧式距离为5000
public static final double MAXLENGTH = 5000.0;

/**
* 初始化层次聚类算法的DataVO实体
* @param cellList 封装了Excel中一行行数据的List
* @return
*/
public static DataVO initDataVO(List<Cell[]> cellList){
int i, j;
DataVO dataVO = new DataVO();
List<ClusterVO> clusterList = new ArrayList<ClusterVO>();
List<PointVO> pointList = new ArrayList<PointVO>();
Cell[] cell = new Cell[cellList.get(0).length];
for(i = 0; i < cellList.size(); i++){
cell = cellList.get(i);
Double[] point = new Double[cellList.get(0).length];
for(j = 0; j < cell.length; j++){
point[j] = Double.valueOf(cell[j].getContents());
}
List<PointVO> clusterPointList = new ArrayList<PointVO>();
ClusterVO cluster = new ClusterVO();
PointVO pointVO = new PointVO();
pointVO.setPoint(point);
pointVO.setPointName(null);
String clusterId = UUID.randomUUID().toString();
pointVO.setClusterid(clusterId);
clusterPointList.add(pointVO);
cluster.setClusterCenter(point);
cluster.setClusterid(clusterId);
cluster.setPointList(clusterPointList);
clusterList.add(cluster);
pointList.add(pointVO);
}
dataVO.setClusterList(clusterList);
dataVO.setPointList(pointList);
return dataVO;
}

/**
* 簇合并
* @param dataVO DataVO实体
* @return 修改后的DataVO实体
*/
public static DataVO mergeCluster(DataVO dataVO){
double max = MAXLENGTH;
//定义一个临时数组
Double[] tempArray = new Double[dataVO.getClusterList().get(0).getClusterCenter().length];
//定义要合并的两个簇的下标
int clusterLoca1 = 0, clusterLoca2 = 0;
int j, m, count, n, p;
double sum;
//遍历每个簇
for(int i = 0; i < dataVO.getClusterList().size(); i++){
//得到第一个簇的中心点
Double[] clusterCenter1 = dataVO.getClusterList().get(i).getClusterCenter();
for(int k = i + 1; k < dataVO.getClusterList().size(); k++){
sum = 0.0;
//得到第二个簇的中心点
Double[] clusterCenter2 = dataVO.getClusterList().get(k).getClusterCenter();
//将平方值保存在一个temp数组,求未开根号的欧式距离
for(j = 0; j < tempArray.length; j++){
tempArray[j] = Math.pow(clusterCenter1[j] - clusterCenter2[j], 2);
sum += tempArray[j].doubleValue();
}
if(sum < max){
max = sum;
clusterLoca1 = i;//第一个簇的位置
clusterLoca2 = k;//第二个簇的位置
}
}
}
//合并两个簇
String clusterid = UUID.randomUUID().toString();
ClusterVO cluster1 = dataVO.getClusterList().get(clusterLoca1);
//遍历第一个簇的全集,更新其所在dataVO中的数据点的簇id
for(m = 0; m < cluster1.getPointList().size(); m++){
count = 0;
Double[] pointValueArray = cluster1.getPointList().get(m).getPoint();
List<PointVO> everypoint = dataVO.getPointList();
for(n = 0; n < everypoint.size(); n++){
Double[] everypointValue = everypoint.get(n).getPoint();
for(p = 0; p < everypointValue.length; p++){
if(pointValueArray[p].doubleValue() == everypointValue[p].doubleValue()){
count++;
}
}
if(count == everypointValue.length){
PointVO newpoint1 = everypoint.get(n);
newpoint1.setClusterid(clusterid);
dataVO.getPointList().set(n, newpoint1);
break;
}
}
}
//更新簇中的数据的簇id
for(m = 0; m < cluster1.getPointList().size(); m++ ){
PointVO point = cluster1.getPointList().get(m);
point.setClusterid(clusterid);
cluster1.getPointList().set(m, point);
}
ClusterVO cluster2 = dataVO.getClusterList().get(clusterLoca2);
//遍历第二个簇的全集,更新其所在dataVO中的簇id
for(m = 0; m < cluster2.getPointList().size(); m++){
count = 0;
Double[] pointValueArray = cluster2.getPointList().get(m).getPoint();
List<PointVO> everypoint = dataVO.getPointList();
for(n = 0; n < everypoint.size(); n++){
Double[] everypointValue = everypoint.get(n).getPoint();
for(p = 0; p < everypointValue.length; p++){
if(pointValueArray[p].doubleValue() == everypointValue[p].doubleValue()){
count++;
}
}
if(count == everypointValue.length){
PointVO newpoint2 = everypoint.get(n);
newpoint2.setClusterid(clusterid);
dataVO.getPointList().set(n, newpoint2);
break;
}
}
}
//更新簇中的数据的簇id
for(m = 0; m < cluster2.getPointList().size(); m++ ){
PointVO point = cluster2.getPointList().get(m);
point.setClusterid(clusterid);
cluster2.getPointList().set(m, point);
}
ClusterVO newCluster = new ClusterVO();
List<PointVO> newPointList = new ArrayList<PointVO>();
newPointList.addAll(cluster1.getPointList());
newPointList.addAll(cluster2.getPointList());
Double[] clusterCenter1 = cluster1.getClusterCenter();
Double[] clusterCenter2 = cluster2.getClusterCenter();
Double[] newCenter = new Double[clusterCenter1.length];
for(int i = 0; i < clusterCenter1.length; i++){
newCenter[i] = (clusterCenter1[i] * cluster1.getPointList().size() + clusterCenter2[i] * cluster2.getPointList().size()) / (cluster1.getPointList().size() + cluster2.getPointList().size());
}
newCluster.setClusterCenter(newCenter);
newCluster.setClusterid(clusterid);
newCluster.setPointList(newPointList);
dataVO.getClusterList().set(clusterLoca1, newCluster);
dataVO.getClusterList().remove(clusterLoca2);
return dataVO;
}

}

(4)CorrectRate.java

package com.remoa.experiment4.service;

import java.text.DecimalFormat;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Iterator;
import java.util.List;
import java.util.Map;
import java.util.Map.Entry;
import java.util.Set;

import com.remoa.experiment4.common.ImportData;
import com.remoa.experiment4.domain.ClusterVO;
import com.remoa.experiment4.domain.DataVO;
import com.remoa.experiment4.domain.PointVO;

import jxl.Cell;

public class CorrectRate {
/**
* 获得两个Excel表之间的正确的映射
* @return 返回键值对map
*/
public static Map<Double[], String> getMap(){
List<Cell[]> resultCellList = ImportData.importResultData();
List<Cell[]> testCellList = ImportData.importData();
Map<Double[], String> map = new HashMap<Double[], String>();
for(int j = 0; j < testCellList.size(); j++){
Cell[] testCell = testCellList.get(j);
Cell[] resultCell = resultCellList.get(j);
String name = resultCell[0].getContents();
Double[] cellValue = new Double[testCell.length];
for(int i = 0; i < testCell.length; i++){
cellValue[i] = Double.valueOf(testCell[i].getContents());
}
map.put(cellValue, name);
}
return map;
}

/**
* 获得正确率
* @param dataVO
*/
public static void getCorrectRate(DataVO dataVO){
int maxLoca = 0;//最多项在数组中出现的位置
int maxSize = 0;//最多项出现的次数
int sum = 0;//正确项的总和
Map<Double[], String> map = getMap();
//每个簇所获得的簇名
String[] clusterNameReal = new String[dataVO.getClusterList().size()];
Set<String> set = new HashSet<String>();
for(Iterator<Entry<Double[], String>> iter = map.entrySet().iterator(); iter.hasNext();){
Map.Entry<Double[], String> entry = iter.next();
String value = entry.getValue();
set.add(value);
}
//封装簇名
String[] clusterNameArray = set.toArray(new String[dataVO.getClusterList().size()]);
int[] countArray = new int[clusterNameArray.length];
//每个簇正确项个数的数组
int[] correctArray = new int[clusterNameArray.length];
for(int j = 0; j < countArray.length; j++){
correctArray[j] = 0;
}
List<ClusterVO> clusterList = dataVO.getClusterList();
//遍历每个簇,根据簇中元素所属于正确结果的最多值定一个初始的簇类
for(int i = 0; i < clusterList.size(); i++){
//计数器初始化
for(int j = 0; j < countArray.length; j++){
countArray[j] = 0;
}
//最多项出现的次数初始化
maxSize = 0;
//簇中元素的List
List<PointVO> pointList = clusterList.get(i).getPointList();
//遍历簇内元素,得到该簇的真实名字
for(int j = 0; j < pointList.size(); j++){
String valueStr = "";
Double[] testDoubleArray = dataVO.getClusterList().get(i).getPointList().get(j).getPoint();
Set<Double[]> valueSet = map.keySet();
int temp = 0;
for(Iterator<Double[]> iter = valueSet.iterator(); iter.hasNext(); ){
int countSame = 0;
Double[] valueArray = iter.next();
for(int m = 0; m < valueArray.length; m++){
if(valueArray[m].doubleValue() == testDoubleArray[m].doubleValue()){
countSame++;
}
}
if(countSame == valueArray.length){
valueStr = map.get(valueArray);
dataVO.getPointList().get(temp).setPointName(valueStr);
dataVO.getClusterList().get(i).getPointList().get(j).setPointName(valueStr);
break;
}
temp++;
}
for(int m = 0; m < clusterNameArray.length; m++){
if(clusterNameArray[m].equals(valueStr)){
countArray[m]++;
}
}
}
for(int z = 0; z < countArray.length; z++){
if(countArray[z] >= maxSize){
maxSize = countArray[z];
maxLoca = z;
}
}
clusterNameReal[i] = clusterNameArray[maxLoca];
correctArray[i] = maxSize;
}
System.out.println("###############################");
for(int i = 0; i < correctArray.length; i++){
sum += correctArray[i];
System.out.println("簇" + clusterNameReal[i] + "共有" + dataVO.getClusterList().get(i).getPointList().size() + "项,其中正确项有" + correctArray[i] + "项;");
}
System.out.println("项的总数为:" + dataVO.getPointList().size() + "项");
double result = sum * 1.0 / dataVO.getPointList().size() * 100;
DecimalFormat df = new DecimalFormat("0.00");
System.out.println("正确率为:" + df.format(result) + "%");
}

}

(5)ImportData.java

package com.remoa.experiment4.common;

import java.io.FileInputStream;
import java.io.InputStream;
import java.util.List;
import java.util.Properties;

import com.remoa.experiment4.common.util.ExcelUtil;

import jxl.Cell;
import jxl.Workbook;

/**
* 获得需要的数据的工具类
* @author Remoa
*
*/
public class ImportData {
/**
* 导入测试数据
* @return 返回封装了测试数据的list
*/
public static List<Cell[]> importData(){
Properties prop = null;
try {
prop = new Properties();
InputStream is = new FileInputStream("DataLoadIn.properties");
prop.load(is);
is.close();
} catch (Exception e) {
System.out.println("未能读取到Excel文件,修改配置文件路径后重试!");
e.printStackTrace();
}
String absolutePath = prop.getProperty("absolutePath");
int sheetLoca = Integer.valueOf(prop.getProperty("sheetLoca"));
int initRowLoca = Integer.valueOf(prop.getProperty("initRowLoca"));
Workbook workbook = ExcelUtil.readExcel(absolutePath);
List<Cell[]> list = ExcelUtil.sheetEncapsulation(workbook, sheetLoca, initRowLoca);
return list;
}

/**
* 得到簇的数目
* @return 返回簇的数目
*/
public static int getclusterNumber(){
Properties prop = null;
try {
prop = new Properties();
InputStream is = new FileInputStream("DataLoadIn.properties");
prop.load(is);
is.close();
} catch (Exception e) {
System.out.println("未能读取到Excel文件,修改配置文件路径后重试!");
e.printStackTrace();
}
int clusterNumber = Integer.valueOf(prop.getProperty("clusterNumber"));
return clusterNumber;
}

/**
* 导入正确的分类结果数据
* @return 返回封装该结果数据的list
*/
public static List<Cell[]> importResultData(){
Properties prop = null;
try {
prop = new Properties();
InputStream is = new FileInputStream("ResultLoadIn.properties");
prop.load(is);
is.close();
} catch (Exception e) {
System.out.println("未能读取到Excel文件,修改配置文件路径后重试!");
e.printStackTrace();
}
String absolutePath = prop.getProperty("absolutePath");
int sheetLoca = Integer.valueOf(prop.getProperty("sheetLoca"));
int initRowLoca = Integer.valueOf(prop.getProperty("initRowLoca"));
Workbook workbook = ExcelUtil.readExcel(absolutePath);
List<Cell[]> list = ExcelUtil.sheetEncapsulation(workbook, sheetLoca, initRowLoca);
return list;
}

}
(6)ExcelUtil.java

package com.remoa.experiment4.common.util;

import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import jxl.Cell;
import jxl.Sheet;
import jxl.Workbook;
import jxl.read.biff.BiffException;

/**
* Excel工具类需要导入jxl的jar包,其常用方法总结如下:
* (1)Workbook为Excel文件,Cell为单元格,Sheet为工作表对象
* (2)sheet.getCell(x,y):获得第x行第y列的单元格
* (3)workbook.getWorkbook(File):获得文件
* (4)workbook.getSheet(0):获得0号(第一个)工作表对象
* (5)cell.getContents():获得单元格的内容
* (6)Cell[] cells = sheet.getColumn(column):获得某一列的值
* (7)Cell[] cells = sheet.getRow(row):获得某一行的值
* @author Remoa
*
*/
public class ExcelUtil {
/**
* 读取Excel文件
* @param filePath Excel文件的绝对路径
* @return 返回Workbook
*/
public static Workbook readExcel(String filePath){
File file = null;
Workbook workbook = null;
file = new File(filePath);
try {
workbook = Workbook.getWorkbook(file);
} catch (BiffException e) {
System.out.println("输入流读入为空,java读取Excel异常");
e.printStackTrace();
} catch (IOException e) {
System.out.println("IO异常");
e.printStackTrace();
}
return workbook;
}

/**
* 对Excel文件工作表的内容进行封装
* @param workbook Excel文件
* @param sheetLoca 工作表位置
* @param initRowLoca 初始行,即非表头行的记录开始的行数
* @return 返回一个封装了一行行数据的List
*/
public static List<Cell[]> sheetEncapsulation(Workbook workbook, int sheetLoca, int initRowLoca){
Sheet sheet = workbook.getSheet(sheetLoca);
List<Cell[]> list = new ArrayList<Cell[]>();
Cell[] cells = null;
int i = initRowLoca - 1, length = sheet.getRows() - initRowLoca + 1;
while(length-- != 0){
cells = sheet.getRow(i);
list.add(cells);
i++;
}
return list;
}

/**
* 当表头存在多行时,获得某一特定所需表头行,将该表头行信息保存为一个Cell数组
* @param workbook Excel文件
* @param sheetLoca 工作表位置
* @param wantLoca 想获得的特定表头行位置
* @return 该表头行信息Cell[]数组
*/
public static Cell[] getHeadInfo(Workbook workbook, int sheetLoca, int wantLoca){
if(wantLoca == -1){
return null;
}else{
Sheet sheet = workbook.getSheet(sheetLoca);
Cell[] cells = sheet.getRow(wantLoca - 1);
return cells;
}
}

}
(7)PrintUtil.java

package com.remoa.experiment4.common.util;

import java.util.Iterator;
import java.util.List;

import com.remoa.experiment4.domain.ClusterVO;
import com.remoa.experiment4.domain.DataVO;
import com.remoa.experiment4.domain.PointVO;

/**
* 打印工具类
* @author Remoa
*
*/
public class PrintUtil {
/**
* 打印每个簇中的具体内容
* @param dataVO DataVO实体
*/
public static void printClusterContents(DataVO dataVO){
List<ClusterVO> clusterList = dataVO.getClusterList();
//遍历每个簇
for(int i = 0; i < clusterList.size(); i++){
System.out.println("第" + (i+1) + "个簇共有" + clusterList.get(i).getPointList().size() + "项,内容如下:");
ClusterVO cluster = clusterList.get(i);//得到该簇
List<PointVO> pointList = cluster.getPointList();//簇内元素的list
//遍历簇内元素
for(Iterator<PointVO> iter = pointList.iterator(); iter.hasNext(); ){
PointVO pointVO = iter.next();
Double[] valueArray = pointVO.getPoint();
for(int j = 0; j < valueArray.length - 1; j++){
System.out.print(valueArray[j] + ", ");
}
System.out.println(valueArray[valueArray.length - 1]);
}
}
}

}

(8)DataVO.java

package com.remoa.experiment4.domain;

import java.util.List;

/**
* DataVO实体类,封装了簇的list以及数据点集的list
* @author Remoa
*
*/
public class DataVO {
private List<ClusterVO> clusterList;//簇
private List<PointVO> pointList;//数据点集

public List<ClusterVO> getClusterList() {
return clusterList;
}

public void setClusterList(List<ClusterVO> clusterList) {
this.clusterList = clusterList;
}

public List<PointVO> getPointList() {
return pointList;
}

public void setPointList(List<PointVO> pointList) {
this.pointList = pointList;
}

@Override
public String toString() {
return "DataVO [clusterList=" + clusterList + ", pointList=" + pointList + "]";
}

}
(9)ClusterVO.java
package com.remoa.experiment4.domain;

import java.util.Arrays;
import java.util.List;

/**
* 簇实体,封装了簇心和该簇中的簇内元素集
* @author Remoa
*
*/
public class ClusterVO{
private String clusterid;
private Double[] clusterCenter;//簇心
private List<PointVO> pointList;//簇内元素

public String getClusterid() {
return clusterid;
}

public void setClusterid(String clusterid) {
this.clusterid = clusterid;
}

public Double[] getClusterCenter() {
return clusterCenter;
}

public void setClusterCenter(Double[] clusterCenter) {
this.clusterCenter = clusterCenter;
}

public List<PointVO> getPointList() {
return pointList;
}

public void setPointList(List<PointVO> pointList) {
this.pointList = pointList;
}

@Override
public String toString() {
return "ClusterVO [clusterid=" + clusterid + ", clusterCenter=" + Arrays.toString(clusterCenter)
+ ", pointList=" + pointList + "]";
}

}
(10)PointVO.java
package com.remoa.experiment4.domain;import java.util.Arrays;/** * 数据点实体,封装了具体的Cell中每行的数据点的具体的double值,以及数据点所在的簇和该簇的簇名 * @author Remoa * */public class PointVO {private Double[] point;//数据点private String clusterid;//数据点所在的簇private String pointName;//给数据点所对应的簇名public Double[] getPoint() {return point;}public void setPoint(Double[] point) {this.point = point;}public String getClusterid() {return clusterid;}public void setClusterid(String clusterid) {this.clusterid = clusterid;}public String getPointName() {return pointName;}public void setPointName(String pointName) {this.pointName = pointName;}@Overridepublic String toString() {return "PointVO [point=" + Arrays.toString(point) + ", clusterid=" + clusterid + ", pointName=" + pointName+ "]";}}
(11)ChooseFactory.java
package com.remoa.experiment4.common.factory;

import com.remoa.experiment4.common.strategy.Clustering;
import com.remoa.experiment4.common.strategy.HierarchicalClustering;
import com.remoa.experiment4.common.strategy.KMeansClustering;

/**
* 决策工厂类,用于用户决策选择
* @author Remoa
*
*/
public class ChooserFactory {
/**
* 运行聚类算法
* @param algorithmName
*/
public void runAlgorithm(String algorithmName){
long startTime = System.currentTimeMillis();
Clustering clustering = null;
algorithmName = algorithmName.toLowerCase();
switch(algorithmName){
case "kmeans":
clustering = new KMeansClustering();
break;
case "hierarchical clustering":
clustering = new HierarchicalClustering();
break;
}
clustering.clusterAlgorithm();
long endTime = System.currentTimeMillis();
System.out.println(algorithmName + "算法共耗时:" + (endTime - startTime)*1.0 / 1000 + "s");
System.out.println("------------------------------------------------");
}

}
(12)Clustering.java

package com.remoa.experiment4.common.strategy;
/**
* 策略接口,封装聚类分析的不同算法。
* @author Remoa
*
*/
public interface Clustering {
/**
* 策略抽象接口
*/
public void clusterAlgorithm();

}
(13)KMeansClustering.java
package com.remoa.experiment4.common.strategy;import java.util.ArrayList;import java.util.List;import com.remoa.experiment4.common.ImportData;import com.remoa.experiment4.common.util.PrintUtil;import com.remoa.experiment4.domain.DataVO;import com.remoa.experiment4.domain.PointVO;import com.remoa.experiment4.service.CorrectRate;import com.remoa.experiment4.service.KMeans;import jxl.Cell;/** * 使用K-Means算法 * @author Remoa * */public class KMeansClustering implements Clustering{/** * 调用K-Means算法 */@Overridepublic void clusterAlgorithm() {List<Cell[]> cellList = ImportData.importData();        int clusterNumber = ImportData.getclusterNumber();        DataVO dataVO = KMeans.initDataVO(cellList, clusterNumber);        List<PointVO> pointList = dataVO.getPointList();        int count = clusterNumber;        List<Double[]> centerValueList = new ArrayList<Double[]>();        for(int i = 0; i < clusterNumber; i++){        Double[] center = dataVO.getClusterList().get(i).getClusterCenter();        centerValueList.add(center);        }        while(count != 0){        count = 0;        for(int i = 0; i < pointList.size(); i++){        dataVO = KMeans.distributeIntoCluster(dataVO, dataVO.getPointList().get(i));        }        dataVO = KMeans.countClusterCenter(dataVO);        List<Double[]> newCenterValueList = new ArrayList<Double[]>();        for(int i = 0; i < clusterNumber; i++){            Double[] center = dataVO.getClusterList().get(i).getClusterCenter();            newCenterValueList.add(center);            }        for(int i = 0; i < clusterNumber; i++){        Double[] oldCenter = centerValueList.get(i);        Double[] newCenter = newCenterValueList.get(i);        for(int j = 0; j < oldCenter.length; j++){        //控制误差的精确度范围在0.01%        if(Math.abs(oldCenter[j] - newCenter[j]) >= 0.0001){        count++;        break;        }        }        }        for(int i = 0; i < clusterNumber; i++){        centerValueList.remove(0);        }        centerValueList.addAll(newCenterValueList);        }        PrintUtil.printClusterContents(dataVO);        CorrectRate.getCorrectRate(dataVO);}}
(14)HierarchicalClustering.java
package com.remoa.experiment4.common.strategy;

import java.util.List;

import com.remoa.experiment4.common.ImportData;
import com.remoa.experiment4.common.util.PrintUtil;
import com.remoa.experiment4.domain.DataVO;
import com.remoa.experiment4.service.CorrectRate;
import com.remoa.experiment4.service.HierarchicalAlgorithm;

import jxl.Cell;

/**
* 使用层次聚类算法
* @author Remoa
*
*/
public class HierarchicalClustering implements Clustering{
/**
* 调用层次聚类算法
*/
@Override
public void clusterAlgorithm() {
int k = ImportData.getclusterNumber();
List<Cell[]> cellList = ImportData.importData();
DataVO dataVO = HierarchicalAlgorithm.initDataVO(cellList);
while(dataVO.getClusterList().size() != k){
dataVO = HierarchicalAlgorithm.mergeCluster(dataVO);
}
PrintUtil.printClusterContents(dataVO);
CorrectRate.getCorrectRate(dataVO);
}

}
(15)Main.java

package com.remoa.experiment4.action;

import com.remoa.experiment4.common.factory.ChooserFactory;

/**
* 用户接口,程序入口
* @author Remoa
*
*/
public class Main {
public static void main(String[] args){
new ChooserFactory().runAlgorithm("KMeans");
new ChooserFactory().runAlgorithm("Hierarchical Clustering");
}

}
(16)DataLoadIn.properties

absolutePath=C:/Users/\u9093\u5C0F\u827A/Desktop/fisheriris_meas.xls
sheetLoca=0
wantLoca=-1
initRowLoca=1
columnInitLoca=1
clusterNumber=3
(17)ResultLoadIn.properties

absolutePath=C:/Users/\u9093\u5C0F\u827A/Desktop/fisheriris_species.xls
sheetLoca=0
wantLoca=-1
initRowLoca=1
columnInitLoca=1

4、运行结果:

(1)运行结果部分截图:

聚类分析层次聚类及k-means算法

图4.1 运行结果截图1

聚类分析层次聚类及k-means算法

图4.2 运行结果截图2

(2)完整运行结果:

第1个簇共有39项,内容如下:
7.0, 3.2, 4.7, 1.4
6.9, 3.1, 4.9, 1.5
6.7, 3.0, 5.0, 1.7
6.3, 3.3, 6.0, 2.5
7.1, 3.0, 5.9, 2.1
6.3, 2.9, 5.6, 1.8
6.5, 3.0, 5.8, 2.2
7.6, 3.0, 6.6, 2.1
7.3, 2.9, 6.3, 1.8
7.2, 3.6, 6.1, 2.5
6.5, 3.2, 5.1, 2.0
6.4, 2.7, 5.3, 1.9
6.8, 3.0, 5.5, 2.1
6.4, 3.2, 5.3, 2.3
6.5, 3.0, 5.5, 1.8
7.7, 3.8, 6.7, 2.2
7.7, 2.6, 6.9, 2.3
6.9, 3.2, 5.7, 2.3
7.7, 2.8, 6.7, 2.0
6.7, 3.3, 5.7, 2.1
7.2, 3.2, 6.0, 1.8
6.4, 2.8, 5.6, 2.1
7.2, 3.0, 5.8, 1.6
7.4, 2.8, 6.1, 1.9
7.9, 3.8, 6.4, 2.0
6.4, 2.8, 5.6, 2.2
7.7, 3.0, 6.1, 2.3
6.3, 3.4, 5.6, 2.4
6.4, 3.1, 5.5, 1.8
6.9, 3.1, 5.4, 2.1
6.7, 3.1, 5.6, 2.4
6.9, 3.1, 5.1, 2.3
6.8, 3.2, 5.9, 2.3
6.7, 3.3, 5.7, 2.5
6.7, 3.0, 5.2, 2.3
6.5, 3.0, 5.2, 2.0
6.2, 3.4, 5.4, 2.3
6.7, 2.5, 5.8, 1.8
6.1, 2.6, 5.6, 1.4
第2个簇共有61项,内容如下:
5.5, 2.3, 4.0, 1.3
5.7, 2.8, 4.5, 1.3
4.9, 2.4, 3.3, 1.0
5.2, 2.7, 3.9, 1.4
5.0, 2.0, 3.5, 1.0
6.0, 2.2, 4.0, 1.0
5.6, 2.9, 3.6, 1.3
5.8, 2.7, 4.1, 1.0
6.2, 2.2, 4.5, 1.5
5.6, 2.5, 3.9, 1.1
5.7, 2.6, 3.5, 1.0
5.5, 2.4, 3.8, 1.1
5.5, 2.4, 3.7, 1.0
5.8, 2.7, 3.9, 1.2
5.4, 3.0, 4.5, 1.5
6.3, 2.3, 4.4, 1.3
5.6, 3.0, 4.1, 1.3
5.5, 2.5, 4.0, 1.3
5.5, 2.6, 4.4, 1.2
5.8, 2.6, 4.0, 1.2
5.0, 2.3, 3.3, 1.0
5.6, 2.7, 4.2, 1.3
5.7, 2.9, 4.2, 1.3
5.1, 2.5, 3.0, 1.1
5.7, 2.8, 4.1, 1.3
4.9, 2.5, 4.5, 1.7
5.7, 3.0, 4.2, 1.2
5.9, 3.0, 4.2, 1.5
5.6, 3.0, 4.5, 1.5
6.1, 2.8, 4.0, 1.3
6.1, 2.8, 4.7, 1.2
6.4, 2.9, 4.3, 1.3
6.0, 2.9, 4.5, 1.5
6.1, 3.0, 4.6, 1.4
6.2, 2.9, 4.3, 1.3
6.1, 2.9, 4.7, 1.4
6.6, 3.0, 4.4, 1.4
6.0, 3.4, 4.5, 1.6
6.0, 2.2, 5.0, 1.5
5.6, 2.8, 4.9, 2.0
6.4, 3.2, 4.5, 1.5
6.5, 2.8, 4.6, 1.5
6.6, 2.9, 4.6, 1.3
6.7, 3.1, 4.4, 1.4
5.9, 3.2, 4.8, 1.8
5.7, 2.5, 5.0, 2.0
6.3, 3.3, 4.7, 1.6
6.3, 2.5, 4.9, 1.5
6.2, 2.8, 4.8, 1.8
6.0, 3.0, 4.8, 1.8
6.0, 2.7, 5.1, 1.6
5.8, 2.7, 5.1, 1.9
6.1, 3.0, 4.9, 1.8
5.8, 2.7, 5.1, 1.9
6.7, 3.1, 4.7, 1.5
6.3, 2.7, 4.9, 1.8
5.9, 3.0, 5.1, 1.8
6.8, 2.8, 4.8, 1.4
6.3, 2.8, 5.1, 1.5
6.3, 2.5, 5.0, 1.9
5.8, 2.8, 5.1, 2.4
第3个簇共有50项,内容如下:
4.7, 3.2, 1.3, 0.2
4.6, 3.1, 1.5, 0.2
4.6, 3.4, 1.4, 0.3
4.4, 2.9, 1.4, 0.2
4.8, 3.4, 1.6, 0.2
4.3, 3.0, 1.1, 0.1
4.6, 3.6, 1.0, 0.2
4.7, 3.2, 1.6, 0.2
4.4, 3.0, 1.3, 0.2
4.4, 3.2, 1.3, 0.2
4.6, 3.2, 1.4, 0.2
5.1, 3.5, 1.4, 0.2
4.9, 3.0, 1.4, 0.2
5.0, 3.6, 1.4, 0.2
5.4, 3.9, 1.7, 0.4
5.0, 3.4, 1.5, 0.2
4.9, 3.1, 1.5, 0.1
5.4, 3.7, 1.5, 0.2
4.8, 3.0, 1.4, 0.1
5.8, 4.0, 1.2, 0.2
5.7, 4.4, 1.5, 0.4
5.4, 3.9, 1.3, 0.4
5.1, 3.5, 1.4, 0.3
5.7, 3.8, 1.7, 0.3
5.1, 3.8, 1.5, 0.3
5.4, 3.4, 1.7, 0.2
5.1, 3.7, 1.5, 0.4
5.1, 3.3, 1.7, 0.5
4.8, 3.4, 1.9, 0.2
5.0, 3.0, 1.6, 0.2
5.0, 3.4, 1.6, 0.4
5.2, 3.5, 1.5, 0.2
5.2, 3.4, 1.4, 0.2
4.8, 3.1, 1.6, 0.2
5.4, 3.4, 1.5, 0.4
5.2, 4.1, 1.5, 0.1
5.5, 4.2, 1.4, 0.2
4.9, 3.1, 1.5, 0.2
5.0, 3.2, 1.2, 0.2
5.5, 3.5, 1.3, 0.2
4.9, 3.6, 1.4, 0.1
5.1, 3.4, 1.5, 0.2
5.0, 3.5, 1.3, 0.3
4.5, 2.3, 1.3, 0.3
5.0, 3.5, 1.6, 0.6
5.1, 3.8, 1.9, 0.4
4.8, 3.0, 1.4, 0.3
5.1, 3.8, 1.6, 0.2
5.3, 3.7, 1.5, 0.2
5.0, 3.3, 1.4, 0.2
###############################
簇virginica共有39项,其中正确项有36项;
簇versicolor共有61项,其中正确项有47项;
簇setosa共有50项,其中正确项有50项;
项的总数为:150项
正确率为:88.67%
kmeans算法共耗时:0.39s
------------------------------------------------
第1个簇共有50项,内容如下:
5.1, 3.5, 1.4, 0.2
5.1, 3.5, 1.4, 0.3
5.2, 3.5, 1.5, 0.2
5.2, 3.4, 1.4, 0.2
5.0, 3.4, 1.5, 0.2
5.1, 3.4, 1.5, 0.2
5.0, 3.3, 1.4, 0.2
5.0, 3.5, 1.3, 0.3
5.0, 3.6, 1.4, 0.2
4.9, 3.6, 1.4, 0.1
5.4, 3.7, 1.5, 0.2
5.3, 3.7, 1.5, 0.2
5.1, 3.8, 1.5, 0.3
5.1, 3.7, 1.5, 0.4
5.1, 3.8, 1.6, 0.2
5.4, 3.4, 1.7, 0.2
5.4, 3.4, 1.5, 0.4
5.5, 3.5, 1.3, 0.2
5.1, 3.3, 1.7, 0.5
5.0, 3.4, 1.6, 0.4
5.0, 3.5, 1.6, 0.6
5.1, 3.8, 1.9, 0.4
4.9, 3.0, 1.4, 0.2
4.8, 3.0, 1.4, 0.3
4.8, 3.0, 1.4, 0.1
4.9, 3.1, 1.5, 0.1
4.9, 3.1, 1.5, 0.2
5.0, 3.0, 1.6, 0.2
4.7, 3.2, 1.6, 0.2
4.8, 3.1, 1.6, 0.2
4.7, 3.2, 1.3, 0.2
4.6, 3.1, 1.5, 0.2
4.6, 3.2, 1.4, 0.2
4.6, 3.4, 1.4, 0.3
5.0, 3.2, 1.2, 0.2
4.8, 3.4, 1.6, 0.2
4.8, 3.4, 1.9, 0.2
4.4, 2.9, 1.4, 0.2
4.4, 3.0, 1.3, 0.2
4.4, 3.2, 1.3, 0.2
4.3, 3.0, 1.1, 0.1
4.6, 3.6, 1.0, 0.2
5.4, 3.9, 1.7, 0.4
5.7, 3.8, 1.7, 0.3
5.4, 3.9, 1.3, 0.4
5.2, 4.1, 1.5, 0.1
5.5, 4.2, 1.4, 0.2
5.8, 4.0, 1.2, 0.2
5.7, 4.4, 1.5, 0.4
4.5, 2.3, 1.3, 0.3
第2个簇共有64项,内容如下:
7.0, 3.2, 4.7, 1.4
6.9, 3.1, 4.9, 1.5
6.7, 3.1, 4.7, 1.5
6.8, 2.8, 4.8, 1.4
6.7, 3.0, 5.0, 1.7
6.5, 2.8, 4.6, 1.5
6.6, 2.9, 4.6, 1.3
6.7, 3.1, 4.4, 1.4
6.6, 3.0, 4.4, 1.4
6.4, 3.2, 4.5, 1.5
6.3, 3.3, 4.7, 1.6
6.0, 3.4, 4.5, 1.6
6.1, 2.9, 4.7, 1.4
6.1, 3.0, 4.6, 1.4
6.0, 2.9, 4.5, 1.5
6.1, 2.8, 4.7, 1.2
6.1, 2.8, 4.0, 1.3
6.4, 2.9, 4.3, 1.3
6.2, 2.9, 4.3, 1.3
5.9, 3.2, 4.8, 1.8
6.1, 3.0, 4.9, 1.8
6.0, 3.0, 4.8, 1.8
5.9, 3.0, 5.1, 1.8
6.3, 2.5, 4.9, 1.5
6.0, 2.7, 5.1, 1.6
6.3, 2.8, 5.1, 1.5
6.3, 2.7, 4.9, 1.8
6.2, 2.8, 4.8, 1.8
6.3, 2.5, 5.0, 1.9
5.8, 2.7, 5.1, 1.9
5.8, 2.7, 5.1, 1.9
5.7, 2.5, 5.0, 2.0
5.6, 2.8, 4.9, 2.0
5.8, 2.8, 5.1, 2.4
6.2, 2.2, 4.5, 1.5
6.3, 2.3, 4.4, 1.3
6.0, 2.2, 5.0, 1.5
5.5, 2.3, 4.0, 1.3
5.5, 2.5, 4.0, 1.3
5.6, 2.5, 3.9, 1.1
5.5, 2.4, 3.8, 1.1
5.5, 2.4, 3.7, 1.0
5.6, 2.9, 3.6, 1.3
5.7, 2.6, 3.5, 1.0
5.2, 2.7, 3.9, 1.4
5.7, 2.8, 4.5, 1.3
5.5, 2.6, 4.4, 1.2
5.8, 2.7, 4.1, 1.0
5.8, 2.7, 3.9, 1.2
5.8, 2.6, 4.0, 1.2
5.6, 3.0, 4.1, 1.3
5.7, 3.0, 4.2, 1.2
5.7, 2.9, 4.2, 1.3
5.6, 2.7, 4.2, 1.3
5.7, 2.8, 4.1, 1.3
5.9, 3.0, 4.2, 1.5
5.6, 3.0, 4.5, 1.5
5.4, 3.0, 4.5, 1.5
6.0, 2.2, 4.0, 1.0
4.9, 2.5, 4.5, 1.7
4.9, 2.4, 3.3, 1.0
5.0, 2.3, 3.3, 1.0
5.1, 2.5, 3.0, 1.1
5.0, 2.0, 3.5, 1.0
第3个簇共有36项,内容如下:
6.3, 3.3, 6.0, 2.5
6.3, 2.9, 5.6, 1.8
6.5, 3.0, 5.5, 1.8
6.4, 3.1, 5.5, 1.8
6.4, 2.7, 5.3, 1.9
6.5, 3.0, 5.8, 2.2
6.4, 2.8, 5.6, 2.1
6.4, 2.8, 5.6, 2.2
6.5, 3.2, 5.1, 2.0
6.5, 3.0, 5.2, 2.0
6.8, 3.0, 5.5, 2.1
6.9, 3.1, 5.4, 2.1
6.9, 3.1, 5.1, 2.3
6.7, 3.0, 5.2, 2.3
6.9, 3.2, 5.7, 2.3
6.8, 3.2, 5.9, 2.3
6.7, 3.1, 5.6, 2.4
6.7, 3.3, 5.7, 2.5
6.7, 3.3, 5.7, 2.1
6.4, 3.2, 5.3, 2.3
6.3, 3.4, 5.6, 2.4
6.2, 3.4, 5.4, 2.3
6.7, 2.5, 5.8, 1.8
6.1, 2.6, 5.6, 1.4
7.1, 3.0, 5.9, 2.1
7.2, 3.2, 6.0, 1.8
7.2, 3.0, 5.8, 1.6
7.3, 2.9, 6.3, 1.8
7.4, 2.8, 6.1, 1.9
7.7, 3.0, 6.1, 2.3
7.6, 3.0, 6.6, 2.1
7.7, 2.8, 6.7, 2.0
7.7, 2.6, 6.9, 2.3
7.2, 3.6, 6.1, 2.5
7.7, 3.8, 6.7, 2.2
7.9, 3.8, 6.4, 2.0
###############################
簇setosa共有50项,其中正确项有50项;
簇versicolor共有64项,其中正确项有50项;
簇virginica共有36项,其中正确项有36项;
项的总数为:150项
正确率为:90.67%
hierarchical clustering算法共耗时:0.59s
------------------------------------------------