主流的聚类评价指标概览及聚类精度Accuracy的Java实现

本文记录了几种主流的聚类算法的评价指标。主要参考文献：《机器学习》-周志华。
其中，我们重点关注聚类精度（ AC ）这种评价指标的原理及实现。

大体上，聚类算法的评价指标分为两种，
0）外部评价指标
1）内部评价指标

外部评价指标是在真实标签已知的情况下，衡量聚类结果与真实标签之间的吻合程度。常用的有以下几个：
0）Jaccard Coefficient （ JC ）；
1）Fowlkes and Mallows Index （ FMI ）；
2）Rand Index （ RI ）;
3） Purity ；
4）Accuracy （ AC ）；
5）Normalized Mutual Information （ NMI ）；

内部评价指标是在不能获得真实标签的情况下，衡量聚类结果本身的好坏情况（比如簇的内聚性，簇间独立性）。常用的有两个：
6）Davies-Bouldin Index （ DBI ）；
7）Dunn Index （ DI ）；

下面分别介绍：
假设数据集 D={x1,…,xn} ，假设聚类得出的标签为 p=[p1,…,pn] ，真实的标签为 r=[r1,…,rn] ，将样本两两配对考虑，定义

SS={(xi,xj)|pi=pj,ri=rj,i<j} ,
SD={(xi,xj)|pi=pj,ri≠rj,i<j} ,
DS={(xi,xj)|pi≠pj,ri=rj,i<j} ,
DD={(xi,xj)|pi≠pj,ri≠rj,i<j} ,

其中，SS包含了那些预测为相同簇并且真实标签也一致的样本对,
SD包含了那些预测为相同簇但是真实标签不一致的样本对,
DS包含了那些预测为不同簇但是真实标签一致的样本对,
DD包含了那些预测为不同簇并且真实标签也不一致的样本对。
易知，每个样本对出现并只能出现在上述某一个集合中。
基于上述式子，可导出以下外部指标：

0） JC

J C = | S S | | S S | + | S D | + | D S |

1） FMI

F M I = | S S | ( | S S | + | S D | ) ( | S S | + | D S | ) - - - - - - - - - - - - - - - - - - - - - - \sqrt

2） RI

J C = | S S | | S S | + | S D | + | D S |

显然，上述指标的结果值均在[0, 1]区间内，值越大越好。

假设通过聚类给出的簇划分为 C={Ci}ki=1 ，真实簇划分为 C′={C′i}si=1 ，我们构建一个矩阵 W={wij=|Ci∩C′j|}k×s ， W 存储了每一个预测簇和真实簇之间的相同样本数量。

如表一所示：
主流的聚类评价指标概览及聚类精度Accuracy的Java实现

3） Purity
顾名思义， Purity 指的是纯度，该指标可通过如下优化问题获得：

P u r i t y = s . t . max \sum k i = 1 \sum s j = 1 w i j x i j 1 T W 1 \sum j = 1 s x i j = 1, i = 1, \dots, k x i j = 0 o r 1, i = 1, \dots, k, j = 1, \dots, s

显然，

1TW1=n 为样本个数。
实际上，

Purity 就是每一行的最大值之和除以样本总数
对于表一，

Purity=10+20+8+15102=0.5196 。

4） AC
AC 是目前最流行的聚类评价指标。在很多文献里面，都将 AC 作为聚类结果的评价指标。 AC 定义如下：

A C (p, r) = \sum n i = 1 δ ( r i , m a p ( p i ) ) n,

其中，

δ (a, b) = {1, 0, if a = b; o t h e r w i s e,

map(pi) 是一个排列映射函数，将聚类得到的标签映射到与之等价的真实标签，聚类标签与真实标签之间是1-1映射(不一定是满的)。
很多论文里面说，一个最佳的

map(pi) 函数可以由Kuhn-Munkres算法产生[ Matching Theory]。实际上，

AC 可以由如下最优化问题获得，

A C = s . t . max \sum k i = 1 \sum s j = 1 w i j x i j 1 T W 1 \sum j = 1 s x i j = 1, i = 1, \dots, k \sum i = 1 k x i j = 1, j = 1, \dots, s x i j = 0 o r 1, i = 1, \dots, k, j = 1, \dots, s

可以看到，

AC 的优化问题仅比

Purity 的优化问题多了一个约束条件，

Purity 要求每一行只选择一个数，

AC 不仅要求每一行唯一，而且要求每一列唯一，也就是一个预测簇只能与一个真实簇对应，一个真实簇也只能与一个预测簇对应。也就是得到的最优解

X={xij}k×s 是一个正交阵（当k=s时成立）。上述最优化问题有一个名称叫做 指派问题，解决指派问题有一个专门的算法— 匈牙利算法，也就是说，求解

AC 只需要用到Kuhn-Munkres算法的一部分，匈牙利算法。
关于匈牙利算法的原理和算法流程都在很多最优化书籍中有讲解。在这篇博客里面
http://blog.csdn.net/zhanghaor/article/details/52344766
有给出这个算法的Java实现。实际上我在用这个Java实现的过程中发现，对于有些情况，该算法不能收敛。一怒之下自己实现了一个，还是自己实现的靠谱点，Java代码如下：

import java.util.Arrays;
import org.ujmp.core.Matrix;
import org.ujmp.core.calculation.Calculation.Ret;

/**
 * The Hungary method solving allocating problem.
 * @author Yanxue
 *
 */
public class Hungary {

    Matrix graph;

int n, m;

//int minMatchValue;

    Matrix mapMatrix;

int[] mapIndices;

public static final int MAX_ITE_NUM = 1000;

public Hungary(Matrix pGraph) {
        graph = pGraph.plus(Ret.NEW, false, 0);
        n = (int) pGraph.getRowCount();
        m = (int) pGraph.getColumnCount();
if (n != m) {
            graphSqureChange();
        }
    }

private void graphSqureChange() {
if (n < m) {
            graph = graph.appendVertically(Ret.LINK,
                    Matrix.Factory.zeros(m - n, m));
        } else {
            graph = graph.appendHorizontally(Ret.LINK,
                    Matrix.Factory.zeros(n, n - m));
        }
        n = (int) graph.getRowCount();
        m = n;
    }

public void findMinMatch() {
// Compute C'
        Matrix rowMinValue = graph.min(Ret.NEW, 1);
        Matrix tC = Matrix.Factory.emptyMatrix();

for (int i = 0; i < n; i++) {
            tC = tC.appendVertically(Ret.LINK, graph.selectRows(Ret.LINK, i)
                    .minus(rowMinValue.getAsInt(i, 0)));
        }

        Matrix columnMinValue = tC.min(Ret.NEW, 0);
        Matrix _tC = Matrix.Factory.emptyMatrix();
for (int i = 0; i < m; i++) {
            _tC = _tC.appendHorizontally(
                    Ret.LINK,
                    tC.selectColumns(Ret.LINK, i).minus(
                            columnMinValue.getAsInt(0, i)));
        }
//System.out.println("C(1) computed");
        Matrix tMapMatrix = constructMapAndUpdate(_tC)[0];
int tCount = 0;
while (!isOptimal(tMapMatrix) && tCount++ < MAX_ITE_NUM) {
            Matrix[] tMatrix = constructMapAndUpdate(_tC);
            tMapMatrix = tMatrix[0];
            _tC = tMatrix[1];
        }

        mapMatrix = tMapMatrix;
        mapIndices = new int[n];
        Arrays.fill(mapIndices, -1);
for (int i = 0; i < n; i++) {
for (int j = 0; j < m; j++) {
if(mapMatrix.getAsInt(i, j) == 1) {
                    mapIndices[i] = j;
break;
                }
            }
        }
    }

private Matrix[] constructMapAndUpdate(Matrix c) {
        Matrix tMap = Matrix.Factory.zeros(n, m);
        Matrix updateC = c.plus(Ret.NEW, false, 0);

int[][] rowZeroIndices = getRowZeroIndices(c);

int[] indexSequence = findMinToMaxRowZeroCountIndexSequence(rowZeroIndices);
boolean[] rowComputed = new boolean[n];
boolean[] columnComputed = new boolean[m];
for (int i = 0; i < n; i++) {
int currentRow = indexSequence[i];
for (int j = 0; j < rowZeroIndices[currentRow].length; j++) {
if (!columnComputed[rowZeroIndices[currentRow][j]]) {
                    tMap.setAsInt(1, currentRow, rowZeroIndices[currentRow][j]);
                    columnComputed[rowZeroIndices[currentRow][j]] = true;
// 1) Flag for having bracket.
                    rowComputed[currentRow] = true;
break;
                }
            }
        }
//System.out.println("C(1)\r\n" + tMap);

if (isOptimal(tMap)) {
return new Matrix[] { tMap, updateC };
        }
// C' --> C''
boolean[] rowFlag = new boolean[n];
// 1)
for (int i = 0; i < n; i++) {
            rowFlag[i] = !rowComputed[i];
        }
//System.out.println("C(1): " + Arrays.toString(rowFlag));

boolean[] columnFlag = new boolean[m];

boolean[] _rowFlag = new boolean[n];
boolean[] _columnFlag = new boolean[m];

while (!Arrays.equals(_rowFlag, rowFlag)
                || !Arrays.equals(_columnFlag, columnFlag)) {

            _rowFlag = rowFlag;
            _columnFlag = columnFlag;

// 2) Flag column indices for all the zero elements in those
// bracket-flaged row.
for (int i = 0; i < n; i++) {
// flaged row
if (rowFlag[i]) {
for (int j = 0; j < rowZeroIndices[i].length; j++) {
                        columnFlag[rowZeroIndices[i][j]] = true;
                    }
                }
            }
//System.out.println("C(1)" + Arrays.toString(columnFlag));

// 3) Flag row indices for those bracket-flaged elements in flaged
// columns.
for (int i = 0; i < m; i++) {
if (columnFlag[i]) {
for (int j = 0; j < n; j++) {
if (tMap.getAsInt(j, i) == 1) {
                            rowFlag[j] = true;
break;
                        }
                    }
                }
            }
        }

// 5) Find minimum element in those locations uncovered by lines.
int tMinValue = Integer.MAX_VALUE;
for (int i = 0; i < n; i++) {
// skip row Lines
if (!rowFlag[i]) {
continue;
            }

for (int j = 0; j < m; j++) {
if (!columnFlag[j]) {
if (c.getAsInt(i, j) < tMinValue) {
                        tMinValue = c.getAsInt(i, j);
                    }
                }
            }
        }

// 6) Minus the minimum value for those flaged rows.
for (int i = 0; i < n; i++) {
if (rowFlag[i]) {
for (int j = 0; j < m; j++) {
                    updateC.setAsInt(updateC.getAsInt(i, j) - tMinValue, i, j);
                }
            }
        }
// 6) Plus the minimum value for those flaged columns.
for (int i = 0; i < m; i++) {
if (columnFlag[i]) {
for (int j = 0; j < n; j++) {
                    updateC.setAsInt(updateC.getAsInt(j, i) + tMinValue, j, i);
                }
            }
        }

return new Matrix[] { tMap, updateC };
    }

private int[] findMinToMaxRowZeroCountIndexSequence(int[][] rowZeroIndices) {
int[] tSequence = new int[n];
int tIndex = 0;
boolean[] rowComputed = new boolean[n];
while (tIndex < n) {
int minZeroCountIndex = 0;
int minZeroCount = Integer.MAX_VALUE;

for (int i = 0; i < n; i++) {
if (rowComputed[i]) {
continue;
                }

if (rowZeroIndices[i].length < minZeroCount) {
                    minZeroCount = rowZeroIndices[i].length;
                    minZeroCountIndex = i;
                }

            }
            tSequence[tIndex++] = minZeroCountIndex;
            rowComputed[minZeroCountIndex] = true;
        }
return tSequence;
    }

private int[][] getRowZeroIndices(Matrix c) {

int[][] tRowZeroIndices = new int[n][];
int[] tRowZeroCounts = new int[n];

for (int i = 0; i < n; i++) {
for (int j = 0; j < m; j++) {
if (c.getAsInt(i, j) == 0) {
                    tRowZeroCounts[i]++;
                }
            }
        }

for (int i = 0; i < n; i++) {
            tRowZeroIndices[i] = new int[tRowZeroCounts[i]];
            tRowZeroCounts[i] = 0;
for (int j = 0; j < m; j++) {
if (c.getAsInt(i, j) == 0) {
                    tRowZeroIndices[i][tRowZeroCounts[i]++] = j;
                }
            }
        }

return tRowZeroIndices;
    }

/**
 * Judge if the map matrix is optimal.
 * 
 * @param mapC
 * @return
 */
private boolean isOptimal(Matrix mapC) {
return mapC.sum(Ret.NEW, Matrix.ALL, false).getAsInt(0, 0) == n;
    }

public int[] getMapIndices() {
return mapIndices;
    }
/**
 Testing method.
 **/
public static void main(String[] args) {
int[][] m = null;
        m = new int[][]{ 
                { 12, 7, 9, 7, 9 }, 
                { 8, 9, 6, 6, 6 },
                { 7, 17, 12, 14, 9 }, 
                { 15, 14, 6, 6, 10 }, 
                { 4, 10, 7, 10, 9 } 
        };
        m = new int[][]{
                {2, 15, 13, 4}, 
                {10, 4, 14, 15},
                {9, 14, 16, 13},
                {7, 8, 11, 9}, 
        };
        Matrix mMatrix = Matrix.Factory.zeros(m.length, m[0].length);

for (int i = 0; i < m.length; i++) {
for (int j = 0; j < m[i].length; j++) {
                mMatrix.setAsInt(m[i][j], i, j);
            }
        }

        Hungary h = new Hungary(mMatrix);
        h.findMinMatch();
        System.out.println(h.mapMatrix);
        System.out.println(Arrays.toString(h.mapIndices));
    }
}

在使用这个算法的时候，需要注意以下2点：
1. UJMP三方库是必不可少的，这里面涉及到矩阵运算，下载链接https://ujmp.org/；
2. 这个算法解决的是极小化的指派问题，如需计算极大化问题的最优解（ AC 就是极大化问题），需要将 W 转化为
W′={w′ij}k×s，w′ij=max(W)−wij ， max(W) 是矩阵 W 中的最大值。这样转化之后的极小化问题的最优解等于原问题的最优解。
计算 AC 的时候，只需要拿到这个匹配， W 矩阵中对应的数相加，再除以样本总数，就可以了。

关于这个算法还有Matlab实现，可参见
http://www.cad.zju.edu.cn/home/dengcai/Data/code/hungarian.m

5） NMI
NMI 为归一化的互信息，给定两个随机变量 P 和 Q ， P,Q 之间的NMI由下式给出：

N M I (P, Q) = I ( P , Q ) H ( P ) H ( Q ) - - - - - - - - - \sqrt,

其中，

I(P,Q) 为

P,Q 的互信息，

H(.) 为信息熵，有的文章将分母设置为

max(H(P),H(Q)) ，没有太大的区别。
根据上式，预测的簇划分

C 和真实的簇划分

C′ 之间的NMI由下式给出

N M I (C, C') = \sum k i = 1 \sum s j = 1 | C i \cap C ' j | log n | C i \cap C ' j | | C i | | C ' j | ( \sum k i = 1 | C i | log | C i | n ) ( \sum s j = 1 | C ' j | log | C ' j | n ) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - \sqrt

我们再谈一谈两个内部评价指标，内部的评价指标并没有利用到真实的标签，或者说，内部的评价指标反应了预测簇本身的内聚性，或者反应了簇间的独立性。考虑聚类结果的簇划分 C={Ci}ki=1 ，定义

a v g (C i) = 2 | C i | ( | C i | - 1 ) \sum x l, x j \in C i, l < j d i s t (x l, x j), d i a m (C i) = max x l, x j \in C i, l < j d i s t (x l, x j), d m i n (C i, C j) = min x l \in C i, x m \in C j d i s t (x l, x m), d c e n (C i, C j) = d i s t (u i, u j) ，

其中，

dist(.,.) 为两个样本之间的距离。

ui 表示簇

Ci 的中心。基于上述式子，我们可以导出以下内部指标。

6) DBI

D B I = 1 k \sum i = 1 k max j \neq i (a v g ( C i ) + a v g ( C j ) d c e n ( u i , u j ))

注意， DBI 反应了簇间的独立性与簇的内聚性，越小越好。

7) DI

D I = min 1 \leq i \leq k { min j \neq i d m i n ( C i , C j ) } max 1 \leq l \leq k d i a m ( C l )

DI 越大越好。

秒客网

主流的聚类评价指标概览及聚类精度Accuracy的Java实现

相关文章