4.5 健壮的链接型聚类（ROCK）-码农场

4.5.1 ROCK简介
4.5.2 为什么ROCK这么强大？

原书译为“鲁棒的链接型聚类”，这个“鲁棒”真的太绅（henn）士(tai)了。我查了下，“鲁棒”来源于英文“Robust”，意为“强健的，稳固的，耐用的，粗野的，浓的”等。“Robustness”的一般含义是“强度, 坚固性”。我还是接受维基百科的翻译，译作“健壮”的比较好。

与前两种聚类算法不同，这个算法适用于类别型数据，比如关键字、布尔值和枚举值等，另外，它还能很好地处理大型数据集。

这一节的测试数据：

4.5.1 ROCK简介

主程序：

package com.hankcs;


import iweb2.ch4.clustering.hierarchical.Dendrogram;
import iweb2.ch4.clustering.rock.ROCKAlgorithm;
import iweb2.ch4.data.MyDiggSpaceData;
import iweb2.ch4.data.MyDiggSpaceDataset;
import iweb2.ch4.model.DataPoint;

public class ch4_4_Rock
{
    public static void main(String[] args) throws Exception
    {
        // 加载Digg新闻，只是用前15条
        MyDiggSpaceDataset ds = MyDiggSpaceData.createDataset(15);

        DataPoint[] dps = ds.getData();

        // 初始化ROCK，寻找5个聚类
        ROCKAlgorithm rock = new ROCKAlgorithm(dps, 5, 0.2);

        Dendrogram dnd = rock.cluster();

        dnd.print(21);
    }
}

输出：

From file: C:/iWeb2/data/ch04/ch4_digg_stories.csv
Loaded 10 users.
Loaded 48 stories (items).

Number of clusters: 23
Clusters for: level=21, Goodness=1.044451296741812
____________________________________________________________

{5619782:Lack Of Democracy on Digg and Wikipedia?,
5611165:Lack Of Democracy on Digg and Wikipedia?}
____________________________________________________________


____________________________________________________________

{5598008:The Confederacy's Special Agent,
5613383:The Confederacy's Special Agent,
5613380:The Confederacy's Special Agent,
5142233:The Confederacy's Special Agent,
5620839:The Confederacy's Special Agent,
5586183:The Confederacy's Special Agent,
5610584:The Confederacy's Special Agent}
____________________________________________________________


____________________________________________________________

{5571841:Lack Of Democracy on Digg and Wikipedia?,
5543643:Lack Of Democracy on Digg and Wikipedia?}
____________________________________________________________


____________________________________________________________

{5585930:Microsoft, The Jekyll And Hyde Of Companies,
5524441:Microsoft, The Jekyll And Hyde Of Companies,
5609070:Microsoft, The Jekyll And Hyde Of Companies,
5618201:Microsoft, The Jekyll And Hyde Of Companies,
5620878:Microsoft, The Jekyll And Hyde Of Companies,
5609797:Microsoft, The Jekyll And Hyde Of Companies}
____________________________________________________________


____________________________________________________________

{5608052:Contract Free on AT&T,
5620493:Contract Free on AT&T,
5621623:Contract Free on AT&T,
4955184:Contract Free on AT&T,
5594161:Contract Free on AT&T}
____________________________________________________________


____________________________________________________________

{5607788:Recycle or go to Hell, warns Vatican -- part I,
5592940:Recycle or go to Hell, warns Vatican -- part II,
5618262:Recycle or go to Hell, warns Vatican -- part III,
5595841:Recycle or go to Hell, warns Vatican --- part IV}
____________________________________________________________

可见聚类是成功的，类似part I等等的相似文章被归入同一类。

4.5.2 为什么ROCK这么强大？

ROCK的核心思想是利用链接作为相似性的度量，而不是仅仅依赖于距离。

package iweb2.ch4.clustering.rock;

import iweb2.ch4.clustering.hierarchical.Dendrogram;
import iweb2.ch4.model.Cluster;
import iweb2.ch4.model.DataPoint;
import iweb2.ch4.similarity.JaccardCoefficient;
import iweb2.ch4.similarity.SimilarityMeasure;

import java.util.ArrayList;
import java.util.List;

public class ROCKAlgorithm
{

    private DataPoint[] points;
    private int k;
    private double th;

    /**
     * 相似度矩阵
     */
    private SimilarityMeasure similarityMeasure;

    /**
     * 链接矩阵
     */
    private LinkMatrix linkMatrix;

    /**
     *
     * @param points 要聚类的数据点
     * @param k 聚类的最小数量
     * @param th 创建链接的阀值
     */
    public ROCKAlgorithm(DataPoint[] points, int k, double th)
    {
        this.points = points;
        this.k = k;
        this.th = th;
        this.similarityMeasure = new JaccardCoefficient();
        //this.similarityMeasure = new CosineSimilarity();
        this.linkMatrix = new LinkMatrix(points, similarityMeasure, th);
    }

    /**
     * 聚类
     * @return 树状图
     */
    public Dendrogram cluster()
    {

        //  Create a new cluster out of every point.
        // 初始化，为每个数据点创建一个新的聚类
        List<Cluster> initialClusters = new ArrayList<Cluster>();
        for (int i = 0, n = points.length; i < n; i++)
        {
            Cluster cluster = new Cluster(points[i]);
            initialClusters.add(cluster);
        }
        double g = Double.POSITIVE_INFINITY;
        Dendrogram dnd = new Dendrogram("Goodness");
        dnd.addLevel(String.valueOf(g), initialClusters);

        // 创建一个度量值，判断两个聚类是否应该合并，度量值越大就越应该合并
        MergeGoodnessMeasure goodnessMeasure = new MergeGoodnessMeasure(th);

        // 封装了所有根据度量值识别最佳聚类所需的数据和算法
        ROCKClusters allClusters = new ROCKClusters(initialClusters,
                                                    linkMatrix,
                                                    goodnessMeasure);

        int nClusters = allClusters.size();
        while (nClusters > k)   // 如果聚类数量等于所需的数量则终止
        {
            int nClustersBeforeMerge = nClusters;
            g = allClusters.mergeBestCandidates();
            nClusters = allClusters.size();
            if (nClusters == nClustersBeforeMerge)
            {
                // there are no linked clusters to merge
                // 前后两次迭代中聚类的数量没有变化，终止
                break;
            }
            dnd.addLevel(String.valueOf(g), allClusters.getAllClusters());
        }

        System.out.println("Number of clusters: " + allClusters.getAllClusters().size());
        return dnd;
    }

    public static void main(String[] args)
    {
        //Define data
        DataPoint[] elements = new DataPoint[4];
        elements[0] = new DataPoint("Doc1", new String[]{"book"});
        elements[1] = new DataPoint("Doc2", new String[]{"water", "sun", "sand", "swim"});
        elements[2] = new DataPoint("Doc3", new String[]{"water", "sun", "swim", "read"});
        elements[3] = new DataPoint("Doc4", new String[]{"read", "sand"});

        int k = 1;
        double th = 0.2;
        ROCKAlgorithm rock = new ROCKAlgorithm(elements, k, th);
        Dendrogram dnd = rock.cluster();
        dnd.printAll();
    }


    public int getK()
    {
        return k;
    }


    public double getTh()
    {
        return th;
    }


    public SimilarityMeasure getSimilarityMeasure()
    {
        return similarityMeasure;
    }


    public LinkMatrix getLinkMatrix()
    {
        return linkMatrix;
    }

}

MergeGoodnessMeasure封装了评判聚类优劣的依据，它要防止的前车之鉴是“连锁效应”（两个很近的点导致两个聚类合并）。为了实现这个目标，ROCK算法使用了链接。链接指的是，两个点具有共同邻居则它们之间有链接。MergeGoodnessMeasure评判是否应当聚类的依据是，两个聚类中数据点之间的链接数量，具体说来，聚合的好坏程度，等于X、Y之间的链接数量 / X、Y期望的链接数量的积。为何使用这么复杂的公式的原因是，要消除“某个别的点之间的链接数量很大导致XY之间链接数量很大”的连锁效应。

package iweb2.ch4.clustering.rock;


/**
 * Goodness measure for merging two clusters.
 * 评判最优聚类的依据
 */
public class MergeGoodnessMeasure
{

    /**
     * Threshold value that was used to identify neighbors among points.
      * 识别邻居的阀值
     */
    private double linkThreshold;

    /**
     * Intermediate value that is used in calculation of goodness measure 
     * and stays the same for different clusters.
     * 在最优聚类计算过程中保持不变的中间值
     */
    private double p;

    public MergeGoodnessMeasure(double th)
    {
        this.linkThreshold = th;
        this.p = 1.0 + 2.0 * f(th);
    }

    /**
     *
     * @param nLinks 聚类X、Y之间的链接数量
     * @param nX X中数据点的数量
     * @param nY Y中数据点的数量
     * @return 聚合的好坏程度，等于X、Y之间的链接数量 / X、Y期望的链接数量的积
     */
    public double g(int nLinks, int nX, int nY)
    {
        double a = Math.pow(nX + nY, p);
        // 估计X和Y中的链接总数
        double b = Math.pow(nX, p);
        double c = Math.pow(nY, p);

        return nLinks / (a - b - c);
    }


    /**
     * 计算中间值
     * @param th 阀值
     * @return 中间值
     */
    private double f(double th)
    {

        /*
         * This implementation assumes that linkThreshold was a threshold for
         * similarity measure (as opposed to dissimilarity/distance). 
         */
        return (1.0 - th) / (1.0 + th);
    }

    /**
     * @return the linkThreshold
     */
    public double getTh()
    {
        return linkThreshold;
    }

    public void setTh(double th)
    {
        this.linkThreshold = th;
    }
}

ROCKClusters 从所有度量值中找出最大值，并且合并它对应的两个聚类，ROCKAlgorithm不断重复这一操作直至聚类的数量满足要求。

知识共享署名-非商业性使用-相同方式共享：码农场 » 4.5 健壮的链接型聚类（ROCK）