放牧代码和思想
专注自然语言处理、机器学习算法
    正处于一个非常忙的阶段,抱歉不会经常回应任何联络

4.6 DBSCAN

目录

DBSCAN是一种基于密度的空间聚类算法,适用于含噪声的数据。

4.6.1 基于密度的算法简介

生活经验,将密度集中点圈起来就成为一个聚类,而那些分散的点,则被称为噪声。DBSCAN算法的目的是发现数据集中的聚类和噪声。

主程序:

package com.hankcs;


import iweb2.ch4.clustering.dbscan.DBSCANAlgorithm;
import iweb2.ch4.data.MyDiggSpaceData;
import iweb2.ch4.data.MyDiggSpaceDataset;
import iweb2.ch4.model.DataPoint;
import iweb2.ch4.similarity.CosineDistance;

public class ch4_4_DBSCAN
{
    public static void main(String[] args) throws Exception
    {
        // 加载前15个Digg的新闻
        MyDiggSpaceDataset ds = MyDiggSpaceData.createDataset(15);

        DataPoint[] dps = ds.getData();

        CosineDistance cosD = new CosineDistance();

        // 初始化DBSCAN算法
        DBSCANAlgorithm dbscan = new DBSCANAlgorithm(dps, cosD, 0.8, 2, true);

        dbscan.cluster();
    }
}

输出:

Clusters:
____________________________________________________________

1: 
{5608576:A Facebook Application To Find Blood Donors Fast,
5611687:A Facebook Application To Find Blood Donors Fast,
5605887:A Facebook Application To Find Blood Donors Fast}
____________________________________________________________


____________________________________________________________

2: 
{5598008:The Confederacy's Special Agent,
5142233:The Confederacy's Special Agent,
5613380:The Confederacy's Special Agent,
5610584:The Confederacy's Special Agent,
5613383:The Confederacy's Special Agent,
5620839:The Confederacy's Special Agent,
5586183:The Confederacy's Special Agent}
____________________________________________________________


____________________________________________________________

3: 
{5620878:Microsoft, The Jekyll And Hyde Of Companies,
5618201:Microsoft, The Jekyll And Hyde Of Companies,
5585930:Microsoft, The Jekyll And Hyde Of Companies,
5609797:Microsoft, The Jekyll And Hyde Of Companies,
5609070:Microsoft, The Jekyll And Hyde Of Companies,
5524441:Microsoft, The Jekyll And Hyde Of Companies}
____________________________________________________________


____________________________________________________________

4: 
{5594161:Contract Free on AT&T,
4955184:Contract Free on AT&T,
5608052:Contract Free on AT&T,
5621623:Contract Free on AT&T,
5579109:Contract Free on AT&T,
5620493:Contract Free on AT&T}
____________________________________________________________


____________________________________________________________

5: 
{5607863:Lack Of Democracy on Digg and Wikipedia?,
5571841:Lack Of Democracy on Digg and Wikipedia?,
5619782:Lack Of Democracy on Digg and Wikipedia?,
5611165:Lack Of Democracy on Digg and Wikipedia?,
5543643:Lack Of Democracy on Digg and Wikipedia?}
____________________________________________________________


____________________________________________________________

6: 
{5613023:How Traffic Jams Occur : Simulation,
5481876:How Traffic Jams Occur : Simulation}
____________________________________________________________


____________________________________________________________

7: 
{5617459:Robotic drumstick keeps novices on the beat,
5619693:Robotic drumstick keeps novices on the beat}
____________________________________________________________


____________________________________________________________

8: 
{5625315:Obama Accuses Clinton of Using ""Republican Tactics"",
5617998:Obama: ""I Am NOT Running for Vice President""}
____________________________________________________________


____________________________________________________________

9: 
{5618262:Recycle or go to Hell, warns Vatican -- part III,
5592940:Recycle or go to Hell, warns Vatican -- part II,
5607788:Recycle or go to Hell, warns Vatican -- part I,
5595841:Recycle or go to Hell, warns Vatican --- part IV}
____________________________________________________________


Noise Elements:
 {5619818:A Facebook Application To Find Blood Donors Fast,
5522983:Smoking Monkey[flash],
5612810:Super Mario Bros Inspired Wii with USB base [ Pics ],
5609833:NSA's Domestic Spying Grows As Agency Sweeps Up Data,
5592473:Maryland police officers refuse to pay speeding tickets,
5625339:Lawmaker's Attempt to Criminalize Anonymous Posting Doomed,
5610213:Senate panel critiques prewar claims by White House,
5610081:Digg's Algo Change Cut Promotions by 38%,
5604438:Archaeologists Unveil Finds in Rome Digs,
5614085:House Files Contempt Lawsuit Against Bush Officials,
5622802:House Democrats Defy White House on Spying Program}

Process finished with exit code 0

值得注意的是,第八个聚类的识别非常成功,而前一节的4.5 健壮的链接型聚类(ROCK)则根本没有识别出来。但是它也有瑕疵,5619818:A Facebook Application To Find Blood Donors Fast本应当属于第一个聚类。

4.6.2 DBSCAN的原理

如果点p的eps邻域里至少有minPoints个点,则称其为核心点。如果p在q的eps邻域内,并且,q是核心点,则称数据点q基于密度可直接达到数据点p。这是理解代码所必须的基本概念。

算法将每一个核心点和它邻居里面的核心点归入一个聚类,将其他的点归入噪声,直到所有点都被聚类为止。

package iweb2.ch4.clustering.dbscan;

import iweb2.ch4.model.Cluster;
import iweb2.ch4.model.DataPoint;
import iweb2.ch4.similarity.Distance;
import iweb2.ch4.utils.ObjectToIndexMapping;
import iweb2.ch4.utils.TermFrequencyBuilder;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashSet;
import java.util.LinkedHashMap;
import java.util.List;
import java.util.Map;
import java.util.Set;

/**
 * Implementation of DBSCAN clustering algorithm.
 * <p>
 * Algorithm parameters:
 * <ol>
 * <li>Eps - threshold value to determine point neighbors. Two points are
 * neighbors if the distance between them does not exceed this threshold value.
 * </li>
 * <li>MinPts - minimum number of points in any cluster.</li>
 * </ol>
 * Choice of parameter values depends on the data.
 * </p>
 * <p/>
 * Point types:
 * <ol>
 * <li>Core point - point that belongs to the core of the cluster. It has at least
 * MinPts neighboring points.</li>
 * <li>Border point - is a neighbor to at least one core point but it doesn't
 * have enough neighbors to be a core point.</li>
 * <li>Noise point - is a point that doesn't belong to any cluster because it is
 * not close to any of the core points.</li>
 * </ol>
 */
public class DBSCANAlgorithm
{

    /*
     * Data points for clustering.
     */
    private DataPoint[] points;

    /*
     * Adjacency matrix. Contains distances between points.
     */
    private double[][] adjacencyMatrix;

    /*
     * Threshold value. Determines which points will be considered as
     * neighbors. Two points are neighbors if the distance between them does 
     * not exceed threshold value.
     */
    private double eps;

    /*
     * Identifies a set of Noise points.
     */
    private int CLUSTER_ID_NOISE = -1;

    /*
     * Identifies a set of Unclassified points.
     */
    private int CLUSTER_ID_UNCLASSIFIED = 0;

    /*
     * Sequence that is used to generate next cluster id.
     */
    private int nextClusterId = 1;

    /*
     * Sets of points. Initially all points will be assigned into 
     * Unclassified points set.
     */
    private Map<Integer, Set<DataPoint>> clusters =
            new LinkedHashMap<Integer, Set<DataPoint>>();


    /*
     * Number of points that should exist in the neighborhood for a point
     * to be a core point.
     * 
     * Best value for this parameter depends on the data set.
     */
    private int minPoints;


    private ObjectToIndexMapping<DataPoint> idxMapping =
            new ObjectToIndexMapping<DataPoint>();


    private boolean verbose = true;

    /**
     * Initializes algorithm with all data that it needs.
     *
     * @param points          points to cluster 存放数据
     * @param adjacencyMatrix adjacency matrix with distances 邻接矩阵
     * @param eps             distance threshold value 阀值,作为半径
     * @param minPoints       number of neighbors for point to be considered a core point. 至少邻域内有多少个点才能称得上核心点
     */
    public DBSCANAlgorithm(DataPoint[] points, double[][] adjacencyMatrix, double eps, int minPoints)
    {
        init(points, eps, minPoints);
        this.adjacencyMatrix = adjacencyMatrix;
    }

    /**
     * Initializes algorithm with all data that it needs.
     *
     * @param points    all points to cluster
     * @param distance  metric distance function
     * @param eps       threshold value used to calculate point neighborhood.
     * @param minPoints number of neighbors for point to be considered a core point.
     */
    public DBSCANAlgorithm(DataPoint[] points,
                           Distance distance,
                           double eps,
                           int minPoints, boolean useTermFrequencies)
    {

        init(points, eps, minPoints);
        this.adjacencyMatrix =
                calculateAdjacencyMatrix(distance, points, useTermFrequencies);
    }


    private static double[][] calculateAdjacencyMatrix(Distance distance,
                                                       DataPoint[] points,
                                                       boolean useTermFrequencies)
    {
        int n = points.length;
        double[][] a = new double[n][n];
        for (int i = 0; i < n; i++)
        {
            double[] x = points[i].getNumericAttrValues();
            for (int j = i + 1; j < n; j++)
            {
                double[] y;
                if (useTermFrequencies)
                {
                    double[][] tfVectors =
                            TermFrequencyBuilder.buildTermFrequencyVectors(
                                    points[i].getTextAttrValues(),
                                    points[j].getTextAttrValues());
                    x = tfVectors[0];
                    y = tfVectors[1];
                }
                else
                {
                    y = points[j].getNumericAttrValues();
                }
                a[i][j] = distance.getDistance(x, y);
                a[j][i] = a[i][j];
            }
            a[i][i] = 0.0;
        }
        return a;
    }

    private void init(DataPoint[] points, double neighborThreshold, int minPoints)
    {

        this.points = points;
        this.eps = neighborThreshold;
        this.minPoints = minPoints;

        for (DataPoint p : points)
        {
            // Creating a Point <-> Index mappping for all points            
            idxMapping.getIndex(p);
            // Assign all points into "Unclassified" group
            assignPointToCluster(p, CLUSTER_ID_UNCLASSIFIED);
        }
    }

    public List<Cluster> cluster()
    {
        int clusterId = getNextClusterId();

        for (DataPoint p : points)
        {
            if (isUnclassified(p))
            {

                // 对每一个还未聚类的点,创建一个包含它的聚类
                boolean isClusterCreated = createCluster(p, clusterId);

                if (isClusterCreated)
                {
                    // Generate id for the next cluster                    
                    clusterId = getNextClusterId();
                }
            }
        }

        // Convert sets of points into clusters...
        List<Cluster> allClusters = new ArrayList<Cluster>();

        for (Map.Entry<Integer, Set<DataPoint>> e : clusters.entrySet())
        {

            String label = String.valueOf(e.getKey());

            Set<DataPoint> points = e.getValue();

            if (points != null && !points.isEmpty())
            {

                Cluster cluster = new Cluster(label, e.getValue());

                allClusters.add(cluster);
            }
        }

        if (verbose)
        {
            printResults(allClusters);
        }

        // Group with Noise elements returned as well
        return allClusters;
    }

    /**
     * 为某个点p创建聚类
     * @param p
     * @param clusterId
     * @return 是否是噪点或边界点
     */
    private boolean createCluster(DataPoint p, Integer clusterId)
    {

        boolean isClusterCreated = false;

        // 找到eps邻域内的邻居
        Set<DataPoint> nPoints = findNeighbors(p, eps);

        if (nPoints.size() < minPoints)
        {
            // Assign point into "Noise" group. 
            // It will have a chance to become a border point later on.
            assignPointToCluster(p, CLUSTER_ID_NOISE);

            // return false to indicate that we didn't create any cluster
            isClusterCreated = false;

        }
        else
        {

            // All points are reachable from the core point...
            // 加入聚类
            assignPointToCluster(nPoints, clusterId);

            // Remove point itself.
            // 先从聚类里面删除自身
            nPoints.remove(p);

            // Process the rest of the neighbors...
            // 处理其他邻居
            while (nPoints.size() > 0)
            {

                // pick the first neighbor
                DataPoint nPoint = nPoints.iterator().next();

                // process neighbor
                // 找到eps邻域,判断数据点是否基于密度可直达
                Set<DataPoint> nnPoints = findNeighbors(nPoint, eps);

                if (nnPoints.size() >= minPoints)
                {

                    // nPoint is another core point.
                    // nPoint也是一个核心点
                    for (DataPoint nnPoint : nnPoints)
                    {

                        if (isNoise(nnPoint))
                        {
                        
                        	/* It's a border point. We know that it doesn't have 
                             * enough neighbors to be a core point. Just add it 
                             * to the cluster.
                             */
                            // 如果是边界点,将其分配给相应的聚类
                            assignPointToCluster(nnPoint, clusterId);

                        }
                        else if (isUnclassified(nnPoint))
                        {

                            /*
                             * We don't know if this point has enough neighbors
                             * to be a core point... add it to the list of points
                             * to be checked.
                             */
                            // 否则就要判断它是否是核心点,先将它添加到原来的列表中待处理
                            nPoints.add(nnPoint);
                            
                            /*
                             * And assign it to the cluster
                             */
                            assignPointToCluster(nnPoint, clusterId);
                        }
                    }
                }
                else
                {
                    // do nothing. The neighbor is just a border point.
                }

                // 在继续检查p的邻域中的下一个数据点之前,先将已经检查完的数据点移除
                nPoints.remove(nPoint);
            }

            // return true to indicate that we did create a cluster
            isClusterCreated = true;
        }

        return isClusterCreated;
    }

    private Set<DataPoint> findNeighbors(DataPoint p, double threshold)
    {
        Set<DataPoint> neighbors = new HashSet<DataPoint>();
        int i = idxMapping.getIndex(p);
        for (int j = 0, n = idxMapping.getSize(); j < n; j++)
        {
            if (adjacencyMatrix[i][j] <= threshold)
            {
                neighbors.add(idxMapping.getObject(j));
            }
        }
        return neighbors;
    }

    private boolean isUnclassified(DataPoint p)
    {
        return isPointInCluster(p, CLUSTER_ID_UNCLASSIFIED);

    }

    private boolean isNoise(DataPoint p)
    {
        return isPointInCluster(p, CLUSTER_ID_NOISE);
    }

    private boolean isPointInCluster(DataPoint p, int clusterId)
    {
        boolean inCluster = false;
        Set<DataPoint> points = clusters.get(clusterId);
        if (points != null)
        {
            inCluster = points.contains(p);
        }
        return inCluster;
    }

    private void assignPointToCluster(Set<DataPoint> points, int clusterId)
    {
        for (DataPoint p : points)
        {
            assignPointToCluster(p, clusterId);
        }
    }

    private void assignPointToCluster(DataPoint p, int clusterId)
    {

        // Remove point from the group that it currently belongs to...
        if (isNoise(p))
        {
            removePointFromCluster(p, CLUSTER_ID_NOISE);
        }
        else if (isUnclassified(p))
        {
            removePointFromCluster(p, CLUSTER_ID_UNCLASSIFIED);
        }
        else
        {
            if (clusterId != CLUSTER_ID_UNCLASSIFIED)
            {
                throw new RuntimeException(
                        "Trying to move point that has already been" +
                                "assigned to some other cluster. Point: " + p + ", clusterId=" + clusterId);
            }
            else
            {
                // do nothing. we are registering a brand new point in UNCLASSIFIED set.
            }
        }

        Set<DataPoint> points = clusters.get(clusterId);
        if (points == null)
        {
            points = new HashSet<DataPoint>();
            clusters.put(clusterId, points);
        }
        points.add(p);
    }

    private boolean removePointFromCluster(DataPoint p, int clusterId)
    {
        boolean removed = false;
        Set<DataPoint> points = clusters.get(clusterId);
        if (points != null)
        {
            removed = points.remove(p);
        }
        return removed;
    }

    private int getNextClusterId()
    {
        return nextClusterId++;
    }


    public void printDistances()
    {
        System.out.println("Point Similarity matrix:");
        for (int i = 0; i < adjacencyMatrix.length; i++)
        {
            System.out.println(Arrays.toString(adjacencyMatrix[i]));
        }
    }

    private void printResults(List<Cluster> allClusters)
    {
        System.out.println("DBSCAN Clustering with NeighborThreshold=" +
                                   this.eps + " minPoints=" + this.minPoints);
        System.out.println("Clusters:");
        String noiseElements = "no noise elements";
        for (Cluster c : allClusters)
        {
            if (String.valueOf(CLUSTER_ID_NOISE).equals(c.getLabel()))
            {
                // print noise data at the end
                noiseElements = c.getElementsAsString();
            }
            else
            {
                System.out.println("____________________________________________________________\n");
                System.out.println(c.getLabel() + ": \n" + c.getElementsAsString());
                System.out.println("____________________________________________________________\n\n");
            }
        }
        System.out.println("Noise Elements:\n " + noiseElements);
    }

    public static void main(String[] args)
    {

        DataPoint[] elements = new DataPoint[5];
        elements[0] = new DataPoint("A", new double[]{});
        elements[1] = new DataPoint("B", new double[]{});
        elements[2] = new DataPoint("C", new double[]{});
        elements[3] = new DataPoint("D", new double[]{});
        elements[4] = new DataPoint("E", new double[]{});

        double[][] a = new double[][]{
                {0, 1, 2, 2, 3},
                {1, 0, 2, 4, 3},
                {2, 2, 0, 1, 5},
                {2, 4, 1, 0, 3},
                {3, 3, 5, 3, 0}
        };

        double threshold = 2.0;
        int minPoints = 2;

        DBSCANAlgorithm dbscan = new DBSCANAlgorithm(elements, a, threshold, minPoints);
        dbscan.cluster();
    }

}

知识共享许可协议 知识共享署名-非商业性使用-相同方式共享码农场 » 4.6 DBSCAN

分享到:更多 ()

评论 欢迎留言

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址

我的开源项目

HanLP自然语言处理包基于DoubleArrayTrie的Aho Corasick自动机