DBSCAN是一种基于密度的空间聚类算法,适用于含噪声的数据。
4.6.1 基于密度的算法简介
生活经验,将密度集中点圈起来就成为一个聚类,而那些分散的点,则被称为噪声。DBSCAN算法的目的是发现数据集中的聚类和噪声。
主程序:
package com.hankcs;
import iweb2.ch4.clustering.dbscan.DBSCANAlgorithm;
import iweb2.ch4.data.MyDiggSpaceData;
import iweb2.ch4.data.MyDiggSpaceDataset;
import iweb2.ch4.model.DataPoint;
import iweb2.ch4.similarity.CosineDistance;
public class ch4_4_DBSCAN
{
public static void main(String[] args) throws Exception
{
// 加载前15个Digg的新闻
MyDiggSpaceDataset ds = MyDiggSpaceData.createDataset(15);
DataPoint[] dps = ds.getData();
CosineDistance cosD = new CosineDistance();
// 初始化DBSCAN算法
DBSCANAlgorithm dbscan = new DBSCANAlgorithm(dps, cosD, 0.8, 2, true);
dbscan.cluster();
}
}
输出:
Clusters:
____________________________________________________________
1:
{5608576:A Facebook Application To Find Blood Donors Fast,
5611687:A Facebook Application To Find Blood Donors Fast,
5605887:A Facebook Application To Find Blood Donors Fast}
____________________________________________________________
____________________________________________________________
2:
{5598008:The Confederacy's Special Agent,
5142233:The Confederacy's Special Agent,
5613380:The Confederacy's Special Agent,
5610584:The Confederacy's Special Agent,
5613383:The Confederacy's Special Agent,
5620839:The Confederacy's Special Agent,
5586183:The Confederacy's Special Agent}
____________________________________________________________
____________________________________________________________
3:
{5620878:Microsoft, The Jekyll And Hyde Of Companies,
5618201:Microsoft, The Jekyll And Hyde Of Companies,
5585930:Microsoft, The Jekyll And Hyde Of Companies,
5609797:Microsoft, The Jekyll And Hyde Of Companies,
5609070:Microsoft, The Jekyll And Hyde Of Companies,
5524441:Microsoft, The Jekyll And Hyde Of Companies}
____________________________________________________________
____________________________________________________________
4:
{5594161:Contract Free on AT&T,
4955184:Contract Free on AT&T,
5608052:Contract Free on AT&T,
5621623:Contract Free on AT&T,
5579109:Contract Free on AT&T,
5620493:Contract Free on AT&T}
____________________________________________________________
____________________________________________________________
5:
{5607863:Lack Of Democracy on Digg and Wikipedia?,
5571841:Lack Of Democracy on Digg and Wikipedia?,
5619782:Lack Of Democracy on Digg and Wikipedia?,
5611165:Lack Of Democracy on Digg and Wikipedia?,
5543643:Lack Of Democracy on Digg and Wikipedia?}
____________________________________________________________
____________________________________________________________
6:
{5613023:How Traffic Jams Occur : Simulation,
5481876:How Traffic Jams Occur : Simulation}
____________________________________________________________
____________________________________________________________
7:
{5617459:Robotic drumstick keeps novices on the beat,
5619693:Robotic drumstick keeps novices on the beat}
____________________________________________________________
____________________________________________________________
8:
{5625315:Obama Accuses Clinton of Using ""Republican Tactics"",
5617998:Obama: ""I Am NOT Running for Vice President""}
____________________________________________________________
____________________________________________________________
9:
{5618262:Recycle or go to Hell, warns Vatican -- part III,
5592940:Recycle or go to Hell, warns Vatican -- part II,
5607788:Recycle or go to Hell, warns Vatican -- part I,
5595841:Recycle or go to Hell, warns Vatican --- part IV}
____________________________________________________________
Noise Elements:
{5619818:A Facebook Application To Find Blood Donors Fast,
5522983:Smoking Monkey[flash],
5612810:Super Mario Bros Inspired Wii with USB base [ Pics ],
5609833:NSA's Domestic Spying Grows As Agency Sweeps Up Data,
5592473:Maryland police officers refuse to pay speeding tickets,
5625339:Lawmaker's Attempt to Criminalize Anonymous Posting Doomed,
5610213:Senate panel critiques prewar claims by White House,
5610081:Digg's Algo Change Cut Promotions by 38%,
5604438:Archaeologists Unveil Finds in Rome Digs,
5614085:House Files Contempt Lawsuit Against Bush Officials,
5622802:House Democrats Defy White House on Spying Program}
Process finished with exit code 0
值得注意的是,第八个聚类的识别非常成功,而前一节的4.5 健壮的链接型聚类(ROCK)则根本没有识别出来。但是它也有瑕疵,5619818:A Facebook Application To Find Blood Donors Fast本应当属于第一个聚类。
4.6.2 DBSCAN的原理

如果点p的eps邻域里至少有minPoints个点,则称其为核心点。如果p在q的eps邻域内,并且,q是核心点,则称数据点q基于密度可直接达到数据点p。这是理解代码所必须的基本概念。
算法将每一个核心点和它邻居里面的核心点归入一个聚类,将其他的点归入噪声,直到所有点都被聚类为止。
package iweb2.ch4.clustering.dbscan;
import iweb2.ch4.model.Cluster;
import iweb2.ch4.model.DataPoint;
import iweb2.ch4.similarity.Distance;
import iweb2.ch4.utils.ObjectToIndexMapping;
import iweb2.ch4.utils.TermFrequencyBuilder;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashSet;
import java.util.LinkedHashMap;
import java.util.List;
import java.util.Map;
import java.util.Set;
/**
* Implementation of DBSCAN clustering algorithm.
* <p>
* Algorithm parameters:
* <ol>
* <li>Eps - threshold value to determine point neighbors. Two points are
* neighbors if the distance between them does not exceed this threshold value.
* </li>
* <li>MinPts - minimum number of points in any cluster.</li>
* </ol>
* Choice of parameter values depends on the data.
* </p>
* <p/>
* Point types:
* <ol>
* <li>Core point - point that belongs to the core of the cluster. It has at least
* MinPts neighboring points.</li>
* <li>Border point - is a neighbor to at least one core point but it doesn't
* have enough neighbors to be a core point.</li>
* <li>Noise point - is a point that doesn't belong to any cluster because it is
* not close to any of the core points.</li>
* </ol>
*/
public class DBSCANAlgorithm
{
/*
* Data points for clustering.
*/
private DataPoint[] points;
/*
* Adjacency matrix. Contains distances between points.
*/
private double[][] adjacencyMatrix;
/*
* Threshold value. Determines which points will be considered as
* neighbors. Two points are neighbors if the distance between them does
* not exceed threshold value.
*/
private double eps;
/*
* Identifies a set of Noise points.
*/
private int CLUSTER_ID_NOISE = -1;
/*
* Identifies a set of Unclassified points.
*/
private int CLUSTER_ID_UNCLASSIFIED = 0;
/*
* Sequence that is used to generate next cluster id.
*/
private int nextClusterId = 1;
/*
* Sets of points. Initially all points will be assigned into
* Unclassified points set.
*/
private Map<Integer, Set<DataPoint>> clusters =
new LinkedHashMap<Integer, Set<DataPoint>>();
/*
* Number of points that should exist in the neighborhood for a point
* to be a core point.
*
* Best value for this parameter depends on the data set.
*/
private int minPoints;
private ObjectToIndexMapping<DataPoint> idxMapping =
new ObjectToIndexMapping<DataPoint>();
private boolean verbose = true;
/**
* Initializes algorithm with all data that it needs.
*
* @param points points to cluster 存放数据
* @param adjacencyMatrix adjacency matrix with distances 邻接矩阵
* @param eps distance threshold value 阀值,作为半径
* @param minPoints number of neighbors for point to be considered a core point. 至少邻域内有多少个点才能称得上核心点
*/
public DBSCANAlgorithm(DataPoint[] points, double[][] adjacencyMatrix, double eps, int minPoints)
{
init(points, eps, minPoints);
this.adjacencyMatrix = adjacencyMatrix;
}
/**
* Initializes algorithm with all data that it needs.
*
* @param points all points to cluster
* @param distance metric distance function
* @param eps threshold value used to calculate point neighborhood.
* @param minPoints number of neighbors for point to be considered a core point.
*/
public DBSCANAlgorithm(DataPoint[] points,
Distance distance,
double eps,
int minPoints, boolean useTermFrequencies)
{
init(points, eps, minPoints);
this.adjacencyMatrix =
calculateAdjacencyMatrix(distance, points, useTermFrequencies);
}
private static double[][] calculateAdjacencyMatrix(Distance distance,
DataPoint[] points,
boolean useTermFrequencies)
{
int n = points.length;
double[][] a = new double[n][n];
for (int i = 0; i < n; i++)
{
double[] x = points[i].getNumericAttrValues();
for (int j = i + 1; j < n; j++)
{
double[] y;
if (useTermFrequencies)
{
double[][] tfVectors =
TermFrequencyBuilder.buildTermFrequencyVectors(
points[i].getTextAttrValues(),
points[j].getTextAttrValues());
x = tfVectors[0];
y = tfVectors[1];
}
else
{
y = points[j].getNumericAttrValues();
}
a[i][j] = distance.getDistance(x, y);
a[j][i] = a[i][j];
}
a[i][i] = 0.0;
}
return a;
}
private void init(DataPoint[] points, double neighborThreshold, int minPoints)
{
this.points = points;
this.eps = neighborThreshold;
this.minPoints = minPoints;
for (DataPoint p : points)
{
// Creating a Point <-> Index mappping for all points
idxMapping.getIndex(p);
// Assign all points into "Unclassified" group
assignPointToCluster(p, CLUSTER_ID_UNCLASSIFIED);
}
}
public List<Cluster> cluster()
{
int clusterId = getNextClusterId();
for (DataPoint p : points)
{
if (isUnclassified(p))
{
// 对每一个还未聚类的点,创建一个包含它的聚类
boolean isClusterCreated = createCluster(p, clusterId);
if (isClusterCreated)
{
// Generate id for the next cluster
clusterId = getNextClusterId();
}
}
}
// Convert sets of points into clusters...
List<Cluster> allClusters = new ArrayList<Cluster>();
for (Map.Entry<Integer, Set<DataPoint>> e : clusters.entrySet())
{
String label = String.valueOf(e.getKey());
Set<DataPoint> points = e.getValue();
if (points != null && !points.isEmpty())
{
Cluster cluster = new Cluster(label, e.getValue());
allClusters.add(cluster);
}
}
if (verbose)
{
printResults(allClusters);
}
// Group with Noise elements returned as well
return allClusters;
}
/**
* 为某个点p创建聚类
* @param p
* @param clusterId
* @return 是否是噪点或边界点
*/
private boolean createCluster(DataPoint p, Integer clusterId)
{
boolean isClusterCreated = false;
// 找到eps邻域内的邻居
Set<DataPoint> nPoints = findNeighbors(p, eps);
if (nPoints.size() < minPoints)
{
// Assign point into "Noise" group.
// It will have a chance to become a border point later on.
assignPointToCluster(p, CLUSTER_ID_NOISE);
// return false to indicate that we didn't create any cluster
isClusterCreated = false;
}
else
{
// All points are reachable from the core point...
// 加入聚类
assignPointToCluster(nPoints, clusterId);
// Remove point itself.
// 先从聚类里面删除自身
nPoints.remove(p);
// Process the rest of the neighbors...
// 处理其他邻居
while (nPoints.size() > 0)
{
// pick the first neighbor
DataPoint nPoint = nPoints.iterator().next();
// process neighbor
// 找到eps邻域,判断数据点是否基于密度可直达
Set<DataPoint> nnPoints = findNeighbors(nPoint, eps);
if (nnPoints.size() >= minPoints)
{
// nPoint is another core point.
// nPoint也是一个核心点
for (DataPoint nnPoint : nnPoints)
{
if (isNoise(nnPoint))
{
/* It's a border point. We know that it doesn't have
* enough neighbors to be a core point. Just add it
* to the cluster.
*/
// 如果是边界点,将其分配给相应的聚类
assignPointToCluster(nnPoint, clusterId);
}
else if (isUnclassified(nnPoint))
{
/*
* We don't know if this point has enough neighbors
* to be a core point... add it to the list of points
* to be checked.
*/
// 否则就要判断它是否是核心点,先将它添加到原来的列表中待处理
nPoints.add(nnPoint);
/*
* And assign it to the cluster
*/
assignPointToCluster(nnPoint, clusterId);
}
}
}
else
{
// do nothing. The neighbor is just a border point.
}
// 在继续检查p的邻域中的下一个数据点之前,先将已经检查完的数据点移除
nPoints.remove(nPoint);
}
// return true to indicate that we did create a cluster
isClusterCreated = true;
}
return isClusterCreated;
}
private Set<DataPoint> findNeighbors(DataPoint p, double threshold)
{
Set<DataPoint> neighbors = new HashSet<DataPoint>();
int i = idxMapping.getIndex(p);
for (int j = 0, n = idxMapping.getSize(); j < n; j++)
{
if (adjacencyMatrix[i][j] <= threshold)
{
neighbors.add(idxMapping.getObject(j));
}
}
return neighbors;
}
private boolean isUnclassified(DataPoint p)
{
return isPointInCluster(p, CLUSTER_ID_UNCLASSIFIED);
}
private boolean isNoise(DataPoint p)
{
return isPointInCluster(p, CLUSTER_ID_NOISE);
}
private boolean isPointInCluster(DataPoint p, int clusterId)
{
boolean inCluster = false;
Set<DataPoint> points = clusters.get(clusterId);
if (points != null)
{
inCluster = points.contains(p);
}
return inCluster;
}
private void assignPointToCluster(Set<DataPoint> points, int clusterId)
{
for (DataPoint p : points)
{
assignPointToCluster(p, clusterId);
}
}
private void assignPointToCluster(DataPoint p, int clusterId)
{
// Remove point from the group that it currently belongs to...
if (isNoise(p))
{
removePointFromCluster(p, CLUSTER_ID_NOISE);
}
else if (isUnclassified(p))
{
removePointFromCluster(p, CLUSTER_ID_UNCLASSIFIED);
}
else
{
if (clusterId != CLUSTER_ID_UNCLASSIFIED)
{
throw new RuntimeException(
"Trying to move point that has already been" +
"assigned to some other cluster. Point: " + p + ", clusterId=" + clusterId);
}
else
{
// do nothing. we are registering a brand new point in UNCLASSIFIED set.
}
}
Set<DataPoint> points = clusters.get(clusterId);
if (points == null)
{
points = new HashSet<DataPoint>();
clusters.put(clusterId, points);
}
points.add(p);
}
private boolean removePointFromCluster(DataPoint p, int clusterId)
{
boolean removed = false;
Set<DataPoint> points = clusters.get(clusterId);
if (points != null)
{
removed = points.remove(p);
}
return removed;
}
private int getNextClusterId()
{
return nextClusterId++;
}
public void printDistances()
{
System.out.println("Point Similarity matrix:");
for (int i = 0; i < adjacencyMatrix.length; i++)
{
System.out.println(Arrays.toString(adjacencyMatrix[i]));
}
}
private void printResults(List<Cluster> allClusters)
{
System.out.println("DBSCAN Clustering with NeighborThreshold=" +
this.eps + " minPoints=" + this.minPoints);
System.out.println("Clusters:");
String noiseElements = "no noise elements";
for (Cluster c : allClusters)
{
if (String.valueOf(CLUSTER_ID_NOISE).equals(c.getLabel()))
{
// print noise data at the end
noiseElements = c.getElementsAsString();
}
else
{
System.out.println("____________________________________________________________\n");
System.out.println(c.getLabel() + ": \n" + c.getElementsAsString());
System.out.println("____________________________________________________________\n\n");
}
}
System.out.println("Noise Elements:\n " + noiseElements);
}
public static void main(String[] args)
{
DataPoint[] elements = new DataPoint[5];
elements[0] = new DataPoint("A", new double[]{});
elements[1] = new DataPoint("B", new double[]{});
elements[2] = new DataPoint("C", new double[]{});
elements[3] = new DataPoint("D", new double[]{});
elements[4] = new DataPoint("E", new double[]{});
double[][] a = new double[][]{
{0, 1, 2, 2, 3},
{1, 0, 2, 4, 3},
{2, 2, 0, 1, 5},
{2, 4, 1, 0, 3},
{3, 3, 5, 3, 0}
};
double threshold = 2.0;
int minPoints = 2;
DBSCANAlgorithm dbscan = new DBSCANAlgorithm(elements, a, threshold, minPoints);
dbscan.cluster();
}
}
码农场