04_caine_density_clustering

Download Report

Transcript 04_caine_density_clustering

Parameter Reduction for Density-based
Clustering on Large Data Sets
Elizabeth Wang
Input Parameters of DBSCAN
1. The neighborhood radius, r.
2. The minimum number, k, of neighbors to be a core
point – the seed for expansion
The clustering quality highly depends on the input
parameters, especially r.
Selection of Parameters Matters:
R = 20 K =4
500
DBSCAN K = 4 R= 10
500
450
450
400
400
350
350
300
300
250
250
200
200
150
150
100
100
50
50
0
0
0
100
200
300
400
500
600
0
700
100
200
R = 7 K =4
500
400
500
600
700
500
600
700
R=5K=4
500
450
300
450
400
400
350
350
300
300
250
250
200
200
150
150
100
100
50
50
0
0
0
100
200
300
400
500
600
700
0
100
200
300
400
Content
•
•
•
•
•
Introduction
Related Work
Parameter Reduction for Density-based
Clustering
– Experiments and Observations
– Determination of the neighborhood radii
– Nonparametric Density-based Clustering
Performance Analysis
Conclusion
Introduction
• Density-based clustering is widely used in various
spatial applications such as geographical information
analysis, medical applications, sky data observatories
and satellite image analysis.
• In density-based clustering, clusters are dense areas
of points in the data space that are separated by
areas of low density (noises)
• One of the problems of density-based clustering is
minimal domain knowledge to determine the input
parameters.
Our Approach to Solve the Problem
• We explore an automatic approach to determine the
minimum neighborhood radii based on data distribution of
the dataset.
• The algorithm, MINR, is developed to determine the
minimum neighborhood radii for different density clusters
based on many experiments and observations.
• We combine MINR with the enhanced DBCSCAN, eDBCSCAN, into a nonparametric density-based clustering
algorithm (NPDBC).
Problem with Input Parameter
• In the density based clustering, the main input is the
minimum neighborhood radius.
• A dataset may consist of clusters with same density or
different densities. Figure shows some possible
distributions of X.
• When clusters are in different densities, it is more
difficult to determine the minimum neighborhood radii.
Related Work
•
•
•
Clustering methods
Attempts to reduce parameters
Enhanced DBSCAN clustering
Clustering Methods
• There are mainly two clustering methods:
– similarity-based partitioning methods
– density-based clustering methods.
• A similarity-based partitioning algorithm breaks a
dataset into k subsets, called clusters.
• The major problems with partitioning methods are:
– k has to be predetermined;
– it is difficult to identify clusters with different sizes;
– it only finds convex clusters.
Clustering Methods (Cont.)
• Density-based clustering methods are used to discover
clusters with arbitrary shapes. The most typical
algorithm is DBSCAN [1].
• DBSCAN is very sensitive to input parameters, which
are the neighborhood radius (r) and a minimum
number of neighbors (MinPts).
• Another density-based algorithm DenClue uses a grid
and is very efficient. This algorithm generalizes some
other clustering approaches which, however, results in
a large number of input parameters.
Attempts To Reduce Parameters
• There have been many efforts to make clustering process parameterfree, such as OPTICS [8], CHAMELEON [7] and TURN*[2].
• OPTICS computes an augmented cluster ordering, which represents
the density-based clustering structure of the data. This method is used
for interactive cluster analysis.
• CHAMELEON has been found to be very effective in clustering convex
shapes. However, the algorithm cannot handle outliers and needs
parameter setting to work effectively.
• TURN* is a brute force approach. It first decreases the neighborhood
radius to so small that every data point becomes noise. Then the
radius is doubled each time to do clustering until it finds a “turn.” Even
though it chooses big steps, the computation time is not promising for
large datasets with various densities.
Parameter Reduction
• DBSCAN have two input parameters:
– the minimum number of neighbors, k,
– the neighborhood radius, r.
• k is the size of the smallest cluster. DBSCAN set k to 4
[1]. TURN* also treats it as a fixed value [2]. The
clustering comparison between k= 4 and k=7 are shown
in the next slide.
• Therefore, the only input parameter is the minimum
neighborhood radius, r.
• r depends on data distribution of the dataset. It should
be different for different density clusters.
“K = 4” vs “K = 7”: (R4)2 /4 = (R7)2 /7
K = 7 R = 7.93
500
K=4 R=6
500
450
450
400
400
350
350
300
300
250
250
200
200
150
150
100
100
50
50
0
0
0
0
100
200
300
400
500
600
K=4 R=8
500
100
200
300
400
500
600
500
600
700
700
K = 7 R = 10.6
500
450
450
400
400
350
350
300
300
250
250
200
200
150
150
100
100
50
50
0
0
0
100
200
300
400
500
600
700
0
100
200
300
400
700
Observation 1
• Define R as a distance
between each point x and
its 4th nearest neighbor.
• The points are then
sorted based on R in
ascending order. DS1 is a
dataset used by DBSCAN.
The data size is 200. DS2
is reproduced from a
dataset used by
CHAMELEON.
(a) DS1
(b) DS2
(c) R-x of DS1 (d) R-x of DS2
Observation 2
• Given a neighborhood
radius r, we calculate the
number of neighbors for
each point within the given
radius, denoted as K
• Sort the points in
descending order, and get
the sorted K-x graph.
• When the neighborhood
radius is close to the
maximum R, the K-x graph
shows “knees” very clearly.
Observation 2 (Cont.)
• In order to find the “knees” in the
graph, we calculate the
differentials of the graphs, Ks.
• The knees are close to the
points with peak differentials.
• The number of “knees” is equal
to the number of cluster
densities in the dataset.
• Intuitively, we infer that the
points divided by “knees” belong
to different density clusters or
noise.
Observation 3
• Sort the dataset DS2 based
on K, and then partition the
sorted dataset into three
subsets separated by two
“knees.”
• The two “knees” are at
positions of 10000 and 15500.
Therefore the three partitions
are 0 – 10000, 10001 – 15500,
and 15501-17500.
Determination of the neighborhood radii
• Based on the experiments,
we develop an algorithm
to automatically determine
the minimum
neighborhood radii for
mining clusters with
different densities, MINR,
based on the data
distribution.
MINR Algorithm
Input: A data set X
Output: neighborhood radii ri
1. Calculate the distance between each point and its 4th
neighbor, R. Get Rm = max (R).
2. Compute the number of neighbors within Rm for
each point, K.
3. Sort the points in descending order based on K.
4. Calculate the differential K, and find the peak K
position, XPi. Stop if it is at the end of dataset.
5. For the ith peak K position, find the “knee” point
KNi: if x < XPi and Ki = 0 and |x- XPi| is the
smallest, then KNi = x.
6. ri = Rx. Increase i and go back to step 4.
Nonparametric Density-based Clustering
• Given a series of radii, start
clustering using the
enhance DBSCAN
algorithm, e-DBSCAN, with
the smallest r = r1. The
densest cluster(s) would be
formed.
• Then set r = r2. Only
process those unclustered
points. The next sparser
cluster(s) are formed.
Densest cluster
is formed
+
r1
++
+ +
+ +
+
+
Densest cluster
is formed
+
+
++
+ +
Noise
+
r2
+ +
+
+
Sparser cluster
is formed
Nonparametric Density-based Clustering
• Calculates a series of neighborhood radii for different
density clusters using MINR
• Iterative clustering process using e-DBSCAN with the radii.
• Merge any pair of clusters which share most of the
boundary points of either cluster.
Nonparametric Clustering Algorithm
Input: A dataset X
Output: Clusters and noise
1. Calculate a number of the neighborhood radii: r1,
r2 … rm for different density clusters with MINR ( ):
2. Iterative Clustering with e-DBSCAN
3. Check the boundaries of each pair of clusters. If two
clusters share most of the boundary of either cluster,
merge the two clusters into one.
Performance Analysis
• We compare our nonparametric density-based
clustering algorithm (NPDBC) with the performance
of TURN*.
• We show the run time comparisons on the dataset,
DS2, we discussed above.
• In order to make the data contain the clusters in
different densities, we artificially insert more data in
some clusters to make them denser than the others.
• The resulted datasets have the sizes from 10k to
200k.
Performance Analysis (Cont.)
• NPDBC is more efficient than TURN*
for large datasets.
• The reason is that for NPDBC, the
parameters are computed once at the
beginning of the clustering process
• TURN* algorithm tries different
neighborhood radii until the first “turn”
is found in case of two different
densities.
• We only compare NPDBC with TURN*
on datasets with two different
densities. If the density variety
increases, NPDBC will outperform
TURN* much more.
Conclusion and Future Work
• In this paper, we explore an automatic approach to determine
this parameter based on the distribution of datasets.
• The algorithm, MINR, is developed to determine the minimum
neighborhood radii for different density clusters.
• We developed a nonparametric clustering method (NPDBC)
by combining MINR with the enhanced DBCSCAN, eDBCSCAN.
• Experiments show our NPDBC is more efficient and scalable
than TURN* for clusters in two different densities.
• In our future work, we will implement our NPDBC using the
vertical data structure, P-tree, the efficient data mining ready
data representation.
Thanks!