CHAMELEON: A Hierarchical Clustering Algorithm Using

Download Report

Transcript CHAMELEON: A Hierarchical Clustering Algorithm Using

CHAMELEON:
A Hierarchical Clustering Algorithm Using
Dynamic Modeling
Paper presentation in data mining class
Presenter : 許明壽 ; 蘇建仲
Data : 2001/12/18
2001/12/18
CHAMELEON
1
About this paper …

Department of Computer Science and
Engineering , University of Minnesota




George Karypis
Eui-Honh (Sam) Han
Vipin Kumar
IEEE Computer Journal - Aug. 1999
2001/12/18
CHAMELEON
2
Outline





Problems definition
Main algorithm
Keys features of CHAMELEON
Experiment and related worked
Conclusion and discussion
2001/12/18
CHAMELEON
3
Problems definition

Clustering



Intracluster similarity is maximized
Intercluster similarity is minimized
Problems of existing clustering algorithms



2001/12/18
Static model constrain
Breakdown when clusters that are of diverse
shapes,densities,and sizes
Susceptible to noise , outliers , and artifacts
CHAMELEON
4
Static model constrain

Data space constrain
 K means , PAM … etc


Cluster shape constrain
 K means , PAM , CLARANS


Assume cluster as ellipsoidal or globular and are similar sizes
Cluster density constrain
 DBScan


Suitable only for data in metric spaces
Points within genuine cluster are density-reachable and point across
different clusters are not
Similarity determine constrain
 CURE , ROCK

2001/12/18
Use static model to determine the most similar cluster to merge
CHAMELEON
5
Partition techniques problem
(b) Clusters with convex shapes
(a) Clusters of widely different sizes
2001/12/18
CHAMELEON
6
Hierarchical technique
problem (1/2)

The {(c) , (d)} will be choose to merge when we only consider
closeness
2001/12/18
CHAMELEON
7
Hierarchical technique
problem (2/2)

The {(a) , (c)} will be choose to merge when we only consider interconnectivity
2001/12/18
CHAMELEON
8
Main algorithm

Two phase algorithm

PHASE I


PHASE II

2001/12/18
Use graph partitioning algorithm to cluster the data
items into a large number of relatively small subclusters.
Uses an agglomerative hierarchical clustering
algorithm to find the genuine clusters by repeatedly
combining together these sub-clusters.
CHAMELEON
9
Framework
Construct
Partition the
Graph
Sparse Graph
Data Set
Merge Partition
Final Clusters
2001/12/18
CHAMELEON
10
Keys features of CHAMELEON




Modeling the data
Modeling the cluster similarity
Partition algorithms
Merge schemes
2001/12/18
CHAMELEON
11
Terms

Arguments needed

K
K-nearest neighbor graph
MINSIZE
 The minima size of initial cluster
TRI
 Threshold of related inter-connectivity
TRC
 Threshold of related intra-connectivity
α
 Coefficient for weight of RI and RC





2001/12/18
CHAMELEON
12
Modeling the data


K-nearest neighbor graph approach
Advantages



2001/12/18
Data points that are far apart are completely
disconnected in the Gk
Gk capture the concept of neighborhood
dynamically
The edge weights of dense regions in Gk tend to
be large and the edge weights of sparse tend to
be small
CHAMELEON
13
Example of k-nearest neighbor
graph
2001/12/18
CHAMELEON
14
Modeling the clustering
similarity (1/2)

Relative interconnectivity
| EC{Ci , Cj } |
RI (Ci, Cj ) 
| ECCi |  | ECCj |
2

Relative closeness
RC(Ci, Cj ) 
2001/12/18
S EC{Ci , Cj }
| Ci |
| Cj |
S ECCi 
S ECCj
| Ci |  | Cj |
| Ci |  | Cj |
CHAMELEON
15
Modeling the clustering
similarity (2/2)
• If related is considered , {(c) , (d)} will be merged
2001/12/18
CHAMELEON
16
Partition algorithm (PHASE I)

What


Why


Finding the initial sub-clusters
RI and RC can’t be accurately calculated for clusters
containing only a few data points
How

2001/12/18
Utilize multilevel graph partitioning algorithm (hMETIS)
 Coarsening phase
 Partitioning phase
 Uncoarsening phase
CHAMELEON
17
Partition algorithm (cont.)

Initial


Repeat until (size of all clusters < MINSIZE)


all points belonging to the same cluster
Select the largest cluster and use hMETIS to
bisect
Post scriptum

Balance constrain

2001/12/18
Spilt Ci into CiA and CiB and each sub-clusters
contains at least 25% of the node of Ci
CHAMELEON
18
2001/12/18
CHAMELEON
19
Merge schemes (Phase II)

What


Merging sub-clusters using a dynamic framework
How


Finding and merging the pair of sub-clusters that are the
most similar
Scheme 1
RI (Ci, Cj )  TRI

and
RC(Ci, Cj )  TRC
Scheme 2
RI (Ci, Cj ) * RC(Ci, Cj )
2001/12/18
CHAMELEON
20
Experiment and related
worked




Introduction of CURE
Introduction of DBSCAN
Results of experiment
Performance analysis
2001/12/18
CHAMELEON
21
Introduction of CURE (1/n)
Clustering Using Representative points
1. Properties :






2001/12/18
Fit for non-spherical shapes.
Shrinking can help to dampen the effects of outliers.
Multiple representative points chosen for non-spherical
Each iteration , representative points shrunk ratio related
to merge procedure by some scattered points chosen
Random sampling in data sets is fit for large databases
CHAMELEON
22
Introduction of CURE (2/n)
2. Drawbacks :


Partitioning method can not prove data points chosen are
good.
Clustering accuracy with respect to the parameters below :

(1) Shrink factor s : CURE always find the right clusters by
range of s values from 0.2 to 0.7.

(2) Number of representative points c : CURE always
found right clusters for value of c greater than 10.

(3) Number of Partitions p : with as many as 50 partitions ,
CURE always discovered the desired clusters.

(4) Random Sample size r :


2001/12/18
(a) for sample size up to 2000 , clusters found poor quality
(b) from 2500 sample points and above , about 2.5% of the
data set size , CURE always correctly find the clusters.
CHAMELEON
23
3. Clustering algorithm : Representative points
2001/12/18
CHAMELEON
24
• Merge procedure
2001/12/18
CHAMELEON
25
Introduction of DBSCAN (1/n)
Density Based Spatial Clustering of Application With
Noise
1. Properties :








2001/12/18
Can discovery clusters of arbitrary shape.
Each cluster with a typical density of points which is higher
than outside of cluster.
The density within the areas of noise is lower than the
density in any of the clusters.
Input the parameters MinPts only
Easy to implement in C++ language using R*-tree
Runtime is linear depending on the number of points.
Time complexity is O(n * log n)
CHAMELEON
26
Introduction of DBSCAN (2/n)
2. Drawbacks :




Cannot apply to polygons.
Cannot apply to high dimensional feature spaces.
Cannot process the shape of k-dist graph with multifeatures.
Cannot fit for large database because no method applied
to reduce spatial database.
3. Definitions


2001/12/18
Eps-neighborhood of a point p

NEps(p)={q€D | dist(p,q)<=Eps}
Each cluster with MinPts points
CHAMELEON
27
Introduction of DBSCAN (3/n)
4. p is directly density-reachable from q

(1) p€ NEps(q) and
(2) | NEps(q) | >=MinPts (core point condition)
We know directly density-reachable is symmetric when p
and q both are core point , otherwise is asymmetric if one
core point and one border point.
5. p is density-reachable from q if there is a chain of
points between p and q


2001/12/18
Density-reachable is transitive , but not symmetric
Density-reachable is symmetric for core points.
CHAMELEON
28
Introduction of DBSCAN (4/n)
6. A point p is density-connected to a point q if there is a point s
such that both p and q are density-reachable from s.

Density-connected is symmetric and reflexive relation

A cluster is defined to be a set of density-connected points
which is maximal density-reachability.

Noise is the set of points not belong to any of clusters.
7. How to find cluster C ?

Maximality

∆ p , q : if p€ C and q is density-reachable from p , then q € C
Connectivity


∆ p , q € C : p is density-connected to q
8. How to find noises ?

∆ p , if p is not belong to any clusters , then p is noise point
2001/12/18
CHAMELEON
29
Results of experiment
2001/12/18
CHAMELEON
30
Performance analysis (1/2)
The time of construct the k-nearest neighbor

Low-dimensional data sets based on k-d trees , overall
complexity of O(n log n)
High-dimensional data sets based on k-d trees not
applicable , overall complexity of O(n2)


Finding initial sub-clusters




2001/12/18
Obtains m clusters by repeated partitioning successively
smaller graphs , overall computational complexity is O(n
log (n/m))
Is bounded by O(n log n)
A faster partitioning algorithm to obtain the initial m
clusters in time O(n+m log m) using multilevel m-way
partitioning algorithm
CHAMELEON
31
Performance analysis (2/2)
Merging sub-clusters using a dynamic
framework

The time of compute the internal inter-connectivity
and internal closeness for each initial cluster is
which is O(nm)
The time of the most similar pair of clusters to merge
is O(m2 log m) by using a heap-based priority queue



So overall complexity of CHAMELEON’s is
O(n log n + nm + m2 log m)
2001/12/18
CHAMELEON
32
Conclusion and discussion




Dynamic model with related interconnectivity
and closeness
This paper ignore the issue of scaling to large
data
Other graph representation methodology??
Other Partition algorithm??
2001/12/18
CHAMELEON
33