CHAMELEON: A Hierarchical Clustering Algorithm Using

Download Report

Transcript CHAMELEON: A Hierarchical Clustering Algorithm Using

CHAMELEON: A Hierarchical
Clustering Algorithm Using
Dynamic Modeling
Author: George et al.
Advisor: Dr. Hsu
Graduate: ZenJohn Huang
IDSL seminar 2001/10/23
Outline
 Motivation
 Objective
 Research restrict
 Literature review


An overview of related clustering algorithms
The limitations of clustering algorithms
 CHAMELEON
 Concluding remarks
 Personal opinion
Motivation
 Existing clustering algorithms can
breakdown
 Choice
of parameters is incorrect
 Model is not adequate to capture the
characteristics of clusters
 Diverse shapes, densities, and sizes
Objective
 Presenting a novel hierarchical
clustering algorithm – CHAMELEON
 Facilitating
discovery of natural and
homogeneous
 Being applicable to all types of data
Research Restrict
 In this paper, authors ignored the issue
of scaling to large data sets that cannot
fit in the main memory
Literature Review
 Clustering
 An overview of related clustering algorithms
 The limitations of the recently proposed state
of the art clustering algorithms
Clustering
 The intracluster similarity is maximized and
the intercluster similarity is minimized [Jain
and Dubes, 1988]
 Serving as the foundation for data mining and
analysis techniques
Clustering(cont’d)
 Applications
 Purchasing patterns
 Categorization of documents on WWW [Boley, et
al., 1999]
 Grouping of genes and proteins that have similar
functionality[Harris, et al., 1992]
 Grouping if spatial locations prone to earth
quakes[Byers and Adrian, 1998]
An Overview of Related
Clustering Algorithms
 Partitional techniques
 Hierarchical techniques
Partitional Techniques
 K means[Jain and Dubes, 1988]
Hierarchical Techniques
 CURE [Guha, Rastogi and Shim, 1998]
 ROCK [Guha, Rastogi and Shim, 1999]
Limitations of Existing
Hierarchical Schemas
 CURE
 Fail
to take into account special
characteristics
Limitations of Existing
Hierarchical Schemas(cont’d)
 ROCK
 Irrespective
of densities and shapes
CHAMELEON
 Overview
 Modeling the data
 Modeling the cluster similarity
 A two-phase clustering algorithm
 Performance analysis
 Experimental Results
Overall Framework
CHAMELEON
Modeling the Data
 K-nearest graphs from an original data
in 2D
Modeling the Cluster Similarity
 Relative inter-connectivity
Modeling the Cluster
Similarity(cont’d)
 Relative closeness
A Two-phase Clustering
Algorithm
 Phase I: Finding initial sub-clusters
A Two-phase Clustering
Algorithm(cont’d)
 Phase I: Finding initial sub-clusters
 Multilevel
paradigm[Karypis & Kumar, 1999]
 hMeT|s [Karypis & Kumar, 1999]
A Two-phase Clustering
Algorithm(cont’d)
 Phase II: Merging sub-clusters using a
dynamic framework
RI (Ci , C j )  TRI and RC(Ci , C j )  TRC
TRI, TRC: user specified threshold
A Two-phase Clustering
Algorithm(cont’d)
 Phase II: Merging sub-clusters using a
dynamic framework

RI (Ci , C j ) * RC(Ci , C j )
α is a user specified parameter
Performance Analysis
 The amount of time required to compute
 K-nearest
neighbor graph
 Two-phase clustering
Performance Analysis(cont’d)
 The amount of time required to compute
 K-nearest
neighbor graph
Low-dimensional data sets = O(n log n)
 High-dimensional data sets = O(n2)

Performance Analysis(cont’d)
 The amount of time required to compute
 Two-phase
clustering
Computing internal inter-connectivity and
closeness for each cluster: O(nm)
 Selecting the most similar pair of cluster:
O(n log n + m2 log m)

 Total
time = O(nm + n log n + m2 log m)
Experimental Results
 Program
 DBSCAN:
a publicly available version
 CURE: a locally implemented version
 Data sets
 Qualitative comparison
Data Sets
•
•
Five clusters
Different size, shape,
and density
Noise point
•
•
•
Six clusters
Different size, shape,
and orientation
Random noise point
Special artifacts
•
•
•
•
•
•
•
•
Eight clusters
Different size, shape, density,
and orientation
Random noise point
Two clusters
Close to each other
Different region, different densities
•
•
•
Eight clusters
Different size, shape,
and orientation
Random noise
and special artifacts
Concluding remarks
 CHAMELEON can discover natural
clusters of different shapes and sizes
 It is possible to use other algorithms
instead of k-nearest neighbor graph
 Different domains may require different
models for capturing closeness and
inter-connectivity
Personal Opinion
 Without further work