CHAMELEON: A Hierarchical Clustering Algorithm Using
Download
Report
Transcript CHAMELEON: A Hierarchical Clustering Algorithm Using
CHAMELEON: A Hierarchical
Clustering Algorithm Using
Dynamic Modeling
Author: George et al.
Advisor: Dr. Hsu
Graduate: ZenJohn Huang
IDSL seminar 2001/10/23
Outline
Motivation
Objective
Research restrict
Literature review
An overview of related clustering algorithms
The limitations of clustering algorithms
CHAMELEON
Concluding remarks
Personal opinion
Motivation
Existing clustering algorithms can
breakdown
Choice
of parameters is incorrect
Model is not adequate to capture the
characteristics of clusters
Diverse shapes, densities, and sizes
Objective
Presenting a novel hierarchical
clustering algorithm – CHAMELEON
Facilitating
discovery of natural and
homogeneous
Being applicable to all types of data
Research Restrict
In this paper, authors ignored the issue
of scaling to large data sets that cannot
fit in the main memory
Literature Review
Clustering
An overview of related clustering algorithms
The limitations of the recently proposed state
of the art clustering algorithms
Clustering
The intracluster similarity is maximized and
the intercluster similarity is minimized [Jain
and Dubes, 1988]
Serving as the foundation for data mining and
analysis techniques
Clustering(cont’d)
Applications
Purchasing patterns
Categorization of documents on WWW [Boley, et
al., 1999]
Grouping of genes and proteins that have similar
functionality[Harris, et al., 1992]
Grouping if spatial locations prone to earth
quakes[Byers and Adrian, 1998]
An Overview of Related
Clustering Algorithms
Partitional techniques
Hierarchical techniques
Partitional Techniques
K means[Jain and Dubes, 1988]
Hierarchical Techniques
CURE [Guha, Rastogi and Shim, 1998]
ROCK [Guha, Rastogi and Shim, 1999]
Limitations of Existing
Hierarchical Schemas
CURE
Fail
to take into account special
characteristics
Limitations of Existing
Hierarchical Schemas(cont’d)
ROCK
Irrespective
of densities and shapes
CHAMELEON
Overview
Modeling the data
Modeling the cluster similarity
A two-phase clustering algorithm
Performance analysis
Experimental Results
Overall Framework
CHAMELEON
Modeling the Data
K-nearest graphs from an original data
in 2D
Modeling the Cluster Similarity
Relative inter-connectivity
Modeling the Cluster
Similarity(cont’d)
Relative closeness
A Two-phase Clustering
Algorithm
Phase I: Finding initial sub-clusters
A Two-phase Clustering
Algorithm(cont’d)
Phase I: Finding initial sub-clusters
Multilevel
paradigm[Karypis & Kumar, 1999]
hMeT|s [Karypis & Kumar, 1999]
A Two-phase Clustering
Algorithm(cont’d)
Phase II: Merging sub-clusters using a
dynamic framework
RI (Ci , C j ) TRI and RC(Ci , C j ) TRC
TRI, TRC: user specified threshold
A Two-phase Clustering
Algorithm(cont’d)
Phase II: Merging sub-clusters using a
dynamic framework
RI (Ci , C j ) * RC(Ci , C j )
α is a user specified parameter
Performance Analysis
The amount of time required to compute
K-nearest
neighbor graph
Two-phase clustering
Performance Analysis(cont’d)
The amount of time required to compute
K-nearest
neighbor graph
Low-dimensional data sets = O(n log n)
High-dimensional data sets = O(n2)
Performance Analysis(cont’d)
The amount of time required to compute
Two-phase
clustering
Computing internal inter-connectivity and
closeness for each cluster: O(nm)
Selecting the most similar pair of cluster:
O(n log n + m2 log m)
Total
time = O(nm + n log n + m2 log m)
Experimental Results
Program
DBSCAN:
a publicly available version
CURE: a locally implemented version
Data sets
Qualitative comparison
Data Sets
•
•
Five clusters
Different size, shape,
and density
Noise point
•
•
•
Six clusters
Different size, shape,
and orientation
Random noise point
Special artifacts
•
•
•
•
•
•
•
•
Eight clusters
Different size, shape, density,
and orientation
Random noise point
Two clusters
Close to each other
Different region, different densities
•
•
•
Eight clusters
Different size, shape,
and orientation
Random noise
and special artifacts
Concluding remarks
CHAMELEON can discover natural
clusters of different shapes and sizes
It is possible to use other algorithms
instead of k-nearest neighbor graph
Different domains may require different
models for capturing closeness and
inter-connectivity
Personal Opinion
Without further work