No Slide Title - The University of North Carolina at Chapel Hill

Download Report

Transcript No Slide Title - The University of North Carolina at Chapel Hill

Clustering
COMP 790-90 Research Seminar
BCB 713 Module
Spring 2011
Wei Wang
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Hierarchical Clustering
Group data objects into a tree of clusters
Step 0
a
Step 1
Step 2 Step 3 Step 4
ab
b
abcde
c
cde
d
de
e
Step 4
2
agglomerative
(AGNES)
Step 3
Step 2 Step 1 Step 0
divisive
(DIANA)
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
AGNES (Agglomerative
Nesting)
Initially, each object is a cluster
Step-by-step cluster merging, until all
objects form a cluster
Single-link approach
Each cluster is represented by all of the objects
in the cluster
The similarity between two clusters is measured
by the similarity of the closest pair of data
points belonging to different clusters
3
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Dendrogram
Show how to merge clusters hierarchically
Decompose data objects into a multi-level
nested partitioning (a tree of clusters)
A clustering of the data objects: cutting the
dendrogram at the desired level
Each connected component forms a cluster
4
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
DIANA (DIvisive ANAlysis)
Initially, all objects are in one cluster
Step-by-step splitting clusters until each
cluster contains only one object
10
9
9
8
8
8
7
7
7
6
6
6
5
5
5
4
4
4
3
3
3
2
2
2
1
1
1
0
0
0
0
5
10
10
9
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Distance Measures
Minimum distance
Maximum distance
Mean distance
Average distance
d min (Ci , C j )  min d ( p, q )
pCi , qC j
d max (Ci , C j )  max d ( p, q )
pCi , qC j
d mean (Ci , C j )  d (mi , m j )
d avg (Ci , C j ) 
1
ni n j
  d ( p, q )
pCi qC j
m: mean for a cluster
C: a cluster
n: the number of objects in a cluster
6
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Challenges of Hierarchical
Clustering Methods
Hard to choose merge/split points
Never undo merging/splitting
Merging/splitting decisions are critical
Do not scale well: O(n2)
What is the bottleneck when the data can’t fit in
memory?
Integrating hierarchical clustering with other
techniques
BIRCH, CURE, CHAMELEON, ROCK
7
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
BIRCH
Balanced Iterative Reducing and Clustering
using Hierarchies
CF (Clustering Feature) tree: a hierarchical
data structure summarizing object info
Clustering objects  clustering leaf nodes of
the CF tree
8
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Clustering Feature Vector
Clustering Feature: CF = (N, LS, SS)
N: Number of data points
LS: Ni=1=Xi
SS:
N
CF = (5, (16,30),(54,190))
2
i=1=Xi
10
9
8
7
6
5
4
3
2
1
0
0
9
1
2
3
4
5
6
7
8
9
10
(3,4)
(2,6)
(4,5)
(4,7)
(3,8)
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
CF-tree in BIRCH
Clustering feature:
Summarize the statistics for a subcluster: the 0th, 1st and
2nd moments of the subcluster
Register crucial measurements for computing cluster
and utilize storage efficiently
A CF tree: a height-balanced tree storing the
clustering features for a hierarchical clustering
A nonleaf node in a tree has descendants or “children”
The nonleaf nodes store sums of the CFs of children
10
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
CF Tree
B = 7 CF1 CF2 CF3
L = 6 child1 child2 child3
CF6
Root
child6
Non-leaf node
CF1
CF2 CF3
CF5
child1
child2 child3
child5
Leaf node
prev CF1 CF2
11
CF6 next
Leaf node
prev CF1 CF2
CF4 next
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Parameters of A CF-tree
Branching factor: the maximum number of
children
Threshold: max diameter of sub-clusters
stored at the leaf nodes
12
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
BIRCH Clustering
Phase 1: scan DB to build an initial inmemory CF tree (a multi-level compression
of the data that tries to preserve the inherent
clustering structure of the data)
Phase 2: use an arbitrary clustering
algorithm to cluster the leaf nodes of the
CF-tree
13
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Pros & Cons of BIRCH
Linear scalability
Good clustering with a single scan
Quality can be further improved by a few
additional scans
Can handle only numeric data
Sensitive to the order of the data records
14
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Drawbacks of Square Error
Based Methods
One representative per cluster
Good only for convex shaped having similar
size and density
A number of clusters parameter k
Good only if k can be reasonably estimated
15
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
CURE: the Ideas
Each cluster has c representatives
Choose c well scattered points in the cluster
Shrink them towards the mean of the cluster by
a fraction of 
The representatives capture the physical shape
and geometry of the cluster
Merge the closest two clusters
Distance of two clusters: the distance between
the two closest representatives
16
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications