Transcript Clustering

Clustering
CS 685: Special Topics in Data Mining
Jinze Liu
The UNIVERSITY
of Mining,
KENTUCKY
CS685 : Special
Topics in Data
UKY
Outline
•
•
•
•
•
•
•
What is clustering
Partitioning methods
Hierarchical methods
Density-based methods
Grid-based methods
Model-based clustering methods
Outlier analysis
CS685 : Special Topics in Data Mining, UKY
Hierarchical Clustering
• Group data objects into a tree of clusters
Step 0
a
Step 1
Step 2 Step 3 Step 4
ab
b
abcde
c
cde
d
de
e
Step 4
agglomerative
(AGNES)
Step 3
Step 2 Step 1 Step 0
divisive
(DIANA)
CS685 : Special Topics in Data Mining, UKY
AGNES (Agglomerative Nesting)
• Initially, each object is a cluster
• Step-by-step cluster merging, until all objects
form a cluster
– Single-link approach
– Each cluster is represented by all of the objects in
the cluster
– The similarity between two clusters is measured
by the similarity of the closest pair of data points
belonging to different clusters
CS685 : Special Topics in Data Mining, UKY
Dendrogram
• Show how to merge clusters hierarchically
• Decompose data objects into a multi-level
nested partitioning (a tree of clusters)
• A clustering of the data objects: cutting the
dendrogram at the desired level
– Each connected component forms a cluster
CS685 : Special Topics in Data Mining, UKY
DIANA (DIvisive ANAlysis)
• Initially, all objects are in one cluster
• Step-by-step splitting clusters until each
cluster contains only one object
10
10
10
9
9
9
8
8
8
7
7
7
6
6
6
5
5
5
4
4
4
3
3
3
2
2
2
1
1
1
0
0
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
CS685 : Special Topics in Data Mining, UKY
Distance Measures
•
•
•
•
Minimum distance
Maximum distance
Mean distance
Average distance
d min (Ci , C j )  min d ( p, q )
pCi , qC j
d max (Ci , C j )  max d ( p, q )
pCi , qC j
d mean (Ci , C j )  d (mi , m j )
d avg (Ci , C j ) 
1
ni n j
  d ( p, q )
pCi qC j
m: mean for a cluster
C: a cluster
n: the number of objects in a cluster
CS685 : Special Topics in Data Mining, UKY
Challenges of Hierarchical
Clustering Methods
• Hard to choose merge/split points
– Never undo merging/splitting
– Merging/splitting decisions are critical
• Do not scale well: O(n2)
• What is the bottleneck when the data can’t fit in
memory?
• Integrating hierarchical clustering with other
techniques
– BIRCH, CURE, CHAMELEON, ROCK
CS685 : Special Topics in Data Mining, UKY
BIRCH
• Balanced Iterative Reducing and Clustering
using Hierarchies
• CF (Clustering Feature) tree: a hierarchical
data structure summarizing object info
– Clustering objects  clustering leaf nodes of the
CF tree
CS685 : Special Topics in Data Mining, UKY
Clustering Feature Vector
Clustering Feature: CF = (N, LS, SS)
N: Number of data points
LS: Ni=1=Xi
SS:
N
CF = (5, (16,30),(54,190))
2
i=1=Xi
10
9
8
7
6
5
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
(3,4)
(2,6)
(4,5)
(4,7)
(3,8)
CS685 : Special Topics in Data Mining, UKY
CF-tree in BIRCH
• Clustering feature:
– Summarize the statistics for a subcluster: the 0th, 1st and
2nd moments of the subcluster
– Register crucial measurements for computing cluster and
utilize storage efficiently
• A CF tree: a height-balanced tree storing the
clustering features for a hierarchical clustering
– A nonleaf node in a tree has descendants or “children”
– The nonleaf nodes store sums of the CFs of children
CS685 : Special Topics in Data Mining, UKY
CF Tree
B = 7 CF1 CF2 CF3
L = 6 child1 child2 child3
CF6
Root
child6
Non-leaf node
CF1
CF2 CF3
CF5
child1
child2 child3
child5
Leaf node
prev CF1 CF2
CF6 next
Leaf node
prev CF1 CF2
CF4 next
CS685 : Special Topics in Data Mining, UKY
Parameters of A CF-tree
• Branching factor: the maximum number of
children
• Threshold: max diameter of sub-clusters
stored at the leaf nodes
CS685 : Special Topics in Data Mining, UKY
CF Tree Insertion
• Identifying the appropriate leaf: recursively
descending the CF tree and choosing the
closest child node according to a chosen
distance metric
• Modifying the leaf: test whether the leaf can
absorb the node without violating the
threshold. If there is no room, split the node
• Modifying the path: update CF information up
the path.
14 of 28
CS685 : Special Topics in Data Mining, UKY
Example of the BIRCH Algorithm
New subcluster
sc4 sc5
sc8
sc6
sc3
sc1
sc2
LN1
LN1
LN2
Root
LN2 LN3
sc8 sc1 sc3sc4sc5sc6 sc7
sc2
sc7
LN3
Vladimir Jelić
([email protected])
15 of 28
CS685 : Special Topics in Data Mining, UKY
Merge Operation in BIRCH
If the branching factor of a leaf node can not exceed
3, then LN1 is split
sc4 sc5
sc1
sc8
sc2
sc3
sc6
LN2
Root
LN1” LN2 LN3
LN1”
LN1’
LN1’
sc8 sc1 sc3sc4sc5sc6 sc7
sc2
sc7
LN3
19 of 28
CS685 : Special Topics in Data Mining, UKY
Merge Operation in BIRCH
If the branching factor of a non-leaf node can not exceed 3, then
the root is split and the height of the CF Tree increases by one
sc1
sc3
Root
sc2
sc8
NLN1
NLN2
LN1’
LN1” LN1’
LN1”
sc6
sc4
LN2
sc5
LN2
sc7
LN3
LN3
sc8 sc1
sc2sc3 sc4sc5 sc6 sc7
19 of 28
CS685 : Special Topics in Data Mining, UKY
Merge Operation in BIRCH
Assume that the subclusters are numbered according to the order of
formation
sc6
sc5
sc3
LN2
root
sc2
sc1
LN1
sc4
LN1
sc1
LN2
sc6
sc2 sc3 sc4 sc5
19 of 28
CS685 : Special Topics in Data Mining, UKY
Merge Operation in BIRCH
If the branching factor of a leaf node can not exceed 3, then L
sc5
sc6
sc3
LN2”
sc4
LN2’
sc1
root
sc2
LN1
LN1 LN2’
LN2”
sc6
sc5
sc3
sc1 sc2
sc4
19 of 28
CS685 : Special Topics in Data Mining, UKY
Merge Operation in BIRCH
LN2’ and LN1 will be merged, and the newly formed node
wil be split immediately
sc2
sc5
sc6
sc3
LN2”
sc4
sc1
LN3”
LN3’
root
LN3’
LN3”
LN2”
sc2 sc1 sc4 sc3
sc5
Vladimir Jelić
([email protected])
sc6
19 of 28
CS685 : Special Topics in Data Mining, UKY
Birch Clustering Algorithm (1)
• Phase 1: Scan all data and build an initial inmemory CF tree.
• Phase 2: condense into desirable length by
building a smaller CF tree.
• Phase 3: Global clustering
• Phase 4: Cluster refining – this is optional, and
requires more passes over the data to refine
the results
20 of 28
CS685 : Special Topics in Data Mining, UKY
Birch Clustering Algorithm (2)
21 of 28
CS685 : Special Topics in Data Mining, UKY
Birch – Phase 1
• Start with initial threshold and insert points into the tree
• If run out of memory, increase thresholdvalue, and rebuild a
smaller tree by reinserting values from older tree and then
other values
• Good initial threshold is important but hard to figure out
• Outlier removal – when rebuilding tree remove outliers
Vladimir Jelić
22 of 28
([email protected])
CS685 : Special Topics in Data Mining, UKY
Birch - Phase 2
• Optional
• Phase 3 sometime have minimum size which
performs well, so phase 2 prepares the tree
for phase 3.
• BIRCH applies a (selected) clustering algorithm
to cluster the leaf nodes of the CF tree, which
removes sparse clusters as outliers and groups
dense clusters into larger ones.
Vladimir Jelić
23 of 28
([email protected])
CS685 : Special Topics in Data Mining, UKY
Birch – Phase 3
• Problems after phase 1:
– Input order affects results
– Splitting triggered by node size
• Phase 3:
– cluster all leaf nodes on the CF values according to
an existing algorithm
– Algorithm used here: agglomerative hierarchical
clustering
Vladimir Jelić
24 of 28
([email protected])
CS685 : Special Topics in Data Mining, UKY
Birch – Phase 4
• Optional
• Do additional passes over the dataset &
reassign data points to the closest centroid
from phase 3
• Recalculating the centroids and redistributing
the items.
• Always converges (no matter how many time
phase 4 is repeated)
Vladimir Jelić
25 of 28
([email protected])
CS685 : Special Topics in Data Mining, UKY
Conclusions (1)
• Birch performs faster than existing algorithms
(CLARANS and KMEANS) on large datasets
• Scans whole data only once
• Handles outliers better
• Superior to other algorithms in stability and
scalability
Vladimir Jelić
26 of 28
([email protected])
CS685 : Special Topics in Data Mining, UKY
Conclusions (2)
• Since each node in a CF tree can hold only a
limited number of entries due to the size, a CF
tree node doesn’t always correspond to what
a user may consider a nature cluster.
Moreover, if the clusters are not spherical in
shape, it doesn’t perform well because it uses
the notion of radius or diameter to control the
boundary of a cluster
Vladimir Jelić
27 of 28
([email protected])
CS685 : Special Topics in Data Mining, UKY
BIRCH Clustering
• Phase 1: scan DB to build an initial in-memory
CF tree (a multi-level compression of the data
that tries to preserve the inherent clustering
structure of the data)
• Phase 2: use an arbitrary clustering algorithm
to cluster the leaf nodes of the CF-tree
CS685 : Special Topics in Data Mining, UKY
Pros & Cons of BIRCH
• Linear scalability
– Good clustering with a single scan
– Quality can be further improved by a few
additional scans
• Can handle only numeric data
• Sensitive to the order of the data records
CS685 : Special Topics in Data Mining, UKY
Drawbacks of Square Error Based
Methods
• One representative per cluster
– Good only for convex shaped having similar size
and density
• A number of clusters parameter k
– Good only if k can be reasonably estimated
CS685 : Special Topics in Data Mining, UKY
Drawback of Distance-based
Methods
• Hard to find clusters with irregular shapes
• Hard to specify the number of clusters
• Heuristic: a cluster must be dense
CS685 : Special Topics in Data Mining, UKY
Directly Density Reachable
• Parameters
MinPts = 3
Eps = 1 cm
– Eps: Maximum radius of the neighborhood
– MinPts: Minimum number of points in an Epsneighborhood of that point
– NEps(p): {q | dist(p,q) Eps}
p
q
• Core object p: |Neps(p)|MinPts
• Point q directly density-reachable from p iff q
Neps(p) and p is a core object
CS685 : Special Topics in Data Mining, UKY
Density-Based Clustering:
Background (II)
• Density-reachable
p
p1
q
– Directly density reachable p1p2, p2p3, …, pn1 pn  pn density-reachable from p1
• Density-connected
– Points p, q are density-reachable from o  p and
q are density-connected
p
q
o
CS685 : Special Topics in Data Mining, UKY
DBSCAN
• A cluster: a maximal set of density-connected
points
– Discover clusters of arbitrary shape in spatial
databases with noise
Outlier
Border
Eps = 1cm
Core
MinPts = 5
CS685 : Special Topics in Data Mining, UKY
DBSCAN: the Algorithm
• Arbitrary select a point p
• Retrieve all points density-reachable from p wrt Eps
and MinPts
• If p is a core point, a cluster is formed
• If p is a border point, no points are density-reachable
from p and DBSCAN visits the next point of the
database
• Continue the process until all of the points have been
processed
CS685 : Special Topics in Data Mining, UKY
Problems of DBSCAN
• Different clusters may have very different
densities
• Clusters may be in hierarchies
CS685 : Special Topics in Data Mining, UKY