Transcript Intro

BIRCH: A New data clustering
Algorithm and Its Applications
• Tian Zhang
• Raghu Ramakrishnan
• Miron Livny
• Presented by: Peter Vile
• Data Clustering:
Problem of dividing N data points into K
groups so as to minimize an intra-group
difference metric.
Many Algorithms already exists
Problem:
Due to abundance of local minima, there is
no way to find a globally minimal solution
without trying all possible partitions.
Other Methods
• Probability-Based Clustering
• COBWEB
– use probabilistic measurements for making
decisions (Category Utility)
– Hierarchical, Iterative
• Disadvantage
– Category Utility takes time and memory
– tends to over fit
Cobweb’s Limits
-assumes probability distributions of attributes
are independent
-can only have discrete values, approximates
continuous data with discretization
-storing and manipulating large sets of data
becomes infeasible
The Competition
• Distance-Based Clustering
• KMEANS, CLARAN
• CLARAN
– like Kmeans
– Node - set of medians
– starts in random node and moves to closest
neighbor
BIRCH
• Good
-doesn’t assume attributes independent
-minimizes memory useage
-scans data once from disk.
-can handle very large data sets (use the
concept of summarization)
-exploits the observation that not every data
point is important for clustering purposes
Limitations of BIRCH
• Handles only metric data
Definitions
• Centroid
– avg value
• Radius
– std dev
• Diameter
– avg. pairwise distance
within a cluster

X0 

i1 X i
N
N
1


 N X i  X 02  2

R   i 1


N


1


 N N X i  X j 2  2
 i 1 j 1

D

N N  1



How Far Away
• Given the centroids of two clusters:
• centroid Euclidean distance D0


 2
D0  X 01  X 02 

1
2
• centroid Manhattan distance D1
d


 (i )  (i )
D1 | X 01  X 02 |  | X 01  X 02 |
– d is the dimension
i 1
More Distances
•

 2 12
 

Average Inter-cluster


X

X
i
j
 i 1 j  N1 1

D2  

Distance (D2)
N1 N 2



N1  N 2
N1
1
• Average Intra-cluster


  N  N N  N X i  X j 2  2
 i 1

j 1
Distance (D3)
D3  
 N1  N 2 N1  N 2  1 


• Variance Increase
Distance (D4)
1
2
1
2
1
2 2
2
2
N1  N 2 


N1  N 2 
N1 






X
 N1  N 2   l 1 X l 


X

l
N1  
N

N

l


l

N

1
1
2

l 1
1
D4   k 1
X 
 i 1 X i 
  j  N 1  X j 
 
 k



1
N

N
N
N

1
2
1
2






 

Clustering Feature (CF)
• “A Clustering Feature (CF) entry is a
triple summarizing the information that we
maintain about a sub-cluster of data
points.”

• CF Definition: CF = (N, L S , SS)
–
–
–
N : number of data points in the cluster 
N

Xi

i

1
:
Linear
sum
of
the
N
data
points,
LS
N  2
SS : Square sum of the N data points, i 1 X i
• CF Representativity Theorem: Given CF
entries for all sub-clusters, all the
measurements, Q1 and Q2, can be
computed accurately
• CF Additivity Theorem: Assume that CF1
and CF2 are the CF entries of two disjoint
sub-clusters. Then the CF entry of the subcluster that is formed by merging the two
disjoint sub-clusters is :


CF1  N1 , LS1 , SS1  CF2  N2 , LS2 , SS2 


 CF1  CF2  N1  N2 , LS1  LS2 , SS1  SS2 
CF-tree Features
• Has two parameters
– 1. Branching Factor
• B - non-leaf node ( [CF i, child i], i = 1..B)
– child i - pointer to its i-th child node
• L - leaf node ( [CF j, prev, next], j = 1..L)
– 2. Threshold
• specify the size of each leaf entry
– diameter(D) of each leaf entry < T
– or radius(R) of each leaf entry < T
CF-tree Features(Continue)
• Tree size is a function of T
– tree size = f(T)
– T increases tree size decreases
• Page Size (P)
– A node is required to fit in a page of size P
– P can be varied for performance tuning
• CF tree will be built dynamically as new
data objects are inserted
CF-tree
• Two Algorithms used to build a CF-tree
– 1. Insertion Algorithm
• Purpose: Building a CF-Tree
– 2. Rebuilding Algorithm
• Purpose:
– Rebuild the whole tree with larger T (smaller size)
– this happens when CF-tree size limit is exceeded
Insertion Algorithm
• 1. Identifying the appropriate leaf
– non-leaf node
– use distance metric to chose closest branch
• 2. Insert leaf into leaf node
– merge with closest leaf [CF i, prev, next]
– if T violated make new leaf [CF i+1, prev, next]
– if L violated split into two leaf nodes
• choose two leaves that are farthest apart
• put the rest in leaf node with the closest leaf
Insertion Algorithm
• 3. Update tree path
– if B is violated split the node
– CF should be the sum of child CFs
• 4. A Merging Refinement
– try to combine two closest children of the node
that did not split
– might free space
Rebuilding Algorithm
• When to do it?
– If the CF-tree size limit is exceeded
• What does it do?
– Creates new tree with larger T(diameter/radius)
• larger T -> smaller tree size
• How
– Deletes path from old tree and adds path to new
tree.
Rebuilding Algorithm
• Procedure
– ‘OldCurrentPath’ starts at leftmost branch
– 1. Create the ‘NewCurrentPath’ in new tree
• copy nodes from OldCurrentPath into
NewCurrentPath
– 2. Insert leaf entries in ‘OldCurrentPath’ into
the new tree
• if leaf does not go into NewCurrentPath remove it
from the NewCurrentPath
Rebuilding Algorithm
– 3. Free space in ‘OldCurrentPath’ and
‘NewCurrentPath’
• delete nodes in ‘OldCurrentPath’
• if ‘NewCurrentPath’ Is empty delete its nodes
– 4. Process the next path in the old tree
– only needs enough pages to cover
‘OldCurrentPath’
• usually the height of the tree
Potential Problems
• Anomaly 1: Natural cluster is split across
two leaf nodes, or two distant clusters are
placed in the same node
• Anomaly 2: Sub cluster ends up in the
wrong leaf node
Reducibility Theorem: Assume we rebuild
CF-tree ti+1 of threshold Ti by the above
algorithm, and let Si and Si+1be the sizes of ti
and ti+1 respectively. If Ti+1 > Ti, then
Si+1<Si, and the transformation from ti to ti+1
needs at most h extra pages of memory,
where h is the height of t i
BIRCH Clustering Algorithm
• Four Phases
– 1. Loading
• scan all data and build an initial in-memory CF-tree
– 2. Optional Condensing
• rebuild CF-tree to make it smaller
• faster analysis, but reduced accuracy
– 3. Global Clustering
• run clustering algorithm on CFs (KMEANS)
• handles anomaly 1
BIRCH Clustering Algorithm
– 4. Optional Refining
• Use the centroids of the clusters as seeds
• Scan data again and assign points to closest seed
• Handles anomaly 2
BIRCH Phase 1
• Building CF-tree
– Heuristic Threshold (T default = 0)
• when rebuilding need new T (diameter/radius)
• use avg distance of closest leaf pairs in same node
• should reduce size of tree by about half
BIRCH Phase 1
– Outlier Handling Option
• CF with small N(# data points) is saved to disk
• try to reinsert when
– run out of memory
– finish reading data
• if data is noisy, improves runtime and accuracy
– Delay Split Option
• about to run out of memory
• CFs that would cause the tree to split saved to disk
How does this compare to other
clustering methods?
Ran against Kmeans and CLARANS
Results
Results
Results
Results
Runtime
Conclusions
• BIRCH vs CLARANS and KMEANS
– runs faster (fewer scans)
– less order sensitive
– less memory
Where can I use this?
• Interactive and Iterative Pixel Classification
– MVI ( Multi-band Vegetation Imager)
– BIRCH helps classify pixels through clustering
• Code Generalization in Image Compression
– compressing visual data to save space
– code book - vector code words for image
blocks
– BIRCH assigns nearest code word to vector
Main limitations of BIRCH?
• Ability to only handle metric data.
Name the two algorithms in BIRCH
clustering?
1. Inserting
2. Rebuilding
What is the purpose to have phase 4 in
the BIRCH clustering algorithm?
-All copies of a given data point go to
the same cluster.
-option to discard outliers
-can converge to a minimum