#### Transcript PP140-141

```Data Mining Technology
Chapter 12
Clustering: Large Databases
Written by Farial Shahnaz
Presented by Zhao Xinyou
5/29/2008AI Lab @ UEC in Japan
Contents


Introduction
Idea for there major approaches for scalable
clustering
{Divide-and-Conquer, Incremental, Parallel}

There approaches for scalable clustering
{ BIRCH, DSBCAN, CURE}

Application
Introduction –Common method

PP133
Common method for clustering: visit all data
from database and analyze the data, just like:
 huge, huge number  millions


Time: Computational Complexities: O(n*n).
Memory: Need to load all data to main
memory
Time/
Memory
Data
PP134
Motivation—Clustering for large database
Time/
Memory
Data
Method ???
f(x): O(n*n).
Time/
Memory
Data
f(x):
O(n).
PP134
Requirement—Clustering for large database

Time/
Memory


No more (preferably less) than one
scan of the database.
Process each [record] only once
With limited memory
Data
Method ???
f(x): O(n*n).




Can suspend, stop, and resume
Can update the results when new
data inserted or removed
Can perform different technology to
scan the database
During execution, method should
Time/
Memory
Data
f(x):
O(n).
PP135
Major approach for scalable clustering



Divide-and-Conquer approach
Parallel clustering approach
Incremental clustering approach
Divide-and Conquer approach
PP135
9*9 数独




Definition.
Divide-and-conquer is a problem-solving approach in
which we:
divide the problem into sub-problems,
recursively conquer or solve each sub-problem, and
then
combine the sub-problem solutions to obtain a
solution to the original problem.
Key Assumptions
1.Problem solutions can be constructed using subproblem solutions.
2.Subproblem solutions are independent of one another.
Parallel clustering approach

PP136-137
Idea: Divide data into small set and then run
small set on different machine (Come from
Divide-and-Conquer)
Divide
Pi
P1
n/p
P0
n/p
Divide is some algorithms
Conquer is some algorithms
n/p
Conquer
K
kp
clusters
clusters Merging
K
clusters
Record: n
Aim: k cluster
P0
n/p
K
clusters
n/p
Divide
Conquer
K
clusters
K
Conquer clusters
P1
Merging
Pi
n/p
Conquer
K
clusters
kp
clusters
K
clusters
Application





Sorting: quick-sort and merge sort
Fast Fourier transforms
Tower of Hanoi puzzle
matrix multiplication
…..
PP135
CURE- Divide-and-Conquer




PP140-141
1.Get the size n of set D and partition D into p group
(contain n/p elements)
2.To each group pi, clustered into k groups by using
Heap and k-d tree
3.delete some no relationship node in Heap and k-d
tree
4. Cluster the partial clusters and get the final cluster
Heap
PP140-141
PP140-141
k-D Tree

Technically, the letter k refers to the number of
dimensions
3-dimensional kd-tree
K-D Tree
PP140-141
CURE- Divide-and-Conquer
Nearest
Merge
Merge
Nearest
PP140-141
Incremental clustering approach











PP135-136
Idea: scan all data in database, Compare with the existing clusters,
if find similar cluster, assign it to with cluster, or else, create a new
cluster. Go on till no data
Steps:
1. S={};//set cluster = NULL
2. do{
4. r = find_simiarity_cluster(d, S);
5. if (r exists)
6.
assign d to the cluster r
6. else
7.
8. } untill (no record in database);
Application--Incremental clustering
approach


BIRCH
Balanced Iterative Reducing and Clustering
using Hierarchies
DBSCAN
Density-Based Spatial Clustering of
Application with Noise
PP137-138
BIRCH (Balanced Iterative Reducing and
Clustering using Hierarchies )

Based on distance measurement, compute
the similarity between record and cluster and
give the clusters.
Inner Cluster

Among Cluster

PP137-138
BIRCH (Balanced Iterative Reducing and
Clustering using Hierarchies )
Inner Cluster
Among Cluster
Related Definiation


Cluster: {xi}, where i = 1, 2, …, N
CF(Clustering Feature)：is a triple,
(N,LS,SS)，N：number of data；LS：linear
sum of N data ；
SS：Square sum
Related Definiation




CF tree = (B,T),
B = (CFi, childi), if is internal node in a cluster
B = (CFi, prev, next) if is external or leaf node
in a cluster.
T: threshold for all leaf node, which should
satisfy mean distance D < T
Algorithm for BIRCH
DBSCAN




DBSCAN: Density-Based Spatial Clustering of
Application with Noise
Ex1:
We want to class house along with river from one
spatial photo
Ex2:
Definition for DBSCAN




Eps-neighborhood of a point
The Eps-neighborhood of a point p, denoted
by NEps(p), is defined by
NEps(p)={q∈D|dist(p,q) ≤ Eps}
Minimum Number (MinPts)
The MinPts is the minimum number of data
points in any cluster.
Definition for DBSCAN




Directly density-reachable
A point p is directly density-reachable from a
point q. Eps and MinPts if
1): p ∈ NEps(q);
2): |NEps(q)|≥MinPts;
Definition for DBSCAN


Density-reachable
A point p is density-reachable from a point q.
Eps and MinPts if there is a chain of points
p1,p2,…,pn,p=p1,q=pn such as pi+1 is directly
desity-reachable from pi;
Definition for DBSCAN


Density-reachable
A point p is density-reachable from a point q.
Eps and MinPts if there is a chain of points
p1,p2,…,pn,p=p1,q=pn such as pi+1 is directly
desity-reachable from pi;
Algorithm of DBSCAN
Input
D={t1,t2,…,tn}
MinPts
Eps
Output
K=K1,K2,…Kk
k = 0;
for i =1 to n do
if ti is not in a cluster then
X={ti| tj is density-reachable from ti}
end if
if X is a valid cluster then
k= k+1;
Kk = X;
end if
end for
```