Transcript PP140-141

Data Mining Technology
Chapter 12
Clustering: Large Databases
Written by Farial Shahnaz
Presented by Zhao Xinyou
5/29/2008AI Lab @ UEC in Japan
Contents


Introduction
Idea for there major approaches for scalable
clustering
{Divide-and-Conquer, Incremental, Parallel}

There approaches for scalable clustering
{ BIRCH, DSBCAN, CURE}

Application
Introduction –Common method

PP133
Common method for clustering: visit all data
from database and analyze the data, just like:
 huge, huge number  millions


Time: Computational Complexities: O(n*n).
Memory: Need to load all data to main
memory
Time/
Memory
Data
PP134
Motivation—Clustering for large database
Time/
Memory
Data
Method ???
f(x): O(n*n).
Time/
Memory
Data
f(x):
O(n).
PP134
Requirement—Clustering for large database

Time/
Memory


No more (preferably less) than one
scan of the database.
Process each [record] only once
With limited memory
Data
Method ???
f(x): O(n*n).




Can suspend, stop, and resume
Can update the results when new
data inserted or removed
Can perform different technology to
scan the database
During execution, method should
provide status and ‘best’ answer.
Time/
Memory
Data
f(x):
O(n).
PP135
Major approach for scalable clustering



Divide-and-Conquer approach
Parallel clustering approach
Incremental clustering approach
Divide-and Conquer approach
PP135
9*9 数独




Definition.
Divide-and-conquer is a problem-solving approach in
which we:
divide the problem into sub-problems,
recursively conquer or solve each sub-problem, and
then
combine the sub-problem solutions to obtain a
solution to the original problem.
Key Assumptions
1.Problem solutions can be constructed using subproblem solutions.
2.Subproblem solutions are independent of one another.
Parallel clustering approach

PP136-137
Idea: Divide data into small set and then run
small set on different machine (Come from
Divide-and-Conquer)
Explanation about Divide-and-Conquer
Divide
Pi
P1
n/p
P0
n/p
Divide is some algorithms
Conquer is some algorithms
n/p
Conquer
K
kp
clusters
clusters Merging
K
clusters
Record: n
Aim: k cluster
P0
n/p
K
clusters
n/p
Divide
Conquer
K
clusters
K
Conquer clusters
P1
Merging
Pi
n/p
Conquer
K
clusters
kp
clusters
K
clusters
Application





Sorting: quick-sort and merge sort
Fast Fourier transforms
Tower of Hanoi puzzle
matrix multiplication
…..
PP135
CURE- Divide-and-Conquer




PP140-141
1.Get the size n of set D and partition D into p group
(contain n/p elements)
2.To each group pi, clustered into k groups by using
Heap and k-d tree
3.delete some no relationship node in Heap and k-d
tree
4. Cluster the partial clusters and get the final cluster
Heap
PP140-141
PP140-141
k-D Tree

Technically, the letter k refers to the number of
dimensions
3-dimensional kd-tree
K-D Tree
PP140-141
CURE- Divide-and-Conquer
Nearest
Merge
Merge
Nearest
PP140-141
Incremental clustering approach











PP135-136
Idea: scan all data in database, Compare with the existing clusters,
if find similar cluster, assign it to with cluster, or else, create a new
cluster. Go on till no data
Steps:
1. S={};//set cluster = NULL
2. do{
3. read one record d;
4. r = find_simiarity_cluster(d, S);
5. if (r exists)
6.
assign d to the cluster r
6. else
7.
Add_cluster(d, S);
8. } untill (no record in database);
Application--Incremental clustering
approach


BIRCH
Balanced Iterative Reducing and Clustering
using Hierarchies
DBSCAN
Density-Based Spatial Clustering of
Application with Noise
PP137-138
BIRCH (Balanced Iterative Reducing and
Clustering using Hierarchies )

Based on distance measurement, compute
the similarity between record and cluster and
give the clusters.
Inner Cluster

Among Cluster

PP137-138
BIRCH (Balanced Iterative Reducing and
Clustering using Hierarchies )
Inner Cluster
Among Cluster
Related Definiation


Cluster: {xi}, where i = 1, 2, …, N
CF(Clustering Feature):is a triple,
(N,LS,SS),N:number of data;LS:linear
sum of N data ;
SS:Square sum
Related Definiation




CF tree = (B,T),
B = (CFi, childi), if is internal node in a cluster
B = (CFi, prev, next) if is external or leaf node
in a cluster.
T: threshold for all leaf node, which should
satisfy mean distance D < T
Algorithm for BIRCH
DBSCAN




DBSCAN: Density-Based Spatial Clustering of
Application with Noise
Ex1:
We want to class house along with river from one
spatial photo
Ex2:
Definition for DBSCAN




Eps-neighborhood of a point
The Eps-neighborhood of a point p, denoted
by NEps(p), is defined by
NEps(p)={q∈D|dist(p,q) ≤ Eps}
Minimum Number (MinPts)
The MinPts is the minimum number of data
points in any cluster.
Definition for DBSCAN




Directly density-reachable
A point p is directly density-reachable from a
point q. Eps and MinPts if
1): p ∈ NEps(q);
2): |NEps(q)|≥MinPts;
Definition for DBSCAN


Density-reachable
A point p is density-reachable from a point q.
Eps and MinPts if there is a chain of points
p1,p2,…,pn,p=p1,q=pn such as pi+1 is directly
desity-reachable from pi;
Definition for DBSCAN


Density-reachable
A point p is density-reachable from a point q.
Eps and MinPts if there is a chain of points
p1,p2,…,pn,p=p1,q=pn such as pi+1 is directly
desity-reachable from pi;
Algorithm of DBSCAN
Input
D={t1,t2,…,tn}
MinPts
Eps
Output
K=K1,K2,…Kk
k = 0;
for i =1 to n do
if ti is not in a cluster then
X={ti| tj is density-reachable from ti}
end if
if X is a valid cluster then
k= k+1;
Kk = X;
end if
end for