Data Mining - Clustering

Download Report

Transcript Data Mining - Clustering

DATA MINING CLUSTERING
Clustering
 Clustering
- unsupervised classification
 Clustering - the process of grouping
physical or abstract objects into classes of
similar objects
 Clustering - help in construct meaningful
partitioning of a large set of objects
 Data clustering in statistics, machine
learning, spatial database and data mining
CLARANS Algorithm
 CLARANS
- Clustering Large
Applications Based on Randomized
Search - presented by Ng and Han
 CLARANS - based on randomized
search and 2 statistics algorithm: PAM
and CLARA
 Method of algorithm - search of local
optimum
 Example of algorithm usage
Focusing Methods
 FM
- based on CLARANS algorithm and
efficient spatial access method, like
R*-tree
 The focusing on representative objects
technique
 The focus on relevant clusters technique
 The focus on a cluster technique
 Examples of usage
Pattern-Based Similarity
Search
Searching for similar patterns in a temporal or
spatial-temporal database
 Two types of queries encountered in data
mining operations:

–
–

object - relative similarity query
all -pair similarity query
Various approaches:
–
–
–
similarity measures chosen
type of comparison chosen
subsequence parameters chosen
Similarity Measures (1)
1st Measure - the Euclidean distance
between two sequences:
{xi} - the target sequence of length n
{yi} - a sequence of length N in thedatabase
{ziJ} - J-th subsequence of length n of {yi}
n
min
J
 (x  K
i 1
i
J 2
J i
z )
Similarity Measures (2)
2nd Measure - the linear 3th Measure - the correlation
correlation between two (Discrete Fourier Transforms)
sequences:
between two sequences:
n
ci 
x y
i j
n
N
j
j 1
x y
j 1
2
j
j 1
ci 
2
j
F 1{ X *j Y j }
n
2
X
 j
j 1
N
2
Y
 j
j 1
Alternative approaches
 Matching
all of the data points of a
sequence simultaneously
 Matching each sequence into a small
set of multidimensional rectangles in the
featude space
–
–
–
Fourier transformation
SVD - Singular Value Decomposition
The Karhunen-Loeve transformation
 Hierarchy
Scan - new approach
Mining Path
Traversal Patterns
 Solution
of the problem of mining traversal
patterns:
first step: devise to convert the original
sequence of log data into a set of traversal
subsequences (maximal forward reference)
second step: determine the frequent traversal
patterns, term large reference sequences
 Problems
with finding large reference
sequences
Mining Path
Traversal Patterns - Example
Traversal path for a user:
1
13
6
5
3
{ ABCD, ABEGH,
ABEGW, AOU, AOV }
O
2
C
D
12
B
{A,B,C,D,C,B,E,G,H,G,
W,A,O,U,O,V}
The set of maximal
forward references for
this user:
A
E
U
7
4
G
8
9
H
14
15
11
10
W
V
Clustering Features and CF-trees
Clustering Features triplet summarizing
information about
subclusters of points:
CF  ( N , LS , SS )
N

N

2
i
LS   X i , SS   X
i 1
max number of children



CF-tree - a balanced
tree with 2 parameters:
 branching factor B -
i 1
threshold T - max
diameter of subclusters
stored at the leaf nodes
Construct of CF-tree




The non leaf nodes are storing
2 370
sums of their’ s children’s CFs
The CF-tree is build dynamically
as data points are inserted
2 270
2 100
A point is inserted to the closest
leaf entry (subcluster)
If the diameter of the subcluster 120
150
100
after insertion is larger than T
value, the leaf node(s) are split
Usage of this structure in
BIRCH algorithm
BIRCH Algorithm
-
Balanced
Interactive
Reducing and
Clustering using
Hierarchies
BIRCH Algorithm (1)
Data
Phase 1: Load into memory
Phase 2: Condense
Phase 3: Global Clustering
Phase 4: Cluster Refining
PHASE 1
Scan all data and built
an initial in memory
CF tree using the
given amount of
memory and
recycling space on
disk
BIRCH Algorithm (2)
Data
Phase 1: Load into memory
Phase 2: Condense
Phase 3: Global Clustering
Phase 4: Cluster Refining
PHASE 2 (optional)
Scan the leaf entries in
the initial CF tree to
rebuild a smaller CF
tree, while removing
more outliers and
grouping crowded
clusters into largest
one
BIRCH Algorithm (3)
Data
Phase 1: Load into memory
Phase 2: Condense
Phase 3: Global Clustering
Phase 4: Cluster Refining
PHASE 3
Adapt an existing
clustering algorithm
for a set of data
points to work with a
set of subclusters,
each described by
its CF vector.
BIRCH Algorithm (4)
Data
PHASE 4 (optional)
Phase 1: Load into memory
Phase 2: Condense
Phase 3: Global Clustering
Phase 4: Cluster Refining
Pass over the data to
correct inaccuracies
and refine clusters
further. Phase 4
entails the cost of
additional pass.
CURE Algorithm
-
Clustering
-
Using
-
Representatives
CURE Algorithm (1)
Data
Draw random sample
Label data in disk
Partition sample
Cluster partial clusters
Partially cluster partitions
Eliminate outliers
CURE begins by drawing random sample
from the database.
CURE Algorithm (2)
Data
Draw random sample
Label data in disk
Partition sample
Cluster partial clusters
Partially cluster partitions
Eliminate outliers
In order to further speed up clustering, CURE
first partition the random sample into p
partitions, each of size n/p.
CURE Algorithm (3)
Data
Draw random sample
Label data in disk
Partition sample
Cluster partial clusters
Partially cluster partitions
Eliminate outliers
Partially cluster each partition until the final
number of clusters in each partition reduce
to n/pq for some constant q > 1.
CURE Algorithm (4)
Data
Draw random sample
Label data in disk
Partition sample
Cluster partial clusters
Partially cluster partitions
Eliminate outliers
Outliers do not belong to any of the clusters.
In CURE outliers are eliminated at multiply
steps.
CURE Algorithm (5)
Data
Draw random sample
Label data in disk
Partition sample
Cluster partial clusters
Partially cluster partitions
Eliminate outliers
Cluster in a final pass to generate the final k
clusters.
CURE Algorithm (6)
Data
Draw random sample
Label data in disk
Partition sample
Cluster partial clusters
Partially cluster partitions
Eliminate outliers
Each data point is assigned to the cluster
containing the representative point closest
to it.
CURE - cluster procedure
procedure cluster(S,k)
begin
T := build_kd_tree(S)
Q := buid_heap(S)
while size(Q) > k do {
u := extract_min(Q)
v := u.closet
delete(Q,v)
w := merge(u,v)
delete_rep(T,u)
delete_rep(T,u)
insert_rep(T,w)
w.closet := x /*x is an
arbitrary cluster in Q */
for each xQ do {
if dist(w,x) < dist(w,w.closet)
w.closet := x
if x.closet is either u or v {
if dist(x,x.closet)< dist(x,w)
a.closet :=
closet_cluster(T,x,dist(x,w))
else
x.closet :=w
relocate(Q,x) }
else if dist(x,x.closet)>dist(x,w){
x.closet := w
relocate(Q,x) } }
insert (Q,w) }
end
CURE - merge procedure
procedure merge(u,v)
begin
w := uUv
w.mean :=
|u|u.mean+|v|v.mean/|u|+|v|
tmpSet := 
for i := 1 to c do {
maxDist := 0
foreach point p in cluster
w do {
if i = 1
maxDist :=
dist(p,w.mean)
else
minDist :=
min{dist(p,q) : q tmpSet
if (minDist  maxDist) {
maxDist := minDist
maxPoint := p } }
tmpSet := tmpSet U {maxPoint}
}
foreach point p in tmpSet do
w.rep := w.rep U
{p+*(w.mean-p)}
return w
end
The Intelligent Miner of IBM
Clustering - Demographic provides fast and natural
clustering of very large databases. It automatically
determines the number of clusters to be generated.
Similarities between records are determined by
comparing their field values. The clusters are then
defined so that Condorcet’s criterion is maximised:
(sum of all record similarities of pairs in the same
cluster) - (sum of all record similarities of pairs in
different clusters)
The Intelligent Miner - example
Suppose that you have a database of a
supermarket that includes customer
identification and information abut the date
and time of the purchases. The clustering
mining function clusters this data to enable
the identification of different types of
shoppers. For example, this might reveal
that customers buy many articles on
Friday and usually pay by credit card.