Data Mining - Clustering
Download
Report
Transcript Data Mining - Clustering
DATA MINING CLUSTERING
Clustering
Clustering
- unsupervised classification
Clustering - the process of grouping
physical or abstract objects into classes of
similar objects
Clustering - help in construct meaningful
partitioning of a large set of objects
Data clustering in statistics, machine
learning, spatial database and data mining
CLARANS Algorithm
CLARANS
- Clustering Large
Applications Based on Randomized
Search - presented by Ng and Han
CLARANS - based on randomized
search and 2 statistics algorithm: PAM
and CLARA
Method of algorithm - search of local
optimum
Example of algorithm usage
Focusing Methods
FM
- based on CLARANS algorithm and
efficient spatial access method, like
R*-tree
The focusing on representative objects
technique
The focus on relevant clusters technique
The focus on a cluster technique
Examples of usage
Pattern-Based Similarity
Search
Searching for similar patterns in a temporal or
spatial-temporal database
Two types of queries encountered in data
mining operations:
–
–
object - relative similarity query
all -pair similarity query
Various approaches:
–
–
–
similarity measures chosen
type of comparison chosen
subsequence parameters chosen
Similarity Measures (1)
1st Measure - the Euclidean distance
between two sequences:
{xi} - the target sequence of length n
{yi} - a sequence of length N in thedatabase
{ziJ} - J-th subsequence of length n of {yi}
n
min
J
(x K
i 1
i
J 2
J i
z )
Similarity Measures (2)
2nd Measure - the linear 3th Measure - the correlation
correlation between two (Discrete Fourier Transforms)
sequences:
between two sequences:
n
ci
x y
i j
n
N
j
j 1
x y
j 1
2
j
j 1
ci
2
j
F 1{ X *j Y j }
n
2
X
j
j 1
N
2
Y
j
j 1
Alternative approaches
Matching
all of the data points of a
sequence simultaneously
Matching each sequence into a small
set of multidimensional rectangles in the
featude space
–
–
–
Fourier transformation
SVD - Singular Value Decomposition
The Karhunen-Loeve transformation
Hierarchy
Scan - new approach
Mining Path
Traversal Patterns
Solution
of the problem of mining traversal
patterns:
first step: devise to convert the original
sequence of log data into a set of traversal
subsequences (maximal forward reference)
second step: determine the frequent traversal
patterns, term large reference sequences
Problems
with finding large reference
sequences
Mining Path
Traversal Patterns - Example
Traversal path for a user:
1
13
6
5
3
{ ABCD, ABEGH,
ABEGW, AOU, AOV }
O
2
C
D
12
B
{A,B,C,D,C,B,E,G,H,G,
W,A,O,U,O,V}
The set of maximal
forward references for
this user:
A
E
U
7
4
G
8
9
H
14
15
11
10
W
V
Clustering Features and CF-trees
Clustering Features triplet summarizing
information about
subclusters of points:
CF ( N , LS , SS )
N
N
2
i
LS X i , SS X
i 1
max number of children
CF-tree - a balanced
tree with 2 parameters:
branching factor B -
i 1
threshold T - max
diameter of subclusters
stored at the leaf nodes
Construct of CF-tree
The non leaf nodes are storing
2 370
sums of their’ s children’s CFs
The CF-tree is build dynamically
as data points are inserted
2 270
2 100
A point is inserted to the closest
leaf entry (subcluster)
If the diameter of the subcluster 120
150
100
after insertion is larger than T
value, the leaf node(s) are split
Usage of this structure in
BIRCH algorithm
BIRCH Algorithm
-
Balanced
Interactive
Reducing and
Clustering using
Hierarchies
BIRCH Algorithm (1)
Data
Phase 1: Load into memory
Phase 2: Condense
Phase 3: Global Clustering
Phase 4: Cluster Refining
PHASE 1
Scan all data and built
an initial in memory
CF tree using the
given amount of
memory and
recycling space on
disk
BIRCH Algorithm (2)
Data
Phase 1: Load into memory
Phase 2: Condense
Phase 3: Global Clustering
Phase 4: Cluster Refining
PHASE 2 (optional)
Scan the leaf entries in
the initial CF tree to
rebuild a smaller CF
tree, while removing
more outliers and
grouping crowded
clusters into largest
one
BIRCH Algorithm (3)
Data
Phase 1: Load into memory
Phase 2: Condense
Phase 3: Global Clustering
Phase 4: Cluster Refining
PHASE 3
Adapt an existing
clustering algorithm
for a set of data
points to work with a
set of subclusters,
each described by
its CF vector.
BIRCH Algorithm (4)
Data
PHASE 4 (optional)
Phase 1: Load into memory
Phase 2: Condense
Phase 3: Global Clustering
Phase 4: Cluster Refining
Pass over the data to
correct inaccuracies
and refine clusters
further. Phase 4
entails the cost of
additional pass.
CURE Algorithm
-
Clustering
-
Using
-
Representatives
CURE Algorithm (1)
Data
Draw random sample
Label data in disk
Partition sample
Cluster partial clusters
Partially cluster partitions
Eliminate outliers
CURE begins by drawing random sample
from the database.
CURE Algorithm (2)
Data
Draw random sample
Label data in disk
Partition sample
Cluster partial clusters
Partially cluster partitions
Eliminate outliers
In order to further speed up clustering, CURE
first partition the random sample into p
partitions, each of size n/p.
CURE Algorithm (3)
Data
Draw random sample
Label data in disk
Partition sample
Cluster partial clusters
Partially cluster partitions
Eliminate outliers
Partially cluster each partition until the final
number of clusters in each partition reduce
to n/pq for some constant q > 1.
CURE Algorithm (4)
Data
Draw random sample
Label data in disk
Partition sample
Cluster partial clusters
Partially cluster partitions
Eliminate outliers
Outliers do not belong to any of the clusters.
In CURE outliers are eliminated at multiply
steps.
CURE Algorithm (5)
Data
Draw random sample
Label data in disk
Partition sample
Cluster partial clusters
Partially cluster partitions
Eliminate outliers
Cluster in a final pass to generate the final k
clusters.
CURE Algorithm (6)
Data
Draw random sample
Label data in disk
Partition sample
Cluster partial clusters
Partially cluster partitions
Eliminate outliers
Each data point is assigned to the cluster
containing the representative point closest
to it.
CURE - cluster procedure
procedure cluster(S,k)
begin
T := build_kd_tree(S)
Q := buid_heap(S)
while size(Q) > k do {
u := extract_min(Q)
v := u.closet
delete(Q,v)
w := merge(u,v)
delete_rep(T,u)
delete_rep(T,u)
insert_rep(T,w)
w.closet := x /*x is an
arbitrary cluster in Q */
for each xQ do {
if dist(w,x) < dist(w,w.closet)
w.closet := x
if x.closet is either u or v {
if dist(x,x.closet)< dist(x,w)
a.closet :=
closet_cluster(T,x,dist(x,w))
else
x.closet :=w
relocate(Q,x) }
else if dist(x,x.closet)>dist(x,w){
x.closet := w
relocate(Q,x) } }
insert (Q,w) }
end
CURE - merge procedure
procedure merge(u,v)
begin
w := uUv
w.mean :=
|u|u.mean+|v|v.mean/|u|+|v|
tmpSet :=
for i := 1 to c do {
maxDist := 0
foreach point p in cluster
w do {
if i = 1
maxDist :=
dist(p,w.mean)
else
minDist :=
min{dist(p,q) : q tmpSet
if (minDist maxDist) {
maxDist := minDist
maxPoint := p } }
tmpSet := tmpSet U {maxPoint}
}
foreach point p in tmpSet do
w.rep := w.rep U
{p+*(w.mean-p)}
return w
end
The Intelligent Miner of IBM
Clustering - Demographic provides fast and natural
clustering of very large databases. It automatically
determines the number of clusters to be generated.
Similarities between records are determined by
comparing their field values. The clusters are then
defined so that Condorcet’s criterion is maximised:
(sum of all record similarities of pairs in the same
cluster) - (sum of all record similarities of pairs in
different clusters)
The Intelligent Miner - example
Suppose that you have a database of a
supermarket that includes customer
identification and information abut the date
and time of the purchases. The clustering
mining function clusters this data to enable
the identification of different types of
shoppers. For example, this might reveal
that customers buy many articles on
Friday and usually pay by credit card.