Graph-PNN - MikroBitti

Download Report

Transcript Graph-PNN - MikroBitti

UNIVERSITY OF JOENSUU
DEPARTMENT OF COMPUTER SCIENCE
JOENSUU, FINLAND
Pairwise Nearest Neighbor
Method Revisited
Parittainen yhdistelymenetelmä
uudistettuna
Olli Virmajoki
11.12.2004
Clustering
Important combinatorial optimization problem
that must often be solved as a part of more
complicated tasks in




data analysis
pattern recognition
data mining
other fields of science and engineering
Entails partitioning a data set so that similar
objects are grouped together and dissimilar
objects are placed in separate groups
Example of data sets
Employment statistics
ID
800
801
802
803
804
805
806
807


POSTAL ZONE
Munchen
Munchen-land ost
Munchen-land sued
Munchen-land west
Munchen-land nord
Freising
Dachau
Ingolstadt


RGB-data
R
26
28
28
23
31


G
20
5
12
13
4


Self
employed
56750
7684
3780
7226
2226
8187
8165
5810


B
45
46
44
46
51


Civil
servents
57218
5790
1977
5623
1305
5140
2763
5212


Clerks
300201
20279
11058
25571
9347
14632
11638
15019


Manual
workers
242375
23491
7398
20380
12432
24377
24489
30532


Summary of data sets
Data set Type of
data set
Bridge
Grayscale
House
RGB
Miss
Residual
America vectors
Number of
data
vectors (N)
4096
Number of Dim of
clusters
data
(M)
vector
256
16
34112
256
3
6480
256
16
15
2
Data set Synthetic 5000
S1-S4
Data sets
An example of clustering
Clustering
Given a set of N data vectors X={x1, x2, ...XN}
in K-dimensional space, clustering aims at
solving the partition P={p1, p2, ...pN}, which
defines for each data vector the index of the
cluster where it is assigned to.
Cluster sa = {xi|pi=a}
Clustering S={s1, s2, ...,sM}
Codebook C={c1, c2, ...,cM}
2
Cost function
1 N
f (C , P) 
xi  c pi

N i 1
Combinatorial optimization problem
Clustering algorithms
Heuristic methods
Optimization methods


K-means
Genetic algorithms
Graph-theoretical methods
Hierarchical methods


Divisive
Agglomerative (yhdistelevä)
Agglomerative clustering
N = 22 ( number of data points )
M = 3 ( number of final clusters )
Ward’s method (PNN in VQ)
Merge cost:
d a ,b
na nb

 ca  cb
na  nb
2
Local optimization strategy:
a, b  arg min d i , j
i , j1, N 
i j
Nearest neighbor search is needed:
(1) finding the cluster pair to be merged
(2) updating of NN pointers
The PNN method
M=5000
M=16
M=50
M=15
M=5000
M=4999
M=4988
.
.
.
M=50
.
.
M=16
M=15
Nearest neighbor pointers
b
c
g
a
f
e
d
Fast exaxt PNN method:
Reduces the amount of the nearest neighbor searches
in each iteration: O(N 3)  Ω (N 2)
Combining the PNN and k-means
1
M
codebook size
M
k-means
M0
N
combined
PNN
random
selection
M0
standard
PNN
N
PNN as a crossover method in
the genetic algorithm
Initial1
Combined
Initial2
Union
PNN
Result of PNN
Two
random
codebooks
M=15
Combined
codebook
M=30 and
final
codebook
M=15
Publication 1:
Speed-up methods
Partial distortion search (PDS)
Mean-distance-ordered search (MPS)
Uses the component means of the
vectors
 Derives a precondition for the
distance calculations
Reduction of the run time to 2 to 15%

Example of the MPS method
Input
vector
A
B'
C'
A'
B
Best candidate
C
Publication 2:
Graph-based PNN
Based on the exact PNN method
NN search is limited only to the k
clusters that are connected by the
graph structure
Reduces the time complexity of every
search from O(N) to O(k)
Reduction in the run time to 1 to 4%
Why graph structure ?
O(N) searches with
the full search
(N=4096)
Only O(k) searches
with the graph structure !
(k = 3)
Sample graph
Publication 3:
Multilevel thresholding
Can be considerd as a special case of
vector quantization (VQ), where the
vectors are 1-dimensional
Existing method  (N 2)
PNN thresholding can be implemented
in O(N·logN)
The proposed method works in real
time for any number of thresholds
Distances in heap structure
minimum distance
73
15
12
70
4
1
4
2
7
8
remove
73
1
28
2
88
4
2
7
8
8
O(1)
update
1
O(log N)
update
Publication 4:
Iterative shrinking (IS)
Generates the clustering by a sequence
of cluster removal operations
In the IS method the vectors can be
reassigned more freely than in the PNN
method
Can be applied as a crossover method
in the genetic algorithm (GAIS)
GAIS outperforms all other clustering
algorithms
Example of the PNN method
Before cluster merge
x
x
x
x
After cluster merge
S2
x
x
x
x
S3
+
+
+
+
+
x
x
x
+
+
xx
+
+
+
x
x
+
+
x
+
x
+
xx
+
+
x
x
S1
+
x
+
+
+
+
x
+
+
+
+
+
+
S4
+
Code vectors:
+
x
x
S5
+
x
+
+
+
+
Data vectors:
Vectors to be merged
x
Remaining vectors
+
Data vectors of the clusters to be merged
Other data vectors
+
+
Example of the iterative
shrinking method
Before cluster removal
+ +
S2
+
+
+
+
+
+
After cluster removal
+
S3
S1
+
+
+
+
+
+
+
+
+
+
xx
+
+
+
x
+
+
+
+
+
x
+
+
+
xx
+
+
+
+
+
+
+
x
+
S5
+
+
+
+
S4
+
Code vectors:
+
x
x
+
+
x
+
+
+
+
+
Data vectors:
Vector to be removed
x
Data vectors of the cluster to be removed
Remaining vectors
+
Other data vectors
+
The PNN and IS in the search
of the number of clusters
0.000120
S4
0.000115
F-ratio
0.000110
PNN
0.000105
0.000100
0.000095
IS
0.000090
0.000085
minimum
0.000080
25
20
15
Number of clusters
10
5
Time-distortion performance
190
SAGA
185
repeated
K-means
MSE
180
175
RLS
170
GAIS
PNN
IS
165
160
0
1
10
100
1000
Time (s)
10000 100000
Publication 5:
Optimal clustering
Can be found by considering all possible
merge sequences and finding the one that
minimizes the optimization function
Can be implemented as a branch-and-bound
(BB) technique
Two suboptimal, but polynomial, time
variants:
 Piecewise optimization
 Look-Ahead optimization
Example of non-redundant
search tree
AB
ABC ABD ABE CD
ABCD
DE
ABCE
CDE
CE
ABDE
CD
AC
CE
DE
ACDE
ACD ACE BD
BE
BDE
AD
BE
DE
ADE
BC
BC
BCE
AE
BE
BC
BCD
BD
BC
CD
BCD BCE
BCDE
BD
Branches that do not have any valid clustering
have been cut out
BD
BDE
Illustration of the Piecewise
optimization
N clusters
Z merge
steps
N - Z clusters
N - 2Z clusters
N - 3Z clusters
M clusters
Final result
Comparative results
Bridge
Standard k-means
180
K-means+PDS+MPS+Activity
MSE
175
Graph-PNN
PNN+PDS+MPS+Lazy
170
165
PNN
Graph-PNN+K-means
IS
GAIS(short)
GAIS(long)
160
1
10
100
1000
10000
Running time (in seconds)
100000
Comparative results
8.0
MSE
7.5
Standard k-means
House
K-means+PDS+MPS+Activity
7.0
Graph-PNN
6.5
PNN+PDS+MPS+Lazy
IS
PNN
6.0
Graph-PNN+K-means
5.5
1
10
100
GAIS(long)
GAIS(short)
1000
10000
Running time (in seconds)
100000 1000000
Comparative results
Standard k-means
6.0
MSE
5.8
Miss America
K-means+PDS+MPS+Activity
5.6
Graph-PNN
PNN+PDS+MPS+Lazy
5.4
PNN
Graph-PNN+K-means
5.2
IS
GAIS(short)
5.0
1
10
100
1000
10000
Running time (in seconds)
GAIS(long)
100000 1000000
Example of clustering
k-means
agglomerative clustering
Conclusions
Several speed-up methods
 Projection-based search
 Partial distortion search
 k nearest neighbor graph
Efficient O(N·logN) time implementation
for the 1-dimensional case
Generalization of the merge phase by
cluster removal philosofy (IS) for better
quality
Optimal clustering based on the PNN
method