Chapter 9 Part 2

Download Report

Transcript Chapter 9 Part 2

Chapter 9
UNSUPERVISED LEARNING:
Clustering
Part 2
Cios / Pedrycz / Swiniarski / Kurgan
SOM Clustering
Characteristics of operation of a human associative
memory:
•
information is retrieved/recalled on basis of some
measure of similarity relating to a key pattern
•
memory stores and recalls representations of “what is
out there” as structured sequences
•
the recall of information from memory is dynamic
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
2
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
3
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
4
Self-Organizing Feature Maps (SOM)
In data analysis it is fundamental to:
• capture the topology and probability distribution of
pattern vectors
• map pattern vectors from the original high-D space
onto the lower-D new feature space (compressed
space)
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
5
SOM
Data compression requires selection of features that best
represent data for a specific purpose, e.g., for better
visual inspection of the data's structure
Most attractive from the human point of view are
visualizations in 2D or 3D space
The major difficulty is how to faithfully project/map the
data to ensure preservation of the topology present in
the original data feature space.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
6
SOM
Topology-preserving
properties:
mapping
should
have
these
• similar patterns in the original feature space must also
be similar in the reduced feature space - according to
some similarity criteria
• the original and the reduced space should be of
"continuous nature“, i.e., density of patterns in the
reduced feature space should correspond to those in the
original space.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
7
Topology-preserving Mappings
Several methods were developed for topology-preserving
mapping:
• linear projections, such as eigenvectors
• nonlinear projections, such as Sammon's projection
• nonlinear projections, such as SOM neural networks
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
8
Sammon’s Projection
Nonlinear projection used to preserve topological relations between
patterns in the original and the reduced spaces, by preserving the interpattern distances.
The Sammon’s projection minimizes an error defined as the difference
between patterns in the original and reduced feature spaces.
{xk} -- set of L n-dimensional vectors xk in the original feature space Rn,
{yk} -- set of L corresponding m-dimensional vectors y in the reduced lowdimensional space Rm, with m <<n
Distortion measure
Jd 
(d(xi , x j )  d(y i , y j )) 2
 
d(xi , x j )
d (xi , x j ) i 1,i j j1, ji
1
L

L

i 1,i  j j1, j i
L
L
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
9
Sammon’s Projection
Performs a non-linear projection, typically, onto a 2D
plane
Disadvantages:
- it is computationally heavy
- it cannot be used to project new points (points that
were not used during training) on the output plane
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
10
SOM: Principle
high-dim space
low-dim space
(2-dim, 3-dim)
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
11
Self-Organizing Feature Maps (SOM)
• Developed by Kohonen in 1982
• SOM is an unsupervised learning, topology preserving,
projection algorithm
• It uses a feedforward NN topology
• It is a scaling method that projects data from high-D
input space into a lower-D output space
• Similar vectors in the input space are projected onto
nearby neurons on the 2D map
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
12
SOM
The feature map is a layer in which the neurons are self-organizing
themselves, according to input patterns
Each “neuron” of the input layer is connected to each neuron of the
2D map
The weights associated with the inputs are propagated to the 2D
map neurons
• Often not only the winning neuron updates its weight but also
neurons in a certain area around it
• SOM reflects the ability of biological neurons to perform global
ordering based on local interactions
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
13
SOM: Topology and Learning Rule
One iteration of the SOM learning:
1.
Present a randomly selected input vector x to all
neurons
2.
Select the winning neuron, i.e., one whose weight
vector is the closest to the input vector, according to
the chosen similarity measure
3.
Adjust the weight of the jth winning neuron, and the
weights
of
neighboring
(defined
by
some
neighborhood function) neurons
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
14
SOM: Topology and Learning Rule
The jth winning neuron is selected as the one having minimal
distance value:
jth winnin g neuron x - w  min x  w , k 1,..., M
j
k
Winner-takes-all learning rule is used for adjusting the weights:
W (t 1)k  w (t)  h (N (t), t) (x - w (t)), i  1,2,..., M
k
k
j j
k
where (x - w (t)) is the degree of similarity between a neuron and its input
k
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
15
SOM: Topology and Learning Rule
Kohonen also proposed a dot product similarity for selecting the
winning neuron:
jth winnin g neuron w x  max 
 w x , k  1,..., M
j
 k 
and the learning rule:
( w (t)   ( t)x)
i
W (t  1) 
, i  N (t )
i
|| w (t)   ( t)x ||
j
i
where Nj(t) is the winning neuron’s neighborhood
and (0< (t) < ) is the decreasing learning function.
This formula assures automatic weight normalization to the length of
one.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
16
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
17
Neighborhood Kernel
The neighborhood kernel (of the winning neuron) is a
non-increasing function of time and distance
It defines the region of influence that the input has on the SOM
h (N (t), t)  h ( r  r , t)
j j
j j i
where 0  h (N (t), t) 1
j j
r , r radius of the winning neuron j and i respectively
j i
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
18
Neighborhood Kernel
The geometric set of neurons must decrease with the increase of
iteration steps (time)
Convergence of the learning process requires that the radius must
decrease with learning time/iteration
r(t )  r(t )  r(t ) ,...
1
2
3
N (r(t ), t )  N (r(t ), t )  N (r(t ), t ),...
j 1 1
j 2 2
j 3 3
This causes global ordering by local interactions and local weight
adjustments.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
19
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
20
Neighborhood Kernel
With the neighborhood function fixed, the neighborhood
kernel learning function can be defined as:

h (N (t), t)  h 
j j
j
r r
j i

, t   η(t)

0
for i  (N (t), t)
j
otherwise
and
η(t)  ηmax (t)(1 t/T) where T  number of iterations
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
21
Neighborhood Kernel
Another frequently used neighborhood kernel is a
Gaussian function with a radius decreasing with time:

h (N (t), t)  h 
j j
j
r r
j i

, t   η(t) exp (( r  r
j i

2
)/(2 2 (t))
where r  position radius of the winning neuron
j
r  radius of the ith neuron
i
 (t)  width of the kernel
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
22
SOM: Algorithm
Conditions for successful learning of the SOM:
•
The training data must be large since self-organization relies on
statistical properties of data
• Proper selection of the neighborhood kernel function assures that
only the weights of the winning neuron and its neighborhood
neurons are locally adjusted
• The radius of the winning neighborhood, as well as the learning
function rate, must monotonically decrease with time
• The amount of weight adjustment for neurons in a winning
neighborhood depends on how close they are to the winner
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
23
SOM: Algorithm
Given: The 2D network topology consisting of M neurons; training data
set of L n-D input vectors; number of iterations T; neighborhood
kernel function; learning rate function
1. Set learning rate to the max learning rate function
2. set iteration step t=0
3. randomly select initial values of the weights
4. randomly select the input pattern and present it to the network
5. compute the current learning rate for step t using the given
learning rate function
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
24
SOM: Algorithm
6. compute the Euclidean distances
|| xi - wk (t) ||,
k = 1,…, M
7. select the jth winning neuron
|| xi - wj(t) || = min || xi(t) – wk(t) ||, k = 1,…, M
8. define the winning neighborhood around the winning neuron using
the neighborhood kernel
9. adjust the weights of the neurons
wp (t+1) = wp (t) + (t) (xi – wp (t)),
p  Nj(t)
10. increase t=t+1
If t>T stop; otherwise go to step 4
Result: Trained SOM network (clusters found)
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
25
from Kohonen
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
26
SOM: Interpretation
Given: Trained SOM network consisting of the 2D array of M neurons,
each neuron receiving an input via its weight vector w. Small
training data set consisting of pairs (xi, ci), i=1, 2, …, L, where c is
the class label
1. Set i = 1
2. Present input pattern to the network
3. Calculate the distances or the dot products
4. Locate the spatial position of the winning neuron and assign
label c to that neuron
5. Increase i= i+1 and continue with i < = L
Result: Calibrated SOM network
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
27
SOM: Issues
In practice, we do not optimize the SOM design, instead, we pay
attention to:
• Initialization of the weights
Often they are normalized and set equal to the first several input
vectors.
If not initialized to input vectors, they should be grouped in one
quadrant of the unit circle so that they can unfold to map
distribution of the input data.
• Starting value of the learning rate and its decreasing schedule
• Structure of neighborhood kernel and its decreasing schedule
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
28
SOM: Example
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
29
SOM: Example
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
30
SOM: Example
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
31
SOM: Example
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
32
SOM
Visualize a structure of highly dimensional data by mapping it onto
the low-dimensional (typically two-dimensional) grid of linear neurons
j
i
p-rows
X
x(1)
x(2)
…
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
33
Fig 1. Classes of mice.
Higuera C, Gardiner KJ, Cios KJ (2015) Self-Organizing Feature Maps Identify Proteins Critical to Learning in a Mouse Model of
Down Syndrome. PLoS ONE 10(6): e0129126. doi:10.1371/journal.pone.0129126
http://journals.plos.org/plosone/article?id=info:doi/10.1371/journal.pone.0129126
Table 1. Group comparisons and biological relevance.
Higuera C, Gardiner KJ, Cios KJ (2015) Self-Organizing Feature Maps Identify Proteins Critical to Learning in a Mouse Model of
Down Syndrome. PLoS ONE 10(6): e0129126. doi:10.1371/journal.pone.0129126
http://journals.plos.org/plosone/article?id=info:doi/10.1371/journal.pone.0129126
SOM result
The number inside
each node tells how
many training data
points are clustered
within it.
The sum of all
numbers is the total
number of the
training data points.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
37
Fig 3. Venn diagrams of discriminating proteins.
Higuera C, Gardiner KJ, Cios KJ (2015) Self-Organizing Feature Maps Identify Proteins Critical to Learning in a Mouse Model of
Down Syndrome. PLoS ONE 10(6): e0129126. doi:10.1371/journal.pone.0129126
http://journals.plos.org/plosone/article?id=info:doi/10.1371/journal.pone.0129126
Fig 4. SOM clustering with subsets of protein expressions data from control mice.
Using only 11 proteins/features
Using the remaining 66
(77-11) proteins
Higuera C, Gardiner KJ, Cios KJ (2015) Self-Organizing Feature Maps Identify Proteins Critical to Learning in a Mouse Model of
Down Syndrome. PLoS ONE 10(6): e0129126. doi:10.1371/journal.pone.0129126
http://journals.plos.org/plosone/article?id=info:doi/10.1371/journal.pone.0129126
Table 3. Functional associations of discriminant proteins used to generate SOMs.
Higuera C, Gardiner KJ, Cios KJ (2015) Self-Organizing Feature Maps Identify Proteins Critical to Learning in a Mouse Model of
Down Syndrome. PLoS ONE 10(6): e0129126. doi:10.1371/journal.pone.0129126
http://journals.plos.org/plosone/article?id=info:doi/10.1371/journal.pone.0129126
Fig 5. SOM clustering with data from the 11 proteins that discriminate between context-shock and
shock-context (c-CS and c-SC) plus.
Higuera C, Gardiner KJ, Cios KJ (2015) Self-Organizing Feature Maps Identify Proteins Critical to Learning in a Mouse Model of
Down Syndrome. PLoS ONE 10(6): e0129126. doi:10.1371/journal.pone.0129126
http://journals.plos.org/plosone/article?id=info:doi/10.1371/journal.pone.0129126
Fig 6. SOM clustering of trisomic mice data using 77 proteins.
Higuera C, Gardiner KJ, Cios KJ (2015) Self-Organizing Feature Maps Identify Proteins Critical to Learning in a Mouse Model of
Down Syndrome. PLoS ONE 10(6): e0129126. doi:10.1371/journal.pone.0129126
http://journals.plos.org/plosone/article?id=info:doi/10.1371/journal.pone.0129126
Clustering and Vector Quantization
`
x
i0
encoder
decoder
codebook
Encoding: determine the best representative (prototype) of the codebook and store
(transmit) its index i0,
i0= arg mini ||x –vi|| where vi - i-th prototype.
Decoding: recall the best prototype given the transmitted index (i0)
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
43
Cluster validity
• Using different clustering methods might result in different
partitions of X, at each value of c
• How many clusters do exist in the data?
• Which clustering's are valid?
• It is plausible to expect “good” clusters at more than one
value of c ( 2  c < n )
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
44
Aspects of Cluster Validation
1.
2.
3.
4.
5.
Determine clustering tendency for a data set, i.e., distinguish
whether a non-random structure actually exists in the data.
Compare results of a cluster analysis to the known class labels
(if they exist).
Evaluate how well the results fit the data
- Use only the data
Compare results of different cluster analyses results and
determine “the best” one.
Determining the ‘correct’ number of clusters.
From : http://www.cs.kent.edu/~jin/DM08/ClusterValidation.pdf
45
Cluster Validation Process
Cluster validation refers to procedures that evaluate the results of
clustering in a quantitative and objective fashion. [Jain & Dubes,
1988]
– How to be “quantitative”: employ some measures.
– How to be “objective”: validate the measures
INPUT:
DataSet(X)
Clustering
Algorithm
Validity
Index
m*
Different number of clusters m
From: cs.uef.fi/pages/franti/cluster/Clustering-Part3.ppt
46
Cluster validity
Process X (generate clusters) at
c = 2, 3, …, n – 1
and record the optimal values of some criterion as a
function of c
The most valid clustering is taken as an extremum of this
function
Problem:
Often criterion functions have multiple local stationary
points at fixed c; also global extrema do not necessarily
correspond to the “best” c-partitions of the data
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
47
Cluster validity
Cluster Error: associated with any U  Mc
it is the number of vectors in X that are mislabeled by U
uˆ


E (U ) 
n
c
k 1
i 1
u ik 
2
ik
2n
E(U) is an absolute measure of cluster validity
when X is labeled,
and is undefined when X is not labeled.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
48
Cluster validity
More formal approach is to pose the validity question in the
framework of statistical hypothesis testing
• However, sampling distribution is not known
• Nonetheless, goodness of fit statistics such as chi-square
and Kolmogorov-Smirnov tests are being used
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
49
Random
Points
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
y
y
Clusters in Random Data
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
0.2
0.4
0.6
0.8
0
1
DBSCAN
0
0.2
0.4
x
0.6
0.8
1
x
1
1
0.9
0.9
0.8
K-means
0.8
0.7
0.6
0.5
y
y
0.6
0.4
0.5
0.4
0.3
0.3
0.2
0.2
0.1
0
Complete
Link
0.7
0.1
0
0.2
0.4
0.6
x
0.8
1
0
0
0.2
0.4
0.6
0.8
1
x / Swiniarski / Kurgan
© 2007 Cios / Pedrycz
50
Measures of Cluster Validity
Numerical measures that are applied to judge various aspects of
cluster validity can be classified into:
– External Index: Used to measure the extent to which cluster
labels match externally supplied class labels
• Entropy
– Internal Index: Used to measure the goodness of a clustering
structure without using external information.
• Sum of Squared Error (SSE)
– Relative Index: Used to compare two different clustering's or
clusters.
• Often SSE or entropy
From: http://www.cs.kent.edu/~jin/DM08/ClusterValidation.pdf
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
51
Cluster validity
The global minimum of Jw may suggest the “wrong” 2clusters
X
Example from Bezdek:
s
n=29 data vectors {xk
}R2
“correct” 2-partition of X
is shown on the left



For hard 2-partitionU ,
For hard 2-partitionU ,
  
J U , v   102


  
J w U , v   2.25s 2  5s  6.25






The global minimum is
hardly an attractive solution.
  
  
For s  5.5, J w U , v   J w U , v 








© 2007 Cios / Pedrycz / Swiniarski / Kurgan
52
Cluster validity
Basic question:
what constitutes a “good” cluster ?
What do we mean by a “cluster” anyhow?
The difficulty is that the data X, and every partition of X, are separated
by the algorithm generating partition matrix U
(and defining “clusters” in the process)
• Many of the algorithms ignore this fundamental difficulty and are
thus heuristic in nature
• Some heuristic methods measure the amount of fuzziness in U, and
presume that the least fuzzy partitions are the most valid
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
53
Cluster Validity
How to assess the quality of clusters?
How many clusters should be found in data?
Compactness: expresses how close the elements in a cluster are.
Quantified in terms of intra-cluster distances
Separability: expresses how distinct the clusters are.
Quantified through inter-cluster distances.
Goal: high compactness and high separability
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
54
Cluster Distances
From: bioinfo.cipf.es/courses/mda08/_media/clustering_validation.pdf
55
Davies-Bouldin index
within scatter distance for the i-th cluster:
si 
1
2
 || x  v i ||
card(Ω i ) xΩi
distance between the prototypes between the prototypes of the clusters:
dij = ||vi – vj||2
Define the ratio
ri  max j, ji
si  s j
d ij
and then the sum
1 c
r   ri
c i 1
The “optimal”, “correct” number of the clusters (c) is the one for which the value of “r” attains its
minimum.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
56
Davies-Bouldin index
From: cs.uef.fi/pages/franti/cluster/Clustering-Part3.ppt
57
From: carme2011.agrocampus-ouest.fr/slides/Roux.pdf
58
Dunn separation index
diameter of the cluster
Δ(Ωi )  max x,yΩi || x  y ||
inter-cluster distance
δ(Ωi , Ω j )  min xΩi ,yΩ j || x  y ||
r  min i min
δ(Ω i , Ω j )
j, j c
max k Δ(Ω k )
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
59
Xie-Benie index
aka compactness and separation validity function
Optimal number of clusters corresponds to the lowest value of r, yet
the index shows monotonicity when the number of clusters is close to
the number of data points.
N c
r
m
2
u
||
x

v
||
  ik
k
i
k 1i 1
N{min i  j || v i  v j || }
2
Find the lowest value of “r”
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
60
Cluster Validity
• Divisive Coefficient (DC)
1 c
DC   l (i )
n i 1
a
a,b
b
a,b,c,
d,e
c
c,d,e
d,e
d
e
10.0
5.0
3.0 2.0
0.0
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
61
Cluster Validity
1 c
DC   l (i )
n i 1
 md ( j )  md (i )  
n(i )
l (i )  1 
md (0)


where :
i  Current object (either cluster or singleton)
j  The cluster from which the current cluster comes from
md (i )  Diameter of object i (maximum distance between any two sub - objects)
n(i )  Number of sub - objects of i
l0  0
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
62
a
Cluster Validity
• Divisive Coefficient
(DC)
a
a,b
a,b,c,
d,e
l0
b
l2
c
c,d,e
l1
d,e
d
l3
e
l7
a  0 .0
b  2.0
c  6 .0

d 10.0
e  9.0
5.0
L1=(1-((10-5)/10))3=1.5
3.0 2.0
c
l1  1.5
l5
l5  0.7
l 6  0 .7
l6 l 7  0 . 8
e
2
l8 l2  0.4 DC1 
l3  1.6
l4 l  0 . 5
4
d
2.0 6.0 10.0 9.0
0.0 5.0 9.0 8.0 
5 .0 0 . 0 4 .0 5 .0 

9 .0 4 .0 0 . 0 3 .0 
8.0 5.0 3.0 0.0
l0  0
l8  0.8
10.0
b
 l (i)
1
2
l l
DC1  1 2
2
1 . 5  0 .4
DC1 
2
DC1  0.95
0.0
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
63
b
a
Cluster Validity
• Divisive Coefficient
(DC)
a
a,b
a,b,c,
d,e
l0
b
l2
c
c,d,e
l1
d,e
d
l3
e
10.0
5.0
3.0 2.0
a  0 .0
b  2.0
c  6 .0

d 10.0
e  9.0
c
e
d
2.0 6.0 10.0 9.0
0.0 5.0 9.0 8.0 
5 .0 0 . 0 4 .0 5 .0 

9 .0 4 .0 0 . 0 3 .0 
8.0 5.0 3.0 0.0
l0  0
l7 l  1 . 5
1
l8 l 2  0 . 4
l3  1.6
l4 l  0 . 5
4
DC2  0.83
l5  0.7
l5
l6
l 6  0 .7
l 7  0.8
l8  0.8
0.0
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
64
b
a
Cluster Validity
• Divisive Coefficient
(DC)
a
a,b
a,b,c,
d,e
l0
b
l2
c
c,d,e
l1
d,e
d
l3
e
10.0
5.0
3.0 2.0
a  0 .0
b  2.0
c  6 .0

d 10.0
e  9.0
c
e
d
2.0 6.0 10.0 9.0
0.0 5.0 9.0 8.0 
5 .0 0 . 0 4 .0 5 .0 

9 .0 4 .0 0 . 0 3 .0 
8.0 5.0 3.0 0.0
l0  0
l7 l  1 . 5
1
l8 l 2  0 . 4
l3  1.6
l4 l  0 . 5
4
DC3  0.575
l5  0.7
l5
l6
l 6  0 .7
l 7  0.8
l8  0.8
0.0
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
65
b
a
Cluster Validity
• Divisive Coefficient
(DC)
a
a,b
a,b,c,
d,e
l0
b
l2
c
c,d,e
l1
d,e
d
l3
e
10.0
5.0
3.0 2.0
a  0 .0
b  2.0
c  6 .0

d 10.0
e  9.0
c
e
d
2.0 6.0 10.0 9.0
0.0 5.0 9.0 8.0 
5 .0 0 . 0 4 .0 5 .0 

9 .0 4 .0 0 . 0 3 .0 
8.0 5.0 3.0 0.0
l0  0
l7 l  1 . 5
1
l8 l 2  0 . 4
l3  1.6
l4 l  0 . 5
4
DC4  0.70
l5  0.7
l5
l6
l 6  0 .7
l 7  0.8
l8  0.8
0.0
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
66
Cluster Validity: Fuzzy Clustering
Partition coefficient
c
N
2
P1    u ik
i 1 k 1
Partition entropy
Sugeno-Takagi
1 c N
P2     u ik log a u ik
N i1 k 1
c
N
m
P3    u ik
(|| x k  v i ||2  || v i  v ||2 )
i 1 k 1
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
67
Degree of Separation - Bezdek
The degree of separation between two fuzzy clusters u1 and
u2 is the scalar
n
 U ;2  1    u1k  u2 k 
k 1

and its generalization to c fuzzy clusters is:
nc

Z U ; c   1      uik U  M fc
k 1 i 1 
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
68
Degree of Separation - Bezdek
Example: c=2 and two different fuzzy 2-partitions of X:
1
1
 1



1

0
1

1
0

0
 2



2
2
U 
V 


1
1
1
0
0  0

1
1  1
2
2 
2



Z(U;2) = Z(V;2) = 0.50
U and V are very different so Z does not distinguish
between the two partitions.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
69
Partition Coefficient - Bezdek
Partition coefficient (PC) measures the amount of “overlap”
between fuzzy clusters, where U is a fuzzy c-partition of n
data points.
The partition coefficient, F, is the scalar:
n
F U ; c  
c
 u 
2
k 1 i 1
ik
n
The value of F(U; c) depends on all (c x n) elements of U in
contrast to Z(U; c) that depends on just one.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
70
Partition Coefficient - Bezdek
Example: values of F on U and V partitions of X:
1
1
 1



1 2  2 0
1  1 2 0  0
U 
V 


1
1
1
0
0  0

1
1  1
2
2 
2



F(U;2) = 0.510
F(V; 2) = 0.990
The value of F gives accurate indication of the partition for
both partitions (the most uncertain and certain).
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
71
Partition Coefficient - Bezdek
from Bezdek
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
72
Partition Entropy - Bezdek
Let {Ai | 1  i  c } denote a c-partition of events of any sample space
connected with an experiment; and let
pT = (p1,p2, …, pc)
denote a probability vector associated with the {Ai}.
The pair ({Ai}, p) is called a finite probability scheme for the experiment
ith component of p is the probability of event Ai
c is called the length of the scheme
Note: c does NOT indicate the number of clusters
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
73
Partition Entropy - Bezdek
Our aim is to find a measure h(p) of the amount of uncertainty
associated with each state.
• h(p) – should maximize for p=(1/c, …, 1/c)
• h(p) – should minimize for p=(0, 1, 0, …)
(any partition statistically certain)
The entropy of the scheme is defined as:
c
h( p)   pi log a ( pi )
i 1
pi loga(pi) = 0 whenever pi = 0
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
74
Partition Entropy - Bezdek
Partition entropy
of any fuzzy c-partition U  Mfc of X, where |X| = n, is
for 1  c  n:
n
c
H (U ; c) 
  uik log a (uik )
k 1 i 1
n
Let U  Mfc be a fuzzy c-partition of n data points. Then for
1  c  n and a  (1,)
0  H(U;c)  loga(c)
H(U;c) = 0  Mco is hard
H(U;c) = loga(c)  U = [1/c]
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
75
Partition Entropy - Bezdek
Example: entropy for U and V
1
1
 1



1

0
1

1
0

0




2
2
U  2
V



1
1 
1
0
0  0

1
1  1
2
2 
2



H(U;c) = 49 loge(2)/51 = 0.665
H(V;c) = loge(2)/51 = 0.013
U is a very uncertain partition
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
76
Partition Entropy - Bezdek
We assume that minimization of H(U;c) corresponds to
maximizing the amount of information about structural
memberships an algorithm extracted from data X.
H(U;c) is used for cluster validity as follows:
c denotes a finite set of “optimal” U’s Mfc
c = 2, 3, …, n-1
min {min {H (U ; c)}}
c
c
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
77
From Bezdek
Normalized partition entropy ->

H (U ; c)
H (U ; c) 
c
[1  ( )]
n
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
78
Prototypes as FEATURE SELECTORS
from Bezdek
Feature centers
Absolute
differences
Symptom
(Hernia)
v1j
(Gallstones)
v2j
1
0.57
0.27
0.30
2
0.98
0.67
0.31
3
0.06
0.93
0.87
4
0.22
0.55
0.33
5
0.17
0.10
0.07
6
0.77
0.84
0.07
7
0.42
0.05
0.37
8
0.39
0.84
0.45
9
0.48
0.04
0.44
10
0.02
0.16
0.14
11
0.12
0.25
0.13
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
79
Cluster Errors indicate best FEATURES
from Bezdek
Symptoms used
Cluster Error
E(U)
{1-11}
23
{3}
23
{3, 8}
23
{3, 9}
36
{3, 8, 9}
36
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
80
Takagi-Sugeno index
It is of particular use in identification of non-linear
interactions between variables.
It deals with a partition matrix and also involves data.
P3 
c
N
m
  u ik (|| x k
i 1 k 1
 v i ||  || v i  v || )
2
2
81
Random Sampling
prototypes
Two-phase clustering:
prototypes
(a) Random sampling
(question: how large it should be
to become representative of
data and structure we are
interested in finding)
sampling
and
DATA
(b) Clustering of prototypes
found in the samples
clustering
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
82
Cluster Validity
• Cluster validity is very hard to check
• Without a good validation,
any clustering result is
potentially just random.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
83
References
Anderberg MR. 1973.Cluster Analysis for Applications, Academic Press
Bezdek, JC. 1981. Pattern Recognition with Fuzzy Objective Function Algorithms,
Plenum Press
Devijver PA and Kittler J (eds.), 1987. Pattern Recognition Theory and Applications,
Springer-Verlag
Dubes R. 1987. How many clusters are the best? – an experiment.
Pattern Recognition, 20, 6, 645-663
Duda RO, Hart, PE and Stork DG. 2001 Pattern Classification, 2nd edition, J. Wiley
Dunn JC. 1974.A fuzzy relative of the ISODATA process and its use in detecting
compact well-separated clusters, J. of Cybernetics, 3, 3, 32-57
Jain, AK, Murthy MN and Flynn PJ. 1999. Data clustering: A review,
ACM Comput. Survey, 31, 3, 264-323
Kaufmann L and Rousseeuw PJ. 1990. Finding Groups in Data:
An Introduction to Cluster Analysis, J. Wiley
Kohonen, T., 1995. Self-organizing Maps, Springer Verlag
Sammon, JW Jr. 1969. A nonlinear mapping for data structure analysis.
IEEE Trans. on Computers, 5, 401-409
Xie, XL and Beni G. 1991.A validity measure for fuzzy clustering,
IEEE Trans. on Pattern Analysis and Machine Intelligence, 13, 841-847
Webb A. 2002. Statistical Pattern Recognition, 2nd edition, J. Wiley
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
84