S - VideoLectures.NET

Download Report

Transcript S - VideoLectures.NET

ON STATISTICAL MODEL OF
CLUSTER STABILITY
Z. Volkovich
Software Engineering Department, ORT Braude
College of Engineering, Karmiel 21982, Israel;
Department of Mathematics and Statistics, the
University of Maryland, Baltimore County, USA;
Z. Barzily
Software Engineering Department, ORT Braude
College of Engineering, Karmiel 21982, Israel;
1
Concept
We propose a general statistical approach for
the study of the cluster validation problem.
Our concept suggests that partitions obtained by
a cluster algorithm rerunning can be considered as
realizations of a random variable such that the most
stable random variable infers the true number of
clusters.
We offer to measure stability by means of
probability distances within the observed realization.
Our method suggests the application of
probability metrics between random variables.
2
Motivating works
– T. Lange, V. Roth, L. M. Braun, and J. M.
Buhmann. Stability-based validation of
clustering solutions. Neural Computation,
16(6):1299 – 1323, 2004.
– V. Roth, V. Lange, M. Braun, and Buhmann J.
A resampling approach to cluster validation. In
COMPSTAT, 2002.
3
Clustering
Goal: partition a set S  Rd by means of a clustering
algorithm CL such that
CL(x)= CL(y) if x and y are “similar”
CL(x)  CL(y) if x and y are “dissimilar”
4
Clustering (cont)
CL(x)  CL(y)
CL(x)= CL(y)
An important question: how many clusters are there?
5
Example: a three-cluster set partitioned into 2
and 4 clusters
6
Implication
It is observed that in the case when the
number of clusters is not correctly chosen non
consistent clusters can be formed. Such a
cluster is a union of several heterogeneous
parts having different distributions.
7
Concept
DATA
SAMPLE - S
INITIAL
PARTITION
CLUSTERING MACHINE- CL
CLUSTERED
SAMPLE
S= CL(S)
8
Concept (cont. 1)
S1= CL(S1)
S2= CL(S2)
………………………………………………………………………………
These clustered samples can be considered as estimates of the looked-for
“true partition”, and appropriately, our concept views these partitions as
instances of a random variable such that the steadiest random variable is
associated with the true number of clusters.
DATA
9
Concept (cont. 2)
Outliers in the samples and limitations of
clustering algorithms significantly add to the noise
level and increase the model’s volatility. Thus, the
clustering procedure has to be iterated many times to
obtain meaningful results.
The stability of random variables is measured via
probability distances. According to our principle, it is
natural to assume that the true number of clusters
corresponds to the distance distribution having the
minimal inner variation.
10
Some probability metrics
Probability metrics are introduced on a space Λ of realvalued random variables defined on the same probability space.
A functional dis: ΛR+ is called a probability metric if it
satisfies the following additional properties:
– Identity:
– Symmetry
– Triangular inequality
The last two properties are not always required. If a probability
metric identifies the distribution
i.e.
then the metric is called simple, otherwise the metric is called
compound.
11
Examples
LP-metric
Ky Fan metrics m
Hellinger distance
Relative entropy (or Kullback-Leibler divergence)
Kolmogorov (or Uniform) metric
Levy metric
Prokhorov metric
Total variation distance
Wasserstein (or Kantorovich) metric
Chi-2 distance
And so on….
13
Concentration measure index
The LP and the Ky Fan metrics are compound and the other
metrics shown above are simple. A difference between the
simple and compound metrics is that, unlike the compound
ones, simple metrics equal zero for two independent
identically distributed random variables. Moreover, if dis is a
compound metric, dis(X , X’ ) = 0 and X , X’ are independent
realizations of the same random variable X, then X is a
constant almost surely.
A compound distance can be used as a measure of
uncertainty. Particularly, dis(X , X’ ) = d(X ) is called a
concentration measure index derived by the compound
distance dis. The stability of a random variable can be
estimated by the average value of this index.
14
Simple and Compound Metrics
Simple metrics
Geometrical Algorithms
Compound metrics
Membership Algorithms
15
Geometrical Algorithm
Several algorithms can be offered here. The first
type is to consider the Geometrical Stability. The
basic idea is that if one ”properly” clusters, two
independent samples then, under the assumption
of a consistent clustering algorithm, the clustered
samples can be classified as samples drawn from
the same population.
16
Example: Cluster stability via samples
Cluster 2
Cluster 1
The samples found the same clusters. A partition
is stable
17
General algorithm.
Given a probability metric dis(·, ·)
1.
2.
3.
4.
5.
6.
7.
For each tested number of clusters k, execute steps 2
to 6 as follows:
Draw pairs of disjoint samples (St, St+1), t=1,2,…
Cluster the united sample (S= CL(StSt+1)) and
separate S to the two clustered samples (St,St+1)
Calculate the distance values dt=dis(St,St+1)
Average each consecutive T distances
Normalize the set of averages and obtain the set DISk
The number of clusters k, which yields the most
concentrated normalized distribution DISk, is chosen
as the estimate of the true number of clusters.
18
How is the concentration
measured?
The distances are inherently
positive, and we are interested in
distributions concentrated near zero.
Thus, we can use as concentration
measures the sample mean, the sample
standard deviation and the lower
quartile.
19

Simple distances
Many simple distances, applicable in two sample tests, are simple
metrics. For instance, the well known Kolmogorov-Smirnov test is
based on the max-distance between the probability functions in the
one dimensional case. In the multivariate case, the following tests
can be mentioned in this context:
• the Friedman-Rafsky test 1979. (Minimal spanning tree based)
• the Nearest Neighbors test of Henze 1988.
• the Energy test, Zech and Aslan 2005.
• the N-distances test of Klebanov 2005.
Such metrics can be used in the Geometrical Algorithm.
20
Simple distances (cont)
Recall, that the classical two-sample problem
is intended for testing the hypothesis
H0 : F(x) = G(x)
against the general alternative
H1 : F(x)  G(x),
when the distributions F and G are unknown.
Here we consider applications of the Ndistances test of Klebanov and the Friedman-Rafsky
test.
21
Klebanov’s N-distances
N-distances are defined as follows:
dis( X  Y )  2E ( N ( X 1 Y1 )) 


 E ( N ( X1 X1 ))  E ( N (Y1 Y1 ))
where X1, X1’ and Y1, Y1’ are independent
realizations of X and Y respectively. N is a
negative definite kernel. We use
N(x,y)=||x-y||α , 0<α2.
22
Graphical illustration
Let us consider two samples S1 and S2 partitioned into clusters
S1
S2
23
Graphical illustration (cont. 1)
Distances between points belonging to
different samples
Cluster C1
Cluster C2
k
  2dis ( S , S
i 1
i
1
2
)  disi ( S1 , S1 )  disi ( S 2 , S 2 ) 
24
Graphical illustration (cont. 2)
Cluster C2
Cluster C1
k
  2dis ( S , S
i 1
i
1
2
)  disi ( S1 , S1 )  disi ( S 2 , S 2 ) 
25
Remark
Note, the metric is actually the average distance
between the samples in the clusters. If this
value is close to zero then it can be concluded
that the partition is stable. Namely, the elements
of the samples are closely located inside the
clusters and can be deemed as a consistent set
within each of the clusters.
26
Distances calculations
k
Dis( S1 , S2 )    2disi ( S1 , S2 )  disi ( S1 , S1 )  disi ( S2 , S2 ) 
i 1
disi ( S1 , S2 ) 
disi ( S1 , S1 ) 
disi ( S 2 , S 2 ) 


X1Ci  S1 X 2 Ci  S2
X1  X 2

| C i  S1 | * C i  S2

X 1 , X 2 C i  S1 ; X 1  X 2
X1  X 2

| C i  S1 | *  C i  S1  1

X 1 , X 2 C i  S 2 ; X 1  X 2
X1  X 2

| C i  S 2 | *  C i  S 2  1
27
Histograms of the distances’ values
Two cases are presented. The case when the quantity T of the averaged
distance values is big and the case when T is small correspondently. In the
second case, measuring the concentration via the lower quartile appears to
be more appropriated.
28
Example:The Iris Flower Dataset
• The kernel: N(x,y)=||x-y||
• Sample size: 70
• Number of samples: 20*80
• Number of averaged samples: 80
29
Graph of the normalized mean value
30
Graph of the normalized quartile
value
31
Euclidean Minimal Spanning Tree
The Euclidean minimum spanning tree or EMST is a
minimum spanning tree of a set S of points in an
Euclidean space Rd, where the weight of the edge
between each pair of points is the distance between
those two points. In simpler terms, an EMST
connects a set of dots using lines such that the total
length of all the lines is minimized and any dot can
be reached from any other by following the lines
( see, Wikipedia).
32
Euclidean minimal spanning tree
(cont.)
A tree on S is a graph which has no loops and
whose vertices are elements of S.
If all distances between the items of S are
distinct then the set S is called nice and it has a
unique EMST.
An EMST can be built in O(n2) time
(including distance calculations) using the Prim’s,
Kruskal’s, Boruvka’s or the Dijkstra’s algorithms.
33
An EMST of 60 random points
34
How can an EMST be used in the cluster
validation problem?
We draw two samples S1 and S2 and determine
S= S1S2- pooled sample
S= CL(S)-clustered pooled sample
We expect that, in the case of a stable clustering,
the two samples’ items are closely located inside the
clusters.
This fact is characterized by the number of edges
connecting points from different samples.
These edges are marked by red in the diagrams in
the examples.
35
Graphical illustration. Stable
clustering
36
Graphical illustration. Non-stable
clustering
37
The two-sample MST-test
The two- sample test the hypothesis
H0 : F(x) = G(x)
against the general alternative
H1 : F(x)  G(x),
when the distributions F and G are unknown.
The number of the edges connecting points
from different samples has been considered in the
Friedman-Rafsky’s MST test.
Particularly, let X = {xi}, i=1,…n, Y = {yj},
j=1,…m be two samples of independent random
elements, distributed according to F and G,
respectively.
38
The two-sample MST-test (cont. 1)
Suppose that the set Z = X  Y is nice.
Friedman and Rafsky’s test statistic Rmn
can be defined as the number of edges of Z
which connect a point of X to a point of Y .
Friedman and Rafsky actually
introduced the statistic 1+Rmn, which
expresses the number of disjoint sub-trees
resulting from removing all edges uniting
vertices of different samples.
39
The two-sample MST-test (cont. 2)
Henze and Penrose (1979) considered the asymptotic
behavior of Rmn. Suppose that m  and n  such that
m/(m+n) p (0, 1). Introducing q = 1 − p and r =
2pq, they obtained:
where the convergence is in distribution and N(0,2) denotes
the normal distribution with a zero expectation and a variance
2= r (r + Cd(1 − 2r))
for some constant Cd depending only on the space’s
dimension d.
40
The two-sample MST-test (cont. 3)
This results coincides with our intuition
since if two sets are closed then there are
many edges which unit points from different
samples. In the spirit of The Central Limit
Theorem, this quantity is expected to be
asymptotically normally distributed.
41
Theorem’s application
To use this theorem, for each possible number
of clusters k=2,…,k* , we draw many pairs of
disjoint samples S1 and S2 having the same
sufficiently big size n and calculate S= S1 S2,
St= CL(S).
We consider sub- samples
S1,j= S1j(St),
S2,j= S2j(St),
where j(St), j=1,…,k is the jth cluster in the
partition of St obtained by means of CL.
42
Theorem’s application (cont. 1)
For each j=1,…,k we compute a value of the twosample MST-test statistics Rnn(S1,j, S2,j)
k
1
Rn (S1 , S2 )=  Rnn (S1,j , S2,j )
k j 1
In the case where all clusters are homogenous sets this
statistic is normally distributed. We construct masses of
such values and look at the distance between their
standardized distribution and the standard normal
distribution.
The number of clusters k, which yields the minimal
distance, is chosen as the estimate of the true number of
clusters.
43
Example: Calculation of Rn(S1,S2)
1
5
; n=m=10, k=2;
( 3  2 )  10   
2 20
2 20
2nm
1
1
 10;

n+m
nm
20 44
Rn (S1 , S2 )=
Distances from normality
We can consider the following distances
from normality:
1. The Kolmogorov-Smirnov test statistics
2. The Friedman’s Pursuit Index
3. Entropy Based Pursuit Indexes
4. BCM function
Generally, any simple metric can be used here.
45
The Kolmogorov-Smirnov Distance
The Kolmogorov-Smirnov distance is based
on the empirical distribution function.
Given N ordered data items R1≤ R2≤ R3 ≤... ≤
RN, a function FN(i) is defined as the fraction of
items less or equal to Ri.
The distance to the standard normal
distribution is given by:
D=max(FN(i) –G(i), G(i)- FN(i)),
where G(i) is the value of standard normal
cumulative distribution function evaluated at the
point i.
46
The Kolmogorov-Smirnov Distance
47
Example : synthetic data
Mixture of three Gaussian clusters
48
Example : synthetic data (cont. 1)
49
Another application
Another approach by Jain, Xu, Ho and Xiao
and by Smith and Jain proposes to carry out a
uniformity testing for cluster detection.
The main idea here is to locate an “
inconsistent” edge whose length is significantly
larger than the average length of nearby edges.
The distribution of these lengths must be normal
under the null hypothesis assumption.
This method is unable to asses the true number
of clusters in the data.
50
51
Membership Stability Algorithm
Such algorithm uses a simulation of
Random Variables, obtained by repeated
clusterings of the same sample.
Obtained results are compared by means
of a compound (non-simple) metric, which
is a function of the variable defined on the
set Ck={1,…,k}, where k is the number of
clusters considered.
52
p
L -metric
The Lp –metrics is the most known example of a
compound metric. For every p the Lp -metrics is defined by
p min (11p )
dis p ( X  Y )  E ( X  Y )
In particular:

dis ( X  Y )  inf{c  0  P  X  Y  c   0}  the  metric;
dis0 ( X  Y )  E (  ( X  Y ))  the indicator metric;
The last metric has been, de facto, applied in:
T. Lange, V. Roth, L. M. Braun, and J. M. Buhmann.
Stability-based validation of clustering solutions. Neural
Computation, 16(6):1299 – 1323, 2004.
V. Roth, V. Lange, M. Braun, and Buhmann J. A
resampling approach to cluster
validation. In COMPSTAT,
53
2002.
A family of clustering algorithms
S
CLUSTERING MACHINE 1CL1
S1= CL1(S)
S
CLUSTERING MACHINE 2CL2
S2= CL2(S)
●
●
●
S
CLUSTERING MACHINE MCLM
SM= CLM(S)
In contrast with the geometrical algorithm, we use here
a family of clustering algorithms. It can be the same algorithm
starting from different initial points.
54
Clusters correspondence problem
Let us consider we twice cluster the same set.
Cluster 1
S
Cluster 2
Cluster 2
Cluster 1
55
The same cluster can
be labeled differently
Correspondence between labels α and β
obtained for a sample S.
This task is solved by finding of a permutation ψ of the set Ck which
achieves the smallest misclassification between α and the permuted β.
I.e.
 *  argmin .  ( ( x)   (  ( x)))


xS
Computational complexity for solving this problem by the well known
Hungarian method is O(k3). This technique has been used in [Lange T.
et al, 2004] and is also known in the clusters combining area (see, for
example [Topchy A. et al, 2003]).
56