#### Transcript Chapter 9 Part 2

Chapter 9 UNSUPERVISED LEARNING: Clustering Part 2 Cios / Pedrycz / Swiniarski / Kurgan SOM Clustering Characteristics of operation of a human associative memory: • information is retrieved/recalled on basis of some measure of similarity relating to a key pattern • memory stores and recalls representations of “what is out there” as structured sequences • the recall of information from memory is dynamic © 2007 Cios / Pedrycz / Swiniarski / Kurgan 2 © 2007 Cios / Pedrycz / Swiniarski / Kurgan 3 © 2007 Cios / Pedrycz / Swiniarski / Kurgan 4 Self-Organizing Feature Maps (SOM) In data analysis it is fundamental to: • capture the topology and probability distribution of pattern vectors • map pattern vectors from the original high-D space onto the lower-D new feature space (compressed space) © 2007 Cios / Pedrycz / Swiniarski / Kurgan 5 SOM Data compression requires selection of features that best represent data for a specific purpose, e.g., for better visual inspection of the data's structure Most attractive from the human point of view are visualizations in 2D or 3D space The major difficulty is how to faithfully project/map the data to ensure preservation of the topology present in the original data feature space. © 2007 Cios / Pedrycz / Swiniarski / Kurgan 6 SOM Topology-preserving properties: mapping should have these • similar patterns in the original feature space must also be similar in the reduced feature space - according to some similarity criteria • the original and the reduced space should be of "continuous nature“, i.e., density of patterns in the reduced feature space should correspond to those in the original space. © 2007 Cios / Pedrycz / Swiniarski / Kurgan 7 Topology-preserving Mappings Several methods were developed for topology-preserving mapping: • linear projections, such as eigenvectors • nonlinear projections, such as Sammon's projection • nonlinear projections, such as SOM neural networks © 2007 Cios / Pedrycz / Swiniarski / Kurgan 8 Sammon’s Projection Nonlinear projection used to preserve topological relations between patterns in the original and the reduced spaces, by preserving the interpattern distances. The Sammon’s projection minimizes an error defined as the difference between patterns in the original and reduced feature spaces. {xk} -- set of L n-dimensional vectors xk in the original feature space Rn, {yk} -- set of L corresponding m-dimensional vectors y in the reduced lowdimensional space Rm, with m <<n Distortion measure Jd (d(xi , x j ) d(y i , y j )) 2 d(xi , x j ) d (xi , x j ) i 1,i j j1, ji 1 L L i 1,i j j1, j i L L © 2007 Cios / Pedrycz / Swiniarski / Kurgan 9 Sammon’s Projection Performs a non-linear projection, typically, onto a 2D plane Disadvantages: - it is computationally heavy - it cannot be used to project new points (points that were not used during training) on the output plane © 2007 Cios / Pedrycz / Swiniarski / Kurgan 10 SOM: Principle high-dim space low-dim space (2-dim, 3-dim) © 2007 Cios / Pedrycz / Swiniarski / Kurgan 11 Self-Organizing Feature Maps (SOM) • Developed by Kohonen in 1982 • SOM is an unsupervised learning, topology preserving, projection algorithm • It uses a feedforward NN topology • It is a scaling method that projects data from high-D input space into a lower-D output space • Similar vectors in the input space are projected onto nearby neurons on the 2D map © 2007 Cios / Pedrycz / Swiniarski / Kurgan 12 SOM The feature map is a layer in which the neurons are self-organizing themselves, according to input patterns Each “neuron” of the input layer is connected to each neuron of the 2D map The weights associated with the inputs are propagated to the 2D map neurons • Often not only the winning neuron updates its weight but also neurons in a certain area around it • SOM reflects the ability of biological neurons to perform global ordering based on local interactions © 2007 Cios / Pedrycz / Swiniarski / Kurgan 13 SOM: Topology and Learning Rule One iteration of the SOM learning: 1. Present a randomly selected input vector x to all neurons 2. Select the winning neuron, i.e., one whose weight vector is the closest to the input vector, according to the chosen similarity measure 3. Adjust the weight of the jth winning neuron, and the weights of neighboring (defined by some neighborhood function) neurons © 2007 Cios / Pedrycz / Swiniarski / Kurgan 14 SOM: Topology and Learning Rule The jth winning neuron is selected as the one having minimal distance value: jth winnin g neuron x - w min x w , k 1,..., M j k Winner-takes-all learning rule is used for adjusting the weights: W (t 1)k w (t) h (N (t), t) (x - w (t)), i 1,2,..., M k k j j k where (x - w (t)) is the degree of similarity between a neuron and its input k © 2007 Cios / Pedrycz / Swiniarski / Kurgan 15 SOM: Topology and Learning Rule Kohonen also proposed a dot product similarity for selecting the winning neuron: jth winnin g neuron w x max w x , k 1,..., M j k and the learning rule: ( w (t) ( t)x) i W (t 1) , i N (t ) i || w (t) ( t)x || j i where Nj(t) is the winning neuron’s neighborhood and (0< (t) < ) is the decreasing learning function. This formula assures automatic weight normalization to the length of one. © 2007 Cios / Pedrycz / Swiniarski / Kurgan 16 © 2007 Cios / Pedrycz / Swiniarski / Kurgan 17 Neighborhood Kernel The neighborhood kernel (of the winning neuron) is a non-increasing function of time and distance It defines the region of influence that the input has on the SOM h (N (t), t) h ( r r , t) j j j j i where 0 h (N (t), t) 1 j j r , r radius of the winning neuron j and i respectively j i © 2007 Cios / Pedrycz / Swiniarski / Kurgan 18 Neighborhood Kernel The geometric set of neurons must decrease with the increase of iteration steps (time) Convergence of the learning process requires that the radius must decrease with learning time/iteration r(t ) r(t ) r(t ) ,... 1 2 3 N (r(t ), t ) N (r(t ), t ) N (r(t ), t ),... j 1 1 j 2 2 j 3 3 This causes global ordering by local interactions and local weight adjustments. © 2007 Cios / Pedrycz / Swiniarski / Kurgan 19 © 2007 Cios / Pedrycz / Swiniarski / Kurgan 20 Neighborhood Kernel With the neighborhood function fixed, the neighborhood kernel learning function can be defined as: h (N (t), t) h j j j r r j i , t η(t) 0 for i (N (t), t) j otherwise and η(t) ηmax (t)(1 t/T) where T number of iterations © 2007 Cios / Pedrycz / Swiniarski / Kurgan 21 Neighborhood Kernel Another frequently used neighborhood kernel is a Gaussian function with a radius decreasing with time: h (N (t), t) h j j j r r j i , t η(t) exp (( r r j i 2 )/(2 2 (t)) where r position radius of the winning neuron j r radius of the ith neuron i (t) width of the kernel © 2007 Cios / Pedrycz / Swiniarski / Kurgan 22 SOM: Algorithm Conditions for successful learning of the SOM: • The training data must be large since self-organization relies on statistical properties of data • Proper selection of the neighborhood kernel function assures that only the weights of the winning neuron and its neighborhood neurons are locally adjusted • The radius of the winning neighborhood, as well as the learning function rate, must monotonically decrease with time • The amount of weight adjustment for neurons in a winning neighborhood depends on how close they are to the winner © 2007 Cios / Pedrycz / Swiniarski / Kurgan 23 SOM: Algorithm Given: The 2D network topology consisting of M neurons; training data set of L n-D input vectors; number of iterations T; neighborhood kernel function; learning rate function 1. Set learning rate to the max learning rate function 2. set iteration step t=0 3. randomly select initial values of the weights 4. randomly select the input pattern and present it to the network 5. compute the current learning rate for step t using the given learning rate function © 2007 Cios / Pedrycz / Swiniarski / Kurgan 24 SOM: Algorithm 6. compute the Euclidean distances || xi - wk (t) ||, k = 1,…, M 7. select the jth winning neuron || xi - wj(t) || = min || xi(t) – wk(t) ||, k = 1,…, M 8. define the winning neighborhood around the winning neuron using the neighborhood kernel 9. adjust the weights of the neurons wp (t+1) = wp (t) + (t) (xi – wp (t)), p Nj(t) 10. increase t=t+1 If t>T stop; otherwise go to step 4 Result: Trained SOM network (clusters found) © 2007 Cios / Pedrycz / Swiniarski / Kurgan 25 from Kohonen © 2007 Cios / Pedrycz / Swiniarski / Kurgan 26 SOM: Interpretation Given: Trained SOM network consisting of the 2D array of M neurons, each neuron receiving an input via its weight vector w. Small training data set consisting of pairs (xi, ci), i=1, 2, …, L, where c is the class label 1. Set i = 1 2. Present input pattern to the network 3. Calculate the distances or the dot products 4. Locate the spatial position of the winning neuron and assign label c to that neuron 5. Increase i= i+1 and continue with i < = L Result: Calibrated SOM network © 2007 Cios / Pedrycz / Swiniarski / Kurgan 27 SOM: Issues In practice, we do not optimize the SOM design, instead, we pay attention to: • Initialization of the weights Often they are normalized and set equal to the first several input vectors. If not initialized to input vectors, they should be grouped in one quadrant of the unit circle so that they can unfold to map distribution of the input data. • Starting value of the learning rate and its decreasing schedule • Structure of neighborhood kernel and its decreasing schedule © 2007 Cios / Pedrycz / Swiniarski / Kurgan 28 SOM: Example © 2007 Cios / Pedrycz / Swiniarski / Kurgan 29 SOM: Example © 2007 Cios / Pedrycz / Swiniarski / Kurgan 30 SOM: Example © 2007 Cios / Pedrycz / Swiniarski / Kurgan 31 SOM: Example © 2007 Cios / Pedrycz / Swiniarski / Kurgan 32 SOM Visualize a structure of highly dimensional data by mapping it onto the low-dimensional (typically two-dimensional) grid of linear neurons j i p-rows X x(1) x(2) … © 2007 Cios / Pedrycz / Swiniarski / Kurgan 33 Fig 1. Classes of mice. Higuera C, Gardiner KJ, Cios KJ (2015) Self-Organizing Feature Maps Identify Proteins Critical to Learning in a Mouse Model of Down Syndrome. PLoS ONE 10(6): e0129126. doi:10.1371/journal.pone.0129126 http://journals.plos.org/plosone/article?id=info:doi/10.1371/journal.pone.0129126 Table 1. Group comparisons and biological relevance. Higuera C, Gardiner KJ, Cios KJ (2015) Self-Organizing Feature Maps Identify Proteins Critical to Learning in a Mouse Model of Down Syndrome. PLoS ONE 10(6): e0129126. doi:10.1371/journal.pone.0129126 http://journals.plos.org/plosone/article?id=info:doi/10.1371/journal.pone.0129126 SOM result The number inside each node tells how many training data points are clustered within it. The sum of all numbers is the total number of the training data points. © 2007 Cios / Pedrycz / Swiniarski / Kurgan 37 Fig 3. Venn diagrams of discriminating proteins. Higuera C, Gardiner KJ, Cios KJ (2015) Self-Organizing Feature Maps Identify Proteins Critical to Learning in a Mouse Model of Down Syndrome. PLoS ONE 10(6): e0129126. doi:10.1371/journal.pone.0129126 http://journals.plos.org/plosone/article?id=info:doi/10.1371/journal.pone.0129126 Fig 4. SOM clustering with subsets of protein expressions data from control mice. Using only 11 proteins/features Using the remaining 66 (77-11) proteins Higuera C, Gardiner KJ, Cios KJ (2015) Self-Organizing Feature Maps Identify Proteins Critical to Learning in a Mouse Model of Down Syndrome. PLoS ONE 10(6): e0129126. doi:10.1371/journal.pone.0129126 http://journals.plos.org/plosone/article?id=info:doi/10.1371/journal.pone.0129126 Table 3. Functional associations of discriminant proteins used to generate SOMs. Higuera C, Gardiner KJ, Cios KJ (2015) Self-Organizing Feature Maps Identify Proteins Critical to Learning in a Mouse Model of Down Syndrome. PLoS ONE 10(6): e0129126. doi:10.1371/journal.pone.0129126 http://journals.plos.org/plosone/article?id=info:doi/10.1371/journal.pone.0129126 Fig 5. SOM clustering with data from the 11 proteins that discriminate between context-shock and shock-context (c-CS and c-SC) plus. Higuera C, Gardiner KJ, Cios KJ (2015) Self-Organizing Feature Maps Identify Proteins Critical to Learning in a Mouse Model of Down Syndrome. PLoS ONE 10(6): e0129126. doi:10.1371/journal.pone.0129126 http://journals.plos.org/plosone/article?id=info:doi/10.1371/journal.pone.0129126 Fig 6. SOM clustering of trisomic mice data using 77 proteins. Higuera C, Gardiner KJ, Cios KJ (2015) Self-Organizing Feature Maps Identify Proteins Critical to Learning in a Mouse Model of Down Syndrome. PLoS ONE 10(6): e0129126. doi:10.1371/journal.pone.0129126 http://journals.plos.org/plosone/article?id=info:doi/10.1371/journal.pone.0129126 Clustering and Vector Quantization ` x i0 encoder decoder codebook Encoding: determine the best representative (prototype) of the codebook and store (transmit) its index i0, i0= arg mini ||x –vi|| where vi - i-th prototype. Decoding: recall the best prototype given the transmitted index (i0) © 2007 Cios / Pedrycz / Swiniarski / Kurgan 43 Cluster validity • Using different clustering methods might result in different partitions of X, at each value of c • How many clusters do exist in the data? • Which clustering's are valid? • It is plausible to expect “good” clusters at more than one value of c ( 2 c < n ) © 2007 Cios / Pedrycz / Swiniarski / Kurgan 44 Aspects of Cluster Validation 1. 2. 3. 4. 5. Determine clustering tendency for a data set, i.e., distinguish whether a non-random structure actually exists in the data. Compare results of a cluster analysis to the known class labels (if they exist). Evaluate how well the results fit the data - Use only the data Compare results of different cluster analyses results and determine “the best” one. Determining the ‘correct’ number of clusters. From : http://www.cs.kent.edu/~jin/DM08/ClusterValidation.pdf 45 Cluster Validation Process Cluster validation refers to procedures that evaluate the results of clustering in a quantitative and objective fashion. [Jain & Dubes, 1988] – How to be “quantitative”: employ some measures. – How to be “objective”: validate the measures INPUT: DataSet(X) Clustering Algorithm Validity Index m* Different number of clusters m From: cs.uef.fi/pages/franti/cluster/Clustering-Part3.ppt 46 Cluster validity Process X (generate clusters) at c = 2, 3, …, n – 1 and record the optimal values of some criterion as a function of c The most valid clustering is taken as an extremum of this function Problem: Often criterion functions have multiple local stationary points at fixed c; also global extrema do not necessarily correspond to the “best” c-partitions of the data © 2007 Cios / Pedrycz / Swiniarski / Kurgan 47 Cluster validity Cluster Error: associated with any U Mc it is the number of vectors in X that are mislabeled by U uˆ E (U ) n c k 1 i 1 u ik 2 ik 2n E(U) is an absolute measure of cluster validity when X is labeled, and is undefined when X is not labeled. © 2007 Cios / Pedrycz / Swiniarski / Kurgan 48 Cluster validity More formal approach is to pose the validity question in the framework of statistical hypothesis testing • However, sampling distribution is not known • Nonetheless, goodness of fit statistics such as chi-square and Kolmogorov-Smirnov tests are being used © 2007 Cios / Pedrycz / Swiniarski / Kurgan 49 Random Points 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 y y Clusters in Random Data 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0.2 0.4 0.6 0.8 0 1 DBSCAN 0 0.2 0.4 x 0.6 0.8 1 x 1 1 0.9 0.9 0.8 K-means 0.8 0.7 0.6 0.5 y y 0.6 0.4 0.5 0.4 0.3 0.3 0.2 0.2 0.1 0 Complete Link 0.7 0.1 0 0.2 0.4 0.6 x 0.8 1 0 0 0.2 0.4 0.6 0.8 1 x / Swiniarski / Kurgan © 2007 Cios / Pedrycz 50 Measures of Cluster Validity Numerical measures that are applied to judge various aspects of cluster validity can be classified into: – External Index: Used to measure the extent to which cluster labels match externally supplied class labels • Entropy – Internal Index: Used to measure the goodness of a clustering structure without using external information. • Sum of Squared Error (SSE) – Relative Index: Used to compare two different clustering's or clusters. • Often SSE or entropy From: http://www.cs.kent.edu/~jin/DM08/ClusterValidation.pdf © 2007 Cios / Pedrycz / Swiniarski / Kurgan 51 Cluster validity The global minimum of Jw may suggest the “wrong” 2clusters X Example from Bezdek: s n=29 data vectors {xk }R2 “correct” 2-partition of X is shown on the left For hard 2-partitionU , For hard 2-partitionU , J U , v 102 J w U , v 2.25s 2 5s 6.25 The global minimum is hardly an attractive solution. For s 5.5, J w U , v J w U , v © 2007 Cios / Pedrycz / Swiniarski / Kurgan 52 Cluster validity Basic question: what constitutes a “good” cluster ? What do we mean by a “cluster” anyhow? The difficulty is that the data X, and every partition of X, are separated by the algorithm generating partition matrix U (and defining “clusters” in the process) • Many of the algorithms ignore this fundamental difficulty and are thus heuristic in nature • Some heuristic methods measure the amount of fuzziness in U, and presume that the least fuzzy partitions are the most valid © 2007 Cios / Pedrycz / Swiniarski / Kurgan 53 Cluster Validity How to assess the quality of clusters? How many clusters should be found in data? Compactness: expresses how close the elements in a cluster are. Quantified in terms of intra-cluster distances Separability: expresses how distinct the clusters are. Quantified through inter-cluster distances. Goal: high compactness and high separability © 2007 Cios / Pedrycz / Swiniarski / Kurgan 54 Cluster Distances From: bioinfo.cipf.es/courses/mda08/_media/clustering_validation.pdf 55 Davies-Bouldin index within scatter distance for the i-th cluster: si 1 2 || x v i || card(Ω i ) xΩi distance between the prototypes between the prototypes of the clusters: dij = ||vi – vj||2 Define the ratio ri max j, ji si s j d ij and then the sum 1 c r ri c i 1 The “optimal”, “correct” number of the clusters (c) is the one for which the value of “r” attains its minimum. © 2007 Cios / Pedrycz / Swiniarski / Kurgan 56 Davies-Bouldin index From: cs.uef.fi/pages/franti/cluster/Clustering-Part3.ppt 57 From: carme2011.agrocampus-ouest.fr/slides/Roux.pdf 58 Dunn separation index diameter of the cluster Δ(Ωi ) max x,yΩi || x y || inter-cluster distance δ(Ωi , Ω j ) min xΩi ,yΩ j || x y || r min i min δ(Ω i , Ω j ) j, j c max k Δ(Ω k ) © 2007 Cios / Pedrycz / Swiniarski / Kurgan 59 Xie-Benie index aka compactness and separation validity function Optimal number of clusters corresponds to the lowest value of r, yet the index shows monotonicity when the number of clusters is close to the number of data points. N c r m 2 u || x v || ik k i k 1i 1 N{min i j || v i v j || } 2 Find the lowest value of “r” © 2007 Cios / Pedrycz / Swiniarski / Kurgan 60 Cluster Validity • Divisive Coefficient (DC) 1 c DC l (i ) n i 1 a a,b b a,b,c, d,e c c,d,e d,e d e 10.0 5.0 3.0 2.0 0.0 © 2007 Cios / Pedrycz / Swiniarski / Kurgan 61 Cluster Validity 1 c DC l (i ) n i 1 md ( j ) md (i ) n(i ) l (i ) 1 md (0) where : i Current object (either cluster or singleton) j The cluster from which the current cluster comes from md (i ) Diameter of object i (maximum distance between any two sub - objects) n(i ) Number of sub - objects of i l0 0 © 2007 Cios / Pedrycz / Swiniarski / Kurgan 62 a Cluster Validity • Divisive Coefficient (DC) a a,b a,b,c, d,e l0 b l2 c c,d,e l1 d,e d l3 e l7 a 0 .0 b 2.0 c 6 .0 d 10.0 e 9.0 5.0 L1=(1-((10-5)/10))3=1.5 3.0 2.0 c l1 1.5 l5 l5 0.7 l 6 0 .7 l6 l 7 0 . 8 e 2 l8 l2 0.4 DC1 l3 1.6 l4 l 0 . 5 4 d 2.0 6.0 10.0 9.0 0.0 5.0 9.0 8.0 5 .0 0 . 0 4 .0 5 .0 9 .0 4 .0 0 . 0 3 .0 8.0 5.0 3.0 0.0 l0 0 l8 0.8 10.0 b l (i) 1 2 l l DC1 1 2 2 1 . 5 0 .4 DC1 2 DC1 0.95 0.0 © 2007 Cios / Pedrycz / Swiniarski / Kurgan 63 b a Cluster Validity • Divisive Coefficient (DC) a a,b a,b,c, d,e l0 b l2 c c,d,e l1 d,e d l3 e 10.0 5.0 3.0 2.0 a 0 .0 b 2.0 c 6 .0 d 10.0 e 9.0 c e d 2.0 6.0 10.0 9.0 0.0 5.0 9.0 8.0 5 .0 0 . 0 4 .0 5 .0 9 .0 4 .0 0 . 0 3 .0 8.0 5.0 3.0 0.0 l0 0 l7 l 1 . 5 1 l8 l 2 0 . 4 l3 1.6 l4 l 0 . 5 4 DC2 0.83 l5 0.7 l5 l6 l 6 0 .7 l 7 0.8 l8 0.8 0.0 © 2007 Cios / Pedrycz / Swiniarski / Kurgan 64 b a Cluster Validity • Divisive Coefficient (DC) a a,b a,b,c, d,e l0 b l2 c c,d,e l1 d,e d l3 e 10.0 5.0 3.0 2.0 a 0 .0 b 2.0 c 6 .0 d 10.0 e 9.0 c e d 2.0 6.0 10.0 9.0 0.0 5.0 9.0 8.0 5 .0 0 . 0 4 .0 5 .0 9 .0 4 .0 0 . 0 3 .0 8.0 5.0 3.0 0.0 l0 0 l7 l 1 . 5 1 l8 l 2 0 . 4 l3 1.6 l4 l 0 . 5 4 DC3 0.575 l5 0.7 l5 l6 l 6 0 .7 l 7 0.8 l8 0.8 0.0 © 2007 Cios / Pedrycz / Swiniarski / Kurgan 65 b a Cluster Validity • Divisive Coefficient (DC) a a,b a,b,c, d,e l0 b l2 c c,d,e l1 d,e d l3 e 10.0 5.0 3.0 2.0 a 0 .0 b 2.0 c 6 .0 d 10.0 e 9.0 c e d 2.0 6.0 10.0 9.0 0.0 5.0 9.0 8.0 5 .0 0 . 0 4 .0 5 .0 9 .0 4 .0 0 . 0 3 .0 8.0 5.0 3.0 0.0 l0 0 l7 l 1 . 5 1 l8 l 2 0 . 4 l3 1.6 l4 l 0 . 5 4 DC4 0.70 l5 0.7 l5 l6 l 6 0 .7 l 7 0.8 l8 0.8 0.0 © 2007 Cios / Pedrycz / Swiniarski / Kurgan 66 Cluster Validity: Fuzzy Clustering Partition coefficient c N 2 P1 u ik i 1 k 1 Partition entropy Sugeno-Takagi 1 c N P2 u ik log a u ik N i1 k 1 c N m P3 u ik (|| x k v i ||2 || v i v ||2 ) i 1 k 1 © 2007 Cios / Pedrycz / Swiniarski / Kurgan 67 Degree of Separation - Bezdek The degree of separation between two fuzzy clusters u1 and u2 is the scalar n U ;2 1 u1k u2 k k 1 and its generalization to c fuzzy clusters is: nc Z U ; c 1 uik U M fc k 1 i 1 © 2007 Cios / Pedrycz / Swiniarski / Kurgan 68 Degree of Separation - Bezdek Example: c=2 and two different fuzzy 2-partitions of X: 1 1 1 1 0 1 1 0 0 2 2 2 U V 1 1 1 0 0 0 1 1 1 2 2 2 Z(U;2) = Z(V;2) = 0.50 U and V are very different so Z does not distinguish between the two partitions. © 2007 Cios / Pedrycz / Swiniarski / Kurgan 69 Partition Coefficient - Bezdek Partition coefficient (PC) measures the amount of “overlap” between fuzzy clusters, where U is a fuzzy c-partition of n data points. The partition coefficient, F, is the scalar: n F U ; c c u 2 k 1 i 1 ik n The value of F(U; c) depends on all (c x n) elements of U in contrast to Z(U; c) that depends on just one. © 2007 Cios / Pedrycz / Swiniarski / Kurgan 70 Partition Coefficient - Bezdek Example: values of F on U and V partitions of X: 1 1 1 1 2 2 0 1 1 2 0 0 U V 1 1 1 0 0 0 1 1 1 2 2 2 F(U;2) = 0.510 F(V; 2) = 0.990 The value of F gives accurate indication of the partition for both partitions (the most uncertain and certain). © 2007 Cios / Pedrycz / Swiniarski / Kurgan 71 Partition Coefficient - Bezdek from Bezdek © 2007 Cios / Pedrycz / Swiniarski / Kurgan 72 Partition Entropy - Bezdek Let {Ai | 1 i c } denote a c-partition of events of any sample space connected with an experiment; and let pT = (p1,p2, …, pc) denote a probability vector associated with the {Ai}. The pair ({Ai}, p) is called a finite probability scheme for the experiment ith component of p is the probability of event Ai c is called the length of the scheme Note: c does NOT indicate the number of clusters © 2007 Cios / Pedrycz / Swiniarski / Kurgan 73 Partition Entropy - Bezdek Our aim is to find a measure h(p) of the amount of uncertainty associated with each state. • h(p) – should maximize for p=(1/c, …, 1/c) • h(p) – should minimize for p=(0, 1, 0, …) (any partition statistically certain) The entropy of the scheme is defined as: c h( p) pi log a ( pi ) i 1 pi loga(pi) = 0 whenever pi = 0 © 2007 Cios / Pedrycz / Swiniarski / Kurgan 74 Partition Entropy - Bezdek Partition entropy of any fuzzy c-partition U Mfc of X, where |X| = n, is for 1 c n: n c H (U ; c) uik log a (uik ) k 1 i 1 n Let U Mfc be a fuzzy c-partition of n data points. Then for 1 c n and a (1,) 0 H(U;c) loga(c) H(U;c) = 0 Mco is hard H(U;c) = loga(c) U = [1/c] © 2007 Cios / Pedrycz / Swiniarski / Kurgan 75 Partition Entropy - Bezdek Example: entropy for U and V 1 1 1 1 0 1 1 0 0 2 2 U 2 V 1 1 1 0 0 0 1 1 1 2 2 2 H(U;c) = 49 loge(2)/51 = 0.665 H(V;c) = loge(2)/51 = 0.013 U is a very uncertain partition © 2007 Cios / Pedrycz / Swiniarski / Kurgan 76 Partition Entropy - Bezdek We assume that minimization of H(U;c) corresponds to maximizing the amount of information about structural memberships an algorithm extracted from data X. H(U;c) is used for cluster validity as follows: c denotes a finite set of “optimal” U’s Mfc c = 2, 3, …, n-1 min {min {H (U ; c)}} c c © 2007 Cios / Pedrycz / Swiniarski / Kurgan 77 From Bezdek Normalized partition entropy -> H (U ; c) H (U ; c) c [1 ( )] n © 2007 Cios / Pedrycz / Swiniarski / Kurgan 78 Prototypes as FEATURE SELECTORS from Bezdek Feature centers Absolute differences Symptom (Hernia) v1j (Gallstones) v2j 1 0.57 0.27 0.30 2 0.98 0.67 0.31 3 0.06 0.93 0.87 4 0.22 0.55 0.33 5 0.17 0.10 0.07 6 0.77 0.84 0.07 7 0.42 0.05 0.37 8 0.39 0.84 0.45 9 0.48 0.04 0.44 10 0.02 0.16 0.14 11 0.12 0.25 0.13 © 2007 Cios / Pedrycz / Swiniarski / Kurgan 79 Cluster Errors indicate best FEATURES from Bezdek Symptoms used Cluster Error E(U) {1-11} 23 {3} 23 {3, 8} 23 {3, 9} 36 {3, 8, 9} 36 © 2007 Cios / Pedrycz / Swiniarski / Kurgan 80 Takagi-Sugeno index It is of particular use in identification of non-linear interactions between variables. It deals with a partition matrix and also involves data. P3 c N m u ik (|| x k i 1 k 1 v i || || v i v || ) 2 2 81 Random Sampling prototypes Two-phase clustering: prototypes (a) Random sampling (question: how large it should be to become representative of data and structure we are interested in finding) sampling and DATA (b) Clustering of prototypes found in the samples clustering © 2007 Cios / Pedrycz / Swiniarski / Kurgan 82 Cluster Validity • Cluster validity is very hard to check • Without a good validation, any clustering result is potentially just random. © 2007 Cios / Pedrycz / Swiniarski / Kurgan 83 References Anderberg MR. 1973.Cluster Analysis for Applications, Academic Press Bezdek, JC. 1981. Pattern Recognition with Fuzzy Objective Function Algorithms, Plenum Press Devijver PA and Kittler J (eds.), 1987. Pattern Recognition Theory and Applications, Springer-Verlag Dubes R. 1987. How many clusters are the best? – an experiment. Pattern Recognition, 20, 6, 645-663 Duda RO, Hart, PE and Stork DG. 2001 Pattern Classification, 2nd edition, J. Wiley Dunn JC. 1974.A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters, J. of Cybernetics, 3, 3, 32-57 Jain, AK, Murthy MN and Flynn PJ. 1999. Data clustering: A review, ACM Comput. Survey, 31, 3, 264-323 Kaufmann L and Rousseeuw PJ. 1990. Finding Groups in Data: An Introduction to Cluster Analysis, J. Wiley Kohonen, T., 1995. Self-organizing Maps, Springer Verlag Sammon, JW Jr. 1969. A nonlinear mapping for data structure analysis. IEEE Trans. on Computers, 5, 401-409 Xie, XL and Beni G. 1991.A validity measure for fuzzy clustering, IEEE Trans. on Pattern Analysis and Machine Intelligence, 13, 841-847 Webb A. 2002. Statistical Pattern Recognition, 2nd edition, J. Wiley © 2007 Cios / Pedrycz / Swiniarski / Kurgan 84