#### Transcript Fuzzy k

A Toolkit for Remote Sensing Enviroinformatics Clustering Fazlul Shahriar, George Bonev Advisors: Michael Grossberg, Irina Gladkova, Srikanth Gottipati Clustering Module: Issues: k-means - This is a top-down clustering algorithm which attempts to find k representative centers for the data. The initial means are selected from the training data itself. Remotely sensing data is typically vast. Data size requires advanced tools to explore them semi-automatically. Clustering is one such tool. Expectation-Maximization - Estimates the means and covariances of components in Gaussian Mixture Model. Implementation: Many clustering algorithms have been proposed in the literature but they are dispersed in multiple libraries in different languages. Hence it becomes difficult to test these algorithms on applications at hand. Our goal is to create a single library (platform independent) so that users can test them on remote sensing data To accomplish this, we choose Python programming language which gives a MATLAB-like interface and and at the same time lends to deal with large databases Fuzzy k-means - This is a top-down clustering algorithm which attempts to find k representative centers for the data. The initial means are selected from the training data itself. This algorithm uses a slightly different gradient search than the simple standard k-means algorithm, but generally yields the same final solution. Competitive learning - Competitive learning clustering, where the nearest cluster center is updated according to the position of a randomly selected training pattern. Leader follower - Basic leader-follower clustering, which is similar to competitive learning but additionally generates a new cluster center whenever a new input pattern differs by more than threshold distance \theta from existing clusters. Clustering obtained using EM algorithm where 5 clusters were specified and run with a random initial staring point. The algorithm usually gets stuck in a local minima. Clustering obtained with preprocessing step using whitening followed by EM algorithm ADDC (agglomerative clustering) - An on-line (simple-pass) clustering algorithm which accepts a single sample at each step, updates the cluster centers and generates new centers as needed. The algorithm is efficient in that it generates the cluster centers with a single pass of the data. Furthermore, Python allows easy integration with C/C++/R libraries. DSLVQ (distinction sensitive linear vector quantization) - Performs earning vector quantization (i.e., represents a data set by a small number of cluster centers) using a distinction or classification criterion rather than a traditional sum-square-error criterion. Minimum spanning tree (undirected) - Builds a minimum spanning tree for a data set based on nearest neighbors. Connected components - Finds connected components for a data set based on nearest neighbor. Returns a list of the connected components of the given graph. Graph cut - Graph cut clustering is achieved by cutting edges of the graph to form a good set of connected components such that the weights of within-components edges will be minimized compared to across-component edges. Spectral clustering - Spectral clustering techniques make use of the spectrum of the similarity matrix of the data to perform dimensionality reduction for clustering in fewer dimensions. Code fragment from clustering toolkit IPython - MATLAB-like interface for Python Clustering obtained using K-means algorithm where 5 clusters were specified and run with a random initial staring point. Clustering obtained with preprocessing step using whitening followed by K-means algorithm HDR (hierarchical dimensionality reduction) - Clusters similar features so as to reduce the dimensionality of the data. SOHC (stepwise optimal hierarchical clustering) - Bottom-up clustering. The algorithm starts by assuming each training point is its own cluster and then iteratively merges the two clusters that change a clustering criterion the least, until the desired number of clusters, k are formed. Utility Functions: Make graph (similarity matrix) - Gen a set of data points A, the similarity matrix may be defined as a matrix S where each elements represents a measure of the similarity between points i,j in A. Normalization (standard deviation based) - Normalize a group of observations on a per feature basis. This is done by dividing each feature by its standard deviation across all observations. UniqueRand - Generates unique set of random points drawn from N(0,1) Modes obtained during the mean shift algorithm. Red dots represent the local peaks of the density estimate of the data Clustering obtained using a combination of mean shift and connected components algorithms Training - Nearest neighbor classifier is used to classify test data set with the clustering obtained from trained data set. 3-d cloud of points which could easily be clustered using a parametric method like the ExpectationMaximization (EM) algorithm 2-d cloud of points where clustering using normal distribution based methods could fail while methods like spectral and geometric clustering algorithms could do a better job UniqueVector - Computes the unique set of feature vectors from a given set of feature vectors. Whitening transform - Performed on d-dimensional data set, it first subtracts the sample mean from each point, and then multiplies the data set by inverse of the square root of the covariance matrix Validation Module: Cross validation – statistical method for validating a predictive model. Subsets of the data are held out, to be used as validating sets; a model is fit to the remaining data (a training set) and used to predict for the validation set. Averaging the quality of the predictions across the validation sets yields an overall measure of prediction accuracy. Bootstrap - statistical method for estimating the sampling distribution of an estimator by sampling with replacement from the original sample, most often with the purpose of deriving robust estimates of standard errors and confidence intervals of a population parameter. Jackknife - To estimate the bias and standard error in a statistic, when a random sample of observations is used to calculate it. The basic idea behind the jackknife estimator lies in systematically recomputing the statistic estimate leaving out one observation at a time from the sample set. Clustering obtained on 2-d cloud of points by running mean shift procedure with various initializations Trajectories of the mean shift procedures drawn over the density estimate computed over the same data set. The peaks retained for final classification are marked with red dots. BIC - In parametric methods, there might be various candidate models each with a different number of parameters to represent a data set. The Bayesian information criterion is a useful statistical criterion for model selection for parametric methods. Physics based cluster labeling: 1 Physics based cluster labeling: 2 Unsupervised nonparametric classification Colors are the same as used in the scatter plot above MODIS cloud classification over the eastern part of United States. AIC – Tool for nonparametric model selection. Given a data set, several competing models may be ranked according to their AIC, with the one having the lowest AIC being the best. This research has been funded by NOAA-CREST grant # NA06OAR4810162