ClusterAnalysisPart2

Download Report

Transcript ClusterAnalysisPart2

Advanced Statistical Methods for Research Math 736/836
Cluster Analysis
Part 2:
K-means clustering.
Watch out for the occasional paper clip!
©2009 Philip J. Ramsey, Ph.D.
1
 K-means clustering is quite different in its approach from
hierarchical clustering. It is based on disjoint clusters.
 K-means clustering tends to have the same biases as Ward’s method
in terms of characteristics of the clusters.
 In K-means clustering we may start with a predetermined number of
clusters K and assign observations to the clusters in such a way that the
within cluster variation is minimized and the between cluster variation
is maximized.
 It is for this reason that Everitt and Dunn refer to this as an
optimization method. The idea is to minimize the distances between
observations within each cluster while maximizing the distance
between the K clusters.
 This method should typically only be used for larger datasets, say at
least 200. The solutions are highly unstable with smaller datasets.
©2009 Philip J. Ramsey, Ph.D.
2
 The basic K-means cluster algorithm is to start with some prespecified number of clusters or use an algorithm to pick an optimal
number of clusters if the user has no a priori idea how many clusters
should exist in the data.
 The first step is to designate a set of N seed values and assign
observations to clusters based on the nearness of the seed values.
 Next compute the initial K centroids based on the groupings from the
seed points. Then points are reassigned based upon nearness to one of
the centroids.
 Keep reassigning observations to clusters based upon the nearness of
the centroids. Each time a point is moved from one cluster to another,
recalculate the centroids for the clusters involved in the swap.
 Continue the process until no further reassignments take place.
©2009 Philip J. Ramsey, Ph.D.
3
 Example: We use the JMP sample data set Cytometry.JMP. We
will perform K-means clustering with K = 4 clusters specified. The
variables CD3 and CD8 are used for the initial clustering attempt.
©2009 Philip J. Ramsey, Ph.D.
4
 Example: In K-Means cluster
report window we are given a
number of options under the Control
Panel, however the JMP defaults are
recommended – see the JMP help
file if you wish to change some of
the defaults. We can click on “Go”
to have JMP find a solution, or the
user can click on “Step” to see each
iteration of the process.
©2009 Philip J. Ramsey, Ph.D.
5
 Example: After 16 iterations
JMP converges to a solution for
4 clusters.
©2009 Philip J. Ramsey, Ph.D.
6
 Example: Under the hotspot at the top of the report you can choose
to see a biplot that displays the variable rays and the clusters.
©2009 Philip J. Ramsey, Ph.D.
7
 Example: An important question is how many clusters to begin
without a scientific basis for specifying the number of clusters. One
good approach is try some scatter plots and see if any natural
clustering seems to occur. Of course this is limited to 2D or 3D plots
but is a valid basis when no a priori information exists.
Bivariate Fit of CD8 By CD3
A Fit Y by X plot
indicates that 4 clusters
may exist. A 3D Plot in
CD3, CD8, and CD4
also indicated 4
clusters.
500
400
CD8
300
200
100
0
0
100
200
300
400
500
CD3
©2009 Philip J. Ramsey, Ph.D.
8
 Example: A 3D plot of CD3, CD8, and MCB indicated the
possibility of 5 clusters.
©2009 Philip J. Ramsey, Ph.D.
9
 Example: K-means clustering was repeated with CD3, CD8, and
MCB and 5 clusters specified.
©2009 Philip J. Ramsey, Ph.D.
10
 Example: Below is a 3D biplot for the five clusters and three
variables.
©2009 Philip J. Ramsey, Ph.D.
11
 JMP provides a fairly modern version of K-means clustering based
on Self Organizing Maps (SOMs). K-means clustering can be
thought of as a special case of SOMs methodology. JMP provides a
limited but useful application of SOMs to cluster analysis.
 SOMs are a data visualization technique which reduce the
dimensions of data through the use of self-organizing neural
networks. SOMs reduce dimensions by producing a 2D map which
plots the similarities of the data by grouping similar data items
together.
The goal of a SOM is to form clusters in a particular layout on a
cluster grid, such that points in clusters that are near each other in the
SOM grid are also near each other in multivariate space.
 In classical K-means clustering, the structure of the clusters is
arbitrary, but in SOMs the clusters have the grid structure. This grid
structure may be of value in interpreting the clusters.
©2009 Philip J. Ramsey, Ph.D.
12
 Example: We continue with the Cytometry example.
Iterative Clustering
Control Panel
Self Organizing Map
The bandwidth option is a
weighting function used in
estimating the clusters. JMP
help does not provide details on
its form. However, the choice of
bandwidth can impact the final
cluster. Try various values of
the bandwidth and see if a more
interpretable cluster analysis
occurs.
Standardize data by Std Dev
Color while clustering
Shift distances using sampling rates
Use within-cluster std deviations
Bandwidth:
0.75
Cluster Summary
Step
Criterion
18
0
Cluster
Count
Max Dist
1
1785
2.75310404
2
988
2.21211425
3
815
1.77335177
4
1412
1.30760703
Cluster Means
Cluster
CD3
CD8
1
314.022388
291.282325
2
344.491485
117.459102
3
139.366061
135.557004
4
218.502141
108.402641
Cluster Standard Deviations
Cluster
©2009 Philip J. Ramsey, Ph.D.
CD3
CD8
1
22.1007724
25.7207367
2
28.5748537
37.0754192
3
33.4412448
40.2863212
4
23.7036989
35.3287975
13
 Example: Below is the initial SOM layout of the four cluster
locations for the Cytometry example.
Biplot
2
CD8
1
3
1
4
2
P rin 2
0
-1
CD3
-2
-3
-4
-3
-2
-1
0
1
2
3
4
Prin 1
Eigenvalues
1.4604529
©2009 Philip J. Ramsey, Ph.D.
0.5395471
14
 Example: Below is the final SOM for the Cytometry example using
a bandwidth of 0.75 (left biplot) and 0.5 (right biplot). Changing the
bandwidth can significantly change the map.
Biplot
Biplot
2
CD8
2
CD8
1
3
1
1
3
1
4
P rin 2
P rin 2
0
-1
4
0
2
CD3
-1
2
-2
-2
CD3
-3
-3
-4
-3
-2
-1
0
1
2
3
4
-4
-3
-2
0
1
2
3
4
Prin 1
Prin 1
Eigenvalues
Eigenvalues
1.4604529
-1
0.5395471
©2009 Philip J. Ramsey, Ph.D.
1.4604529
0.5395471
15
 Example: Below is the final SOM for the Cytometry example using
a bandwidth of 0.90. Notice that with a high bandwidth we do not
even have four clusters – there is too much smoothing.
Biplot
CD8
2
1
2
P rin 2
4
0
-1
1
3
-2
CD3
-3
-4
-3
-2
-1
0
1
2
3
4
Prin 1
Eigenvalues
1.4604529
©2009 Philip J. Ramsey, Ph.D.
0.5395471
16
 Normal Mixtures is an iterative technique implemented in the Kmeans clustering platform in JMP, although it is not a traditional Kmeans clustering algorithm.
 Both K-means and Hierarchical clustering methods attempt to group
observations into clusters. However the normal mixture approach is
more of an estimation method to characterize the cluster groups.
 If clusters overlap, assigning each observation to one cluster is
problematic. In the overlap areas, there are observations from several
clusters sharing the same space.
 Rather than classifying each observation into a cluster, the mixtures
method estimates the probability that an observation is in each cluster
based upon the assumption of a multivariate normal distribution for
each cluster. The user must specify the number of clusters.
 The method produces fuzzy clusters that may overlap based on the
assignments using the probabilities of cluster membership.
©2009 Philip J. Ramsey, Ph.D.
17
 The concept of normal mixtures is that the observed multivariate
data are generated as a finite mixture of K multivariate normal
distributions. With each distribution contributing some proportion p to
the total number of observations N.
 In cluster analysis the user specifies the number of multivariate
distributions that must be estimated.
 Once the parameters of the K multivariate distributions are
estimated, then it is possible to calculate probabilities that each of the
N observations came from one of the distributions.
 Generally, the observation can be assigned to the cluster for which it
has the highest estimated probability of belonging. In this sense the
observations are not strictly assigned to clusters, but rather can be
assigned classifications based on the probabilities.
 An algorithm estimates the distribution parameters and proportions.
©2009 Philip J. Ramsey, Ph.D.
18
 To illustrate the concept of finite normal mixtures, suppose we have
the bivariate case – there are two variables of interest.
 These two variables take on values that come from one of two
distributions (K = 2) with proportions p and (1-p) respectively.
 For each of the two distributions we have to estimate the centroid μ,
the covariance matrix , and the proportion (unless known in
advance). In clustering we assume the proportions are unknown.
 Our overall bivariate distribution is estimated from the observations
as a combination of the two underlying normal distributions.
f
 X 1, X 2  
p  M V N  μ A ; Σ A   1  p   M V N  μ B ; Σ B 
 For some number of distributions m > 2 and a vector of variables X,
the formula generalizes to
m
f X 

pi  M V N  μ i ; Σ i 
i 1
©2009 Philip J. Ramsey, Ph.D.
19
 The parameters and proportions for the distributions are estimated
using the method of maximum likelihood. We will not delve into the
mathematical details, however a sketch of the approach is provided by
Everitt and Dunn text in the Cluster Analysis chapter.
 Example: We will use the Cytometry data. We have K = 4 clusters
for two variables CD3 and CD8, so we have to estimate m = 4
bivariate normal distributions and proportions. We will use the Kmeans, Normal Mixtures option to estimate the proportions and
parameters of the distributions.
©2009 Philip J. Ramsey, Ph.D.
20
 Example continued: The estimated proportions and distribution
parameters are given below. JMP only provides correlation estimates.
Cluster Summary
Correlations for Normal Mixtures
Step
Criterion
500
3.793e-9
Cluster
Cluster1
Proportion
Mixture Count
1
0.39396573
1905.11684
2
0.17119194
844.329154
3
0.34261552
1686.30176
4
0.09222681
564.252252
Cluster Means
Cluster
CD3
CD8
CD3
CD8
CD3
1.0000
0.3751
CD8
0.3751
1.0000
CD3
CD8
CD3
1.0000
0.0121
CD8
0.0121
1.0000
CD3
CD8
Cluster2
Cluster3
1
218.339067
129.156625
2
339.684674
88.7314432
CD3
1.0000
0.3801
3
318.018017
306.756268
CD8
0.3801
1.0000
4
120.184407
96.4038892
CD3
CD8
CD3
1.0000
0.2413
CD8
0.2413
1.0000
Cluster4
Cluster Standard Deviations
Cluster
CD3
CD8
1
52.5587037
45.1128235
2
25.8393687
33.2087219
3
19.8353249
20.8102507
4
39.0601389
26.6025811
Note: StdDev slightly inflated for regularization.
©2009 Philip J. Ramsey, Ph.D.
21
 Example continued: Below is a 3D view of the mixture density.
Notice that cluster 4 does not resolve due to a low proportion.
#1
#3
#2
#4
©2009 Philip J. Ramsey, Ph.D.
22
 Example continued: Below is a Fit Y by X plot of the clusters
showing 95% density ellipses for each cluster.
Bivariate Fit of CD8 By CD3
500
400
3
CD8
300
200
1
4
100
2
0
0
100
200
300
400
500
CD3
Bivariate Normal Ellipse P=0.900 Cluster==1
©2009 Philip J. Ramsey, Ph.D.Bivariate Normal Ellipse P=0.900
Cluster==2
23
 Example continued: Below is a partial view of the spreadsheet
with the calculated probabilities of cluster membership for the
observations.
©2009 Philip J. Ramsey, Ph.D.
24
 It is possible to use K-means clustering with predetermined or fixed
clusters. As an example, there may be known species that make up a
multivariate dataset and the species are the basis for the clusters.
 The estimated cluster formula could then be saved to the
spreadsheet and then applied to a new set of measurements where the
identity of the species is not known. This takes clustering into the
realm of classification (supervised learning).
 We use the JMP sample dataset OwlDiet. JMP to demonstrate fixed
clustering using Normal Mixtures. The data consists of measurements
on seven species of Malaysian rats that are eaten by owls. There are
two data sets. The first (classification) data set had 179 observations
and contained measures of skull length, teeth row, palentine foramen,
and jaw width for seven known species. The second (observational)
set had 115 observations and no information regarding species. We
will try to identify the unknown species.
©2009 Philip J. Ramsey, Ph.D.
25
 Example: For the owl data we specify species as the “Centers”
variable in the Cluster Analysis launch window. The number of
clusters for K-means clustering is now set by “Species.”
©2009 Philip J. Ramsey, Ph.D.
26
 Example: The observations with no designated species will be
classified into their nearest cluster. Select the “Save Clusters” option
to see the classifications for the observations missing a species
designation.
Cluster Summary
Step
Cluster Summary
Criterion
0
Cluster
0
Criterion
999
3.43e-13
species
Proportion
annandalfi
0.14285714
0
1
2 argentiventer
0.14285714
0
3
exulans
0.14285714
0
3
4
rajah
0.14285714
0
5
surifer
0.14285714
1
Mixture Count
Step
Cluster
species
Proportion
Mixture Count
annandalfi
0.36207334
28.4645603
2 argentiventer
0.12937488
28.680347
exulans
0.05468071
13.4915849
4
rajah
0.17687071
0
5
surifer
0.01698639
3.8828717
52
6
tiomanicus
0.14285714
0
6
tiomanicus
0.04568538
6.96155194
7
whiteheadi
0.14285714
0
7
whiteheadi
0.2143286
160.519084
Cluster centers defined by
species
©2009 Philip J. Ramsey, Ph.D.
Cluster centers defined by
species
27
 Example : Notice in the partial spreadsheet view that the
observations without a species are assigned a cluster (or species).
©2009 Philip J. Ramsey, Ph.D.
28
 Example : If you would like the Cluster column to display species
designations, change the column data type to character (Column Info
window), while the cluster column is highlighted click on the Column
menu and select “recode.” Now recode the cluster numbers to
species.
©2009 Philip J. Ramsey, Ph.D.
29
 Example : A biplot with
the 7 clusters marked on
the plot. Palantine
foramen is not highly
correlated with the other
variables. The species
whiteheadi (cluster 7) has
a very small palatine
foramen. The species
exulans (cluster 3) have a
small skull length, teeth,
and jaw. Notice
considerable overlap in
the clusters.
©2009 Philip J. Ramsey, Ph.D.
Exulans
palatine foramen
jaw length
teeth row
skull length
whiteheadi
30
 Example : A 3D biplot with the 7 clusters marked on the plot as
ellipsoids.
Scatterplot 3D
Exulans
whiteheadi
Principal Components
©2009 Philip J. Ramsey, Ph.D.
Prin1
Prin2
Prin3
31