Transcript Chapter 5

Lecture 6
Statistical Lecture
─ Cluster Analysis
Cluster Analysis
• Grouping similar objects to produce a
classification
• Useful when the priori the structure of the
data is unknown
• Involving the assessment of the relative
distances between points
Clustering Algorithms
• Partitioning :
Divide the data set into k clusters where k
needs to be specified beforehand, e.g.
k-means.
Clustering Algorithms
• Hierarchical :
– Agglomerative methods :
Start with the situation where each object
forms its own little cluster, and then
successively merges clusters until only
one large cluster left
– Divisive methods :
Start by considering the whole data set as
one cluster, and then splits up clusters
until each object is separate
Caution
• Most users are interested in the main structure
of their data, consisting of a few large clusters
• When forming larger clusters, agglomerative
methods might makes wrong decisions in the
first step. (Once one step is wrong, the whole
thing is wrong)
• For divisive methods, the larger clusters are
determined first, so they are less likely to
suffer from earlier steps
Agglomerative Hierarchical
Clustering Procedure
(1) Each observation begins in a cluster by
itself
(2) The two closest clusters are merged to from
a new cluster that replaces the two old
clusters
(3) Repeat (2) until only one cluster is left
The various clustering methods differ in how
the distance between two clusters is computed.
Remarks
• For coordinate data, variables with large
variances tend to have more effect on the
resulting clusters than those with small
variance
• Scaling or transforming the variables might
be needed
• Standardization (standardize the variables to
mean 0 and standard deviation 1) or
principle components is useful but not
always appropriate
• Outliers should be removed before analysis
Remarks(cont.)
• Nonlinear transformations of the variables
may change the number of population clusters
and should therefore be approached with
caution
• For most applications, the variables should be
transformed so that equal differences are of
equal practical importance
• An interval scale of measurement is required
if raw data are used as input. Ordinal or
ranked coordinate data are generally not
appropriate
Notation
n
v
G
xi
Ck
Nk
number of observation
number of variables if data are
coordinates
number of clusters at any given level of
the hierarchy
ith observation
kth cluster, subset of {1, 2, …, n}
number of observations in Ck
Notation(cont.)
x
xk
||x||
T
Wk
sample mean vector
mean vector for cluster Ck
Euclidean length of the vector x, that is
the square root of the sum of the squares
of the elements of x
n
2
||
x

x
||
 i
i 1
2
||
x

x
||
 i k
iCk
Notation(cont.)
PG
Wj, where summation is over the G
clusters at the Gth level of the hierarchy
Bkl
Wm – Wk – Wl if Cm=CkCl
d(x, y) any distance or dissimilarity measure
between observations or vectors x and y
Dkl
any distance or dissimilarity measure
between clusters Ck and Cl
Clustering Method
─ Average Linkage
The distance between two clusters is defined by
Dkl    dxi, x j/Nk Nl 
iCk jCl
If d(x, y)=||x – y||2, then
Dkl || xk  xl ||2 Wk / Nk  Wl / Nl
The combinatorial formula is
Djm  Nk Djk  Nl Djl / Nm
if Cm=CkCl
Average Linkage
• The distance between clusters is the average
distance between pairs of observations, one
in each cluster
• It tends to join clusters with small variance
and is slightly biased toward producing
clusters with the same variance
Centroid Method
The distance between two clusters is defined
by
2
Dkl || xk  xl ||
If d(x, y)=||x – y||2, then the combinatorial
formula is
Djm  Nk Djk  Nl Djl / Nm  Nk Nl Dkl / Nm2
Centroid Method
• The distance between two clusters is
defined as the squared Euclidean distance
between their centroids or means
• It is more robust to outliers than most other
hierarchical methods but in other respects
may not perform as well as Ward’s method
or average linkage
Complete Linkage
The distance between two clusters is defined
by
Dkl  max max d xi, x j
iCk
jCl
The combinatorial formula is
Djm  max Djk, Djl 
Complete Linkage
• The distance between two cluster is the
maximum distance between an observation
in one cluster and an observation in the
other cluster
• It is strongly biased toward producing
clusters with roughly equal diameters and
can be severely distorted by moderate
outliers
Single Linkage
The distance between two clusters is defined
by
Dkl 
min min d x , x 
i
iC k
jC l
The combinatorial formula is

D jm  min D jk , D jl

j
Single Linkage
• The distance between two clusters is the
minimum distance between an observation
in one cluster and an observation in the
other cluster
• It sacrifices performance in the recovery of
compact clusters in return for the ability to
detect elongated and irregular clusters
Ward’s Minimum-Variance Method
The distance between two clusters is defined
by
|| xk  xl ||2
Dkl  Bkl 
1 1

Nk Nl
If d(x, y)=||x – y||2, then the combinatorial
formula is

N j  Nk Djk  N j  Nl Djl  N jDkl
Djm 
N j  Nm
Ward’s Minimum-Variance Method
• The distance between two clusters is the
ANOVA sum of squares between the two
clusters added up over all the variables
• It tends to join clusters with a small number
of observation
• It is strongly biased toward producing
clusters with roughly the same number of
observations
• It is also very sensitive to outliers
Assumptions for WMVM
• Multivariate normal mixture
• Equal spherical covariance matrices
• Equal sampling probabilities
Remarks
• Single linkage tends to lead to the formation
of long straggly clusters
• Average, complete linkage and Ward’s
method often find spherical clusters even
when the data appear to contain clusters of
other shapes
McQuitty’s Similarity Analysis
The combinatorial formula is
Djm  Djk  Djl / 2
Median Method
If d(x, y)=||x – y||2, then the combinatorial
formula is
Djm  Djk  Djl / 2  Dkl / 4
Kth-nearest Neighbor Method
• Prespecify k
• Let rk(x) be the distance from point x to the
kth nearest observation
• Consider a closed sphere centered at x with
radius rk(x), say Sk(x)
Kth-nearest Neighbor Method
• The estimated density at x is defined by
# of observatio ns within Skx
f x 
Volume of Skx
• For any two observations xi and xj
1 
1  1


if d xi, x j  max rkxi, rkx j
*
d xi, x j  2  f xi f x j

otherwise
K-Means Algorithm
• It is intended for use with large data sets,
from approximately 100 to 100000
observations
• With small data sets, the results may be
highly sensitive to the order of the
observations in the data set
• It combines an effective method for finding
initial clusters with a standard iterative
algorithm for minimizing the sum of
squared distance from the cluster means
K-Means Algorithm
• Specify the number of clusters, say k
• A set of k points called cluster seeds is
selected as a first guess of the means of the k
clusters
• Each observation is assigned to the nearest
seed to form temporary clusters
• The seeds are then replaced by the means of
the temporary clusters
• The process is repeated until no further
changes occur in the clusters
Cluster Seeds
• Select the first complete (no missing values)
observation as the first seed
• The next complete observation that is
separated from the first seed by at least the
prespecified distance becomes the second
seed
• Later observations are selected as new seeds
if they are separated from all previous seeds
by at least the radius, as long as the
maximum number of seeds is not exceeded
Cluster Seeds
If an observation is complete but fails to
qualify as a new seed, two tests can be made
to see if the observation can replace one of
the old seeds
Cluster Seeds(cont.)
• An old seed is replaced if the distance
between the observation and the closest seed
is greater than the minimum distance between
seeds. The seed that is replaced is selected
from the two seeds that are closest to each
other. The seed that is replaced is the one of
these two with the shortest distance to the
closest of the remaining seed when the other
seed is replaced by the current observation
Cluster Seeds(cont.)
• If the observation fails the first test for seed
replacement, a second test is made. The
observation replaces the nearest seed if the
smallest distance from the observation to all
seeds other than the nearest one is greater
than the shortest distance from the nearest
seed to all other seeds. If this test is failed,
go on to the next observation.
Dissimilarity Matrices
n  n dissimilarity matrix
 0

d 2, 1

0


d 3, 1 d 3, 2 0

 




d n, 1 d n, 2  d n, n  1 0
where d(i, j)=d(j, i) measures the “difference”
or dissimilarity between the objects i and j.
Dissimilarity Matrices
d usually satisfies
• d(i, i) = 0
• d(i, j)  0
• d(i, j) = d(j, i)
Dissimilarity
Interval-scaled variables-continuous
measurements on a (roughly) linear scale
(temperature, height, weight, etc.)
• d i, j   xif  x jf 
v
f 1
2
(Euclidean distance)
v
• di, j   | xif  x jf |
f 1
(Manhattan distance)
Dissimilarity(cont.)
• The choice of measurement units strongly
affects the resulting clustering
• The variable with the large dispersion will
have the largest impact on clustering
• If all variables are considered equally
important, the data need to be standardized
first
Standardization
• Mean absolute deviation (Robust)
xif  mf
1n
1n
Zij 
where mf   xif and s f   | xif  mf |
sf
n i1
n i1
• Median absolute deviation (Robust)
xif  mf
1n
Zij 
where mf  median{xif} and s f  | xif  mf |
sf
n i1
• Usual standard deviation
xif  mf
1n
1 n
2
xif  mf 
Zij 
where mf   xif and s f 
n 1 
sf
n i1
i 1
Continuous Ordinal Variables
These are continuous measurements on an
unknown scale, or where only the ordering is
known but not the actual magnitude.
• Replace the xif by their rank rif {1, …, Mf}
• Transform the scale to [0,1] as follows :
rif  1
Zif 
M f 1
• Compute the dissimilarities as for intervalscaled variables
Ratio-Scaled Variables
These are positive continuous measurements on a
nonlinear scale, such as an exponential scale. One
example would be the growth of a bacterial population
(say, with a growth function AeBt).
• Simple as interval-scaled variables, though this is not
recommended as it can distort the measurement scale
• As continuous ordinal data
• By first transforming the data (perhaps by taking
logarithms), and then treating the results as intervalscaled variables
Discrete Ordinal Variables
A variable of this type has M possible values
(scores) which are ordered.
The dissimilarities are computed in the same
way as for continuous ordinal variables.
Nominal Variables
• Such a variable has M possible values,
which are not ordered.
• The dissimilarity between objects i and j is
usually defined as
# variables taking different values for i and j
di, j 
total number of variables
Symmetric Binary Variables
Two possible values, coded 0 and 1, which
are equally important (s.t. a male and female).
Consider the contingency table of the objects i
and j :
i\ j 1 0
1 a b
0 c d
bc
d i, j 
abcd
Asymmetric Binary Variables
Two possible values, one of which carries
more importance than the other.
The most meaningful outcome is coded as 1,
and the less meaningful outcome as 0.
Typically, 1 stands for the presence of a
certain attribute (e.g., a particular distance),
and 0 for its absence.
Asymmetric Binary Variables
i\ j 1 0
1 a b
0 c d
bc
di, j 
abc
Cluster Analysis of Flying Mileages
Between 10 American Cities
0
587
1212
701
1936
604
748
2139
2182
543
0
920
940
1745
1188
713
1858
1737
597
0
879
831
1726
1631
949
1021
1494
0
1374
968
1420
1645
1891
1220
0
2339
2451
347
959
2300
0
1092
0
2594 2571
0
2734 2408 678
0
923 205 2442 2329
ATLANTA
CHICAGO
DENVER
HOUSTON
LOS ANGELES
MIAMI
NEW YORK
SAN FRANCISCO
SEATTLE
0 WASHINGTON D.C.
The CLUSTER Procedure
Average Linkage Cluster Analysis
Root-Mean-Square Distance Between Observations = 1580.242
Cluster History
NCL
Clusters Joined
FREQ
PSF
PST2
Norm
RMS
Dist
9
NEW YORK
WASHINGTON D.C.
2
66.7
.
0.1297
8
LOS ANGELES
SAN FRANCISCO
2
39.2
.
0.2196
7
ATLANTA
CHICAGO
2
21.7
.
0.3715
6
CL7
CL9
4
14.5
3.4
0.4149
5
CL8
SEATTLE
3
12.4
7.3
0.5255
4
DENVER
HOUSTON
2
13.9
.
0.5562
3
CL6
MIAMI
5
15.5
3.8
0.6185
2
CL3
CL4
7
16.0
5.3
0.8005
1
CL2
CL5
10
.
16.0
1.2967
T
i
e
Average Linkage Cluster Analysis
The CLUSTER Procedure
Centroid Hierarchical Cluster Analysis
Root-Mean-Square Distance Between Observations = 1580.242
Cluster History
NCL
Clusters Joined
FREQ
PSF
PST2
Norm
Cent
Dist
9
NEW YORK
WASHINGTON D.C.
2
66.7
.
0.1297
8
LOS ANGELES
SAN FRANCISCO
2
39.2
.
0.2196
7
ATLANTA
CHICAGO
2
21.7
.
0.3715
6
CL7
CL9
4
14.5
3.4
0.3652
5
CL8
SEATTLE
3
12.4
7.3
0.5139
4
DENVER
CL5
4
12.4
2.1
0.5337
3
CL6
MIAMI
5
14.2
3.8
0.5743
2
CL3
HOUSTON
6
22.1
2.6
0.6091
1
CL2
CL4
10
.
22.1
1.173
T
i
e
Centroid Hierarchical Cluster Analysis
The CLUSTER Procedure
Single Linkage Cluster Analysis
Mean Distance Between Observations = 1417.133
Cluster History
NCL
Clusters Joined
FREQ
Norm
Min
Dist
9
NEW YORK
WASHINGTON D.C.
2
0.1447
8
LOS ANGELES
SAN FRANCISCO
2
0.2449
7
ATLANTA
CL9
3
0.3832
6
CL7
CHICAGO
4
0.4142
5
CL6
MIAMI
5
0.4262
4
CL8
SEATTLE
3
0.4784
3
CL5
HOUSTON
6
0.4947
2
DENVER
CL4
4
0.5864
1
CL3
CL2
10
0.6203
T
i
e
Single Linkage Cluster Analysis
The CLUSTER Procedure
Ward's Minimum Variance Cluster Analysis
Root-Mean-Square Distance Between Observations = 1580.242
Cluster History
NCL
Clusters Joined
FREQ
SPRSQ
RSQ
PSF
PST2
9
NEW YORK
WASHINGTON D.C.
2
0.0019
.998
66.7
.
8
LOS ANGELES
SAN FRANCISCO
2
0.0054
.993
39.2
.
7
ATLANTA
CHICAGO
2
0.0153
.977
21.7
.
6
CL7
CL9
4
0.0296
.948
14.5
3.4
5
DENVER
HOUSTON
2
0.0344
.913
13.2
.
4
CL8
SEATTLE
3
0.0391
.874
13.9
7.3
3
CL6
MIAMI
5
0.0586
.816
15.5
3.8
2
CL3
CL5
7
0.1488
.667
16.0
5.3
1
CL2
CL4
10
0.6669
.000
.
16.0
T
i
e
Ward's Minimum Variance Cluster Analysis
Fisher (1936) Iris Data
The FASTCLUS Procedure
Replace=FULL Radius=0 Maxclusters=2 Maxiter=10 Converge=0.02
Initial Seeds
Cluster
SepalLength
SepalWidth
PetalLength
PetalWidth
1
43.00000000
30.00000000
11.00000000
1.00000000
2
77.00000000
26.00000000
69.00000000
23.00000000
Minimum Distance Between Initial Seeds =
70.85196
Fisher (1936) Iris Data
The FASTCLUS Procedure
Replace=FULL Radius=0 Maxclusters=2 Maxiter=10 Converge=0.02
Iteration History
Relative Change in Cluster
Seeds
Iteration
Criterion
1
2
1
11.0638
0.1904
0.3163
2
5.3780
0.0596
0.0264
3
5.0718
0.0174
0.00766
Convergence criterion is satisfied.
Criterion Based on Final Seeds =
5.0417
Fisher (1936) Iris Data
The FASTCLUS Procedure
Cluster Summary
Freque
ncy
RMS Std
Deviation
Maximum
Distance
from Seed to
Observation
1
53
3.7050
2
97
5.6779
Clust
er
Nearest
Cluster
Distance
Between
Cluster
Centroids
21.1621
2
39.2879
24.6430
1
39.2879
Radius
Exceed
ed
Fisher (1936) Iris Data
The FASTCLUS Procedure
Statistics for Variables
Variable
Total STD
Within STD
R-Square
RSQ/(1-RSQ)
SepalLength
8.28066
5.49313
0.562896
1.287784
SepalWidth
4.35866
3.70393
0.282710
0.394137
17.65298
6.80331
0.852470
5.778291
7.62238
3.57200
0.781868
3.584390
10.69224
5.07291
0.776410
3.472463
PetalLength
PetalWidth
OVER-ALL
Pseudo F Statistic =
513.92
Approximate Expected Over-All R-Squared =
Cubic Clustering Criterion =
0.51539
14.806
WARNING: The two above values are invalid for correlated variables
R /(c  1)
F
2
(1  R ) /(n  c)
2
c: number of clusters
n: number of observations
Fisher (1936) Iris Data
The FASTCLUS Procedure
Replace=FULL Radius=0 Maxclusters=2 Maxiter=10 Converge=0.02
Cluster Means
Cluster
SepalLength
SepalWidth
PetalLength
PetalWidth
1
50.05660377
33.69811321
15.60377358
2.90566038
2
63.01030928
28.86597938
49.58762887
16.95876289
Cluster Standard Deviations
Cluster
SepalLength
SepalWidth
PetalLength
PetalWidth
1
3.427350930
4.396611045
4.404279486
2.105525249
2
6.336887455
3.267991438
7.800577673
4.155612484
Fisher (1936) Iris Data
The FREQ Procedure
Frequency
Percent
Row Pct
Col Pct
Table of CLUSTER by Species
Species
Setosa
Versicolor
Virginica
Total
1
50
33.33
94.34
100.00
3
2.00
5.66
6.00
0
0.00
0.00
0.00
53
35.33
2
0
0.00
0.00
0.00
47
31.33
48.45
94.00
50
33.33
51.55
100.00
97
64.67
50
33.33
50
33.33
50
33.33
150
100.00
CLUSTER(Cluster)
Total
Fisher (1936) Iris Data
The FASTCLUS Procedure
Replace=FULL Radius=0 Maxclusters=3 Maxiter=10 Converge=0.02
Initial Seeds
Cluster
SepalLength
SepalWidth
PetalLength
PetalWidth
1
58.00000000
40.00000000
12.00000000
2.00000000
2
77.00000000
38.00000000
67.00000000
22.00000000
3
49.00000000
25.00000000
45.00000000
17.00000000
Minimum Distance Between Initial Seeds =
38.23611
Fisher (1936) Iris Data
The FASTCLUS Procedure
Replace=FULL Radius=0 Maxclusters=3 Maxiter=10 Converge=0.02
Iteration History
Relative Change in Cluster Seeds
Iteration
Criterion
1
2
3
1
6.7591
0.2652
0.3205
0.2985
2
3.7097
0
0.0459
0.0317
3
3.6427
0
0.0182
0.0124
Convergence criterion is satisfied.
Criterion Based on Final Seeds =
3.6289
Fisher (1936) Iris Data
Cluster Summary
Nearest
Cluster
Distance
Between
Cluster
Centroids
12.4803
3
33.5693
4.0168
14.9736
3
17.9718
4.0398
16.9272
2
17.9718
Freque
ncy
RMS Std
Deviation
Maximum
Distance
from Seed
to Observation
1
50
2.7803
2
38
3
62
Clust
er
Radius
Excee
ded
Fisher (1936) Iris Data
Statistics for Variables
Variable
Total STD
Within STD
R-Square
RSQ/(1-RSQ)
SepalLength
8.28066
4.39488
0.722096
2.598359
SepalWidth
4.35866
3.24816
0.452102
0.825156
17.65298
4.21431
0.943773
16.784895
7.62238
2.45244
0.897872
8.791618
10.69224
3.66198
0.884275
7.641194
PetalLength
PetalWidth
OVER-ALL
Pseudo F Statistic =
561.63
Approximate Expected Over-All R-Squared =
Cubic Clustering Criterion =
0.62728
25.021
WARNING: The two above values are invalid for correlated variables.
Fisher (1936) Iris Data
The FASTCLUS Procedure
Replace=FULL Radius=0 Maxclusters=3 Maxiter=10 Converge=0.02
Cluster Means
Cluster
SepalLength
SepalWidth
PetalLength
PetalWidth
1
50.06000000
34.28000000
14.62000000
2.46000000
2
68.50000000
30.73684211
57.42105263
20.71052632
3
59.01612903
27.48387097
43.93548387
14.33870968
Cluster Standard Deviations
Cluster
SepalLength
SepalWidth
PetalLength
PetalWidth
1
3.524896872
3.790643691
1.736639965
1.053855894
2
4.941550255
2.900924461
4.885895746
2.798724562
3
4.664100551
2.962840548
5.088949673
2.974997167
Fisher (1936) Iris Data
Frequency
Percent
Row Pct
Col Pct
CLUSTER by Species
TheTable
FREQofProcedure
Species
CLUSTER(Cluster)
Setosa
Versicolor
Virginica
Total
1
50
33.33
100.00
100.00
0
0.00
0.00
0.00
0
0.00
0.00
0.00
50
33.33
2
0
0.00
0.00
0.00
2
1.33
5.26
4.00
36
24.00
94.74
72.00
38
25.33
3
0
0.00
0.00
0.00
48
32.00
77.42
96.00
14
9.33
22.58
28.00
62
41.33
50
33.33
50
33.33
50
33.33
150
100.00
Total