DiscriminantAnalysisRev1

Transcript DiscriminantAnalysisRev1

Advanced Statistical Methods for Research Math 736/836
Discriminant
Analysis and
Classification
Supervised learning.
Kathy dreaded her annual performance
review with Catbert the evil H.R. Director.
©2009 Philip J. Ramsey, Ph.D.
1
 In this section we explore a technique to separate observations into
groups as linear functions of some number of covariates.
 We are interested in developing linear functions that accurately
separate the observations into predetermined groups and understand
the relative contributions of the covariates in creating the separations
into these groupings.
 Today Discriminant Analysis is used for more than separation and
is used as a classification method where a training dataset with
known classifications exist. In this sense Discriminant analysis is a
method of supervised learning.
 We will discuss linear discriminant functions and quadratic
discriminant functions.
 Discriminant analysis is related to such methods as logistic
regression, CART, and neural nets, and these other methods are
increasingly used in lieu of discriminant analysis for classification.
©2009 Philip J. Ramsey, Ph.D.
2
 The origins of Discriminant analysis seem to be from a paper
published by R.A. Fisher in 1936.
 Fisher was attempting to find some linear combination of
covariates that could be used to discriminate between predetermined
categories or groupings of the observations.
 He was also attempting to reduce the dimensionality of the
covariates such that one could visually discriminate between the
groups.
 In a sense, Discriminant analysis combines concepts of ANOVA
and PCA in developing these linear discriminant functions.
 PCA is an example of unsupervised learning in that no use is made
of any natural groupings in the observations.
 However, Discriminant is considered supervised learning since it
explicitly makes use of the groupings in developing the functions.
©2009 Philip J. Ramsey, Ph.D.
3
 We begin with the simplest case of k variables or covariates that
can be used to separate a set of N observations into m = 2 groups.
 We assume that both groups have an identical covariance structure
 among the k variables. However, they have different centroids μ1
and μ2.
 A linear discriminate function is a combination of the k variables
that maximizes the distance between the centroids.
 In basic discriminant analysis we typically have two samples, one
from each group, with n1 and n2 observations in each group, where N
= n1 + n2.
 We will let Y1j represent a row vector (k elements, one for each
column or variable) from group 1, there are n1 such vectors, while
Y2j represent a row vector of length k from group 2 and there are n2
such vectors.
©2009 Philip J. Ramsey, Ph.D.
4
 The linear discriminant functions turn each row vector into a scalar
value Z. In other words we create a linear combination of the k
covariates Z = a′Y where a′ is a vector of coefficients for the linear
combination. Note, in concept, the similarity to PCA.
 We have m = 2 such discriminant functions.
Z1 j  aY1j  a1Y1 j1  a2Y1 j 2  ...  akY1 jk
Z 2 j  aY2 j  a1Y2 j1  a2Y2 j 2  ...  akY2 jk
 The idea is to find estimates of the coefficients a′ such that the
standardized differences between the two scalar means are
maximized. Since the differences could be negative we work with the
squared distances.
Z1  aY1 , Z2  aY2 (Y1 , Y2 are the estimated centroids)
©2009 Philip J. Ramsey, Ph.D.
5
 The trick is to find the values of the weights a´ such that we have
the maximum separation between the two groups.
 Fisher viewed this as an ANOVA type problem where he wanted to
find the weights such that the variation between the transformed
groups (the linear combinations) was maximized and the within
group variation minimized.
 We can define the between group covariance as a matrix H, which
we explain in detail later in the notes. Also, we can identify the
within group covariance with the matrix E and again we explain this
matrix in detail later.
 Basically we wish to find the weights that maximize the ratio
aHa

aEa
 For only two groups the expression greatly simplifies.
©2009 Philip J. Ramsey, Ph.D.
6
 Let Spool be the pooled covariance matrix (dimension k x k) from
the two groups. Recall we assume both groups have the same
underlying covariance structure, so it makes sense to pool the two
estimates together in order to get a better overall estimate of .
 Let SZ be the pooled standard deviation for the two columns of
scalars Z1 and Z2, then the formula for the standardize squared
distance between the scalar or column means is
Z
1
 Z2 
S z2
2
a  Y1 - Y2 


aSpool a
2
 The RHS of this expression is our objective function, which we
maximize to find the weights; i.e.; with some calculus the maximum
separation occurs where
a kx1  S pool1  Y1 - Y2 
©2009 Philip J. Ramsey, Ph.D.
7
 What we are doing is trying to find a direction or line (for two
groups) onto which we project the original observations, such that
the standardized mean differences between the transformed groups is
maximized on that line.
 The solution for a´, intuitively, projects the observations onto a line
that runs parallel to a line joining the two centroids of our two
groups.
 The differences between the two transformed or projected means is
maximized only if that line is parallel to the line joining the
centroids. A projection in any other direction (not orthogonal to the
line joining the centroids) would result in a smaller difference
between the transformed means Zi.
 The solution for the weights a´ is not unique, however the
projection direction is unique and is always orthogonal to the line
joining the centroids.
©2009 Philip J. Ramsey, Ph.D.
8
 Example: We use the dataset Table81Rencher.JMP to illustrate
the discriminant function method for 2 groups. The data set consists
of two measurements of strength on samples of steel processed at
two different temperatures; the temperatures form the groups. We
will develop a linear discriminant function to classify the steel into
the two groups based upon the two measurements.
©2009 Philip J. Ramsey, Ph.D.
9
 Example: Notice from the Fit Y by X plot that we would have a
great deal of overlap if we tried to project the observations onto
either the Y axis or X axis. However, if it is possible to project the
points onto a line in another direction, then we could greatly reduce
the overlap between the two groups. This is the concept of a linear
discriminant function.
Bivar iate Fit of Ultim ate By Yie ld
67.5
Ultimate
65
62.5
60
57.5
55
32.5
35
37.5
40
Yield
©2009 Philip J. Ramsey, Ph.D.
42.5
45
10
©2009 Philip J. Ramsey, Ph.D.
64
62
Ultimate
 Example: In the plot to
the right, the ‘X’ symbols
connect the centroids for
the two groups. The line
we project the points onto
will always be parallel to a
line connecting the
centroids and the exact
position of this line
depends upon the
coefficient estimates a′.
Remember the solution
shown is not unique, but
the projection direction is
unique.
60
58
56
30
32.5
35
37.5
40
42.5
Yield
11
 Example: Below are the calculations for the coefficients of the
discriminant functions for the two groups.
36.4 
39.0 
7.92 5.68 
Y1  
Y2  
S pool  



62.6
60.4
5.68
6.29






 0.358 0.323  2.6
1
a  S pool  Y1  Y2   
  2.2 

0.323
0.4508



 1.6426 


1.8327


Z  aY  1.6426Y1  1.8327Y2
©2009 Philip J. Ramsey, Ph.D.
12
 Example: Below is a data table showing the values of the
discriminant function Z. On the Z scale we can clearly differentiate
between the two Temperature groups.
©2009 Philip J. Ramsey, Ph.D.
13
 Example: One could also do a two sample t test on Z to see if a
true mean difference exists between the two groups. Below is the two
sample t test results from the JMP Fit Y by X platform. Notice that a
highly significant difference exists. Equivalently we can say that a
significant difference exists between the centroids of the original
data.
©2009 Philip J. Ramsey, Ph.D.
14
 Example: JMP performs discriminant analysis in the Discrminant
platform. The solution from JMP is shown below. Notice that it is a
scaled version of the manual solution, but it is also equivalent in its
ability to classify into the two groups.
Z  aY  0.5705Y1  0.6356Y2
©2009 Philip J. Ramsey, Ph.D.
15
 Example: The Canon[1] scores below are the discriminant scores
for the JMP solution. We can see that the the JMP solution
equivalently separates the two temperatures into their respective
groups.
Canon[1]  aY  0.5705Y1  0.6356Y2
Z  aY  1.6426Y1  1.8327Y2
 To show the equivalence note that the ratio of the Z values to the
Canon[1] values is a constant = 2.89
©2009 Philip J. Ramsey, Ph.D.
16
 Example: We can show the conceptual similarity to PCA by doing
a PCA on Yield and Ultimate and show that we can separate the two
Temperature groups by projecting all of the points onto the PC2 line.
 In the PC1 direction we cannot separate the two groups but in the
PC2 direction we can easily separate the two groups, so a projection
on to the PC2 line does provide discrimination between the groups.
©2009 Philip J. Ramsey, Ph.D.
17
 Example: The solution in Discriminant is not the eigen
decomposition on S or R used in PCA, however in concept the idea
is to find a coordinate system on which to plot the points (rotate the
original axes) such that the groups are separated as much as possible
– the transformed means are as far apart as possible. Of course PCA
is not intended to separate groups, so the solution is not identical to
Discriminant. More on this later.
PC1
PC2
©2009 Philip J. Ramsey, Ph.D.
18
 When we have more than two groups, the problem becomes more
complex, however the basic principle of linear discriminant functions
is still the same as in the two group case.
 In order to explain the solution for m > 2 groups we need to
introduce a couple of concepts from one-way Multivariate Analysis
of Variance (MANOVA).
 Recall in univariate one-way Analysis of Variance (ANOVA) the
variation in the observations is broken down into two parts. The first
part is the within variation, which describes the purely random
variation within the replicates in each of the groups. The second part
is the between variation, which describes the variation between
groups.
 If the between variation is large compared to the within variation,
then we have evidence of significant differences between the groups.
©2009 Philip J. Ramsey, Ph.D.
19
 For the univariate one-way ANOVA case the sums of squares
define the variation in the two groups and we have the well known
result SS(Total) = SS(Between) + SS(Error) and for m groups the
formula is
2
2
2
 Y
m
ni
i 1 j 1
ij
Y


m
  ni Yi  Y
i 1

  Yij  Yi 
ni
m
i 1 j 1
SS (total )  SS (between)  SS (within)
 When multivariate responses exist, then we have to re-express the
within and between formulas using matrix notation.
m


SS (between)  H   ni Yi  Y Yi  Y
i 1


SS ( within)  E    Yij  Yi  Yij  Yi 
m
ni
i 1 j 1
where Yi = centroid of the i th group, Y  overall centroid,
and Yij  the vector of observations for the ijth cell.
©2009 Philip J. Ramsey, Ph.D.
20
 Below is the expression for standardized distance squared between
two groups
Z
1
 Z2 
S z2
2
a  Y1 - Y2 
a  Y1 - Y2  Y1 - Y2  a




aSpoola
aSpoola
2
 To generate the analogous multivariate expression for m > 2 groups
we substitute in the H and E matrices.
aHa

aEa
 In this case we search for solutions a′ which maximize λ. Notice
that the trivial solution a′ = 0 is not permissible.
 Notice that λ is a ratio of the between groups variation to the
random within groups variation and we want to maximize the
discrimination between the groups, hence maximize λ.
©2009 Philip J. Ramsey, Ph.D.
21
 By rearranging terms we have the expressions
aEa  aHa
a  Ha   Ea   0   Ha  Ea 
1
E
 H  I  a  0
 With some calculus it can be shown that the solutions are the s
nonzero eigenvalues and associated eigenvectors of E-1H and thus
the largest eigenvalue is the maximum value of λ and the associated
eigenvector gives the solution a′, the scoring coefficients. Note the
similarity to PCA.
 In other words, the solution for the coefficients of the linear
discriminant functions are the weights from the eigenvectors of E-1H.
 Typically the number of required linear discriminant functions is
based upon the magnitudes of the eigenvalues and generally only one
or two are required. The method is inherently graphical.
©2009 Philip J. Ramsey, Ph.D.
22
 In general for m groups and k variables s = min(m-1,k) and is the
rank of H. The number of discriminant functions is no greater than
the rank of H. For m = 2 groups we see that 1 discriminant function
is required.
 The relative importance of each discriminant function can be
assessed by the relative contribution of each eigenvalue to the total
(recall the proportion of variation explained concept form PCA). For
the ith discriminant function the relative importance is
i
s

i 1
i
 Typically only the discriminant functions associated with the
largest eigenvalues are retained and the remainder are ignored.
©2009 Philip J. Ramsey, Ph.D.
23
 Example: We use the dataset FootBallHelmet.JMP. This is a
dataset measuring head dimensions, for a study of football helmets,
for three groups of individuals; high school football players, college
football players, and non-football players. For each of the m = 3
groups, k = 6 variables are measured. There are a total of N = 90
observations. We use the JMP Discriminant platform to find a
solution.
©2009 Philip J. Ramsey, Ph.D.
24
 Example: Results from the Discriminant report
H=
Betw een Matrix
WDim
WDim
0.0251034
Circum -0.083175
FBEye
-0.020621
EyeHD -0.153333
EarHD
-0.073172
Jaw
-0.030989
Circum
-0.083175
0.4528203
0.0949131
0.8631
0.3790986
0.1137473
FBEye
-0.020621
0.0949131
0.0209278
0.1792222
0.0806086
0.0271162
EyeHD
-0.153333
0.8631
0.1792222
1.6478697
0.7207088
0.2114636
EarHD
-0.073172
0.3790986
0.0806086
0.7207088
0.3186539
0.0988646
Jaw
-0.030989
0.1137473
0.0271162
0.2114636
0.0988646
0.0389451
Circum
0.5779448
3.1610934
1.0199745
0.6527046
0.3400636
0.5050969
FBEye
0.1579195
1.0199745
0.5459435
0.0768065
0.12859
0.1591149
EyeHD
0.0837931
0.6527046
0.0768065
1.2324636
0.314751
0.0424674
EarHD
0.1245517
0.3400636
0.12859
0.314751
0.6180575
0.0091303
Jaw
0.228
0.5050969
0.1591149
0.0424674
0.0091303
0.3755172
Within Matrix
E=
WDim
Circum
FBEye
EyeHD
EarHD
Jaw
WDim
0.4282299
0.5779448
0.1579195
0.0837931
0.1245517
0.228
Eigenvectors
Eigenvectors
of E-1H
©2009 Philip J. Ramsey, Ph.D.
Canon1
Canon2
Canon3
Canon4
Canon5
Canon6
WDim
Circum
FBEye
EyeHD
EarHD
Jaw
-0.948423 0.0036399 0.0064396 0.6474831 0.5043609 0.8285351
-1.406775 0.0005126 0.0286176
-0.54027 0.3839132 1.5288556
-0.243889 0.9654477
-1.62245 -0.599344 0.6417565 -0.259021
0.5168255 -0.381481 -0.432447 -0.147299 0.9427287 0.2331579
0.9145116 -0.307908 1.107693 -0.009525 0.1699888 0.4759216
-0.353562 -0.139149 1.0937707 -0.209676 0.5939131 -1.005664
25
 Example: From the report we see that only s = 2 nonzero
eigenvalues exist (m-1), therefore there are only two useful
disciminant functions Z1 and Z2. The coefficients of the functions are
obtained from the first two rows of the eigenvector matrix given on
the previous slide.
Eigenvalue
1.91776348
0.11593148
3.0166e-16
9.8859e-17
1.8746e-18
-4.222e-17
Percent Cum Percent
94.2995
94.2995
5.7005
100.0000
0.0000
100.0000
0.0000
100.0000
0.0000
100.0000
0.0000
100.0000
 Since the first eigenvalue is 94.3% of the total, it is possible that
only one discriminant function is actually needed to separate the
groups.
Z1 =
©2009 Philip J. Ramsey, Ph.D.
26
 Example: Below are Fit Y by X plots of the two discriminant
functions by Group. The first function Canon[1] seems to separate
the HSFB group quite well from the other two, while the second
function Canon[2] has little or no discrimination ability between the
groups.
Onew ay Analys is of Canon[1] By Group
15
Onew ay Analys is of Canon[2] By Group
0
14
-1
12
Canon[2]
Canon[1]
13
11
10
-2
-3
-4
9
-5
8
7
-6
CollFB
HSFB
Group
©2009 Philip J. Ramsey, Ph.D.
NonFB
CollFB
HSFB
NonFB
Group
27
 Hypothesis tests can be performed for the significance of the
discriminant functions.
 In developing the discriminant functions we have made no
assumptions about the distributions of the multivariate groups. In
order to perform hypothesis tests, we do have to assume that the
groups are multivariate normal with an identical covariance structure
, but possibly different centroids μ1, μ2, … , μm.
 If the discriminant functions are not significant, the null hypothesis,
then is equivalent to saying that the true coefficient vectors a′ of the
discriminant functions are 0. Stated another way, there is not a
significant difference among the m centroids.
 The null hypotheses can be stated as
H 0 : a1  a2  ...  as  0 H A : at least one ai  0
H 0 : μ1  μ2  ...  μs
©2009 Philip J. Ramsey, Ph.D.
H A : at least two μi 
28
 A number of multivariate hypothesis tests exist for a difference
between a group of centroids. We will not discuss the majority of
them in detail.
 One of the multivariate tests called Wilk’s Λ is particularly useful
for the hypothesis testing of the discriminant functions. Again, we
skip the details of the test for now, but the test can be shown to be a
function of the eigenvalues of E-1H from which we derive the
discriminant functions.
 Wilk’s Λ for s nonzero eigenvalues (the number of discriminant
functions) can be shown to be
s
1
1  
i 1 1  i
©2009 Philip J. Ramsey, Ph.D.
29
 For k variables, m groups, significance level α (usually 0.05), and N
observations the distribution of the Wilk’s Λ test statistic can be
compared to tabled percentile values Λk,m-1,N-m,α. Various
approximations exist and JMP uses an approximation based upon the
F distribution.
 Wilk’s Λ is a bit unusual as a test statistic in that we reject for
smaller values rather than larger. Looking at the formula note that
large eigenvalues will lead to small values of the test statistic.
s
1
1  
i 1 1  i
 In general larger eigenvalues indicate more significant discriminant
functions, so our test rejects for values of the test statistic below the
critical value.
©2009 Philip J. Ramsey, Ph.D.
30
 A nice aspect of Wilk’s Λ test is that we can also use it to test each
of the discriminant functions for significance, in addition to the
overall test for the entire set of s functions.
 If the overall Wilk’s Λ test rejects, then we are assured that at least
the first discriminant function is significant, but this does not tell us
about the remaining s-1 functions.
 To test for the other functions drop the largest eigenvalue and test
the remaining set. This procedure can be followed iteratively for the
entire set of s functions.
 The test for the set with the largest eigenvalue deleted is
s
1
2  
i  2 1  i
©2009 Philip J. Ramsey, Ph.D.
31
 JMP provides the overall Wilk’s Λ test, but does not perform tests
on the remaining discriminant functions. We can perform these tests
by hand, but we will need to compute the F distribution
approximation to the Λ distribution to evaluate the tests.
 For the lth test Λl the F approximation is
s
1
l  
i l 1  i
1  1/l t df 2
F
1/l t df1
k  l  1  m  l   4

t
,   N -1- 0.5  k  m 
2
2
 k  l  1   m  l   5
df1   k  l  1 m  l  , df 2   t  0.5  k  l  1 m  l   2 
2
©2009 Philip J. Ramsey, Ph.D.
2
32
 Example: Continuing with the football helmet data we compute
the Wilk’s test statistics for the s = 2 discriminant functions.
1
1
1  0.3071/ 2 164
1 

=0.307 F 
 10.994
1/ 2
1+1.9178 1+0.1159
0.307
12
6  2  4

t
 2,   90 -1- 0.5  6  3  84.5
2
2
 6   2  5
df1   6  2   12, df 2  84.5  2   0.5  6  2   2   164
2
2
 From an F distribution table F0.95,12,164 = 1.82. Since our test
statistic F = 10.994 >> 1.89 we overwhelmingly reject the null
hypothesis. The p-value < 0.0001. We can now assume that at least
the first discriminant function is statistically significant.
©2009 Philip J. Ramsey, Ph.D.
33
 Example: Next we perform the test for the second discriminant
function.
1
1  0.896 83
2 
=0.896 F 
 1.924
1+0.1159
0.896 5
5  1  4

t
 1,   90 -1- 0.5  6  3   84.5
2
2
 5  1  5
df1   5 1  5, df 2  84.5 1  0.5  5 1  2   83
2
2
 From an F distribution table F0.95,5,83 = 2.325. Since our test
statistic F = 1.924 < 2.325 we fail to reject the null hypothesis. The
p-value < 0.10. We can not assume that the second discriminant
function is statistically significant.
©2009 Philip J. Ramsey, Ph.D.
34
 Example: From the discriminant analysis in JMP we can get the
overall test of significance for the 2 discriminant functions.
Test
Value
Wilks' Lambda
0.307123
Pillai's Trace
0.7611594
Hotelling- Law ley 2.033695
Roy's Max Root 1.9177635
Approx. F NumDF
10.9941
12
8.4994
12
13.7274
12
26.5291
6
DenDF
164
166
162
83
Prob>F
<.0001*
<.0001*
<.0001*
<.0001*
 JMP provides several other tests, however we will not discuss these
in this section. As shown, the Wilk’s Λ test is useful for discriminant
functions since we can partition the test for each of the discriminant
functions.
 All of the multivariate tests overwhelmingly reject the null
hypothesis, so they are in agreement that at least the first
discriminant function is statistically significant.
 Our test indicated that the second function may not be significant.
©2009 Philip J. Ramsey, Ph.D.
35
 Another statistic that can be used to try and determine the
importance of each of the discriminant functions is referred to as the
canonical correlation coefficient.
 The canonical correlation represents the relationship between the
discriminant function and the grouping variable. The higher the
correlation, the greater the ability of the discriminant function to sort
observations into the proper groups.
 In ANOVA a categorical factor is transformed into a dummy
variable (it can only have the values of 1 or 0) in order to construct a
linear model. For m levels of the categorical factor, m-1 dummy
variables are required.
 For example, suppose we have m = 3 groups, then we need 2
dummy variables
©2009 Philip J. Ramsey, Ph.D.
36
 We use the football helmet data as an example
Category
D1 D2
HSFB
1
0
CollFB
0
1
NonFB
0
0
 Canonical correlation measures the association between each of the
discriminant functions and a best linear combination of the dummy
variables associated with the categories of the grouping variable.
 The correlation is a measure of how much variation between the
groups can be explained by the discriminant function.
 Mathematically, the canonical correlation for the ith discriminant
function can be computed as
i
ri 
1  i
©2009 Philip J. Ramsey, Ph.D.
37
 Example: Again using the football helmet data we have JMP
calculate the canonical correlations for the two discriminant
functions.
Eigenvalue
1.91776348
0.11593148
3.0166e-16
9.8859e-17
1.8746e-18
-4.222e-17
Percent Cum Perc ent Canonical Corr
94.2995
94.2995
0.81072297
5.7005
100.0000
0.32231605
0.0000
100.0000
0
0.0000
100.0000
0
0.0000
100.0000
0
0.0000
100.0000
0
 The first discriminant function seems to be highly correlated with
the groupings, while the second appears weakly correlated.
 The hand computations are
1.9178
0.1159
r1 
 0.657, r2 
 0.104
1  1.9178
1  0.1159
©2009 Philip J. Ramsey, Ph.D.
38
 Another issue to consider in Discriminant analysis whether or not
we require all of the potential k variables in our functions in order for
them to classify correctly.
 Often some of the variables may not be of value for classification.
A stepwise model building procedure can be used to see if some of
the variables can be dropped from consideration.
 There are three basic approaches.
 Forward – sequentially enter variables based on ability to
separate the groups.
 Backward – sequentially remove variables based on ability to
separate the groups.
 Stepwise – combine forward and backward (recommended).
 The procedures are based upon computing Wilk’s Λ.
©2009 Philip J. Ramsey, Ph.D.
39
 The initial step for the stepwise selection is to compute a univariate
ANOVA for each of the k variables to see how they individually
separate the m group centroids.
 The first variable entered into the model is the one with the
smallest p-value for the univariate F test of significance.
 Since JMP provides stepwise Discriminant analysis we will work
along with the football helmet data to explain the stepwise method.
Dis crim inant Analys is
Colum n Se le ction
Click to select columns into discriminant model
Columns In
Columns Out
0
6
Lock Entered Column
WDim
Circum
FBEye
EyeHD
EarHD
Jaw
©2009 Philip J. Ramsey, Ph.D.
Smallest P to Enter
0.0000000
Larges t P to Remove
.
F Ratio
2.550
6.231
1.668
58.162
22.427
4.511
Prob>F
0.0839036
0.0029573
0.1946924
0.0000000
0.0000000
0.0136710
40
 The F tests and p-values are the univariate ANOVA tests. It
appears that EyeHD is our first candidate to enter the model. On the
next two slides we show that the initial p-values are from a set of 6
univariate ANOVA’s.
Dis crim inant Analys is
Colum n Se le ction
Click to select columns into discriminant model
Columns In
Columns Out
0
6
Lock Entered Column
WDim
Circum
FBEye
EyeHD
EarHD
Jaw
©2009 Philip J. Ramsey, Ph.D.
0.0000000
Smallest P to Enter
.
Larges t P to Remove
F Ratio
2.550
6.231
1.668
58.162
22.427
4.511
Prob>F
0.0839036
0.0029573
0.1946924
0.0000000
0.0000000
0.0136710
41
 The first three univariate F tests.
Onew ay Analys is of WDim By Gr oup
Onew ay Anova
Analys is of Variance
Source
Group
Error
C. Total
DF
2
87
89
Sum of Squares Mean Square
2.184000
1.09200
37.256000
0.42823
39.440000
F Ratio
2.5500
Prob > F
0.0839
F Ratio
6.2313
Prob > F
0.0030*
F Ratio
1.6675
Prob > F
0.1947
Onew ay Analys is of Cir cum By Group
Onew ay Anova
Analys is of Variance
Source
Group
Error
C. Total
DF
2
87
89
Sum of Squares Mean Square
39.39536
19.6977
275.01513
3.1611
314.41049
Onew ay Analys is of FBEye By Group
Onew ay Anova
Analys is of Variance
Source
Group
Error
C. Total
©2009 Philip J. Ramsey, Ph.D.
DF
2
87
89
Sum of Squares Mean Square
1.820722
0.910361
47.497083
0.545943
49.317806
42
 The next three univariate F tests.
Onew ay Analys is of EyeHD By Gr oup
Onew ay Anova
Analys is of Variance
Source
Group
Error
C. Total
DF
2
87
89
Sum of Squares Mean Square
143.36467
71.6823
107.22433
1.2325
250.58900
F Ratio
58.1618
Prob > F
<.0001*
F Ratio
22.4274
Prob > F
<.0001*
Onew ay Analys is of EarHD By Gr oup
Onew ay Anova
Analys is of Variance
Source
Group
Error
C. Total
DF
2
87
89
Sum of Squares Mean Square
27.722889
13.8614
53.771000
0.6181
81.493889
Onew ay Analys is of Jaw By Group
Onew ay Anova
Analys is of Variance
Source
Group
Error
C. Total
©2009 Philip J. Ramsey, Ph.D.
DF
2
87
89
Sum of Squares Mean Square
3.388222
1.69411
32.670000
0.37552
36.058222
F Ratio
4.5114
Prob > F
0.0137*
43
 After entering EyeHD into the model, partial Wilk’s Λ tests are
computed for each of the remaining k-1 variables and the one with
the smallest p-value is entered into the model, since it gives the best
separation of the groups given the first selected variable is already in
the model.
 Yr , Y1 
 Yr | Y1  
 Y1 
 Here Y1 is the variable entered at step 1 and Yr is one of the
remaining variables. We have to compute 5 partial tests.
 Remember at step 2 a partial test is performed for every possible 2
variable model, since EyeHD is already selected.
©2009 Philip J. Ramsey, Ph.D.
44
 After entering EyeHD into the model, partial Wilk’s Λ tests are
computed for each of the remaining k-1 variables and the one with
the smallest p-value is entered into the model, since it gives the best
separation of the groups given the first selected variable is already in
the model.
 The Wilk’s Λ test for a model containing only EyeHD is shown
below. This equivalent to the univariate ANOVA F test.
Test
Wilks' Lambda
Pillai's Trace
Hotelling- Law ley
Roy's Max Root
Value
0.4278892
0.5721108
1.3370535
1.3370535
Exact F NumDF
58.1618
2
58.1618
2
58.1618
2
58.1618
2
DenDF
87
87
87
87
Prob>F
<.0001*
<.0001*
<.0001*
<.0001*
 We need this test value to compute the partial test values at stage
two.
©2009 Philip J. Ramsey, Ph.D.
45
 With both EyeHD and WDim in the model the test statistic value is
Test
Value
Wilks' Lambda
0.4003001
Pillai's Trace
0.6134257
Hotelling- Law ley 1.4638372
Roy's Max Root
1.440026
Approx. F NumDF
24.9635
4
19.2446
4
31.1065
4
62.6411
2
DenDF
172
174
170
87
Prob>F
<.0001*
<.0001*
<.0001*
<.0001*
 The associated partial Wilk’s test for WDim is
0.4003
 WDim | EyeHD  
 0.9355
0.4279
 It can be shown that for a model with p variables, the associated
partial F test with m-1 and N-m-p+1 degrees of freedom is given by
1   Yr | Y1  N  m  p  1
F Yr | Y1  
 Yr | Y1 
m 1
©2009 Philip J. Ramsey, Ph.D.
46
 For stage two, the partial F test with 2 and 86 degrees of freedom
for a model containing EyeHD and WDim is
1  0.9355 86
F Yr | Y1  
 2.965
0.9355 2
 The associated p-value = 0.057 for this partial F test and since it is
the most significant partial F test among the 5 tests, we will enter
WDim into the model.
 With WDim in the model, both variables still appear to be
significant so we do not remove a variable from the model.
 Next we perform the partial F tests for the remaining 4 variables
and select the one with the most significant partial F test.
©2009 Philip J. Ramsey, Ph.D.
47
 The partial Wilk’s test at step three (p = 3)
Test
Wilks' Lambda
Pillai's Trace
Hotelling- Law ley
Roy's Max Root
Value
0.3382838
0.7215719
1.7791591
1.6734244
Approx. F NumDF
20.3810
6
16.1801
6
24.9082
6
47.9715
3
DenDF
170
172
168
86
Prob>F
<.0001*
<.0001*
<.0001*
<.0001*
 Yr , Y1 , Y2 
 Yr | Y1 , Y2  
 Y1 , Y2 
0.3383
 0.8451
0.4003
F  Jaw | WDim, EyeHD   7.791, p  value  0.0008
  Jaw | WDim, EyeHD  
Dis crim inant Analys is
Colum n Se le ction
Click to select columns into discriminant model
Columns In
2
Smallest P to Enter
0.0007817
Columns Out
4
Larges t P to Remove 0.0569292
Lock Entered Column
WDim
Circum
FBEye
EyeHD
EarHD
Jaw
©2009 Philip J. Ramsey, Ph.D.
F Ratio
2.964
0.981
1.098
58.471
3.239
7.791
Prob>F
0.0569292
0.3792667
0.3381188
0.0000000
0.0440779
0.0007817
48
 So, at step 3 we admit Jaw to the model and then check to make
certain that all three variables are still significant. If a variable is no
longer significant we may opt to remove it form the model – the
backward and forward aspect of the stepwise procedure.
Dis crim inant Analys is
Colum n Se le ction
Click to select columns into discriminant model
Columns In
3
Smallest P to Enter
0.0173374
Columns Out
3
Larges t P to Remove 0.0007817
Lock Entered Column
WDim
Circum
FBEye
EyeHD
EarHD
Jaw
F Ratio
8.787
0.055
0.189
49.211
4.257
7.791
Prob>F
0.0003399
0.9462046
0.8281935
0.0000000
0.0173374
0.0007817
 At step 4 it appears that EarHD is the only candidate to enter the
model.
©2009 Philip J. Ramsey, Ph.D.
49
 At step 4 we entered EarHD into the model and all of the three
variables previously entered into the model remain significant so we
do not remove any variables. Dis crim inant Analys is
Colum n Se le ction
Click to select columns into discriminant model
Columns In
4
Smallest P to Enter
0.9965052
Columns Out
2
Larges t P to Remove 0.0173374
Lock Entered Column
WDim
Circum
FBEye
EyeHD
EarHD
Jaw
F Ratio
11.237
0.003
0.004
20.508
4.257
8.861
Prob>F
0.0000474
0.9970883
0.9965052
0.0000001
0.0173374
0.0003224
 Since neither of the two remaining variables are even close to
significant we stop with the 4 variable model.
 In the original discriminant functions both Circum and FBEye had
nearly 0 weights so they did not contribute much to the separation
capability and we drop them in our new reduced model.
©2009 Philip J. Ramsey, Ph.D.
50
 We next fit the reduced 4 variable model
Test
Wilks' Lambda
Pillai's Trace
Hotelling- Law ley
Roy's Max Root
Value
0.3071512
0.7611056
2.0335002
1.9176137
Approx. F NumDF
16.8916
8
13.0548
8
21.0976
8
40.7493
4
DenDF
168
170
166
85
Prob>F
<.0001*
<.0001*
<.0001*
<.0001*
 Notice by Wilk’s Λ the reduced model appears more significant
than the original 6 variable model. However, we still do not know if
it classifies or separates any better than the full model.
Eigenvalue
1.91761371
0.11588645
4.5614e-16
6.428e-17
Percent Cum Perc ent Canonical Corr
94.3011
94.3011
0.81071212
5.6989
100.0000
0.32225994
0.0000
100.0000
0
0.0000
100.0000
0
Scoring Coefs
Canon1
Canon2
Canon3
Canon4
©2009 Philip J. Ramsey, Ph.D.
WDim
EyeHD
EarHD
Jaw
-0.944694 0.6489574 0.5061521 0.8336918
-1.402812 -0.540168
0.38923 1.5393076
0
0
0
0
0
0
0
0
51
 In comparing the full and reduced linear discriminant functions, we
notice that the coefficients are not much different. Therefore, it is
unlikely that the new model will separate any more effectively than
the original or full model.
Scoring Coefs
Canon1
Canon2
Canon3
Canon4
Canon5
Canon6
WDim
Circum
FBEye
EyeHD
EarHD
Jaw
-0.948423 0.0036399 0.0064396 0.6474831 0.5043609 0.8285351
-1.406775 0.0005126 0.0286176
-0.54027 0.3839132 1.5288556
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Scoring Coefs
Canon1
Canon2
Canon3
Canon4
WDim
EyeHD
EarHD
Jaw
-0.944694 0.6489574 0.5061521 0.8336918
-1.402812 -0.540168
0.38923 1.5393076
0
0
0
0
0
0
0
0
©2009 Philip J. Ramsey, Ph.D.
52
 On the left is the Mosaic plot of Actual vs. Predicted for the 6
variable model and on the right for the 4 variable model. Both plots
are virtually identical.
M osaic Plot
M osaic Plot
1.00
1.00
NonFB
NonFB
0.75
0.50
HSFB
Pred Group
Pred Group
0.75
0.50
HSFB
0.25
0.25
CollFB
CollFB
0.00
0.00
CollFB
HSFB
Group
©2009 Philip J. Ramsey, Ph.D.
NonFB
CollFB
HSFB
NonFB
Group
53
 The procedure we have just demonstrated with the football helmet
data is often referred to as stepwise discriminant analysis.
 The title is a bit of a misnomer since we never compute any
discriminant functions during the stepwise procedure.
 Rather the procedure attempts to find the subset of the original k
variables or covariates which provides for significant separation
between the m group means.
 The procedure might be better named stepwise MANOVA.
 Once the subset of the k variables is found, then one can use this
subset to construct discriminant functions.
 There is confusion on this point, so be wary of the term stepwise
discriminant analysis and what it actually implies.
©2009 Philip J. Ramsey, Ph.D.
54
 Our discussion to this point has mostly focused on the construction
of linear discriminant functions.
 As noted the original purpose of discriminant analysis was not
classification into groups.
 Fisher developed discriminant analysis to provide a graphical
technique that could distinguish between multivariate groups very
much in the spirit of PCA and biplots (not yet invented).
 However, over time discriminat analysis and classification have
become synonymous and this does lead to some confusion.
 In general the linear discriminant functions are only used to build
graphical displays such as the biplots in JMP’s Discriminant
platform.
 Classification in general uses a different set of functions,
unfortunately these functions are often called discriminant functions.
©2009 Philip J. Ramsey, Ph.D.
55
 Fisher’s linear discriminant procedure is nonparametric in that it
makes no assumption about the distribution for each group other than
equal covariance structure.
 It can be shown for 2 multivariate normal groups with equal
covariance matrices, Fisher’s linear discriminant functions are
optimal for classification. If we depart from these assumptions they
are not so.
 Therefore, the linear discriminant functions for 2 groups serve as
optimal linear classification functions – a better term for them.
 For two populations the linear discriminant or classification
function is very straightforward to use as a classification rule for new
sets of observations not used in creating the discriminant functions.
 Assuming that we have no prior knowledge of the probability that
an observation comes from one population or the other the
classification rule is straightforward.
©2009 Philip J. Ramsey, Ph.D.
56
 Assign the new observation vector Y0 (row of a data table) to group
1 if the discriminant function value Z is greater than the midpoint of
the mean discriminant scores for the two groups.
Zo 
1
Z1  Z 2 

2
 Z0 is calculated from the linear discriminant function
1
Zo  aY0   Y1  Y2  Spool
Y0
 We also have that
1
1
1
 S 1 Y  Y  Y  S 1 Y 


Z

Z

a
Y

a
Y

Y

Y
 1 2 2 1
 1 2  pool 2 
 1
2
2
pool 1
2
2
1

1
   Y1  Y2  S pool
Y1  Y2  

2

©2009 Philip J. Ramsey, Ph.D.
57
 The classification rule is to classify Y0 to group 1 if
1

1
Z0    Y1  Y2  Spool
Y1  Y2  

2

 And assign Y0 to group 2 if
1

1
Z0    Y1  Y2  Spool
Y1  Y2  

2

 In the very rare case of equality randomly assign Y0 to either
group.
 The rule implicitly assumes a priori that the probability that the
new observations came from either group is equal or simply 0.5. If
we have prior information that favors one group or the other in terms
of classification, then we can modify the classification rule to take
this new information into account.
©2009 Philip J. Ramsey, Ph.D.
58
 We illustrate the classification rule using the Temperature example
cover earlier. Recall there are two groups Temp A and Temp B.
 The classification rule is to classify Z0 to group 1 if
1
1
Z o   Z A  Z B    54.94  46.69   50.82
2
2
 Or, group 2 otherwise.
 Next we assume that both groups have a normal distribution with
the same standard deviation, but different means.
 To illustrate the classification we simulated 500 values of Z for
each group assuming normal distributions.
 The Graph Builder display on the next slide illustrates the two Z
distributions created by the linear discriminant. One can see that the
classification rule is quite intuitive.
©2009 Philip J. Ramsey, Ph.D.
59
 We illustrate the classification rule using the Temperature example
cover earlier. Recall there are two groups Temp A and Temp B.
©2009 Philip J. Ramsey, Ph.D.
60
 Before discussing the m > 2 groups case, suppose we only have
two groups, but we have prior probabilities p1 and p2 that an
observation may belong to either group.
 In order to use this information we need to assume a probability
distribution for each group. The natural choice is to assume that each
group is multivariate normal with the same covariance matrix .
 With some algebra Rencher (2002) shows that the asymptotically
optimal classification rule (assuming multivariate normality with
equal covariance matrices) reduces to
 p2 
1

1

Z0    Y1  Y2  S pool  Y1  Y2    Ln  
2

 p1 
 Obviously if p1 = p2 (we say the priors are uniform) then the
equation reduces to the classification rule given previously.
©2009 Philip J. Ramsey, Ph.D.
61
 Example: Suppose for the steel processing example given earlier
we have a prior probabilities pA = 0.7 and pB = 0.3 that a sample of
steel was processed at one of the two temperatures.
 Our classification criterion becomes
 p2  1
1
 0.3 
Z 0   Z1  Z 2   Ln     54.94  46.69   Ln 
  49.97
2
 0.7 
 p1  2
 Suppose we test a new sample of steel without knowledge of which
temperature it was processed at and the values are Yield = 40 and
Ultimate = 63. Using the linear discriminant function estimated
earlier, Z0 = 49.76; therefore we assign the sample to temperature B.
Why?
 Notice that we have assigned the sample to B, but the discriminant
value is close to the cutoff and we are not very confident in our
classification.
©2009 Philip J. Ramsey, Ph.D.
62
 For more than two groups we can take a different approach to
classification based upon the distance of a new observation from the
estimated centroids of each group and the posterior probability that a
new observation belongs to each group. Note this procedure can be
used for only two groups and is analogous to Fisher’s procedure.
 We will once again assume that the observations come from
distributions that are multivariate normal with the same covariance
matrix  and possibly different centroids.
 We wish to develop a rule that assigns or classifies a new
observation to a group based upon the highest posterior probability
that the observation came from that distribution.
 We assume that we have k variables that are measured on m groups
that we designate π1, π2, … , πm. Associated with each of the m
groups is a prior probability of membership p1, p2, … , pm. If the
priors are equal then we say they are uniform, as discussed earlier.
©2009 Philip J. Ramsey, Ph.D.
63
 The squared Mahalanobis distance for an observation vector Y0
from the centroid of the ith group is
1
Di2  Y0    Y0  Yi  S pool
 Y0  Yi 
1
1
1
 Y0S pool
Y0  2YiS pool
Y0  YiS pool
Yi
 Notice that the first term on the right hand side is not a function of i
and can be ignored for classification – it is constant for all groups.
 Rencher (2002) shows that an optimal linear classification function
is then
1
1
1
 i  Y0   YiS pool
Y0  YiS pool
Yi
2
 Assign the observation vector Y to the group for which i is a
maximum, which is the same group for which D2 is smallest.
©2009 Philip J. Ramsey, Ph.D.
64
 However this simple linear classification rule does not allow us to
make use of our prior information about the probabilities of
belonging to the different groups.
 If we assume that we have prior probabilities of membership for
each of the m groups, then our linear classification rule is simply
modified to incorporate this information.
 Let pi be the prior probability of membership in the ith group, then
we have the classification rule (Rencher, 2002)
1  1
1


 i  Y0   ln pi  Yi S pool Y0  Yi S pool Yi
2
 Again assign Y0 to the group for which the function is maximized.
 Assuming multivariate normal distributions with equal covariance
the rule is optimal in terms of misclassification errors.
©2009 Philip J. Ramsey, Ph.D.
65
 However more is possible than to just use the prior probabilities to
adjust the classification functions.
 Basically what we wish to do is estimate the probability that Y0
came from a particular group given we have observed the data and
we call this the posterior probability of membership since it is
estimated after we observe the data.
 We will introduce a concept called Bayes Rule to estimate the
posterior probabilities given the data and prior probabilities.
 Our classification rule will then be based upon the posterior
probability of membership in each group.
 We assign the observation vector Y0 to the group for which the
posterior probability of membership is highest.
 Note this is the rule JMP uses in the Discriminant platform to
classify observations.
©2009 Philip J. Ramsey, Ph.D.
66
 Assume that fi(Y) represents the probability density function for the
ith group with prior probability pi , then using Bayes Rule the
posterior probability for the ith group is
P  i and Y0 
P( i | Y0 ) 

P  Y0 
pi fi  Y0 
m
 p f Y 
j 1
j
j
0
 For general probability distributions these posterior probabilities
may be very difficult to impossible to calculate in a closed form.
However, for the multivariate normal distribution they are straight
forward.
 Recall, the estimated density function for the multivariate normal is
fi  Y0  
1
1
 2  Spool
k
1/ 2
©2009 Philip J. Ramsey, Ph.D.
1
 1

 1 2

1
Exp    Y0  Yi  S pool
Y0  Yi  
Exp

D
Y



 2 i 0 
1 1/ 2
 2
  2 k S pool
67
 For the multivariate normal our expression for the posterior
probabilities, if we assume an equal covariance matrix, becomes
P( i | Y0 ) 
pi Exp  0.5Di2  Y0  
m
2


p
Exp

0.5
D
 j 
j  Y0  
j 1
 Furthermore, if the priors are uniform and equal to a value p, they
also drop out of the expression. The above expression is how JMP
calculates the posterior probabilities of membership.
 The observation vector Y0 is then assigned to the group for which
the posterior probability of membership is highest.
©2009 Philip J. Ramsey, Ph.D.
68
 Example: We will use the dataset Iris.JMP to demonstrate the
impact of prior probabilities on the classification of observations to
m groups. This is classic dataset that R.A. Fisher first use to
demonstrate the concept of discriminant analysis. The data consists
of measurements on 150 Iris plants for which the species is known.
The goal is to estimate a discriminant function to separate the three
species and to classify them based on the measurements. Below is a
partial view of the data table.
©2009 Philip J. Ramsey, Ph.D.
69
 We first show the linear
discriminant functions.
These functions are used
to generate the biplot and
are not directly used for
classification.
 In the biplot the
confidence ellipsoids for
each centroid are based
on a multivariate normal
distribution.
©2009 Philip J. Ramsey, Ph.D.
70
 We next show a partial view of the calculated probabilities for each
group under the assumption that the prior probabilities are equal
(uniform). Note there are 3 misclassifications out of 150.
Discriminant Scores
Number Misclassified
Percent Misclassified
-2LogLikelihood
Row
71
73
78
84
120
124
127
128
130
134
135
139
Actual
versicolor
versicolor
versicolor
versicolor
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
3
2
6.56
Posterior probabilities
SqDist(Actual) Prob(Actual) -Log(Prob)
8.66970
0.2532
1.373
4.87619
0.8155
0.204
4.66698
0.6892
0.372
8.43926
0.1434
1.942
8.19641
0.7792
0.249
3.57858
0.9029
0.102
3.90184
0.8116
0.209
3.31470
0.8658
0.144
9.08495
0.8963
0.109
7.23593
0.2706
1.307
15.83301
0.9340
0.068
4.09385
0.8075
0.214
Predicted
* virginica
versicolor
versicolor
* virginica
virginica
virginica
virginica
virginica
virginica
* versicolor
virginica
virginica
Prob(Pred)
0.7468
0.8155
0.6892
0.8566
0.7792
0.9029
0.8116
0.8658
0.8963
0.7294
0.9340
0.8075
Others
virginica 0.18
virginica 0.31
versicolor 0.22
versicolor 0.19
versicolor 0.13
versicolor 0.10
versicolor 0.19
'*' indicates misclassified
Counts: Actual Row s by Predicted Columns
setosa versicolor virginica
setosa
50
0
0
versicolor
0
48
2
virginica
0
1
49
©2009 Philip J. Ramsey, Ph.D.
71
 Now suppose we have prior information about the occurrence of
the three species from the population where the sample was
collected. Suppose we know that 15% are Setosa, 25% are
Versicolor, and 60% are Virginica.
You can specify your own prior probabilities by selecting the
option by selecting the “Specify Prior” option in the main menu in
the Discriminant Analysis report window.
©2009 Philip J. Ramsey, Ph.D.
72
 Notice that the posterior probabilities are changed once we specify
the nonuniform priors. We now have 4 misclassifications. However
our priors in this case were completely arbitrary and in practice are
hopefully based on scientific understanding.
Posterior probabilities
©2009 Philip J. Ramsey, Ph.D.
73
 You can save the posterior probabilities to the data table by
selecting the option “Score Options” and then the option “Save
Formulas”.
©2009 Philip J. Ramsey, Ph.D.
74
 The posterior probabilities are stored in the data table in the format
shown below. Here we assume uniform priors for the three groups.
©2009 Philip J. Ramsey, Ph.D.
75
 The column Prob[0] represents the denominator in the posterior
probability calculation using Bayes rule. Note that the prior
probabilities are factored into the SqDist[ ] functions as the term
-2ln(pi), which is equivalent to the posterior probability formula
shown earlier.
 Prob[0] =
 Prob[setosa] =
 Prob[versicolor]=
 Prob[virginica] =
©2009 Philip J. Ramsey, Ph.D.
76
 Example: We will use the OwlDiet.JMP data. Below is the
results for the analysis using uniform priors for the seven species.
Notice that with uniform priors we have 13.97% misclassified.
©2009 Philip J. Ramsey, Ph.D.
77
 Example: With priors proportional to occurrence in the sample we
reduce the misclassification percentage.
©2009 Philip J. Ramsey, Ph.D.
78
 Example: Let’s revisit the football helmet data. Recall we earlier
examined the data using the biplots from the linear discrimant
functions. We reexamine the data looking at linear classification
instead. We again assume uniform priors on class membership.
Notice that 24 rows have been misclassified or 26.67%. The
highlighted rows in the table are misclassifications.
©2009 Philip J. Ramsey, Ph.D.
79
 Example: From the Mosaic plot of predicted group membership
vs. actual membership, we can see that significant misclassification
occurs for the NonFB and CollFB groups.
M osaic Plot
1.00
NonFB
Pred Group
0.75
0.50
HSFB
The ROC Curves plot the P(Correct
Classification) – Y axis – vs.
P(Incorrect Classification) – X axis.
This is done for each group. A perfect
classifier has Sensitivity = 1.0.
Rece ive r Oper ating Char acter istic
1.00
0.25
0.90
CollFB
0.80
Group
CollFB
HSFB
NonFB
Area
0.8681
0.9750
0.8383
0.70
CollFB
HSFB
Group
NonFB
Sensitivity
0.00
0.60
0.50
0.40
0.30
0.20
0.10
0.00
.00 .10 .20 .30 .40 .50 .60 .70 .80 .90
1-Specificity
©2009 Philip J. Ramsey, Ph.D.
80
 Recall that linear classification (discriminant) analysis makes an
assumption of multivariate normal distributions for each of the m
groups and assumes that all groups share a common covariance
structure or matrix.
 A variation of classification analysis exists, where one does not
assume equal covariance structure for the groups and this version is
referred to as quadratic discriminant analysis.
 The boundaries between the groups in quadratic discriminant
analysis are literally quadratic in shape, hence the term quadratic.
 The application of quadratic discriminant classification is analogous
to the linear version except the discriminant score functions are more
complicated and have more parameters to estimate.
 In general linear discriminant analysis uses simpler functions but
can be biased if the equal covariance assumption is invalid. Quadratic
discriminant functions are larger and more variable.
©2009 Philip J. Ramsey, Ph.D.
81
 The quadratic discriminant function for the ith group, assuming one
is interested in classification, can be shown to be (we omit
considerable mathematical detail)
i  Y0    Y0  Yi  Si1  Y0  Yi   ln Si  2ln pi
 The rule is to assign the observation to the group for which i is a
minimum.
 Notice that the first term on the right hand side is just the
Mahalanobis distance for the observation vector Y0. Also notice that
the discriminant score is weighted by the determinant of the
covariance matrix. Larger covariance matrices having a larger penalty
applied to the score.
 The quadratic classification rules cannot be reduced to linear
functions.
©2009 Philip J. Ramsey, Ph.D.
82
 We illustrate quadratic discriminant classification with the football
helmet data. In the Discriminant platform in JMP Quadratic
discriminant analysis is one of the options available.
©2009 Philip J. Ramsey, Ph.D.
83
 First let’s examine the three covariance matrices to see if there may
be differences – no straightforward tests exist for equal covariance
structure and we will rely on visual assessment.
Multivariate Group=CollFB
Covariance Matrix
Circum
FBEye
EyeHD
EarHD
Jaw
Circum
2.87868
0.92928
0.19468
0.09354
0.30833
FBEye
0.92928
0.55206
-0.06338
-0.00053
0.12813
EyeHD
0.19468
-0.06338
1.15200
0.08697
-0.15703
EarHD
0.09354
-0.00053
0.08697
0.57016
-0.00791
Jaw
0.30833
0.12813
-0.15703
-0.00791
0.37702
EyeHD
0.77888
0.21049
1.08764
0.54023
0.17529
EarHD
0.86021
0.41351
0.54023
0.89195
0.08218
Jaw
0.72031
0.23305
0.17529
0.08218
0.47816
EyeHD
0.98455
0.08331
1.45775
0.31706
0.10915
EarHD
0.06645
-0.02721
0.31706
0.39206
-0.04689
Jaw
0.48666
0.11617
0.10915
-0.04689
0.27137
Multivariate Group=HSFB
Covariance Matrix
Circum
FBEye
EyeHD
EarHD
Jaw
Circum
4.21346
1.43068
0.77888
0.86021
0.72031
FBEye
1.43068
0.70553
0.21049
0.41351
0.23305
Multivariate Group=NonFB
Covariance Matrix
Circum
FBEye
EyeHD
EarHD
Jaw
Circum
2.39114
0.69997
0.98455
0.06645
0.48666
FBEye
0.69997
0.38024
0.08331
-0.02721
0.11617
©2009 Philip J. Ramsey, Ph.D.
Some differences do
seem to exist in the
sample covariance
structures for the
groups. For purposes
of example we will
assume the three
matrices are
different.
84
 Remember the linear discriminant functions are unchanged by
selecting quadratic discriminant analysis and as a result the biplot is
unchanged. It is the classification probabilities where we see the
difference.
Quadratic
Linear
©2009 Philip J. Ramsey, Ph.D.
85
 The key difference in the classification algorithm for quadratic
discriminant (classification) analysis is we assume a different
covariance matrix in the probability calculation for each of the groups.
 Recall in linear discriminant analysis we use an overall pooled
covariance matrix for the probability calculations in all of the groups.
 However, both methods do rely on an assumption that each group
follows a multivariate normal distribution.
 In general, both methods are robust to the normal assumption
(quadratic classification is more sensitive), however the equal
covariance assumption is problematic in many cases for linear
discriminant analysis.
 If the equal covariance assumption is invalid then the linear
procedure is often quite biased in terms of correct classification.
 Unfortunately the quadratic procedure requires the estimation of
more parameters and the classification formulas are more variable.
©2009 Philip J. Ramsey, Ph.D.
86
 A compromise classification procedure can be used between the
linear and quadratic methods.
 The method due to Friedman (1988) attempts to find a compromise
between the bias of linear classification and the added variability of
quadratic classification functions.
 The method is often referred to as regularized discriminant
analysis although once again it is a classification procedure that is
quite different from Fisher’s discriminant analysis.
 We will not delve into the mathematical details of Friedman's
method, however a copy of his paper can be found at
 The key to his method is to find values for two parameters  and 
which are used to create a regularized covariance matrix for each of
the groups, which is difficult in practice. We omit the details. JMP
does the method but you have to supply the values for the constants.
©2009 Philip J. Ramsey, Ph.D.
87

DiscriminantAnalysisRev1

Transcript DiscriminantAnalysisRev1

Directory