2012 Jiffy Lube Mystery Shop Program

Download Report

Transcript 2012 Jiffy Lube Mystery Shop Program

Mixture Discriminant Analysis for Sparse Data Matrices
Kurt Salmela
Maritz Research
MART 2012
1
Mixture Discriminant Analysis for Sparse
Matrices
Page
 What it is and when to use it . . . . . . . . . . . . . . . . . . . 3
 Mixture discriminant analysis (MDA). . . . . . . . . . . . . . 5
 Extending the MDA model . . . . . . . . . . . . . . . . . . . . . 23
 Ridge regression
 Lasso
 Elastic net
 Putting it all together -- Sparse or Flexible MDA . . . . 28
 Example in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
 Scoring new cases in Excel . . . . . . . . . . . . . . . . . . . Not Shown Today
2
Mixture Discriminant Analysis for Sparse Data Matrices
What it is
 Conceptually: Non-linear discriminant analysis with a
subset of predictor variables found automatically.
 Technically, a combination of:
 Mixture discriminant analysis --- the non-linear aspect
 Ridge regression and Lasso – parameter shrinkage for
parameter stabilization and to find a good subset of the original
variables
3
Mixture Discriminant Analysis for Sparse Data Matrices
When to use it
• When Predicting groups and one or more of:
– Want better prediction accuracy
– Want a good subset of predictors
– Have a large p small n situation
– Maybe not enough training cases for tree-based models
– If you need a way to score new data in Excel or a programming
language rather than a black box type of predictor like Random
Forests.
4
Mixture Discriminant Analysis
5
Problems with Simple Linear Discriminant Analysis
 Often has low prediction accuracy when the class
boundaries are complex and nonlinear (under-fits).
 Can yield unstable results in the presence of highly
correlated predictor variables (over-fits).
 Doesn’t do well with skewed distributions in the predictors.
 Can’t reliably estimate a covariance matrix in large p small
n situations.
6
What We’d Like to Do
 Model irregular boundaries to obtain higher prediction
accuracy.
 Eg. Bring in some non-linear capability
 Early form of non-linear discrim was quadratic discriminant analysis
(QDA). Simple, but limited to the exponential form and adds more
parameters.
 Better would be more flexible non-linear forms, even nonparametric classifiers.
7
How to Get There?
First Need to Generalize Linear Discrim
 Involves re-casting Linear Discriminant Analysis (LDA) as a linear
regression problem.
 It has been shown that LDA can be performed by a:
1. sequence of linear regressions
2. followed by an LDA or other classification method on the fitted values from
the multivariate regression fits. 1
Yields predicted LDA of this = Regular LDA on original data.
values of Y
Regression of This
Y
Group2
Group2
Group1
Group3
8
010
010
100
001
x1
x1
x1
x1
x2
x2
x2
x2
X
x3
x3
x3
x3
x4
x4
x4
x4
x5
x5
x5
x5
.3
.1
.5
.3
Y’
.5 .2
1.2 .3
.1 .1
.2 1.4
Y’
Group2
Group2
Group1
Group3
.3
.1
.5
.3
.5
1.2
.1
.2
X
.2 x1 x2 x3 x4
.3 x1 x2 x3 x4
.1 x1 x2 x3 x4
1.4 x1 x2 x3 x4
x5
x5
x5
x5
Linear Discriminant Analysis
For Review
Group A
Group B
9
Mixture Discriminant Analysis (MDA)2,3
What it Does
 1. Divides each of the original
groups/classes into two or more
sub-classes (typically via k-means)
• 2. Fits discriminant functions
through each set of sub-classes
yielding a probability of
membership in each sub-class for
each subject/respondent.
• 3. Followed by the same type of
regularized discrim or optimal
scoring analysis (think canonical
correlation) of this blurred
response matrix. See next page.
10
Class B
Class A
A1
A2
B1
A3
B2
There can be a different number of
sub-classes in each class.
A3
A1
B1
A2
B3
B2
B3
Mixture Discriminant Analysis (MDA)
What it Does
Y
Group2
Group2
Group1
Group3
Steps 1 and
2 yield the
blurred
response
matrix Z.
Could be a nonparametric or nonlinear approach.
010
010
100
001
x1
x1
x1
x1
x2
x2
x2
x2
X
x3
x3
x3
x3
x4
x4
x4
x4
Regularized LDA
formulation with Y
classes coded as a
series of 1-0
variables.
x5
x5
x5
x5
Sub-class probabilities sum to 1
Z
Group2
Group2
Group1
Group3
0 0 0 .2 .5 .3 0 0 0 0
0 0 0 .7 .1 .2 0 0 0 0
.4 .1 .5 0 0 0 0 0 0 0
0 0 0 0 0 0 .2 .6 .1 .1
X
x1
x1
x1
x1
x2
x2
x2
x2
x3
x3
x3
x3
x4
x4
x4
x4
x5
x5
x5
x5
Step 3: Starting over now using this Z and X matrix instead of the original Y and
X matrix yields the final probabilities of sub-class membership across all classes.
Assign subject to the class with the highest resulting sub-class probability.
11
MDA Assumptions
 Normal distribution for each sub-class
 Several papers have shown this is not critical. In the presence of skewed
predictors, MDA has been shown to predict better than linear discriminant
analysis, logistic regression, and linear logistic regression of ranks,
especially when kurtosis is small. 4,5
 Equal covariance matrices for the sub-classes within a
class.
 As implemented in the mda and sparseLDA packages in R, to reduce the
number of parameters needed for estimation, and for the generalizations
implemented in sparseLDA.
 Models have been developed for sub-classes with different covariances
 Don’t need equal covariance matrices across the higher
level groups.
12
-2
-1
0
x2
1
2
3
A Pictorial Two Group Example
-2
-1
0
1
x1
13
2
3
4
1
2
3
Simple Linear Discriminant Model
-2
-1
0
x2
Y
-2
-1
0
1
2
3
4
x1
Model accuracy on hold-out: 70%
14
-2
-1
0
x2
1
2
3
Divide the Two Groups into Five Each
-2
-1
0
1
x1
15
2
3
4
-2
-1
0
x2
1
2
3
Do a Regularized (Regressionized) Discriminant Analysis
Within Each Set of Five Sub-Classes
-2
-1
0
1
x1
16
2
3
4
-2
-1
0
x2
1
2
3
Put it all Together with the step 3 final Model.
-2
-1
0
1
x1
17
2
3
4
3
Envision How the Separate Linear Discrim Functions Could
Conceivably Thought of as non-Linear via Piecewise Linear
Y
-2
-1
0
x2
1
2
MDA of 5
Sub-classes
per Original
Class
-2
-1
Model accuracy on holdout: 78%
Linear was 70%.
18
0
1
x1
2
3
4
Analyst Decisions for MDA
 How to divide the groups?
 R MDA package does k-means or lvq (learning vector quantization)
 How many sub-groups per group?
 No agreed upon best method.
 Typically use cross-classification to find the highest prediction accuracy
using hold-out samples.
 Can use the caret package in R and try many levels easily
19
Potential Bias With MDA

Recall that with regular LDA, the goal is to maximize the variance between
classes while minimizing the variance within each class.

MDA maximizes the variance between all sub-classes while minimizing the
variance within each. This may bias towards a solution that maximizes the
variance between sub-classes of the same class.

Mixture Subclass Discriminant Analysis has been proposed, which in simple terms,
emphasizes the spread between sub-classes in different classes.6

Mixture Subclass Discriminant Analysis hasn’t caught on yet (new, 2011 paper). MDA even
with its bias is very popular currently in many fields of study due to it’s prediction accuracy,
conceptual simplicity, and speed.
 Some people don’t see the bias as being an issue, especially in gene
research and toxicology where the sub-classes may reflect reality better
than the higher level classes.
20
Extending the MDA Model
21
Optional add-ons: Penalized Discriminant Analysis7
via Ridge Regression and the Lasso
 Why?
 Shrinking/flattening the parameters can result in more stable prediction
models when there are highly correlated predictors. (Ridge regression.)
 Gets rid of variables that don’t do much; select a good subset. (Lasso.)
 Ability to handle large p, small n situations.
 Would be nice to achieve both of these goals while obtaining
the discriminant functions.
 Can be done in the M step of methods that can be implemented via EM
 Easily done with the R sparseLDA package
22
Ridge Regression
 Starts from the assumption that lower variance in the parameters is
desirable. Probably came from situations with high multicollinearity.
 Flattens or shrinks the parameters to be more equal.
 Introduces a little prediction bias so as to achieve more stable
estimates.
Ridge regression
penalty
p
Regular regression part
p
N
β ridge = arg min
β
{Σ ( y
i=1
i
βo
- Σ xij β )2
j
j=1
+  Σ βj2
j=1
}
Tuning parameter lambda. How to set?
Set according to what you’re trying to accomplish (do you
want more averaging/flattening of the betas or less?) and
typically using cross-validation trial and error, ideally with
more than one holdout sample. Can use caret package in
R to try many levels and check where RMSE is lowest.
23
The Lasso
Regular regression part
N
β lasso = arg min
β
{Σ ( y
i=1
Lasso penalty
p
i
βo
- Σ xij β )2
j
p
+  Σ βj
j=1
j=1
}
Tuning parameter.
How to set? Trial and error.
Prediction accuracy and number
of predictor variables desired.
Numerous “best” methods have
been proposed – for a review see 8
 Note the similarity to ridge regression.
 Setting the lasso tuning parameter high enough will force some
betas to zero, reducing the number of predictor variables.
24
The Elastic Net
 Combine ridge regression and the lasso. Zou and Hastie, 2005 9
 A compromise between the two to get the shrinkage in the parameters
of ridge regression for more stable prediction accuracy and the variable
selection of the lasso.
Regular regression part
β elastic net = arg min
β
p
N
{ Σ( y
i=1
i
βo
- Σ xij β )2
j
j=1
+
Ridge regression
penalty
p
2 βj2
j=1
Σ
Lasso penalty
p
+ 1 Σ βj
j=1
}
Two tuning parameters to set.
R package caret helps you try different
values easily.
25
Putting it All Together
Mixture discriminant analysis (MDA)
+ ridge regression
+ the lasso
Elastic Net
= Sparse mixture discriminant analysis
26
Sparse mixture discriminant analysis
 Available in the smda function in the R package sparseLDA 10
 One form of “Flexible Discriminant Analysis” in the book Elements of
Statistical Learning – Data Mining, Inference, and Prediction, Hastie.
T, Tibshirani, R., and Friedman, J., 2nd edition, Springer, 2008
Notes:
 For mda by itself (no ridge, no lasso), you can also use the mda function in
the mda R package. Or alternatively use the smda function in the
sparseLDA package and set the tuning paramaters for ridge regression and
the lasso to zero.
 For sparse LDA by itself (no MDA), use the sda function in sparseLDA:
combines regular linear discrim with ridge regression and the lasso.
27
Example in R using the Iris data set
library(mda)
library(sparseLDA)
# load package that contains example iris data set
# load smda package
# create dummy variables for Y dependent variable
iris$Y1 <- ifelse(iris$Species=="setosa",1,0)
iris$Y2 <- ifelse(iris$Species=="versicolor",1,0)
iris$Y3 <- ifelse(iris$Species=="virginica",1,0)
head(iris)
# show a few rows to show what the data looks like now
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1
2
3
4
5
6
28
5.1
4.9
4.7
4.6
5.0
5.4
3.5
3.0
3.2
3.1
3.6
3.9
1.4
1.4
1.3
1.5
1.4
1.7
0.2
0.2
0.2
0.2
0.2
0.4
setosa
setosa
setosa
setosa
setosa
setosa
Y1 Y2 Y3
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
Dependent variable needs to
coded as dummy variables.
Example in R
# divide data into training and holdout at 75/25 mix
set.seed(12345)
positions <- sample(nrow(iris), size=floor( nrow(iris)*.75) )
training <- iris[positions,]
testing <- iris[-positions,]
Xtr <- as.matrix(training[,1:4])
Xtst <- as.matrix(testing[,1:4])
# define X matrix, saves having to name them all
# do the same for the holdout test data
Ytr <- as.matrix(training[,6:8])
Ytst <- as.matrix(testing[,6:8])
# define Y matrix, saves having to name them all
#---------------------------------------------------------------------------------------------------# important! first normalize the predictors to mean of zero and unit length.
# math assumes no intercept.
#---------------------------------------------------------------------------------------------------Xtr2 <- normalize(Xtr)
Xn <- Xtr2$Xc
# retrieve normalized scores from object Xc
29
Example in R
# do the sparse mixture discrim
smdaFit <- smda(
x = Xn,
y = Ytr,
Rj = c(2,2,2) ,
lambda = 1e-6,
2 ridge penalty in elastic net formula.
stop = -2,
positive value is 1 lasso penalty.
negative value means # of variables to stop at.
Difficult to get it to stop at exact number.
maxIte = 10,
trace = TRUE,
tol = 1e-2)
30
number of sub-groups for each group. Here set at
at 2 sub-groups in each of the original 3 groups.
another stopping criterion which is the change in RSS,
watch the trace output if taking too long.
Example in R
# Done. Now can test the model on a holdout sample
Xtstn <- normalizetest(Xtst,Xtr2) # normalize holdout data to training set means and
# vector lengths
testout <- predict(smdaFit, Xtstn)
# run the model on the test/holdout data
testout$class
# show the predicted classes
[1] Y1 Y1 Y1 Y1 Y1 Y1 Y1 Y1 Y1 Y1 Y1 Y2 Y2 Y2 Y2 Y2 Y2 Y2 Y2 Y2 Y2 Y2 Y2 Y3 Y3
[26] Y3 Y3 Y3 Y3 Y3 Y3 Y3 Y3 Y3 Y3 Y3 Y2 Y3
Levels: Y1 Y2 Y3
# set up to compare predicted class to actual class
predicted.species <- ifelse(testout$class=="Y1", "setosa",
ifelse(testout$class=="Y2","versicolor",
ifelse(testout$class=="Y3","virginica","Unknown")))
31
Example in R
confusion(predicted.species, testing$Species)
Predicted
setosa
versicolor
virginica
attr(,"error")
[1] 0.02631579
32
# show confusion matrix
true
setosa versicolor virginica
11
0
0
0
12
1
0
0
14
# 97% match
Note: Linear model does just as well on this data set.
Classifying/Scoring New Observations using Excel
 Requires 10 steps. Too many?
 You will need to print and paste the following 4 objects
(tables/matrices of numbers) into Excel.
smdaFit$beta
# betas
smdaFit$fit
# discrim coefficients, model/traing data set sub-group means, and
# trainign data set sub-group probabilities
# to normalize new data to the means and unit lengths of training data predictors
Xtr2$mx
# group means in training set
Xtr2$vx
# standard deviation of sum of squares of each X in the training data,
# which are the unit lengths of each X vector.
The actual output object name needed follows the $ symbol. The naming to the left of
the $ symbol is based on the example on the previous pages
33
Jump to Excel scoring example
34
References
1. Hastie, T., Tibshirani, R., and Buja, A., “Flexible Discriminant Analysis by Optimal Scoring,” Journal of the American
Statistical Assocation, December 1994, Vol. 89, No. 428
2. Hastie, T., Tibshirani, R., “Discriminant analysis by Gaussian Mixtures”, Journal of the Royal Statistical Society Series B,
1996, 58:158-176.
3. Daza, L., and Acuna, E., “Combining Classifiers based on Gaussian Mixtures,” Proceedings of the International Conference
on Computer, Communication, and Control Technologies, 2003.
4. Rausch, J., and Kelley, K., “A comparison of linear and mixture models for discriminant analysis under non-normality,”
Behavior Research Methods, 2009, 41 (1), 85-98.
5. Marmion, M., Luoto, M., Heikkinen, R., and Thuiller, W., “The performance of state-of-the-art modelling techniques depends
on the geographical distribution of the species”, Ecological Modeling, 2009, 220, 3512-3520.
6. Gkalelis, N., Mezaris, V., and Kompatsiaris, I., “Mixture subclass discriminant analysis”, IEEE Signal Processing Letters,
Vol. 18, no. 5, pp. 319-322, May 2011.
7. Hastie, T., Buja, A. and Tibshirani, R., “Penalized Discriminant Analysis”, The Annals of Statistics, 1995, Vol. 23, No. 1, 73102.
8. Tibshirani, Robert., Bien, J., Friedman, J., Hastie, T., Simon, N., Taylor, J., Tibshirani, Ryan, “Strong rules for discarding
predictors in lasso-type problems”, Journal of the Royal Statistical Society Series B, March 2012, Vol. 74 Issue 2, 245-266.
9. Zou. H, and Hastie, T., “Regularization and variable selection via the elastic net”, Journal of Machine Learning Research,
2005, 7, 2541-2563.
10. Clemmenson, L., Hastie, T., Witten, D., Erboll, B., “Sparse Discriminant Analysis,” Technometrics, 53(4): 406-413, 2011.
Original paper 2008, from the Department of Informatics and Mathematical Modeling, Technical University of Denmark, and
the Statistics Department of Stanford University, California, U.S.A. sparseLDA package available on the Cran web site.
For a short overview, see the chapter on Flexible Discriminants in the book Elements of Statistical
Learning – Data Mining, Inference, and Prediction, Hastie. T, Tibshirani, R., and Friedman, J., Second
edition, Springer, 2008.
35
Explore next?

Sparse PLS (Chun and Keles, University of Wisconsin, 2010).
 Builds on the elastic net and sparse discrim discussed here, and combines with
sparse principal components analysis (spca).
 For regression or classification.
36