sas institute
Download
Report
Transcript sas institute
Matt Bogard
Office of Institutional Research
Western Kentucky University
Purpose
• Are there opportunities at
the applicant stage to
improve our yield,
implement cost savings,
and shape our freshmen
class to maximize
retention?
• Is there a way of knowing
which applicants are most
likely to enroll and retain?
Methodology
• Machine Learning vs. Statistical Inference
– Emphasis on accurate predictions vs. inferences
about particular roles of specific variables
• Decision Trees
• Ensemble Methods
– Gradient Boosting
– Neural Networks
Decision Tree Basics- Algorithm
• Chooses variables and split values creating data
partitions that differ based on the outcome of interest
(retention)
• Finds all possible splits based on an adjusted χ2 p-value
• Prunes the tree to derive the most accurate predictions
with fewest possible splits based on validation data
• The final model is characterized by the split values for
each explanatory variable and creates a set of rules for
classifying new cases.
Basic Decision Tree Visualization
Benefits of Decision Trees
• "Approaching problems by looking for a data model imposes an
apriori straight jacket that restricts the ability of statisticians to deal
with a wide range of statistical problems.“ – Leo Brieman, Statistical
Modeling: The Two Cultures (Statistical Science,2001)
•
•
•
•
•
Non-parametric and non-linear
No distributional assumptions
Treat the data generation process as unknown
No required functional form for predictors
Identify complex interactions
Ensemble Methods
• Generalization Error- how well does a model predict across
training, validation, and test data sets
• Ensemble- combined predictions of several learners or models
• The generalization error of a weighted combination of
predictors in an ensemble is equal to the average error of the
individual predictors minus ‘disagreement’ among them’Krogh (1997), Statistical Mechanics of Ensemble Learning.
Physical Review.
• Ensemble Error is smaller than the weighted average of the
error of a single optimized predictor
Gradient Boosting
• Boosting algorithms:
ensemble of a series of
weak learners.
• Fit a series of trees using
resampled training data
weighted by classification
accuracy of previous tree
• Combined series of trees
form a single model
Neural Networks
• A nonlinear model of complex
relationships with 'hidden' layers
• Using logistic activation functions,
NNETS can be visualized as an
ensemble of logits
• Y= W0 + W1 H1 + W2 H2 + W3 H3 +
W4 Logit H4 and
• H1= logit(w10 +w11 x1 + w12 x2 )
• H2 = logit(w20 +w21 x1 + w22 x2 )
• H3 = logit(w30 +w31 x1 + w32 x2 )
• H4 = logit(w40 +w41 x1 + w42 x2 )
Gradient Boosting vs. Decision Trees
vs.NNETs vs. Logistic Regression
• Decision Trees and Gradient Boosting are both
robust to data generation process
• Decision Trees - more transparent model
structure, which is lost in ensemble methods like
gradient boosting and neural networks
• Neural Networks have issues with input selection
and are more complex to train
• Decision tree posterior probability distribution
may not be very smooth
Gradient Boosting vs. Decision
Trees vs. Logistic Regression
• Logistic Regression provides
– Smooth posterior probability distribution
– Less transparent model structure than decision
trees but more transparent than GB
– Could be used for inferences or agnostic learning
algorithm based on a specified functional form
*some may refuse to make this distinction and make
inferences where inappropriate
Machine Learning vs. Inference
• Trees can guide and direct further inferential
work, but can be misleading in terms of causal
relationships if you are not careful
Fitting the Models
Results
• Focus: how well does the model predict
behavior vs. inferences about the roles of
specific variables
• Tradeoff between discrimination (measured
by ROC ) & model calibration (Cook,2007)
• Gradient Boosting outperformed the other
models based on calibration
Scorecard
• Using our models, we
can sort applicants into
4 categories for
enrollment propensity
and predicted
retention.
Implementation: Use advanced analytics to
develop a strategic recruitment and retention
strategy
Adhoc Reports
•
•
•
•
•
Report by Counselor/Region/Territory
Report by County/ School
Report by Student demographics
Report by Prospect Source
…other??
IR-DSS
Detail Reporting
Additional Reading
•
•
•
•
•
•
•
•
Bogard, M.T. (2013).A Data Driven Analytic Strategy for Increasing Yield and Retention at Western
Kentucky University Using SAS Enterprise BI and SAS Enterprise Miner. Paper 044-2013. SAS
Institute Inc. 2013.Proceedings of the SAS® Global Forum 2013 Conference. Cary, NC.
DeVille, Barry. (2006). Decision Trees for Business Intelligence and Data Mining Using
SAS® Enterprise Miner. SAS® Institute.
SAS® Institute.. By Barry de Ville and Padraic Neville. SAS® Institute. 2013
Friedman, Jerome H. (2001), Greedy function approximation: A gradient boosting machine. The
Annals of Statistics, 29, 1189-1232. Available at http://stat.stanford.
Hasti, Tibshirani and Friedman. (2009)Elements of Statistical Learning: Data Mining,Inference, and
Prediction. Second Edition. Springer-Verlag.
'Statistical Modeling: The Two Cultures' by L. Breiman (Statistical Science
2001, Vol. 16, No. 3, 199–231)
Cook,Nancy R.,(2007). Use and Misuse of the Receiver Operating Characteristic Curve in Risk
Prediction. Circulation, 115 (7):928-35.
Krogh, A. & Sollich, P. (1997, January). Statistical mechanics of ensemble learning. Physical Review E
(Statistical Physics, Plasmas, Fluids, and Related Interdisciplinary Topics), 55 (1), 811-825.