Transcript Slide 1

Model assessment and cross-validation
- overview
 Bias-variance decomposition and tradeoff
 Analytical validation methods (AIC, BIC)
 Resampling methods (cross-validation, bootstrap methods)
Data Mining and statistical
learning 2008
1
Bias, variance and model complexity
Data Mining and statistical
learning 2008
2
Bias, variance decomposition, and tradeoff
Expected prediction error
INSERT formula 7.8
We would like to minimize Err!
Increasing the complexity of models increases variance and decreases bias
Example: Smoothing based on nearest neighbours.
Data Mining and statistical
learning 2008
3
Loss function, training error, and test error
- quantitative response (regression)
• Loss functions (examples)
• Training error:
• Test error:
where the expectation is taken over the joint distribution of
(X, Y)
Data Mining and statistical
learning 2008
4
Loss function, training error, and test error
- qualitative response (classification)
Model: G=[1, 2, …, K], pk(X)=Pr(G=k| X).
Decision rule:
Gˆ  X   arg max k pˆ k  X 
Examples of loss functions:
• 0-1 Loss
INSERT form. 7.4
• Cross-entropy loss (log-likelihood)
INSERT form. 7.5
Data Mining and statistical
learning 2008
5
Training error, and test error
- qualitative response (classification)
• Training error
– 0-1-loss: misclassification rate
– Cross-entropy loss:
2
err  
N
N
 log pˆ
i 1
Gi
( xi )
• Test error


Err  E L G, Gˆ  X 
Data Mining and statistical
learning 2008
6
Model selection
Assume that the given model fα(x) is dependent on some tuning
parameter (model complexity parameter) α
Examples of α:
• No. predictors (multiple regression)
• Degree of a polynomial (polynomial regression)
• Penalty factor (smoothing splines, ridge regression)
• Window width (kernels)
The aim of model selection: To find a model having minimum
test error
Data Mining and statistical
learning 2008
7
Model selection and assessment
Model selection
Estimate the performance of different models in order to
choose the best one.
Model assessment
Having chosen the final model, estimate the test error
Data Mining and statistical
learning 2008
8
Model selection and assessment
in data-rich applications
•
Training set (to produce a fit, appr. 50%)
•
Validation set (for model selection, 25 %)
•
Test set (for model assessment, 25%)
Example (splines)
1. Fit the training set using models with smoothing factors λ1, …, λn
2. Using fitted splines f1*(x),…fn*(x), estimate the prediction error using the
validation set and choose the model #i producing minimal error.
3. Estimate the generalization error using the test set and model #i
Data Mining and statistical
learning 2008
9
Model selection and assessment in applications
with insufficient data
• Analytical expressions to select and assess models
– Cp (correction for the number of inputs or basis functions)
– AIC (Akaike’s information criterion)
– BIC (Bayesian information criterion)
• Resampling
– Cross-validation
– The bootstrap
Data Mining and statistical
learning 2008
10
Analytical validation methods
Background:
A model typically overfits the data.
The prediction error will on average be higher than the
training error
Terminology:
The difference between the average training and prediction
error is called optimism
Basic idea:
Find an analytical expression for the optimism.
Data Mining and statistical
learning 2008
11
Optimism
• Training error rate:
• In-sample error:
1
Errin  EY 
N
N
 L(Y
i 1
i
new

ˆ
, f ( xi )) 

• Optimism:
Data Mining and statistical
learning 2008
12
Analytical expressions for the optimism
1. For squared error, 0-1 loss, and some other loss functions
2. For linear models with d inputs or basis functions
Data Mining and statistical
learning 2008
13
Cp scores
Basic idea:
d 2
Errin  EY (err )  2  
N
when d parameters are fitted under squared loss.
Compute
and choose the model with smallest Cp score
Properties:
•
Penalization inreases with increasing model complexity (d)
and decreases as the training sample size increases
Data Mining and statistical
learning 2008
14
Akaike’s information criterion
When the likelihood is maximized it holds asymptotically that
where
Given a tuning parameter , we set
where d() is the effective number of parameters.
For Gaussian models, AIC is equivalent to Cp
Data Mining and statistical
learning 2008
15
Effective number of parameters (d())
For linear smoothers:
yˆ  Sy
Examples:
• Simple linear regression (#exact param), ridge regression
• Smoothing splines
• Kernel smoothers
Define the effective number of parameters as
d S  traceS
Data Mining and statistical
learning 2008
16
Bayesian information criterion
Based on Bayesian theory we set
• For Gaussian models
Properties:
• BIC =AIC if ”2” is substituted for log(N)
• Since log(N)>1 for N>7, BIC penalizes complex models more
heavily than AIC
Data Mining and statistical
learning 2008
17
Features of AIC and BIC
For large models (assymptotical property)
• BIC chooses the right model (if it is present among
alternatives)
• AIC chooses too complex models
For small models
• BIC chooses too simple models
• AIC is OK
Data Mining and statistical
learning 2008
18
Resampling methods
Cross-validation
K-fold cross-validation (rough scheme, show picture):
1. Divide data-set in K roughly equally-sized subsets
2. Remove subset #i and fit the model using remaining data.
3. Predict the function values for subset #i using the fitted
model.
4. Repeat steps 2-3 for different i
5. CV= squared difference between observed values and
predicted values (another function is possible)
Data Mining and statistical
learning 2008
19
Resampling methods
Cross-validation
Note: if K=N then method is leave-one-out cross-validation.
K-fold cross-validation:
Data Mining and statistical
learning 2008
20
Model selection using cross-validation
Having a model depending on a tuning (complexity)
parameter, choose the one with smallest CV:
Data Mining and statistical
learning 2008
21
Prediction error and cross-validation curve
estimated from a single training set
Data Mining and statistical
learning 2008
22
Generalized cross-validation
Basic idea: approximate the leave-one-out CV to make it faster
Used for linear smoothers:
yˆ  Sy
Note: In smoothing problems, GCV is similiar to AIC
Data Mining and statistical
learning 2008
23
The bootstrap
Why do we need the bootstrap?
• To estimate the uncertainty of parameter estimates
Example:
• Having a sample X1, …, Xn from an unknown distribution
we estimate its mean (expectation) by computing the sample mean
• How to find uncertainty of the sample mean?
Data Mining and statistical
learning 2008
24
Implementation of the crude nonparametric
bootstrap
1. Sample with replacement form the observed data to obtain
a bootstrap sample
2. Repeat step 1 B times
3. Compute parameter estimates using the bootstrap samples
4. Compute the variance of estimates from step 3
Data Mining and statistical
learning 2008
25
Bootstrap replicates
Data Mining and statistical
learning 2008
26
Benefits and drawbacks of the nonparametric
bootstrap
1. The uncertainty of an estimate can be obtained without any
information about the underlying distribution
2. It is not applicable to small data sets
3. If the distribution is known (except for the level of some
parameters) the bootstarp may be slightly less efficient than
conventional parametric methods
Data Mining and statistical
learning 2008
27
The bootstrap and estimation of prediction errors
• Fit the model to bootstrap samples (role=training) and
examine how well it predicts the training set (role=prediction)
Data Mining and statistical
learning 2008
28
An improved bootstrap estimator of the
prediction error
The leave-one-out bootstrap is similar to two-fold CV, and
produces biased estimates of the expected squared
prediction error.
Solution:
.632-estimator(pulls bias down)
Data Mining and statistical
learning 2008
29