EE Dept presentation 06

Download Report

Transcript EE Dept presentation 06

Introduction to
Predictive Learning
LECTURE SET 2
Basic Learning Approaches and
Complexity Control
Electrical and Computer Engineering
1
OUTLINE
2.0 Objectives
2.1 Data Encoding + Preprocessing
2.2 Terminology and Common Learning Tasks
2.3 Basic Learning Approaches
2.4 Generalization and Complexity Control
2.5 Application Example
2.6 Summary
2
2.0 Objectives
1. To quantify the notions of explanation,
prediction and model
2. Introduce terminology
3. Describe basic learning methods
4. Importance of complexity control for
generalization
3
Learning as Induction
Induction ~ function estimation from data:
Deduction ~ prediction for new inputs:
aka Standard Inductive Learning Setting
4
2.1 Data Encoding + Preprocessing
Common Types of Input & Output Variables
(input variables ~ features)
• Real-valued
• Categorical (class labels)
• Ordinal (or fuzzy) variables
• Classification: categorical output
• Regression: real-valued output
• Ranking: ordinal output
5
•
Data Preprocessing and Scaling
Preprocessing is required with observational
data (step 4 in general experimental procedure)
•
Basic preprocessing includes
- summary univariate statistics: mean, st.
deviation, min + max value, range, boxplot
performed independently for each input/output
- detection (removal) of outliers
- scaling of input/output variables (may be
necessary for some learning algorithms)
•
Visual inspection of data is tedious but useful
6
Animal Body & Brain Weight Data
(original, unscaled)
7
Removing Outliers
Remove outliers: Brachiosaurus, Diplodocus, Triceratop,
African elephant, Asian elephant
and plot the data scaled to [0,1] range:
1
0.9
0.8
0.7
Brain weight
•
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
Body weight
0.7
0.8
0.9
1
8
2.2 Terminology and Learning Problems
•
Input and output variables
x
z
Sy stem
y
•
•
Learning ~ estimation of f(x): xy
Loss function L( y, f (x))
measures the quality of prediction yˆ  f (x)
•
Loss function:
- defined for common learning tasks
- has to be related to application requirements
9
Supervised Learning: Regression
•
•
Data in the form (x,y), where
- x is multivariate input (i.e. vector)
- y is univariate output (real-valued)
Regression loss function L( y, f (x))   y  f (x)2
 Estimation of real-valued function xy
10
Supervised Learning: Classification
•
•
Data in the form (x,y), where
- x is multivariate input (i.e. vector)
- y is categorical output (class label)
Loss function for binary classification:
0 if y  f x 
L y, f x   
1 if y  f x 
 Estimation of indicator function xy
11
Unsupervised Learning
•
•
Data in the form (x), where
- x is multivariate input (i.e. vector)
Goal: data reduction or clustering
 Clustering = estimation of mapping X c
12
Inductive Learning Setting
• Predictive Estimator observes samples (x ,y), and returns
an estimated response yˆ  f (x, w)
• Recall ‘first-principle’ vs ‘empirical’ knowledge
Two modes of inference: identification vs imitation
• Minimization of Risk
 Loss(y, f(x,w)) dP(x,y) min
13
Example: Regression estimation
Given: training data (xi , yi ), i  1,2,...n
Find a function f (x, w ) that minimizes squared
error for a large number (N) of future samples:
N

k 1
[( y k  f (x k , w)]2  m in
2 dP(x,y) min
(y

f(
x
,w))

BUT Future data is unknown ~ P(x,y) unknown
14
Discussion
•
•
•
Math formulation useful for quantifying
- explanation ~ fitting error (training data)
- generalization ~ prediction error
Natural assumptions
- future similar to past: stationary P(x,y),
i.i.d.data
- discrepancy measure or loss function,
i.e. MSE
What if these assumptions do not hold?
15
OUTLINE
2.0 Objectives
2.1 Data Encoding + Preprocessing
2.2 Terminology and Common Learning Tasks
2.3 Basic Learning Approaches
- Parametric Modeling
- Non-parametric Modeling
- Data Reduction
2.4 Generalization and Complexity Control
2.5 Application Example
2.6 Summary
16
Parametric Modeling
Given training data (xi , yi ), i  1,2,...n
(1) Specify parametric model
(2) Estimate its parameters (via fitting to data)
• Example: Linear regression F(x)= (w x) + b
2
n
  y  ( w  x )  b
i 1
i
i
 min
17
Parametric Modeling: Classification
Given training data (xi , yi ), i  1,2,...n
(a) Estimate linear decision boundary:
(b) Estimate third-order decision boundary:
18
Non-Parametric Modeling
Given training data
(xi , yi ),i  1,2,...n
Estimate the model (for given x0) as
‘local average’ of the training data.
Note: need to define ‘local’, ‘average’
• Example: k-nearest neighbors regression
k
f (x 0 ) 
y
j 1
j
k
19
Example of kNN Regression
•
Ten training samples from
y  x 2  0.1x  N (0, 2 ), where 2  0.25
•
Using k-nn regression with k=1 and k=4:
20
Data Reduction Approach
Given training data, estimate the model as
‘compact encoding’ of the data.
Note: ‘compact’ ~ # of bits to encode the model
• Example: piece-wise linear regression
How many parameters needed
for two-linear-component model?
21
Data Reduction Approach (cont’d)
Data Reduction approaches are commonly used
for unsupervised learning tasks.
• Example: clustering.
Training data encoded by 3 points (cluster centers)
H
Issues:
- How to find centers?
- How to select the
number of clusters?
22
Standard Inductive Setting
•
Model estimation ~ inductive step, i.e. estimate
function from data samples.
• Prediction ~ deductive step
 (Standard) Inductive Learning Setting
•
•
Discussion: which of the 3 modeling
approaches follow standard inductive learning?
How humans perform inductive inference?
23
OUTLINE
2.0 Objectives
2.1 Data Encoding + Preprocessing
2.2 Terminology and Common Learning Tasks
2.3 Basic Learning Approaches
2.4 Generalization and Complexity Control
- Prediction Accuracy (generalization)
- Complexity Control: examples
- Resampling
2.5 Application Example
2.6 Summary
24
Prediction Accuracy
•
•
•
•
•
All modeling approaches implement ‘data
fitting’ ~ explaining the data
BUT the true goal ~ prediction
Model explanation ~
fitting error, training error, empirical risk
Prediction accuracy ~
generalization, test error, prediction risk
Trade-off between training and test error
is controlled by ‘model complexity’
25
Explanation vs Prediction
(a) Classification
(b) Regression
26
Complexity Control: parametric modeling
Consider regression estimation
• Ten training samples
y  x 2  N (0,  2 ), where 2  0.25
•
Fitting linear and 2-nd order polynomial:
27
Complexity Control: local estimation
Consider regression estimation
• Ten training samples from
y  x 2  N (0,  2 ), where 2  0.25
•
Using k-nn regression with k=1 and k=4:
28
Complexity Control (cont’d)
• Complexity (of admissible models)
affects generalization (for future data)
• Specific complexity indices for
– Parametric models: ~ # of parameters
– Local modeling: size of local region
– Data reduction: # of clusters
• Complexity control = choosing good
complexity (~ good generalization) for a
given (training) data
29
How to Control Complexity ?
• Two approaches: analytic and resampling
• Analytic criteria estimate prediction error as a
function of fitting error and model complexity
For regression problems:
 DoF 
R   r 
 Remp
 n 
Example analytic criteria for regression
1
• Schwartz Criterion:
r p, n  1  p1  p ln n
• Akaike’s FPE:
r p  1 p1 p
1
where p = DoF/n, n~sample size, DoF~degrees-of-freedom
30
Resampling
•
Split available data into 2 sets:
Training + Validation
(1) Use training set for model
estimation (via data fitting)
(2) Use validation data to estimate the
prediction error of the model
• Change model complexity index and
repeat (1) and (2)
• Select the final model providing lowest
(estimated) prediction error
BUT results are sensitive to data splitting
31
K-fold cross-validation
1. Divide the training data Z into k randomly selected
disjoint subsets {Z1, Z2,…, Zk} of size n/k
2. For each ‘left-out’ validation set Zi :
- use remaining data to estimate the model yˆ  f i (x)
k
2
- estimate prediction error on Zi :
ri    f i (x)  y 
k
nZ
1
3. Estimate ave prediction risk as Rcv   ri
k i1
i
32
Example of model selection(1)
• 25 samples are generated as
y  sin 2 2x   
with x uniformly sampled in [0,1], and noise ~ N(0,1)
• Regression estimated using polynomials of degree m=1,2,…,10
• Polynomial degree m = 5 is chosen via 5-fold cross-validation.
The curve shows the polynomial model, along with training (* )
and validation (*) data points, for one partitioning.
m
Estimated R via
Cross validation
1
0.1340
2
0.1356
3
0.1452
4
0.1286
5
0.0699
6
0.1130
7
0.1892
8
0.3528
9
0.3596
10
0.4006
33
Example of model selection(2)
• Same data set, but estimated using k-nn regression.
• Optimal value k = 7 chosen according to 5-fold cross-validation
model selection. The curve shows the k-nn model, along with
training (* ) and validation (*) data points, for one partitioning.
k
Estimated R via
Cross validation
1
0.1109
2
0.0926
3
0.0950
4
0.1035
5
0.1049
6
0.0874
7
0.0831
8
0.0954
9
0.1120
10
0.1227
34
Test Error
• Previous example shows two models (estimated from the
same data by different methods)
• Which model is better? (has lower test error).
Note: Poly model has lower cross-validation error
• Double resampling (for estimating test error)
Partition the data into: Learning/ validation/ test
• Test data should be never used for model estimation
35
Application Example
• Haberman’s Survival Data Set
- 5-year survival of female patients (following
surgery for breast cancer)
- 306 cases (patients)
- inputs: Age, number of positive auxilliary nodes
• Method: k-NN classifier (only odd k-values)
Note: input values are pre-scaled to [0,1]
• Model selection via LOO cross-validation
• Optimal k=45 yields min LOO error 22.75%
36
Model selection for k-NN classifier via cross-validation
Optimal decision boundary for k=45
k
1
3
7
15
…
45
47
51
53
57
61
99
Error (%)
42
30.67
26
24.33
….
21.67
22.33
23
24.33
24
25
26.33
37
Estimating test error of a method
•
•
For the same example (Haberman’s data) what is the
true test error of k-NN method ?
Use double resampling, i.e. 5-fold cross validation to estimate
test error, and LOO cross-validation to estimate optimal k for each
training fold:
1
2
3
4
5
mean
Optimal k LOO error
Test error
11
37
37
33
35
28.33%
13.33%
16.67%
25%
35%
23.67%
22.5%
25%
23.33%
24.17%
18.75%
22.75%
Note: opt k-values are different;
ave test error is larger than ave validation error.
38
Summary and Discussion
• Learning as function estimation (from data)
~ standard inductive learning setting
• Common types of learning problems:
classification, regression, clustering
• Non-standard learning settings
• Model estimation via data fitting (ERM)
• Model complexity and generalization:
- how to measure model complexity
- several complexity (tuning) parameters
39