ENGO Assessment of Environmental Goal Achievement under

Download Report

Transcript ENGO Assessment of Environmental Goal Achievement under

Separating hyperplane
1
x T    0 0.80
x2
0.6
0.4
0.2
0
0
0.2
0.6
0.4
x1
Data mining and statistical learning lecture 13
0.8
1
Optimal separating hyperplane
- support vector classifier
1
x T    0 0.8
0
Find the
hyperplane
that creates
the biggest
margin
between the
training
points for
class 1 and -1
0.6
margin
0.4
0.2
0
0
0.2
0.4
0.6
0.8
Data mining and statistical learning lecture 13
1
Formulation
of the optimization problem
max  . 0 ,  1 C
subject to yi ( x    0 )  C , i  1, ..., N
T
i
Signed distance to
decision border
y=1 for one of the groups
and y=-1 for the other one
Data mining and statistical learning lecture 13
Two equivalent formulations
of the optimization problem
max  . 0 ,  1 C
subject to yi ( xiT    0 )  C , i  1, ..., N
min  . 0 
subject to yi ( xiT    0 )  1, i  1, ..., N
Data mining and statistical learning lecture 13
Optimal separating hyperplane
– overlapping classes
1
Find the
hyperplane
that creates
the biggest
margin
subject to
i  constant
0.8
1
0.6
xT    0  0
0.4
2 
3
0.2
0
0
0.2
0.4
0.6
0.8
1
Data mining and statistical learning lecture 13
Characteristics of the support vector classifier
Points well inside their class boundary do not play a big
role in the shaping of the decision border
Cf. linear discriminant analysis (LDA) for which the
decision boundary is determined by the covariance matrix
of the class distributions and their centroids
Data mining and statistical learning lecture 13
Support vector machines
using basis expansions (polynomials, splines)
1
f ( x)  h( x)T    0 0.80
h2(x)
0.6
0.4
0.2
0
0
0.2
0.6
0.4
h1(x)
Data mining and statistical learning lecture 13
0.8
1
Characteristics of support vector machines
The dimension of the enlarged feature space can be very
large
Overfitting is prevented by a built-in shrinkage of beta
coefficients
Irrelevant inputs can create serious problems
Data mining and statistical learning lecture 13
The SVM as a penalization method
Misclassification: f(x) < 0 when y=1 or f(x)>0 when y=-1
Loss function:
N
 [1  y f (x )]
i 1
i
i

Loss function + penalty:
N
 [1  yi f (xi )]   
2
i 1
Data mining and statistical learning lecture 13
The SVM as a penalization method
Minimizing the loss function + penalty
N
 [1  yi f (xi )]   
2
i 1
is equivalent to fitting a support vector machine to
data
The penalty factor  is a function of the constant
providing an upper bound of
N

i 1
i
Data mining and statistical learning lecture 13
Some characteristics of different learning methods
Characteristic
Neural
networks
Support
vector
machines
Trees
MARS
Natural handling of data of “mixed” type
Poor
Poor
Good
Good
Handling of missing values
Poor
Poor
Good
Good
Robustness to outliers in input space
Poor
Poor
Good
Poor
Insensitive to monotone transformations of
inputs
Poor
Poor
Good
Poor
Computational scalability (large N)
Poor
Poor
Good
Good
Ability to deal with irrelevant inputs
Poor
Poor
Good
Good
Ability to extract linear combinations of features
Good
Good
Poor
Poor
Interpretability
Poor
Poor
Fair
Good
Predictive power
Good
Good
Poor
Fair
Data mining and statistical learning lecture 13
Ve (r )
-insensitive error function
-6
-4
-
-2
4
3.5
3
2.5
2
1.5
1
0.5
0
0
2
Data mining and statistical learning lecture 13
4
6
SVMs for linear regression
Estimate the regression coefficients by minimizing
N

i 1
2
H (  ,  0 )   V ( yi  f (xi )) 

2
(i) The fitting is less sensitive than OLS to outliers
(ii) Errors of size less than  are ignored
(iii) Typically, the parameter estimates are functions of only
a minor subset of the observations
Data mining and statistical learning lecture 13
Ensemble methods




Bootstrapping (Chapter 8)
Bagging (Chapter 8)
Boosting (Chapter 10)
Bagging and boosting in SAS EM
Data mining and statistical learning lecture 13
Major types of ensemble methods
• Manipulation
of the model
• Manipulation
of the data set
Data mining and statistical learning lecture 13
Terminology
 Bagging=Manipulation of the data set
 Boosting = Manipulation of the model
Data mining and statistical learning lecture 13
The bootstrap
We would like to determine a functional F(P) of an unknown
probability distribution P
The bootstrap: Compute F(P*) where P* is an approximation of P
Data mining and statistical learning lecture 13
Resampling techniques
- the bootstrap method
Resampled data
Observed data
62
90
22
41
34
67
88 79
39
73
58
x1* , x2* , ...60
, xN*
88
Sampling with
replacement
58
90 88 79
41
22
44
70 60
44
70 60
85
85
x
34
41
x1* , x2* , ..., x N*
Data mining and statistical learning lecture 13
The bootstrap for assessing the accuracy of an
estimate or prediction
Compute
 
Var e X
Bootstrap samples X k*  ( X k*1 , ..., X kn* ) are generated by
sampling with replacement from the observed data
1. Generate N bootstrap samples and compute
2. Compute the sample variance of Tk
Data mining and statistical learning lecture 13
Tk  e
X k*
Bagging
- using the bootstrap to improve a prediction
Question:
Given the model Y=f(X)+ε and a set of observed
values Z={Yi, Xi, i=1,…,N}, what is E P fˆ X  , where
P denotes the distribution of (X,Y)?


Solution:
Replace P with P*:
• Produce B bootstrap samples Z1* ,, Z*B and, for
each sample, compute fˆ *b x 
• Compute the sample mean by averaging over the
bootstrap functions.
Data mining and statistical learning lecture 13
Bagging
Formula:
B
1
fˆbag  x    fˆ *b  x 
B b 1
Construct graphs,
compute average
Data mining and statistical learning lecture 13
Properties of bagging
 Bagging of fitted functions reduces the variance
 Bagging makes good predictions better, bad predictions
worse
 If the fitted function is linear, it will asymptotically coincide
with the bagged estimate (B -> Infinity)
Data mining and statistical learning lecture 13
Bagging for classification
Given a K-class classification problem with
Z={Yi, Xi, i=1, …, N} and a computed indicator
function (or class probabilities)
fˆ x   p1 x, , pK x , Gˆ x  arg max k pk x
B
1
we produce a bagging estimate fˆbag  x    fˆ *b  x 
B b 1
and predict class variables
Data mining and statistical learning lecture 13
Boosting
- basic idea
Consider a 2-class problem with Y  1, 1 and a
classifier G x  .
Produce a sequence of classifiers and combine them.
The weights for misclassified observations are
increased to force the algorithm to classify them
correctly at next step.
Data mining and statistical learning lecture 13
Boosting
Data mining and statistical learning lecture 13
Boosting
Data mining and statistical learning lecture 13
Boosting
- comments
 Boosting can be modified for regression
 AdaBoost.M1 can be modified to handle
categorical output
Data mining and statistical learning lecture 13
Bagging and boosting in EM
 Create a diagram (Input node (define target!) –
Partition node – Group processing node – Your
model – Ensemble node)
 Comment: boosting works only for classification
(categorical output)
Data mining and statistical learning lecture 13
Group processing: General
Modes:
• Unweighted resampling for bagging
• Weighted resampling for boosting
Data mining and statistical learning lecture 13
Group processing
- Unweighted resampling for bagging
• Specify sample size
Data mining and statistical learning lecture 13
Group processing: weighted resampling for
boosting
• Specify target
Data mining and statistical learning lecture 13
Ensemble results
Data mining and statistical learning lecture 13