Transcript Lecture 12

Neural networks (NN) and
Multivariate Adaptive Regression Splines (MARS)
 Different types of neural networks
 Considerations in neural network modelling
 Multivariate Adaptive Regression Splines
Data mining and statistical learning lecture 12
Feed forward neural network
…
f1
Feed-forward neural network
• Input layer
• Hidden layer(s)
• Output layer
z1
x1
Data mining and statistical learning lecture 12
z2
x2
fK
…
…
zM
xp
Terminology
• Feed-forward network
– Nodes in one layer are connected to the nodes in
next layer
• Recurrent network
– Nodes in one layer may be connected to the ones
in previous layer or within the same layer
Data mining and statistical learning lecture 12
Multilayer perceptrons
• Any number of inputs
• Any number of outputs
• One or more hidden layers with
any number of units.
• Linear combinations of the
outputs from one layer form inputs
to the following layers
• Sigmoid activation functions in the
hidden layers.
Data mining and statistical learning lecture 12
…
f1
z1
x1
z2
x2
fK
…
…
zM
xp
Parameters in a multilayer perceptron
z m   ( 0 m  α mT X ), m  1, ... , M

C1
f k  g k (  0 k  β kT Z ), k  1, ... , K

C2
• C1, C2 : combination function
• g,  : activation function
• 0m 0k : bias of hidden unit
…
f1
z1
z2
fK
…
zM
• im jk : weight of connection
x1
Data mining and statistical learning lecture 12
x2
…
xp
Least squares fitting of neural networks
Consider a simple perceptron (no hidden layer)
f k   ( 0k  αkT X )   ( 0k  1k X 1     pk X p ), k  1, , K
f1
f2
fK
Find weights and bias
minimizing the error function
K
N
R( )    f k ( x i )  y ik 
k 1 i 1
2
x1
Data mining and statistical learning lecture 12
x2
…
xp
Alternative measures of fit
•
For regression we normally use the sum-of-squared errors as
measure of fit
K
N
R( )    f k ( x i )  y ik 
2
k 1 i 1
•
For classification we use either squared errors or cross-entropy
(deviance)
K
N
R( )   y ik log f k ( x i )
k 1 i 1
and the corresponding classifier is argmaxk fk(x)
•
The measure of fit can also be adapted to specific distributions, such
as Poisson distributions
Data mining and statistical learning lecture 12
Combination and activation functions
•
Combination function
– Linear combination:
 0 m    jm x j
– Radial combination:
  02m  ( jm  x j ) 2
j
j
•
Activation function in the hidden layer
– Identity
– Sigmoid
•
Activation function in the output layer
– Softmax
– Identity
g k (T ) 
exp(Tk )
K
 exp(T )
l 1
where Tk   0 k  β kT z
k
Data mining and statistical learning lecture 12
Ordinary radial basis function networks (ORBF)
• Input and output layers and one
hidden layer
• Hidden layer:
Combination function=radial
Activation function=exponential,
softmax
• Output layer:
Combination function=linear
Activation function =any, normally
identity
…
f1
z1
x1
Data mining and statistical learning lecture 12
z2
x2
fK
…
…
zM
xp
Issues in neural network modelling
• Preliminary training – learning with different initial weights
(since multiple local minima are possible)
• Scaling of the inputs is important (standardization)
• The number of nodes in the hidden layer(s)
• The choice of activation function in the output layer
– Interval – identity
– Nominal – softmax
Data mining and statistical learning lecture 12
Overcoming over-fitting
1. Early stopping
2. Adding a penalty function
Objective function=Error function+Penalty term


2
    im2    mk


im
mk

Data mining and statistical learning lecture 12
MARS: Multivariate Adaptive Regression Splines
An adaptive procedure for regression that can be
regarded as a generalization of stepwise linear
regression
Data mining and statistical learning lecture 12
Reflected pair of functions
with a knot at the value x1
(x-x1)+
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.2
x1
(x1-x)+
0.4
0.6
Data mining and statistical learning lecture 12
0.8
1
Reflected pairs of functions
with knots at the values x1 and x2
(x-x1)+
(x1-x)+
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.2
x1
0.4
(x-x2)+
0.6
x2
Data mining and statistical learning lecture 12
(x2-x)+
0.8
1
MARS with a single input X
taking the values x1, …, xN
Form the collection
C  ( X  t )  , (t  X )  ; t {x1 , x2 ,..., xN }
of base functions
Construct models of the form
M
f ( X )   0    m hm ( X )
m 1
where each hm(X) is a function in C or a product of two or
more such functions
Data mining and statistical learning lecture 12
MARS model with a single input X
taking the values x1, x2
E(Y)
3
2.5
2
1.5
1
0.5
0
0
0.2
x1
0.4
0.6
x2
0.8
Data mining and statistical learning lecture 12
1
1.2
MARS model with a single input X
taking the values x1, x2
E(Y)
3
2.5
2
1.5
1
0.5
0
0
0.2
x1
0.4
0.6
x2
0.8
Data mining and statistical learning lecture 12
1
1.2
MARS: Multivariate Adaptive Regression Splines
At each stage we consider as a new basis
function pair all products of functions already in
the model with one of the reflected pairs in the
set C
Although each basis function depends only on a
single Xj it is considered as a function over the
entire input space
Data mining and statistical learning lecture 12
MARS: Multivariate Adaptive Regression Splines
- model selection
MARS functions typically overfit the data and so a
backward deletion procedure is applied
The size of the model is determined by
Generalized Cross Validation
An upper limit can be set on the order of
interaction
Data mining and statistical learning lecture 12
The MARS model can be viewed as a generalization
of the classification and regression tree (CART)
13
12
11
x2
10
9
8
7
6
3
4
5
6
7
x1
Data mining and statistical learning lecture 12
8
Some characteristics of different learning methods
Characteristic
Neural
networks
Trees
MARS
Natural handling of data of “mixed” type
Poor
Good
Good
Handling of missing values
Poor
Good
Good
Robustness to outliers in input space
Poor
Good
Poor
Insensitive to monotone transformations of inputs
Poor
Good
Poor
Computational scalability (large N)
Poor
Good
Good
Ability to deal with irrelevant inputs
Poor
Good
Good
Ability to extract linear combinations of features
Good
Poor
Poor
Interpretability
Poor
Fair
Good
Predictive power
Good
Poor
Fair
Data mining and statistical learning lecture 12
Separating hyperplane
1
x T    0 0.80
x2
0.6
0.4
0.2
0
0
0.2
0.6
0.4
x1
Data mining and statistical learning lecture 12
0.8
1
Optimal separating hyperplane
- support vector classifier
1
x T    0 0.8
0
Find the
hyperplane
that creates
the biggest
margin
between the
training
points for
class 1 and -1
0.6
margin
0.4
0.2
0
0
0.2
0.4
0.6
0.8
Data mining and statistical learning lecture 12
1
Formulation
of the optimization problem
max  . 0 ,  1 C
subject to yi ( x    0 )  C , i  1, ..., N
T
i
Signed distance to
decision border
y=1 for one of the groups
and y=-1 for the other one
Data mining and statistical learning lecture 12
Two equivalent formulations
of the optimization problem
max  . 0 ,  1 C
subject to yi ( xiT    0 )  C , i  1, ..., N
min  . 0 
subject to yi ( xiT    0 )  1, i  1, ..., N
Data mining and statistical learning lecture 12
Characteristics of the support vector classifier
Points well inside their class boundary do not play a big
role in the shaping of the decision border
Cf. linear discriminant analysis (LDA) for which the
decision boundary is determined by the covariance matrix
of the class distributions and their centroids
Data mining and statistical learning lecture 12
Support vector machines
using basis expansions (polynomials, splines)
1
f ( x)  h( x)T    0 0.80
h2(x)
0.6
0.4
0.2
0
0
0.2
0.6
0.4
h1(x)
Data mining and statistical learning lecture 12
0.8
1
Characteristics of support vector machines
The dimension of the enlarged feature space can be very
large
Overfitting is prevented by a built-in shrinkage of beta
coefficients
Irrelevant inputs can create serious problems
Data mining and statistical learning lecture 12
The SVM as a penalization method
Misclassification: f(x) < 0 when y=1 or f(x)>0 when y=-1
Loss function:
N
 [1  y f (x )]
i 1
i
i

Loss function + penalty:
N
 [1  yi f (xi )]   
2
i 1
Data mining and statistical learning lecture 12
The SVM as a penalization method
Minimizing the loss function + penalty
N
 [1  yi f (xi )]   
2
i 1
is equivalent to fitting a support vector machine to
data
The penalty factor  is a function of the constant
providing an upper bound of
N

i 1
i
Data mining and statistical learning lecture 12
Some characteristics of different learning methods
Characteristic
Neural
networks
Support
vector
machines
Trees
MARS
Natural handling of data of “mixed” type
Poor
Poor
Good
Good
Handling of missing values
Poor
Poor
Good
Good
Robustness to outliers in input space
Poor
Poor
Good
Poor
Insensitive to monotone transformations of
inputs
Poor
Poor
Good
Poor
Computational scalability (large N)
Poor
Poor
Good
Good
Ability to deal with irrelevant inputs
Poor
Poor
Good
Good
Ability to extract linear combinations of features
Good
Good
Poor
Poor
Interpretability
Poor
Poor
Fair
Good
Predictive power
Good
Good
Poor
Fair
Data mining and statistical learning lecture 12