Regression, neural networks

Download Report

Transcript Regression, neural networks

Regression,
Artificial Neural Networks
16/03/2016
Regression
Regression
– Supervised learning: Based on
training examples, learn a modell
which works fine on previously
unseen examples.
– Regression: forecasting real values
Regression
Training dataset: {xi, ri} riϵR
Evaluation metric:
„Least squared error”
Linear regression
Linear regression
g(x) = w1x + w0
Its gradient is 0 if
Regression variants
+MLE →
– Bayes
– k nearest neighbur’s
• mean or
• distance weighted average
– Decision tree
• or various linear models on the leaves
Regression SVM
Artificial Neural Networks
Artificial neural networks
• Motivation: the simulation of the neuo
system (human brain)’s information
processing mechanisms
• Structure: huge amount of densely
connected, mutally operating
processing units (neurons)
• It learns from experiences (training
instances)
Some neurobiology…
• Neurons have many
inputs and a single
output
• The output is either
excited or not
• The inputs from other
neurons determins
whether the neuron
fires
• Each input synapse has
a weight
A neuron in maths
Weighted average of inputs. If the average is above
a threshold T it fires (outputs 1) else its output is 0
or -1.
Statistics about the human brain
• #nerons: ~ 1011
• Avg. #connections per neuron: 104
• Signal sending time: 10-3 sec
• Face recognition: 10-1 sec
Motivation
(machine learning point of view)
• Goal: non-linear classification
– Linear machines are not satisfactory at
several real world situations
– Which non-linear function family to choose?
– Neural networks: latent non-linear patterns
will be machine learnt
Perceptron
Multilayer perceptron =
Neural Network
Different
representation at
various layers
Multilayer perceptron
Feedforward neural networks
• Connection only to the next layer
• The weights of the connections
(between two layers) can be changed
• Activation functions are used to
calculate whether the neuron fires
• Three-layer network:
• Input layer
• Hidden layer
• Output layer
Network function
• The network function of neuron j:
d
d
i 1
i 0
net j   xi w ji  w j 0   xi w ji  wtj x,
where i is the index of input neurons, and wji is the
weight between the neurons i and j.
• wj0 is the bias
Activation function
activation function is a non-linear function
of the network value:
yj = f(netj)
(if it’d be linear, the whole network will be linear)
The sign activation function:
oi
1 if net  0
f ( net )  sgn( net )  
 1 if net  0
1
0
Tj
netj
Differentiable activation
functions
• Enables gradient descent-based learning
• The sigmoid function:
f (net j ) 
1
1 e
( net j T j )
1
0
Tj
netj
Output layer
nH
nH
j 1
j 0
net k   y j wkj  wk 0   y j wkj  wkt y
where k is the index on the output layer and nH is the
number of hidden neurons
• Binary classification: sign function
• Multi-class classification: a neuron for
each of the classes, the argmax is
predicted (discriminant function)
• Regression: linear transformation
– y1 hidden unit calculates:
 0  y1 = +1
x1 + x2 + 0.5
< 0  y1 = -1
x1 OR x2
- y2 represents:
 0  y2 = +1
x1 + x2 -1.5
< 0  y2 = -1
– The output neuron: z1 = 0.7y1-0.4y2 - 1,
sgn(z1) is 1 iff y1 =1, y2 = -1
(x1 OR x2 ) AND NOT(x1 AND x2)
x1 AND x2
General (three-layer) feedforward
network (c output unit)
•
 nH

 d


g k ( x)  zk  f k   wkj f j   w ji xi  w j 0   wk 0 
 i 1

 j 1

(k  1,..., c)
– The hidden units with their activation functions can
express non-linear functions
– The activation functions can be different at neurons
(but the same one is used in practice)
Universal approximation
theorem
Universal approximation theorem states that
a feed-forward network with a single hidden layer
containing a finite number of neurons can
approximate any continuous functions
But the theorem does not give any hint on
who to design activation functions for
problems/datasets
Training of neural networks
(backpropagation)
Training of neural networks
• The network topology is given
• The same activation function is
used at each hidden neuron and it
is given
• Training = calibration of weights
• on-line learning (epochs)
Training of neural networks
1.Forward propagation
An input vector propagates through the
network
2. Weight update (backpropagation)
the weights of the network will be
changed in order to decrease the
difference between the predicted and gold
standard values
Training of neural networks
we can calculate (propagate back) the
error signal for each hidden neuron
• tk is the target (gold standard) value of
output neuron k, zk is the prediction at
output neuron k (k = 1, …, c) and w are
the weights
•
1 c
1
2
tz
Error: J ( w )   ( t k  z k ) 
2 k 1
2
– backpropagation is a gradient descent
algorithms
• initial weights are random, then
J
w  
w
2
Backpropagation
The error of the weights between the hidden and
output layers:
netk
J
J netk

.
  k
wkj netk wkj
wkj
the error signal for output neuron k:
J
k  
net k
net k
 yj
because netk = wkty:
w kj
and:
z k  f (net k )
k  
J
J z k

.
 (t k  z k ) f ' (net k )
net k
z k net k
The change of weights between the hidden
and output layers:
wkj = kyj = (tk – zk) f’ (netk)yj
The gradient of the hidden units:
d
y j  f (net j ), net j   w ji xi
i 0
J
J y j net j

.
.
w ji y j net j w ji
J


y j y j
c
zk
1 c
2
 2  (t k  zk )    (tk  zk ) y
k 1
 k 1

j
c
zk net k
  (t k  zk )
.
  (tk  zk ) f ' (net k ) wkj
net k y j
k 1
k 1
k
c
The error signal of the hidden units:
c
 j  f ' ( net j ) w kj k
k 1
The weight change between the input and
hidden layers:
w ji  x i j   w kj k  f ' ( net j ) x i



j
Backpropagation
Calculate the error signal for the output
neurons and update the weights between the
output and hidden layers
 k  (tk  zk ) f ' (netk )
output
update the weights to k:
wki   k zi
hidden
input
Backpropagation
Calculate the error signal for hidden
neurons
output
c
 j  f ' ( net j ) w kj k
rejtett
k 1
input
Backpropagation
Update the weights between the input
and hidden neurons
output
rejtett
updating the ones to j
w ji   j xi
input
Training of neural networks
w initialised randomly
Begin init: nH; w, stopping critera ,
, m  0
do m  m + 1
xm  a sampled training instance
wji  wji + jxi; wkj  wkj + kyj
until ||J(w)|| < 
return w
End
Stopping criteria
• if the change in J(w) is smaller than the threshold

• Problem: estimating the change from a single
training instance. Use bigger batches for change
estimation:
n
J  Jp
p 1
Stopping based on the performance
on a validation dataset
– The usage of unseen training instances for
estimating the performance of supervised
learning (to avoid overfitting)
– Stopping at the minimum error on the
validation set
Notes on backpropagation
• it can be stack at local minima
• In practice, the local minima is close to the
global one
• Multiple training starting from various
randomly initalized weights might help
– we can take the trained network with the minimal
error (on a validation set)
– there are voting schema for voting the networks
Questions of network design
• How many hidden neurons?
– few neurons cannot learn complex
patterns
– too many neurons can easily
overfit
– validation set?
• Learning rate!?
Outlook
History of neural networks
• Perceptron: one of the first
machine learners ~1950
• Backpropagation: multilayer
perceptrons, 1975• Deep learning:
popular again 2006-
Deep learning
(auto-encoder pretraining)
Recurrent neural networks
short term memory
http://www.youtube.com/watch?v=vmDByFN6eig