November 2008_Neural_Computing_Systems.SupervisedBackProp
Download
Report
Transcript November 2008_Neural_Computing_Systems.SupervisedBackProp
Financial Informatics –XVI:
Supervised Backpropagation Learning
Khurshid Ahmad,
Professor of Computer Science,
Department of Computer Science
Trinity College,
Dublin-2, IRELAND
November 19th, 2008.
https://www.cs.tcd.ie/Khurshid.Ahmad/Teaching.html
1
1
Preamble
Neural Networks 'learn' by adapting in accordance with a
training regimen: Five key algorithms.
ERROR-CORRECTION OR PERFORMANCE LEARNING
HEBBIAN OR COINCIDENCE LEARNING
BOLTZMAN LEARNING (STOCHASTIC NET LEARNING)
COMPETITIVE LEARNING
FILTER LEARNING (GROSSBERG'S NETS)
2
ANN Learning Algorithms
Vector
describing the
environment
ENVIRO NMENT
TEACHER
Desired
response
LEARNING
SYSTEM
Actual
Response
-
+
Error Signal
3
ANN Learning Algorithms
ENVIRONMENT
Vector describing
state of the
environment
LEARNING
SYSTEM
4
ANN Learning Algorithms
Primary
Reinforcement
State-vector input
ENVIRO NMENT
CRITIC
Heuristic
Reinforcement
Actions
LEARNING
SYSTEM
5
Back-propagation Algorithm:
Supervised Learning
Backpropagation (BP) is amongst the ‘most popular algorithms
for ANNs’: it has been estimated by Paul Werbos, the person
who first worked on the algorithm in the 1970’s, that between
40% and 90% of the real world ANN applications use the BP
algorithm. Werbos traces the algorithm to the psychologist
Sigmund Freud’s theory of psychodynamics. Werbos applied
the algorithm in political forecasting.
•David Rumelhart, Geoffery Hinton and others applied the BP
algorithm in the 1980’s to problems related to supervised
learning, particularly pattern recognition.
•The most useful example of the BP algorithm has been in
dealing with problems related to prediction and control.
6
Back-propagation Algorithm:
Architecture of a BP system
7
Back-propagation Algorithm
BASIC DEFINITIONS
1. Backpropagation is a procedure for efficiently calculating
the derivatives of some output quantity of a non-linear
system, with respect to all inputs and parameters of that
system, through calculations proceeding backwards from
outputs to inputs.
2. Backpropagation is any technique for adapting the weights
or parameters of a nonlinear system by somehow using such
derivatives or the equivalent.
According to Paul Werbos there is no such thing as
a “backpropagation network”, he used an ANN
design called a multilayer perceptron.
8
Back-propagation Algorithm
Paul Werbos provided a ‘rule for updating the weights of a
multi-layered network undergoing supervised learning.
It is the weight adaptation rule which is called
backpropagation.
Typically, a fully connected feedforward network is used to be
trained using the BP algorithm: activation in such networks
travels in a direction from the input to the output layer and the
units in one layer are connected to every other unit in the next
layer.
There are two sweeps of the fully connected network: forward
sweep and backward sweep.
9
Back-propagation Algorithm
There are two sweeps of the fully connected network: forward
sweep and backward sweep.
Forward Sweep: This sweep is similar to any other
feedforward ANN – the input stimuli is given to the network,
the network computes the weighted average from all the input
units and then passes the average through a squash function.
The ANN generates an output subsequently.
The ANN may have a number of hidden layers, for example,
the multi-net perceptron, and the output from each hidden
layer becomes the input to the next layer forward.
10
Back-propagation Algorithm
There are two sweeps of the fully connected network: forward
sweep and backward sweep.
Backwards Sweep: This sweep is similar to the forward sweep,
except what is ‘swept’ are the error values. These values
essentially are the differences between the actual output and a
( e j d j o j )
desired output.
The ANN may have a number of hidden layers, for example, the
multi-net perceptron, and the output from each hidden layer
becomes the input to the next layer forward.
In the backward sweep output unit sends errors back to the first
proximate hidden layer which in turn passes it onto the next
hidden layer. No error signal is sent to the input units.
11
Back-propagation Algorithm
For each input vector associate a target output vector
while not STOP
STOP = TRUE
for each input vector
•
perform a forward sweep to find the actual output
•
obtain an error vector by comparing the actual and target output
•
if the actual output is not within tolerance set STOP = FALSE
•
use the backward sweep to determine weight changes
•
update weights
12
Back-propagation Algorithm
Example: Perform a complete forward and backward sweep of
a 2-2-1 ( 2 input units, 2 hidden layer units and 1 output unit)
with the following architecture. The target output d=0.9. The
input is [0.1 0.9].
0
0.1
1
2
0.2
0.1
0.1
0.3
4
0.2
0.2
6
0.1
5
d
0.9
0.3
13
Back-propagation Algorithm
Target value is 0.9 so error
for the output unit is:
(0.900 – 0.288) x 0.288 x (1
– 0.288) = 0.125
Forward Pass
0.288
-2
3
1
0.122
3
2
-4
0.500
0.993
0
5.000
3
-2
0.1
-2
1
0.375
-2
0.986
2
0.125
3
0.728
-1.972
-2
-2
-0.906
3
Output from
unit
2
0.125
0.040
-2
2
-4
-0.030
2
3
2
-2
2
-0.110
-0.007
Input to unit
-2
0.025
2
-0.001
3
-2
3
0.9
Example of a forward pass and a backward pass through a 2-2-2-1 feedforward network.
Inputs, outputs and errors are shown in boxes.
14
Back-propagation Algorithm
-2 + (0.8 x 0.125) x 1 = -1.900
3 + (0.8 x 0.125) x 0.122 = 3.012
1.073 = 1 + (0.8 x 0.125) x 0.728
3 + (0.8 x 0.04) x 1 = 3.032
-1.98 = -2 + (0.8 x 0.025) x 1
-2 + (0.8 x 0.04) x 0.5 = -1.984
2.019 = 2 + (0.8 x 0.025) x 0.993
2 + (0.8 x 0.025) x 0.5) = 2.010
2 + (0.8 x –0.007) x 1 = 1.994
-2 + (0.8 x –0.007) x 0.1) = -2.001
3 + (0.8 x –0.001) x 0.1 = 2.999
-3.968 = -4 + (0.8 x 0.04) x 0.993
1.999 = 2 + (0.8 x –0.001) x 1
2.999 = 3 + (0.8 x –0.001) x 0.9
-2.005 = -2 + (0.8 x –0.007) x 0.9
New weights calculated following the errors derived above
15
Back-propagation Algorithm
A derivation of the BP algorithm
The error signal at the output of neuron j at the nth
training cycle is given as:
e j ( n) d j ( n) y j ( n)
The instantaneous value of error energy for neuron j is
1
2
e j ( n)
2
The total error energy E(n) can be computed by summing up
the instantaneous energy over all the neurons in the output
1
layer E ( n) e j 2 ( n) ;
j C 2
the set C includes all output neurons
16
Back-propagation Algorithm
Derivative or differential coefficient
For a function f(x) at the argument x, the limit of the
differential coefficient f ( x x) f ( x)
x
as x0
(x+x,y+y)
Q
y
(x,y)
P
y=f(x)
x
x
dy
y
lim
as P approaches Q
dx
x
17
Back-propagation Algorithm
Derivative or differential coefficient
Typically defined for a function of a single
variable, if the left and right hand limits
exist and are equal, it is the gradient of the
curve at x, and is the limit of the gradient
of the chord adjoining the points (x,f(x))
and (x+x,f(x+ x)). The function of x
defined as this limit for each argument x is
the first derivative y=f(x).
18
Back-propagation Algorithm
Partial derivative or partial differential coefficient
The derivative of a function of two or more
variables with respect to one of these
variables, the others being regarded as
constant; written as:
f
x
19
Back-propagation Algorithm
Total Derivative
The derivative of a function of two or more
variables with regard to a single parameter
in terms of which these variables are
expressed as:
if z=f(x,y) with parametric equations:
x=U(t), y=V(t)
then under appropriate conditions the total
derivative is:
dz z dx z dy
dt x dt y dt
20
Back-propagation Algorithm
Total or Exact Differential
The differential of a function of two or more
variables with regard to a single parameter in terms
of which these variables are expressed, equal to the
sum of the products of each partial derivative of the
function with the corresponding increment. If
z=f(x,y), x=U(t), y=V(t)
then under appropriate conditions, the total
differential is:
z
z
dz
dx
dy
x
y
21
Back-propagation Algorithm
Chain rule of calculus
A theorem that may be used in the differentiation of
a function of a function
dy
dy dt
.
dx
dt dx
where y is a differentiable function of t, and t is a
differentiable function of x. This enables a function
of y=f(x) to be differentiated by finding a suitable
function x, such that f is a composition of y and y is
a differentiable function of u, and u is a
differentiable function of x.
22
Back-propagation Algorithm
Chain rule of calculus
Similarly for partial differentiation
f
f u
.
x
u x
where f is a function of u and u is a function of x.
23
Back-propagation Algorithm
Chain rule of calculus
Now consider the error signal at the output of a
neuron j at iteration n - i.e. presentation of the nth
training examples:
ej(n)=dj(n)-yj(n)
where dj is the desired output and yj is the actual
output and the total error energy overall the
neurons in the output layer:
E (n)
1
2
e j (n)
2
jC
where ‘C’ is the number of all the neurons in the
output layer
24
Back-propagation Algorithm
Chain rule of calculus
If N is the total number of patterns, the
averaged squared error energy is:
1
Eav
N
N
E (n)
n 1
Note that ej is a function of yj and wij (the
weights of connections between neurons in
two adjacent layers ‘i’ and ‘j’)
25
Back-propagation Algorithm
A derivation of the BP algorithm
The total error energy E(n) can be computed by summing up
the instantaneous energy over all the neurons in the output
layer
1
2
E ( n)
j C 2
e j ( n) ;
where the set C includes all the neurons in the output layer.
The total error energy E(n) can be computed by summing up the
instantaneous energy over all the neurons in the output layer
N
1
Eav
E ( n)
N n 1
The instantaneous error energy, E(n) and therefore the
average energy Eav is a function of the free parameters,
26
including synaptic weights and bias levels
Back-propagation Algorithm
The originators of the BP algorithm suggest
that
E
wij (n )
wij
where is the learning rate parameter of the BP
algorithm. The minus sign indicates the use of
gradient descent in the weight; seeking a direction
for weight change that reduces the value of E(n)
wij (n) j (n) yi (n)
27
Back-propagation Algorithm
A derivation of the BP algorithm
E (n) is a function of e j (n)
e j (n) is a function of y j (n)
vj
y j (n) is a function of v j (n)
v j (n) is a function of w ji (n)
E(n) is a function of a function of a function of a function of wji(n)
28
Back-propagation Algorithm
A derivation of the BP algorithm
The back-propagation algorithm trains a
multilayer perceptron by propagating back
some measure of responsibility to a hidden
(non-output) unit.
Back-propagation:
• Is a local rule for synaptic adjustment;
• Takes into account the position of a neuron in
a network to indicate how a neuron’s weight are
to change
29
Back-propagation Algorithm
A derivation of the BP algorithm
Layers in back-propagating multi-layer perceptron
1. First layer – comprises fixed input units;
2. Second (and possibly several subsequent layers)
comprise trainable ‘hidden’ units carrying an internal
representation.
3. Last layer – comprises the trainable output units of
the multi-layer perceptron
30
Back-propagation Algorithm
A derivation of the BP algorithm
Modern back-propagation algorithms are based on a
formalism for propagating back the changes in the error
energy E , with respect to all the weights wij from a unit
(i) to its inputs (j).
More precisely, what is being computed during the
backwards propagation is the rate of change of the error
energy E with respects to the networks weight. This
computation is also called the computation of the
gradient:
E
wij
31
Back-propagation Algorithm
A derivation of the BP algorithm
More precisely, what is being computed during the backwards
propagation is the rate of change of the error energy E with
respects to the networks weight. This computation is also called
the computation of the gradient:
E
wij
( e 2j / 2)
1
2
j
wij
( ( d j y j ) 2 / 2)
j
( (d j y j )2 )
j
wij
wij
(d j y j )
1
.2. ( d j y j ).
2
wij
j
32
Back-propagation Algorithm
A derivation of the BP algorithm
More precisely, what is being computed during the backwards
propagation is the rate of change of the error energy E with
respects to the networks weight. This computation is also called
the computation of the gradient:
E
wij
( d j ) y j
e j .
wij
j
wij
y j
y j
e j . 0
e j
wij
wij
j
j
33
Back-propagation Algorithm
A derivation of the BP algorithm
The back-propagation learning rule is formulated
to change the connection weights wij so as to
reduce the error energy E by gradient descent
wij wijnew wijold
E
wij
ej
j
y j
wij
(d j y j )
j
y j
wij
34
Back-propagation Algorithm
A derivation of the BP algorithm
The error energy E is a function of the error e, the output y,
the weighted sum of all the inputs v, and of the weights wij:
E E (e j , w ji , y j , v j )
According to the chain rule then:
E (n)
w ji
E (n)
e j
e j
w ji
35
Back-propagation Algorithm
A derivation of the BP algorithm
Further applications of the chain rule suggests that:
E (n)
w ji
E (n)
w ji
E (n)
e j
E (n)
e j
vj
e j
y j
y j
w ji
e j
y j
v j
y j
v j
w ji
36
Back-propagation Algorithm
A derivation of the BP algorithm
Each of the partial derivatives can be simplified as:
E ( n )
e j
e j
y j
y j
v j
v j
w ji
e j
dj
y j
1
2
e
j (n)
2 jvC
ej
j
yj
1
v j ( n ) ' v j ( n )
v j
w ji ( n ) yi ( n )
w ji i
yi ( n )
37
Back-propagation Algorithm
A derivation of the BP algorithm
The rate of change of the error energy E
with respect to changes in the synaptic
weights is:
vj
E (n)
w ji
'
e j ( n) v j ( n ) y j
38
Back-propagation Algorithm
A derivation of the BP algorithm
The so-called delta rule suggests that
Δ w ji (n)
E (n)
w ji
vj
or
Δ w ji (n)
y j ( n ) ( n )
is called the local gradient
39
Back-propagation Algorithm
A derivation of the BP algorithm
The local gradient is given by
j ( n)
E ( n)
v j
vj
or
j ( n)
'
e j ( n) j (v j (n))
40
Back-propagation Algorithm
A derivation of the BP algorithm
The case of the output neuron j:
Δ w ji (n)
'
e j (n) j (v j (n)) y j (n)
vj
The weight adjustment requires the
computation of the error signal
41
Back-propagation Algorithm
A derivation of the BP algorithm
The case of the output neuron j: The so-called delta
rule suggests that
Δ w ji ( n )
e j ( n ) 'j ( v j ( n )) y j ( n )
Logisitic function
1
( v j ( n ))
1 exp( v j ( n ))
vj
Derivative of ( v j ( n )) w.r.t.v j ( n )
2
1
. exp( v j ( n )
( v j ( n ))
1
exp(
v
(
n
))
j
'
1
1
1
( v j ( n ))
1
exp(
v
(
n
))
1
exp(
v
(
n
))
j
j
'
42
Back-propagation Algorithm
A derivation of the BP algorithm
The case of the output neuron j: The so-called delta
rule suggests that
Δ w ji (n ) e j ( n ) (v j (n )) y j ( n )
'
j
y j (n ) (v j (n ))
' (v j (n )) y j (n )1 y j ( n )
and
Δ w ji (n ) e j ( n ) y j (n )1 y j (n ) y j (n )
43
Back-propagation Algorithm
A derivation of the BP algorithm
The case of the hidden neuron (j) the delta rule suggests that
Δw ji (n) e j (n) (v j (n)) yi (n)
'
j
The weight adjustment requires the computation of
the error signal, but the situation is complicated in
that we do not have a (set of) desired output(s) for
the hidden neurons and consequently we will have
difficulty in computing the error signal ej for the
hidden neuron j.
44
Back-propagation Algorithm
A derivation of the BP algorithm
The case of the hidden neurons (j):
Recall that the local gradient of the
hidden neuron j is given as' j and that yj
is the output and equals j
E ( n )
E ( n ) y j ( n )
j (n)
v j ( n )
y j ( n ) v j ( n )
We have
used the
chain rule of
calculus
here
E ( n ) '
j (n)
(v j (n ))
y j ( n )
45
Back-propagation Algorithm
A derivation of the BP algorithm
The case of the hidden neurons (j): The ‘error’ energy
related to the hidden neurons is given as
1
E (n ) ek2 ( n )
2 k
The rate of change of the error (energy) with respect to
the input during the backward pass – the input during
the pass yk(n)- is given as:
E ( n )
1
y j ( n )
2
E ( n )
y j ( n )
k
e
k
k
ek2 ( n )
y j ( n )
ek ( n )
y j ( n )
46
Back-propagation Algorithm
A derivation of the BP algorithm
The case of the hidden neurons (j): The rate of change
of the error (energy) with respect to the input during
the backward pass – the input during the pass yk(n)- is
given as:
E ( n )
y j ( n )
ek ( n ) v j ( n )
ek ( n )
v j ( n ) y j ( n )
k
47
Back-propagation Algorithm
A derivation of the BP algorithm
The case of the hidden neurons (k)- the
delta rule suggests that
Δwki (n) ek (n) (vk (n)) y j (n)
'
k
The error signal for a hidden neuron has, thus, to be
computed recursively in terms of the error signals
of ALL the neurons to which the hidden neuron j is
directly connected. Therefore, the backpropagation algorithm becomes complicated.
48
Back-propagation Algorithm
A derivation of the BP algorithm
The case of the hidden neurons (j) The local
gradient is given by
j ( n)
E ( n)
v j
or
j ( n)
'
e j ( n) j (v j (n))
49
Back-propagation Algorithm
A derivation of the BP algorithm
The case of the hidden neurons (j) The local
gradient is given by
j (n)
E ( n )
v j
which can be redefined as
E ( n ) y j
j (n)
y j
v j
j (n)
E ( n ) '
( v j ( n ))
y j
50
Back-propagation Algorithm
A derivation of the BP algorithm
The case of the hidden neurons (j)
The error energy for the hidden neuron can be
given as
1 2
E ( n ) ek ( n ) ;
k C 2
Note that we have used the index k instead of
the index j in order to avoid confusion.
51