Back-propagation Algorithm

Download Report

Transcript Back-propagation Algorithm

Back-propagation Algorithm:
Supervised Learning
Backpropagation (BP) is amongst the ‘most popular algorithms
for ANNs’: it has been estimated by Paul Werbos, the person
who first worked on the algorithm in the 1970’s, that between
40% and 90% of the real world ANN applications use the BP
algorithm. Werbos traces the algorithm to the psychologist
Sigmund Freud’s theory of psychodynamics. Werbos applied
the algorithm in political forecasting.
•David Rumelhart, Geoffery Hinton and others applied the BP
algorithm in the 1980’s to problems related to supervised
learning, particularly pattern recognition.
•The most useful example of the BP algorithm has been in
dealing with problems related to prediction and control.
1
Back-propagation Algorithm
BASIC DEFINITIONS
1. Backpropagation is a procedure for efficiently calculating
the derivatives of some output quantity of a non-linear
system, with respect to all inputs and parameters of that
system, through calculations proceeding backwards from
outputs to inputs.
2. Backpropagation is any technique for adapting the weights
or parameters of a nonlinear system by somehow using such
derivatives or the equivalent.
According to Paul Werbos there is no such thing as
a “backpropagation network”, he used an ANN
design called a multilayer perceptron.
2
Back-propagation Algorithm
Paul Werbos provided a ‘rule for updating the weights of a
multi-layered network undergoing supervised learning.
It is the weight adaptation rule which is called
backpropagation.
Typically, a fully connected feedforward network is used to be
trained using the BP algorithm: activation in such networks
travels in a direction from the input to the output layer and the
units in one layer are connected to every other unit in the next
layer.
There are two sweeps of the fully connected network: forward
sweep and backward sweep.
3
Back-propagation Algorithm
There are two sweeps of the fully connected network: forward
sweep and backward sweep.
Forward Sweep: This sweep is similar to any other
feedforward ANN – the input stimuli is given to the network,
the network computes the weighted average from all the input
units and then passes the average through a squash function.
The ANN generates an output subsequently.
The ANN may have a number of hidden layers, for example,
the multi-net perceptron, and the output from each hidden
layer becomes the input to the next layer forward.
4
Back-propagation Algorithm
There are two sweeps of the fully connected network: forward
sweep and backward sweep.
Backwards Sweep: This sweep is similar to the forward sweep,
except what is ‘swept’ are the error values. These values
essentially are the differences between the actual output and a
( e j  d j  o j )
desired output.
The ANN may have a number of hidden layers, for example, the
multi-net perceptron, and the output from each hidden layer
becomes the input to the next layer forward.
In the backward sweep output unit sends errors back to the first
proximate hidden layer which in turn passes it onto the next
hidden layer. No error signal is sent to the input units.
5
Back-propagation Algorithm
For each input vector associate a target output vector
while not STOP
STOP = TRUE
for each input vector
•
perform a forward sweep to find the actual output
•
obtain an error vector by comparing the actual and target output
•
if the actual output is not within tolerance set STOP = FALSE
•
use the backward sweep to determine weight changes
•
update weights
6
Back-propagation Algorithm
Example: Perform a complete forward and backward sweep of
a 2-2-1 ( 2 input units, 2 hidden layer units and 1 output unit)
with the following architecture. The target output d=0.9. The
input is [0.1 0.9].
0
0.1
1
2
0.2
0.1
0.1
0.3
4
0.2
0.2
6
0.1
5
d
0.9
0.3
7
Back-propagation Algorithm
Target value is 0.9 so error
for the output unit is:
(0.900 – 0.288) x 0.288 x (1
– 0.288) = 0.125
Forward Pass
0.288
-2
3
1
0.122
3
2
-4
0.500
0.993
0
5.000
3
-2
0.1
-2
1
0.375
-2
0.986
2
0.125
3
0.728
-1.972
-2
-2
-0.906
3
Output from
unit
2
0.125
0.040
-2
2
-4
-0.030
2
3
2
-2
2
-0.110
-0.007
Input to unit
-2
0.025
2
-0.001
3
-2
3
0.9
Example of a forward pass and a backward pass through a 2-2-2-1 feedforward network.
Inputs, outputs and errors are shown in boxes.
8
Back-propagation Algorithm
-2 + (0.8 x 0.125) x 1 = -1.900
3 + (0.8 x 0.125) x 0.122 = 3.012
1.073 = 1 + (0.8 x 0.125) x 0.728
3 + (0.8 x 0.04) x 1 = 3.032
-1.98 = -2 + (0.8 x 0.025) x 1
-2 + (0.8 x 0.04) x 0.5 = -1.984
2.019 = 2 + (0.8 x 0.025) x 0.993
2 + (0.8 x 0.025) x 0.5) = 2.010
2 + (0.8 x –0.007) x 1 = 1.994
-2 + (0.8 x –0.007) x 0.1) = -2.001
3 + (0.8 x –0.001) x 0.1 = 2.999
-3.968 = -4 + (0.8 x 0.04) x 0.993
1.999 = 2 + (0.8 x –0.001) x 1
2.999 = 3 + (0.8 x –0.001) x 0.9
-2.005 = -2 + (0.8 x –0.007) x 0.9
New weights calculated following the errors derived above
9
Back-propagation Algorithm
A derivation of the BP algorithm
The error signal at the output of neuron j at the nth
training cycle is given as:
e j ( n)  d j ( n)  y j ( n)
The instantaneous value of error energy for neuron j is
1
2
e j ( n)
2
The total error energy E(n) can be computed by summing up
the instantaneous energy over all the neurons in the output
1
layer E ( n)   e j 2 ( n) ;
j C 2
the set C includes all output neurons
10
Back-propagation Algorithm
Derivative or differential coefficient
For a function f(x) at the argument x, the limit of the
differential coefficient f ( x  x)  f ( x)
x
as x0
(x+x,y+y)
Q
y
(x,y)
P
y=f(x)
x
x
dy
y
 lim
as P approaches Q
dx
x
11
Back-propagation Algorithm
Derivative or differential coefficient
Typically defined for a function of a single
variable, if the left and right hand limits
exist and are equal, it is the gradient of the
curve at x, and is the limit of the gradient
of the chord adjoining the points (x,f(x))
and (x+x,f(x+ x)). The function of x
defined as this limit for each argument x is
the first derivative y=f(x).
12
Back-propagation Algorithm
Partial derivative or partial differential coefficient
The derivative of a function of two or more
variables with respect to one of these
variables, the others being regarded as
constant; written as:
f
x
13
Back-propagation Algorithm
Total Derivative
The derivative of a function of two or more
variables with regard to a single parameter
in terms of which these variables are
expressed as:
if z=f(x,y) with parametric equations:
x=U(t), y=V(t)
then under appropriate conditions the total
derivative is:
dz z dx z dy


dt x dt y dt
14
Back-propagation Algorithm
Total or Exact Differential
The differential of a function of two or more
variables with regard to a single parameter in terms
of which these variables are expressed, equal to the
sum of the products of each partial derivative of the
function with the corresponding increment. If
z=f(x,y), x=U(t), y=V(t)
then under appropriate conditions, the total
differential is:
z
z
dz 
dx 
dy
x
y
15
Back-propagation Algorithm
Chain rule of calculus
A theorem that may be used in the differentiation of
a function of a function
dy
dy dt

.
dx
dt dx
where y is a differentiable function of t, and t is a
differentiable function of x. This enables a function
of y=f(x) to be differentiated by finding a suitable
function x, such that f is a composition of y and y is
a differentiable function of u, and u is a
differentiable function of x.
16
Back-propagation Algorithm
Chain rule of calculus
Similarly for partial differentiation
f
f u

.
x
u x
where f is a function of u and u is a function of x.
17
Back-propagation Algorithm
Chain rule of calculus
Now consider the error signal at the output of a
neuron j at iteration n - i.e. presentation of the nth
training examples:
ej(n)=dj(n)-yj(n)
where dj is the desired output and yj is the actual
output and the total error energy overall the
neurons in the output layer:
E (n) 
1
2
 e j (n)
2
jC
where ‘C’ is the number of all the neurons in the
output layer
18
Back-propagation Algorithm
Chain rule of calculus
If N is the total number of patterns, the
averaged squared error energy is:
1
Eav 
N
N
 E (n)
n 1
Note that ej is a function of yj and wij (the
weights of connections between neurons in
two adjacent layers ‘i’ and ‘j’)
19
Back-propagation Algorithm
A derivation of the BP algorithm
The total error energy E(n) can be computed by summing up
the instantaneous energy over all the neurons in the output
layer
1
2
E ( n) 

j C 2
e j ( n) ;
where the set C includes all the neurons in the output layer.
The total error energy E(n) can be computed by summing up the
instantaneous energy over all the neurons in the output layer
N
1
Eav 
E ( n)

N n 1
The instantaneous error energy, E(n) and therefore the
average energy Eav is a function of the free parameters,
20
including synaptic weights and bias levels
Back-propagation Algorithm
The originators of the BP algorithm suggest
that
E
wij (n )  
wij
where  is the learning rate parameter of the BP
algorithm. The minus sign indicates the use of
gradient descent in the weight; seeking a direction
for weight change that reduces the value of E(n)
 wij (n)   j (n) yi (n)
21
Back-propagation Algorithm
A derivation of the BP algorithm
E (n) is a function of e j (n)
e j (n) is a function of y j (n)
vj
y j (n) is a function of v j (n)
v j (n) is a function of w ji (n)
E(n) is a function of a function of a function of a function of wji(n)
22
Back-propagation Algorithm
A derivation of the BP algorithm
The back-propagation algorithm trains a
multilayer perceptron by propagating back
some measure of responsibility to a hidden
(non-output) unit.
Back-propagation:
• Is a local rule for synaptic adjustment;
• Takes into account the position of a neuron in
a network to indicate how a neuron’s weight are
to change
23
Back-propagation Algorithm
A derivation of the BP algorithm
Layers in back-propagating multi-layer perceptron
1. First layer – comprises fixed input units;
2. Second (and possibly several subsequent layers)
comprise trainable ‘hidden’ units carrying an internal
representation.
3. Last layer – comprises the trainable output units of
the multi-layer perceptron
24
Back-propagation Algorithm
A derivation of the BP algorithm
Modern back-propagation algorithms are based on a
formalism for propagating back the changes in the error
energy E , with respect to all the weights wij from a unit
(i) to its inputs (j).
More precisely, what is being computed during the
backwards propagation is the rate of change of the error
energy E with respects to the networks weight. This
computation is also called the computation of the
gradient:
E
wij
25
Back-propagation Algorithm
A derivation of the BP algorithm
More precisely, what is being computed during the backwards
propagation is the rate of change of the error energy E with
respects to the networks weight. This computation is also called
the computation of the gradient:
E
wij

 (  e 2j / 2)
1

2
j
wij

 (  ( d j  y j ) 2 / 2)
j
(  (d j  y j )2 )
j
wij
wij
(d j  y j )
1
 .2. ( d j  y j ).
2
wij
j
26
Back-propagation Algorithm
A derivation of the BP algorithm
More precisely, what is being computed during the backwards
propagation is the rate of change of the error energy E with
respects to the networks weight. This computation is also called
the computation of the gradient:
E
wij
  ( d j ) y j 
  e j .


wij 
j
 wij

y j 
y j
  e j . 0 
   e j
wij 
wij
j
j

27
Back-propagation Algorithm
A derivation of the BP algorithm
The back-propagation learning rule is formulated
to change the connection weights wij so as to
reduce the error energy E by gradient descent
wij  wijnew  wijold
E
 
wij
 ej
j
y j
wij
   (d j  y j )
j
y j
wij
28
Back-propagation Algorithm
A derivation of the BP algorithm
The error energy E is a function of the error e, the output y,
the weighted sum of all the inputs v, and of the weights wij:
 E  E (e j , w ji , y j , v j )
According to the chain rule then:
E (n)
w ji

E (n)
e j
e j
w ji
29
Back-propagation Algorithm
A derivation of the BP algorithm
Further applications of the chain rule suggests that:
E (n)
w ji
E (n)
w ji

E (n)
e j

E (n)
e j
vj
e j
y j
y j
w ji
e j
y j
v j
y j
v j
w ji
30
Back-propagation Algorithm
A derivation of the BP algorithm
Each of the partial derivatives can be simplified as:
E ( n )

e j
e j
y j
y j
v j
v j
w ji

e j



dj
y j
1
2
e

j (n)
2 jvC

ej
j

yj 
1


 v j ( n )    ' v j ( n ) 
v j
 


w ji ( n ) yi ( n ) 


w ji  i

yi ( n )
31
Back-propagation Algorithm
A derivation of the BP algorithm
The rate of change of the error energy E
with respect to changes in the synaptic
weights is:
vj
E (n)
w ji
'


  e j ( n)  v j ( n ) y j
32
Back-propagation Algorithm
A derivation of the BP algorithm
The so-called delta rule suggests that
Δ w ji (n)

E (n)

w ji
vj
or
Δ w ji (n)

  y j ( n ) ( n )
 is called the local gradient
33
Back-propagation Algorithm
A derivation of the BP algorithm
The local gradient  is given by
 j ( n)

E ( n)

v j
vj
or
 j ( n)
'
 e j ( n)  j (v j (n))
34
Back-propagation Algorithm
A derivation of the BP algorithm
The case of the output neuron j:
Δ w ji (n)

'
  e j (n)  j (v j (n)) y j (n)
vj
The weight adjustment requires the
computation of the error signal
35
Back-propagation Algorithm
A derivation of the BP algorithm
The case of the output neuron j: The so-called delta
rule suggests that
Δ w ji ( n )

  e j ( n )  'j ( v j ( n )) y j ( n )
Logisitic function
1
 ( v j ( n )) 
1  exp(  v j ( n ))
vj
Derivative of  ( v j ( n )) w.r.t.v j ( n )
2


1

 . exp(  v j ( n )
 ( v j ( n ))  

1

exp(

v
(
n
))
j


'



1
1
 1 

 ( v j ( n ))  


1

exp(

v
(
n
))
1

exp(

v
(
n
))
j
j



'
36
Back-propagation Algorithm
A derivation of the BP algorithm
The case of the output neuron j: The so-called delta
rule suggests that
Δ w ji (n )    e j ( n )  (v j (n )) y j ( n )
'
j
y j (n )   (v j (n ))
 ' (v j (n ))  y j (n )1  y j ( n ) 
and
Δ w ji (n )    e j ( n ) y j (n )1  y j (n ) y j (n )
37
Back-propagation Algorithm
A derivation of the BP algorithm
The case of the hidden neuron (j) the delta rule suggests that
Δw ji (n)    e j (n)  (v j (n)) yi (n)
'
j
The weight adjustment requires the computation of
the error signal, but the situation is complicated in
that we do not have a (set of) desired output(s) for
the hidden neurons and consequently we will have
difficulty in computing the error signal ej for the
hidden neuron j.
38
Back-propagation Algorithm
A derivation of the BP algorithm
The case of the hidden neurons (j):
Recall that the local gradient of the
hidden neuron j is given as' j and that yj
is the output and equals  j
E ( n )
E ( n ) y j ( n )
 j (n) 

v j ( n )
y j ( n ) v j ( n )
We have
used the
chain rule of
calculus
here
E ( n ) '
 j (n) 
 (v j (n ))
y j ( n )
39
Back-propagation Algorithm
A derivation of the BP algorithm
The case of the hidden neurons (j): The ‘error’ energy
related to the hidden neurons is given as
1
E (n )   ek2 ( n )
2 k
The rate of change of the error (energy) with respect to
the input during the backward pass – the input during
the pass yk(n)- is given as:
E ( n )
1

y j ( n )
2
E ( n )

y j ( n )

k
e
k
k
ek2 ( n )
y j ( n )
ek ( n )
y j ( n )
40
Back-propagation Algorithm
A derivation of the BP algorithm
The case of the hidden neurons (j): The rate of change
of the error (energy) with respect to the input during
the backward pass – the input during the pass yk(n)- is
given as:
E ( n )

y j ( n )
ek ( n ) v j ( n )
ek ( n )

v j ( n ) y j ( n )
k
41
Back-propagation Algorithm
A derivation of the BP algorithm
The case of the hidden neurons (k)- the
delta rule suggests that
Δwki (n)    ek (n)  (vk (n)) y j (n)
'
k
The error signal for a hidden neuron has, thus, to be
computed recursively in terms of the error signals
of ALL the neurons to which the hidden neuron j is
directly connected. Therefore, the backpropagation algorithm becomes complicated.
42
Back-propagation Algorithm
A derivation of the BP algorithm
The case of the hidden neurons (j) The local
gradient  is given by
 j ( n)

E ( n)

v j
or
 j ( n)
'
 e j ( n)  j (v j (n))
43
Back-propagation Algorithm
A derivation of the BP algorithm
The case of the hidden neurons (j) The local
gradient  is given by
 j (n)

E ( n )

v j
which can be redefined as
E ( n ) y j
 j (n)  
y j
v j
 j (n)
E ( n ) '
 
 ( v j ( n ))
y j
44
Back-propagation Algorithm
A derivation of the BP algorithm
The case of the hidden neurons (j)
The error energy for the hidden neuron can be
given as
1 2
E ( n )   ek ( n ) ;
k C 2
Note that we have used the index k instead of
the index j in order to avoid confusion.
45
Examples
1.
(a) Describe why an elementary perceptron cannot solve the
XOR problem.
(b) Consider a neural network with two input nodes and one
output node connected through a simple hidden layer with two
neuron, a 2-2-1 network, for solving the XOR problem. Draw the
architectural graph and the signal flow graph of the abovementioned network clearly showing the connection weights
between the input, hidden and output layers of the 2-2-1 network.
(c) Tabulate the inputs and outputs for the 2-2-1 network
showing the hidden layer and output response to the input
vectors [1 1], [1 0], [0 1] and [0 0].
(d) Plot the decision boundaries constructed by hidden neurons
1 and 2 of the 2-2-1 network above and the decision boundaries
by the complete 2-2-1 network.
46
Examples
2.
(a) To which of the two paradigms, ‘learning with a teacher’ and ‘learning
without a teacher’, do the following learning algorithms belong: perceptron
learning; Hebbian learning; -learning rule; and competitive learning. Justify
your answers. Draw block diagrams explaining how a neural network can be
trained using algorithms which learn with or without a teacher.
(b) According to Amari’s general learning rule:
The weight vector
increases in proportion to the product of
T
w i  wr.i1 , w
input x and learning signal
The
learning
signal r is in general a fraction of w1,
i 2 ...w
in 
x and for supervised learning networks it is a function of the teacher’s signal d1.
Consider an initial weight vector
of
and 3 training
vectors
w 1T  1 100.5
x1T  1 20 1;x T2  01.5 0.5 1;x 3   110.5 1
The teacher’s response for x1, x2 and x3 are d1 =-1, d2=-1 and d3=1 respectively.
47
Examples
2.
(i) Describe how the weight w 1Twill change in one cycle when each of the
vectors x1, x2 and x3 are shown to a discrete bipolar neuron which learns
according to the perceptron learning rule. Assume the learning constant c to be
0.1.
1T
(ii) Describe how the weight wwill
change if we used -learning rule
instead of the perceptron learning rule.
(iii) Describe how the weight will change in one cycle for a network which
learns according to Hebbian learning rule but is being trained by the following
inputs:
x 1T  1 21.50; x T2  1 0.5 2 1.5; x 3T  01 11.5
48
Examples
3.
(a) Define the weight adaptation rule described in the neural networks’ literature as back-propagation
in terms of the two patterns of computation used in the implementation of the rule: the forward and
backward pass.
(b)
Consider a multi-layer feed forward network with a 2-2-1 architecture:
1
3
5
2
4
This network is to be trained using the back propagation rule on input pattern x
and a target output, d of 0.9.
x T  0.10.9
Perform a complete forward- and backward-pass computation using the input x and the target d for one
cycle. Compute the outputs of the hidden layer neurons 3 and 4 and for the output neuron 5. Compute
the error for the output node and for the hidden nodes.
49
Examples
4.
(a)
A self-organising map has been trained with an input consisting of eight
Gaussian clouds with unit variance but difference centres located at eight points around a cube:
(0,0,0,0,0,0,0,0), (4,0,0,0,0,0,0,0) … (0,4,4,0,0,0,0,0) (Figure 1). This 3-dimensional input
structure was represented by a two dimensional lattice of 10 by 10 neurons output layer of a
Kohenen self-organising map (SOM). Table 1 shows the feature map.
8
5
1
0
5
1
5
5
1
2
5
5
1
1
3
8
8
5
1
1
4
8
8
8
4
1
1
5
7
8
8
4
4
1
1
6
7
7
7
4
4
4
4
1
7
2
2
3
3
3
4
4
4
1
8
2
2
2
3
3
3
3
4
4
4
9
2
2
2
3
3
3
3
4
4
4
1
0
1
2
3
4
5
6
7
8
9
6
6
6
6
5
5
5
5
6
6
6
5
5
5
5
6
7
6
8
8
8
7
7
7
7
8
7
7
7
7
7
7
7
6
7
2
5
7
6
4
Table 1: Feature Map provided by an SOM
1
3
2
Fig. 1: Input Pattern
50
Examples
4.
(a) cont.
Do you think that an SOM has classified the Gaussian clouds correctly?
Draw the clusters out in the feature map. What conclusions can be drawn from the
statement that ‘SOM algorithm preserves the topological relationship that exist in the input
space’?
(b) A self-organised feature map (SOFM) with three input units and four output units is
to be trained using the four training vectors:
x 1T  0.80.70.4; x T2  0.60.90.9;
x 3T  0.30.40.1 & x T4  0.10.10.3
The initial weight matrix is:
0.50.4
0.60.2;


0.80.5
The initial radius is zero and the learning rate  is set to 0.5. Calculate the weight changes
during the first cycle through the training data taking the training vectors in the order given
above.
51
Backpropagation/Training Algorithm
{6}for the output layer, with units 0 and 3 being the bias units for the input and hidden layer respectively:
net4 = (1.0 x 0.1) + (0.1 x –0.2) + (0.9 x 0.1) = -0.170
1
o4
= 1  exp( 0.17)
= 0.542
net5 = (1.0 x 0.1) + (0.1 x –0.1) + (0.9 x 0.3) = 0.360
1
o5
= 1  exp( 0.36)
= 0.589
Similarly, from the hidden to output layer:
net6 = (1.0 x 0.2) + (0.542 x 0.2) + (0.589 x 0.3) = 0.485
o6
= 0.619
The error for the output node is:
6
= (0.9 – 0.619) x 0.619 x (1 – 0.619) = 0.066
The error for the hidden nodes is:
5
4
Note that the hidden unit errors
are used to update the first layer
of weights. There is no error
calculated for the bias unit as no
weights from the first layer
connect to the hidden bias.
= 0.589 x (1.0 – 0.589) x (0.066 c 0.3) = 0.005
= 0.542 x (1.0 – 0.542) x (0.066 x 0.2) = 0.003
52
Backpropagation/Training Algorithm
The rate of learning for this example is taken as 0.25. There is no need to give a momentum term for this
first pattern since there are no previous weight changes:
w5,6
= 0.25 x 0.066 x 0.589 = 0.01
The new weight is:
0.3 + 0.01
w4,6
= 0.31
= 0.25 x 0.066 x 0.542 = 0.009
Then:
0.2 + 0.009 = 0.209
w3,6
= 0.25 x 0.066 x 1.0 = 0.017
Finally:
0.2 + 0.017 = 0.217
The calculation of the new weights for the first layer is left as an exercise.
53
Backpropagation/Training Algorithm
Introduction to ‘artificial’ neural nets
McCuloch Pitts Neurons
Rosenblatt’s Perceptron
Unsupervised Learning:
Hebbian Learning
Kohonen Maps
Supervised Learning:
(Perceptron Learning)
Backpropagation Algorithm
3 Questions; Answer any 2
54
COURSEWORK
Problem Area; (Related Application Area)
Why Neural Nets (Learning?)
Learning Algorithm
Network Architecture (Total No. of neurons; layers?)
Data: Training Data; Testing Data
Evaluation of Test Results
Conclusions+Opinions
YOUR EVALUATION OF THE
WHOLE SIMULATION,
Max 4 pages
55
ORAL EXAMINATION
Demonstrate your system
Answer questions related to the learning algorithms,
network architecture
Questions on potential applications
How do you construct your input representation?
How do you/your system produces output signals?
How did you chose your test data/training data?
Limitations of neural nets?
56