Transcript Document

WK5 – Dynamic Networks
Contents
CS 476: Networks of Neural Computation
Sequences
Time Delayed I
Time Delayed II
Recurrent I
Recurrent II
Conclusions
WK5 – Dynamic Networks:
Time Delayed & Recurrent
Networks
Dr. Stathis Kasderidis
Dept. of Computer Science
University of Crete
Spring Semester, 2009
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Contents
•Sequence Learning
Contents
•Time Delayed Networks I: Implicit Representation
Sequences
Time Delayed I
Time Delayed II
Recurrent I
Recurrent II
Conclusions
•Time Delayed Networks II: Explicit Representation
•Recurrent Networks I: Elman + Jordan Networks
•Recurrent Networks II: Back Propagation Through
Time
•Conclusions
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Sequence Learning
Contents
Sequences
Time Delayed I
Time Delayed II
Recurrent I
Recurrent II
Conclusions
•MLP & RBF networks are static networks, i.e. they
learn a mapping from a single input signal to a
single output response for an arbitrary large number
of pairs;
•Dynamic networks learn a mapping from a single
input signal to a sequence of response signals, for
an arbitrary number of pairs (signal,sequence).
•Typically the input signal to a dynamic network is
an element of the sequence and then the network
produces as a response the rest of the sequence.
•To learn sequences we need to include some form
of memory (short term memory) to the network.
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Sequence Learning II
•We can introduce memory effects with two
principal ways:
Contents
Sequences
•Implicit:
Time Delayed I
e.g. Time lagged signal as input to a
static network or as recurrent connections
Time Delayed II
•Explicit:
Recurrent I
Recurrent II
Conclusions
e.g. Temporal Backpropagation Method
•In the implicit form, we assume that the
environment from which we collect examples of
(input signal, output sequence) is stationary. For the
explicit form the environment could be nonstationary, i.e. the network can track the changes in
the structure of the signal.
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Time Delayed Networks I
•The time delayed approach includes two basic types
Contents
of networks:
Sequences
•Implicit Representation of Time: We combine a
memory structure in the input layer of the network
Time Delayed I
with a static network model
Time Delayed II
•Explicit Representation of Time: We explicitly
Recurrent I
allow the network to code time, by generalising
the network weights from scalars to vectors, as in
Recurrent II
TBP (Temporal Backpropagation).
Conclusions
•Typical forms of memories that are used are the
Tapped Delay Line and the Gamma Memory family.
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Time Delayed Networks I
•The Tapped Delay Line form of memory is shown
below for an input signal x(n):
Contents
Sequences
Time Delayed I
Time Delayed II
Recurrent I
Recurrent II
Conclusions
•The Gamma form of memory is defined by:
 n  1 p
n p
g p (n)  
 p  1
  (1   )


n p
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Time Delayed Networks I
•The Gamma Memory is shown below:
Contents
Sequences
Time Delayed I
Time Delayed II
Recurrent I
Recurrent II
Conclusions
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Time Delayed Networks I
•In the implicit representation approach we combine a
Contents
static network (e.g. MLP / RBF) with a memory
structure (e.g. tapped delay line).
Sequences
Time Delayed I •An example is shown below (the NETtalk network):
Time Delayed II
Recurrent I
Recurrent II
Conclusions
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Time Delayed Networks I
•We present the data in a sliding window. For example
Contents
in NETtalk the middle group of input neurons present
the letter in focus. The rest of the input groups, three
Sequences
before & three after, present context.
Time Delayed I
•The purpose is to predict for example the next
Time Delayed II
symbol in the sequence.
Recurrent I
•The NETtalk model (Sejnowski & Rosenberg, 1987)
has:
Recurrent II
•203 input nodes
Conclusions
•80 hidden neurons
•26 output neurons
•18629 weights
•Used BP method for training
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Time Delayed Networks II
•In explicit time representation, neurons have a spatioContents
temporal structure, i.e. its synapse arriving to a neuron
is not a scalar number but a vector of weights, which
Sequences
are used for convolution of the time-delayed input
Time Delayed I
signal of a previous neuron with the synapse.
Time Delayed II
Recurrent I
•A schematic representation of a neuron is given
below:
Recurrent II
Conclusions
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Time Delayed Networks II
•The output of the neuron in this case is given by:
Contents
 p
y j ( n)   
  w j (l ) x(n  l )  b j
 l 0
Sequences
Time Delayed I




Time Delayed II •In
Recurrent I
Recurrent II
Conclusions
case of a whole network, for example assuming a
single output node and a linear output layer, the
response is given by:
 p

y (n)   w j y j (n)   w j   w j (l ) x(n  l )  b j   b0
j 1
j 1
 l 0

m1
m1
Where p is the depth of the memory and b0 is the bias of
the output neuron
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Time Delayed Networks II
•In the more general case, where we have multiple
neurons at the output layer, we have for neuron j of
any layer:
Contents
Sequences
Time Delayed I
Time Delayed II
Recurrent I
Recurrent II
Conclusions
 m0 p

y j (n)     w ji (l ) xi (n  l )  b j 
 i 1 l  0

•The output of any synapse is given by the convolution
sum:
p
 T
s ji (n)  w ji xi (n)   w ji (l ) xi (n  l )
l 0
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Time Delayed Networks II
Where the state vector xi(n) and weight vector wji for
synapse I are defined as follows:
Contents
Sequences
•xi(n)=[xi(n),
Time Delayed I
•wji=[wji(0),
xi(n-1),…, xi(n-p)]T
wji(1),…, wji(p)]T
Time Delayed II
Recurrent I
Recurrent II
Conclusions
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Time Delayed Networks II: Learning Law
•To train such a network we use the Temporal
BackPropagation algorithm. We present the
algorithm below.
Contents
Sequences
•Assume that neuron j lies in the output layer and its
Time Delayed II response is denoted by yj(n) at time n, while its
desired response is given by dj(n).
Time Delayed I
Recurrent I
Recurrent II
Conclusions
•We can define an instantaneous value for the sum of
squared errors produced by the network as follows:
1
2
E ( n) 
e

j ( n)
2 j
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Time Delayed Networks II: Learning Law-1
•The error signal at the output layer is defined by:
Contents
e j (n)  d j (n)  y j (n)
Sequences
Time Delayed I
Time Delayed II
Recurrent I
•The idea is a minimise an overall cost function,
calculated over all time:
Etotal   E(n)
n
Recurrent II
Conclusions
•We could proceed as usual by calculating the gradient
of the cost function over the weights. This implies that
we need to calculate the instantaneous gradient:
Etotal


w ji
E (n)


w ji
n
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Time Delayed Networks II: Learning Law-2
Contents
Sequences
Time Delayed I
Time Delayed II
Recurrent I
Recurrent II
Conclusions
•However this approach to work we need to unfold the
network in time (i.e. to convert it to an equivalent
static network and then calculate the gradient). This
option presents a number of disadvantages:
•A
loss of symmetry between forward and
backward pass for the calculation of instantaneous
gradient;
•No
nice recursive formula for propagation of error
terms;
•Need
for global bookkeeping to keep track of
which static weights are actually the same in the
equivalent network
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Time Delayed Networks II: Learning Law-3
•For these reasons we prefer to calculate the gradient
of the cost function as follows:
Contents
Sequences
Time Delayed I
Time Delayed II
Etotal


w ji
Etotal v j (n)



v
(
n
)

w
n
j
ji
•Note that in general holds:
Recurrent I
Recurrent II
Conclusions
Etotal v j (n)
E (n)



v j (n) w ji
w ji
•The equality is correct only when we take the sum
over all time.
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Time Delayed Networks II: Learning Law-4
•To calculate the weight update we use the steepest
descent method:
Contents
Sequences
Time Delayed I
Time Delayed II
Recurrent I
Recurrent II
Conclusions
Etotal v j (n)


w ji (n  1)  w ji (n)  

v j (n) w ji (n)
Where  is the learning rate.
•We calculate the terms in the above relation as
follows:
v j (n)


x

i ( n)
w ji (n)
This is by definition the induced field vj(n)
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Time Delayed Networks II: Learning Law-5
•We define the local gradient as:
Contents
 j (n)  
Sequences
Etotal
v j ( n)
Time Delayed I
Time Delayed II •Thus
we can write the weight update equations in the
familiar form:
Recurrent I
Recurrent II



wji (n  1)  wji (n)   j (n) xi (n)
Conclusions
•We need to calculate the  for the cases of output
and hidden layers.
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Time Delayed Networks II: Learning Law-6
•For the output layer the local gradient is given by:
Contents
Etotal
E (n)
 j (n)  

 e j (n) ' (v j (n))
v j (n)
v j (n)
Sequences
Time Delayed I
Time Delayed II •For
Recurrent I
Recurrent II
Conclusions
a hidden layer we assume that neuron j is
connected to a set A of neurons in the next layer
(hidden or output). Then we have:
Etotal
 j ( n)  
v j (n)
Etotal vr (k )
  
r A k vr ( k ) v j ( n)
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Time Delayed Networks II: Learning Law-7
•By re-writing we get the following:
Contents
 j ( n)     r ( k )
Sequences
r A
Time Delayed I
Time Delayed II
k
vr ( k )
v j ( n )
    r (k )
r A
k
vr ( k ) y j ( n )
y j ( n ) v j ( n )
Recurrent I
Recurrent II
Conclusions
•Finally we putting all together we get:
n p
 j (n)   ' (v j (n))   r (k ) wrj (k  l )
r A k  n
p
  ' (v j (n))   r (n  l ) wrj (n)
r A l  0
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Time Delayed Networks II: Learning Law-8
Where l is the layer level
Contents
Sequences
Time Delayed I
Time Delayed II
Recurrent I
Recurrent II
Conclusions
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Recurrent I
Contents
Sequences
Time Delayed I
•A network is called recurrent when there are
connections which feedback to previous layers or
neurons, including self-connections. An example is
shown next:
Time Delayed II
Recurrent I
Recurrent II
Conclusions
•Successful early models of recurrent networks are:
•Jordan
•Elman
Network
Network
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Recurrent I
Contents
Sequences
Time Delayed I
•The Jordan Network has the structure of an MLP
and additional context units. The Output neurons
feedback to the context neurons in 1-1 fashion. The
context units also feedback to themselves.
Time Delayed II •The
network is trained by using the Backpropagation
algorithm
Recurrent I
Recurrent II
•A schematic is shown in the next figure:
Conclusions
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Recurrent I
•The Elman Network has also the structure of an
Contents
MLP and additional context units. The Hidden neurons
feedback to the context neurons in 1-1 fashion. The
Sequences
hidden neurons’ connections to the context units are
Time Delayed I
constant and equal to 1. It is also called Simple
Time Delayed II Recurrent Network (SRN).
Recurrent I
Recurrent II
Conclusions
•The network is trained by using the Backpropagation
algorithm
•A schematic is
shown in the
next figure:
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Recurrent II
Sequences
•More complex forms of recurrent networks are
possible. We can start by extending a MLP as a basic
building block.
Time Delayed I
•Typical paradigms of complex recurrent models are:
Contents
Time Delayed II
Recurrent I
•Nonlinear
Autoregressive with Exogenous Inputs
Network (NARX)
Recurrent II
•The
State Space Model
Conclusions
•The
Recurrent Multilayer Perceptron (RMLP)
•Schematic representations of the networks are given
in the next slides:
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Recurrent II-1
•The structure of the NARX model includes:
Contents
•A
Sequences
MLP static network;
•A
Time Delayed I
Time Delayed II
current input u(n) and its delayed versions up to
a time q;
Recurrent II
time delayed version of the current output y(n)
which feeds back to the input layer. The memory of
the delayed output vector is in general p.
Conclusions
•The
Recurrent I
•A
output is calculated as:
y(n+1)=F(y(n),…,y(n-p+1),u(n),…,u(n-q+1))
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Recurrent II-2
•A schematic of the NARX model is as follows:
Contents
Sequences
Time Delayed I
Time Delayed II
Recurrent I
Recurrent II
Conclusions
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Recurrent II-3
•The structure of the State Space model includes:
Contents
•A
Sequences
hidden neurons define the state of the
network;
•The
Time Delayed I
Time Delayed II
•A
Recurrent I
•A
Recurrent II
Conclusions
MLP network with a single hidden layer;
linear output layer;
feedback of the hidden layer to the input layer
assuming a memory of q lags;
•The
output is determined by the coupled
equations:
x(n+1)=f(x(n),u(n))
y(n+1)=C x(n+1)
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Recurrent II-4
Where f is a suitable nonlinear function characterising
Contents
the hidden layer. x is the state vector, as it is produced
by the hidden layer. It has q components. y is the
Sequences
output vector and it has p components. The input
Time Delayed I vector is given by u and it has m components.
Time Delayed II
Recurrent I
•A schematic representation of the network is given
below:
Recurrent II
Conclusions
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Recurrent II-5
•The structure of the RMLP includes:
Contents
•One
Sequences
•Feedback
Time Delayed I
Time Delayed II
Recurrent I
Recurrent II
Conclusions
or more hidden layers;
•The
around each layer;
general structure of a static MLP network;
•The
output is calculated as follows (assuming that
xI, xII, and xo are the first, second and output layer
outputs):
xI(n+1)=I(xI(n), u(n))
xII(n+1)=II(xII(n), xI(n+1))
xO(n+1)=O(xO(n), xII(n+1))
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Recurrent II-6
Where the functions I(•), II(•) and O(•) denote the
Contents
Activation functions of the corresponding layer.
Sequences
•A schematic representation is given below:
Time Delayed I
Time Delayed II
Recurrent I
Recurrent II
Conclusions
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Recurrent II-7
•Some theorems on the computational power of
recurrent networks:
Contents
Sequences
Time Delayed I
Time Delayed II
Recurrent I
Recurrent II
Conclusions
•Thm
1: All Turing machines may be simulated by
•Thm
2: NARX networks with one layer of hidden
fully connected recurrent networks built on
neurons with sigmoid activation functions.
neurons with bounded, one-sided saturated
activation functions and a linear output neuron can
simulate fully connected recurrent networks with
bounded, one-sided saturated activation functions,
except for a linear slowdown.
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Recurrent II-8
NARX networks with one hidden layer
of neurons with BOSS activations functions and a
linear output neuron are Turing equivalent.
•Corollary:
Contents
Sequences
Time Delayed I
Time Delayed II
Recurrent I
Recurrent II
Conclusions
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Recurrent II-9
•The training of the recurrent networks can be done
with two methods:
Contents
Sequences
•BackPropagation
Time Delayed I
•Real-Time
Time Delayed II
Recurrent I
Recurrent II
Conclusions
Through Time
Recurrent Learning
•We can train a recurrent network with either epochbased or continuous training operation. However an
epoch in recurrent networks does not mean the
presentation of all learning patterns but rather denotes
the length of a single sequence that we use for
training. So an epoch in recurrent network corresponds
in presenting only one pattern to the network.
•At the end of an epoch the network stabilises.
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Recurrent II-10
•Some useful heuristics for the training is given
below:
Contents
•Lexigraphic
Sequences
Time Delayed I
Time Delayed II
Recurrent I
Recurrent II
Conclusions
order of training samples should be
followed, with the shortest strings of symbols being
presented in the network first;
•The
training should begin with a small training
sample and then its size should be incrementally
increased as the training proceeds;
•The
synaptic weights of the network should be
updated only if the absolute error on the training
sample currently being processed by the network is
greater than some prescribed criterion;
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Recurrent II-11
•The
use of weight decay during training is
recommended; weight decay was discussed in WK3
Contents
Sequences
•The BackPropagation Through Time algorithm
Time Delayed II proceeds by unfolding a network in time. To be more
specific:
Time Delayed I
Recurrent I
Recurrent II
Conclusions
•Assume
that we have a recurrent network N which
is required to learn a temporal task starting from
time n0 and going all the way to time n.
•Let
N* denote the feedforward network that
results from unfolding the temporal operation of
the recurrent network N.
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Recurrent II-12
•The
network N* is related to the original network
N as follows:
Contents
•For
Sequences
Time Delayed I
Time Delayed II
each time step in the interval (n0,n], the
network N* has a layer containing K neurons, where
K is the number of neurons contained in network N;
Recurrent I
every layer of network N* there is a copy of
each neuron in network N;
Recurrent II
•For
Conclusions
•In
every time step l  [n0,n], the synaptic
connection from neuron i in layer l to neuron j in
layer l+1 of the network N* is a copy of the synaptic
connection from neuron i to neuron j in the network
N.
•The following example explains the idea of unfolding:
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Recurrent II-13
•We assume that we have a network with two neurons
which is unfolded for a number of steps, n:
Contents
Sequences
Time Delayed I
Time Delayed II
Recurrent I
Recurrent II
Conclusions
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Recurrent II-14
•
We present now the method of Epochwise
BackPropagation Through Time.
•
Let the dataset used for training the network be
partitioned into independent epochs, with each
epoch representing a temporal pattern of interest.
Let n0 denote the start time of an epoch and n1
denotes its end time.
•
We can define the following cost function:
Contents
Sequences
Time Delayed I
Time Delayed II
Recurrent I
Recurrent II
Conclusions
1 n1
2
Etotal (n0 , n1 )   e j (n)
2 n n0 jA
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Recurrent II-15
Contents
Sequences
Time Delayed I
Where A is the set of indices j pertaining to those
neurons in the network for which desired
responses are specified, and ej(n) is the error
signal at the output of such a neuron measured
with respect to some desired response.
Time Delayed II
Recurrent I
Recurrent II
Conclusions
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Recurrent II-16
•
Contents
1.
Sequences
Time Delayed I
Time Delayed II
Recurrent I
Recurrent II
Conclusions
The algorithm proceeds as follows:
For a given epoch, the recurrent network starts
running from some initial state until it reaches
a new state, at which point the training is
stopped and the network is reset to an initial
state for the next epoch. The initial state
doesn’t have to be the same for each epoch of
training. Rather, what is important is for the
initial state for the new epoch to be different
from the state reached by the network at the
end of the previous epoch;
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Recurrent II-17
2.
First a single forward pass of the data through
the network for the interval (n0, n1) is
performed. The complete record of input data,
network state (i.e. synaptic weights), and
desired responses over this interval is saved;
3.
A single backward pass over this past record is
performed to compute the values of the local
gradients:
Contents
Sequences
Time Delayed I
Time Delayed II
Recurrent I
Recurrent II
Conclusions
 j ( n)  
Etotal (n0 , n1 )
v j (n)
For all j  A and n0 < n  n1 . This computation
is performed by the formula:
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Recurrent II-18
 ' (v j (n))e j (n) for n  n1


 j (n)  



'
(
v
(
n
))
e
(
n
)

w

(
n

1
)

j
jk k
 j
 for n0  n  n1

k A



Contents
Sequences
Time Delayed I
Time Delayed II
Recurrent I
Recurrent II
Conclusions
Where ’(•) is the derivative of an activation
function with respect to its argument, and vj(n)
is the induced local field of neuron j.
The use of above formula is repeated, starting
from time n1 and working back, step by step, to
time n0 ; the number of steps involved here is
equal to the number of time steps contained in
the epoch.
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Recurrent II-19
4.
Contents
Sequences
Once the computation of back-propagation has
been performed back to time n0+1, the
following adjustment is applied to the synaptic
weight wji of neuron j:
Time Delayed I
Time Delayed II
w ji  
Recurrent I
Recurrent II
Conclusions

Etotal
w ji
n1

n  n0 1
j
( n) xi ( n  1)
Where  is the learning rate parameter and
xi(n-1) is the input applied to the ith synapse of
neuron j at time n-1.
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Recurrent II-20
•
There is a potential problem with the method,
which is called the Vanishing Gradients
Problem, i.e. the corrections calculated for the
weights are not large enough when using methods
based on steepest descent.
•
However this is a research problem currently and
ones has to see the literature for details.
Contents
Sequences
Time Delayed I
Time Delayed II
Recurrent I
Recurrent II
Conclusions
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Conclusions
Contents
•Dynamic networks learn sequences in contrast to the
static mappings of MLP and RBF networks.
Sequences
•Time representation takes place explicitly or implicitly.
•The implicit form includes time-delayed versions of
Time Delayed II the input vector and use of a static network model
afterwards or the use of recurrent networks.
Time Delayed I
Recurrent I
Recurrent II
Conclusions
•The explicit form uses a generalisation of the MLP
model where a synapse is modelled now as a weight
vector and not as a single number. The synapse
activation is not any more the product of the synapse’s
weight with the output of a previous neuron but rather
the inner product of the synapse’s weight vector with
the time-delayed state vector of the previous neuron.
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Conclusions I
•The extended MLP networks with explicit temporal
structure are trained with the Temporal
BackPropagation algorithm.
Contents
Sequences
•The recurrent networks include a number of simple
and complex architectures. In the simpler case we
Time Delayed II
train the networks using the standard
Recurrent I
BackProgation algorithm.
Time Delayed I
Recurrent II
Conclusions
•In the more complex cases we first unfold the
network in time and then train it using the
BackProgation Through Time algorithm.
CS 476: Networks of Neural Computation, CSD, UOC, 2009