ECE-453 Lecture 1

Download Report

Transcript ECE-453 Lecture 1

ECE 517: Reinforcement Learning in
Artificial Intelligence
Lecture 17: TRTRL, Implementation
Considerations, Apprenticeship Learning
November 3, 2010
Dr. Itamar Arel
College of Engineering
Department of Electrical Engineering and Computer Science
The University of Tennessee
Fall 2010
1
Outline
Recap on RNNs
Implementation and usage issues with RTRL

Computational complexity and resources required
Vanishing gradient problem
Apprenticeship learning
ECE 517: Reinforcement Learning in AI
2
Recap on RNNs
RNNs are potentially much stronger than FFNN



Can capture temporal dependencies
Embed complex state representation (i.e. memory)
Models of discrete-time dynamic systems
They are (very) complex to train


TDNN – limited performance based on window
RTRL – calculates a dynamic gradient on-line
ECE 517: Reinforcement Learning in AI
3
RTRL reviewed
RTRL is a gradient descent based method
It relies on sensitivities


l
p (t  1)  f netk t   wkl pij (t )   ik z j (t )
lU  I

k
ij
'
k
expressing the impact of any weight wij on the activation
of neuron k.
The algorithm then consists of computing weight changes
wij (t )    ek (t ) pijk (t )
kU
Let’s look at the resources involved …
ECE 517: Reinforcement Learning in AI
4
Implementing RTRL – computations involved
The key component in RTRL is the sensitivities matrix
Must be calculated for each neuron
RTRL, however, is NOT local …


l
p (t  1)  f netk t   wkl pij (t )   ik z j (t )
lU  I

k
ij
'
k
N3 N
N4
Can the calculations be efficiently distributed?
ECE 517: Reinforcement Learning in AI
5
Implementing RTRL – storage requirements
Let’s assume a fully-connected network of N neurons
Memory resources

Weights matrix, wij  N 2

Activations, yk  N

Sensitivity matrix  N 3

Total memory requirements O(N 3)
Let’s go over an example:



Let’s assume we have 1000 neurons in the system
Each value requires 20 bits to represent
 ~20 Gb of storage!!
ECE 517: Reinforcement Learning in AI
6
Possible solutions – static subgrouping
Zipser et. al (1989) suggested static grouping of neurons
Relaxing the “fully-connected” requirement


Has backing in neuroscience
Average “branching factor” in the brain ~ 1000
Reduced the complexity by simply leaving out elements of the
sensitivity matrix based upon subgrouping of neurons



Neurons are subgrouped arbitrarily
Sensitivities between groups are ignored
All connections still exist in the forward path
If g is the number of subgroups then …

Storage is O(N3/g2 )

Computational speedup is g3

Communications  each node communicates with N/g nodes
ECE 517: Reinforcement Learning in AI
7
Possible solutions – static subgrouping (cont.)
Zipser’s empirical tests indicate that these networks can solve
many of the problems full RTRL solves
One caveat of the subgrouped RTRL training is that each
subnet must have at least one unit for which a target exists
(since gradient information is not exchanged between groups)
Others have proposed dynamic subgrouping
 Subgrouping based on maximal gradient information
 Not realistic for hardware realization
Open research question: how to calculate gradient without the
O(N3) storage requirement?
ECE 517: Reinforcement Learning in AI
8
Truncated Real Time Recurrent Learning (TRTRL)
Motivation:
To obtain a scalable version of the RTRL algorithm while
minimizing performance degradation
How?
Limit the sensitivities of each neuron to its ingress
(incoming) and egress (outgoing) links
ECE 517: Reinforcement Learning in AI
Performing Sensitivity Calculations in TRTRL
For all nodes that are not in the output set, the egress
sensitivity values for node i are calculated by imposing k=j in
the original RTRL sensitivity equation, such that
p (t  1) 
i
ji
j





fi si t wij p ji (t )   ji yi (t )

Similarly, the ingress sensitivity values for node j are given by

p (t  1)  fi si t  wij p (t )  z j (t )
i
ij

j
ij

For output neurons, a nonzero sensitivity element must exist in
order to update the weights

pijk (t  1)  f k sk t  wki piji (t )  wkj pijj (t )   ik z j (t )
ECE 517: Reinforcement Learning in AI

Resource Requirements of TRTRL
The network structure remains the same with TRTRL,
only the calculation of sensitivities is reduced
Significant reduction in resource requirements …
Computational load for each neuron drops to from O(N3)
to O(2KN), where K denotes the number of output
neurons
Total computational complexity is now O(2KN2)
Storage requirements drop from O(N3) to O(N2)
Example revisited: For N=100, 10 outputs  100k
multiplications and only 20kB of storage!
ECE 517: Reinforcement Learning in AI
Further TRTRL Improvements – Clustering of Neurons
TRTRL introduced localization and memory improvement
Clustered TRTRL adds scalability by reducing the number
of long connection lines between processing elements
Input
Output
ECE 517: Reinforcement Learning in AI
12
Test case #1: Frequency Doubler
Input: sin(x), target output sin(2x)
Both networks had 12 neurons
ECE 517: Reinforcement Learning in AI
Vanishing Gradient Problem
Recap on goals:
 Find temporal dependencies in data with a RNN
 The idea behind RTRL: when an error value is found, apply
it to inputs seen an indefinite number of epochs ago
In 1994 (Bengio et. al) it has been shown that both BPTT
and RTRL suffer from the problem of vanishing gradient
information
 When using gradient based training rules, the “error
signal” that is applied to previous inputs tends to vanish
Because of this, long-term dependencies in the data are
often overlooked

Short-term memory is ok, long-term (>10 epochs) – lost
ECE 517: Reinforcement Learning in AI
14
Vanishing Gradient Problem (cont.)
xt
st
yt
RNN

yt  f st 1 , xt

A learning error yields gradients on outputs, and
therefore on the state variables st
Since the weights (parameters) are shared across time
Et Et st Et st s


W st W st s W
st
st st 1 s 1

...
 f t' f t'1... f' 1  0 for large 
s st 1 st  2
s
ECE 517: Reinforcement Learning in AI
15
What is Apprenticeship Learning
Many times we want to train an agent based on a
reference controller


Riding a bicycle
Flying a plane
Starting from scratch may take a very long time

Particularly for large state/action spaces
May cost a lot (e.g. helicopter crashing)
Process:



Train agent on reference controller
Evaluate trained agent
Improve trained agent
Note: reference controller can be anything (e.g. heuristic
controller for Car Race problem)
ECE 517: Reinforcement Learning in AI
16
Formalizing Apprenticeship Learning
Let’s assume we have a reference policy p from which we
want our agent to learn
We would first like to learn the (approx.) value function, Vp
Once we have Vp , we can try an improve it based on the
policy improvement theorem, i.e.
V p ' ( s)  V p ' ( s)
p ' ( s )  max Qp ' ( s, a )
a
By following the original policy greedily we obtain a better
policy!
In practice, many issues should be considered such as state
space coverage and exploration/exploitation

Train on zero exploration, then explore gradually …
ECE 517: Reinforcement Learning in AI
17