Reinforcement Learning 2

Download Report

Transcript Reinforcement Learning 2

An Introduction to Reinforcement
Learning (Part 1)
Jeremy Wyatt
Intelligent Robotics Lab
School of Computer Science
University of Birmingham
[email protected]
www.cs.bham.ac.uk/~jlw
www.cs.bham.ac.uk/research/robotics
What is Reinforcement Learning (RL) ?
• Learning from punishments and rewards
• Agent moves through world, observing states and rewards
• Adapts its behaviour to maximise some function of
reward
+3
+50
-1 -1
r4 r5 …
r9
s1 s2 s3 s4 s5 …
a1 a2 a 3 a4 a5 …
s9
a9
r1
…
2
Return: A long term measure of performance
• Let’s assume our agent acts according to some rules, called a
policy, p
• The return Rt is a measure of long term reward collected after
time t
Rt  rt 1   rt  2   rt 3 
2

   k rt 1 k
0   1
k 0
+3
+50
-1 -1
r1
R0  3 
r4 r5
r9
  31   41 
  8 50 
3
Value = Utility = Expected Return
• Rt is a random variable
Rt  rt 1   rt  2   2 rt 3 

   k rt 1 k
k 0
• So it has an expected value in a state under a given policy



p
k
V ( st )  E{Rt | st , p }  E   rt 1 k | st , p 
 k 0

• RL problem is to find optimal policy p* that maximises
the expected value in every state
p (s, a)  Pr( A  a | S  s)
4
Markov Decision Processes (MDPs)
• The transitions between states are uncertain
• The probabilities depend only on the current state
3
p12
 Pr( st 1  2 | st  1, at  3)
1
11
p
p112
r=0
a1
a2
1
2
r=2
1
 p111 p12

P= 2
2 
p
p
 11 12 
2
p12
2
p11
0
R = 
 2
• Transition matrix P, and reward function R
5
Summary of the story so far…
• Some key elements of RL problems:
– A class of sequential decision making problems
– We want p*
– Performance metric: Short term
Long term
Rt  rt 1   rt 2   2 rt 3 

   k rt 1k
k 0



p
k
V ( st )  E{Rt | st , p }  E   rt 1 k | st , p 
 k 0

st
at
st+1
rt+1
at+1
rt+2
st+2
at+2
st+3
rt+3
6
Summary of the story so far…
• Some common elements of RL solutions
– Exploit conditional independence
– Randomised interaction
st
at
st+1
rt+1
at+1
rt+2
st+2
at+2
st+3
rt+3
7
Bellman equations
• Conditional independence allows us to define expected return in
terms of a recurrence relation:
V ( s)   pss' {R ss'   V ( s' )}
p
p (s)
p (s)
p
s 'S
where
and
p(s)
pssp (' s )  Pr( s ' | s, p ( s))
4
s
R ssp '( s )  E rt 1 | st 1  s ', st  s, p 
3
5
Q* ( s, a)   pss'a {R ss'a   V * ( s' )}
s 'S
where
*

V ( s)  max Q ( s, a) 
aA
*
8
Two types of bootstrapping
• We can bootstrap using explicit knowledge of P and R
(Dynamic Programming)
Vˆn ( s )   pss' {R ss'   Vˆn 1 ( s' )}
p
p (s)
p (s)
p
s 'S
p(s)
4
s
• Or we can bootstrap using samples from P and R
(Temporal Difference learning)
3
5
p
p
p
p
ˆ
ˆ
ˆ
ˆ
Vt 1 ( st )  Vt ( st )   (rt 1   Vt ( st 1 )  Vt ( st ))
st
at = p(st)
rt+1
st+1
9
TD(0) learning
t:=0
p is the policy to be evaluated
Initialise Vˆtp ( s ) arbitrarily for all s  S
Repeat
select an action at from p(st)
at
observe the transition
st
st+1
rt+1
p
ˆ
update V ( st ) according to
p
p
p
p
ˆ
ˆ
ˆ
ˆ
Vt 1 ( st )  Vt ( st )   (rt 1   Vt ( st 1 )  Vt ( st ))
t:=t+1
10
On and Off policy learning
• On policy: evaluate the policy you are following, e.g. TD learning
Vˆtp1 ( st )  Vˆtp ( st )   (rt 1   Vˆtp ( st 1 )  Vˆtp ( st ))
at
st
rt+1
st+1
• Off-policy: evaluate one policy while
following another policy
st
at
rt+1
• E.g. One step Q-learning


st+1

Qˆ t 1 ( st , at )  Qˆ t ( st , at )   rt 1   max Qˆ t ( st 1 , b)  Qˆ t ( st , at )
bA

11
Off policy learning of control
• Q-learning is powerful because
– it allows us to evaluate p*
– while taking non-greedy actions (explore)
•
h-greedy is a simple and popular exploration rule:
• take a greedy action with probability h
• Take a random action with probability 1-h
• Q-learning is guaranteed to converge for MDPs
(with the right exploration policy)
• Is there a way of finding p* with an on-policy learner?
12
On policy learning of control: Sarsa
• Discard the max operator
• Learn about the policy you are following
st
at
rt+1

st+1
at+1
Qˆtp1 ( st , at )  Qˆtp ( st , at )   rt 1   Qˆtp ( st 1 , at 1 )  Qˆtp ( st , at )

• Change the policy gradually
• Guaranteed to converge for Greedy in the Limit Infinite Exploration
Policies
13
On policy learning of control: Sarsa
t:=0
p
ˆ
Q
Initialise t ( s, a) arbitrarily for all s  S , a  A
select an action at from explore( Qˆ tp ( st , a) )
Repeat
at
st+1
st
observe the transition
rt+1
select an action at+1 from explore( Qˆ tp ( st 1 , a ) )
update Qˆ tp ( st , at ) according to

Qˆtp1 ( st , at )  Qˆtp ( st , at )   rt 1   Qˆtp ( st 1 , at 1 )  Qˆtp ( st , at )

t:=t+1
14
Summary: TD, Q-learning, Sarsa
• TD learning
Vˆtp1 ( st )  Vˆtp ( st )   (rt 1   Vˆtp ( st 1 )  Vˆtp ( st ))
st
at
rt+1
st+1
• One step Q-learning
Qˆ t 1 ( st , at )  Qˆ t ( st , at )   rt 1   max Qˆ t ( st 1 , b)  Qˆ t ( st , at )

bA


st
at
rt+1

st+1
• Sarsa learning
Qˆtp1 ( st , at )  Qˆtp ( st , at )   rt 1   Qˆtp ( st 1 , at 1 )  Qˆtp ( st , at )
at
at+1
st+1
st
rt+1


15
Speeding up learning: Eligibility traces, TD(l
• TD learning only passes the TD error to one state
Vˆtp1 ( st )  Vˆtp ( st )   (rt 1  Vˆtp ( st 1 )  Vˆtp ( st ))
st-2
at-2
rt-1
st-1
at-1
rt
st
at
rt+1
st+1
if st  s
 1
• We add an eligibility for each state: et 1 ( s)  
l et ( s) otherwise

where 0  l  1
• Update Vˆtp ( s ) in every state s  S proportional to the eligibility
Vˆtp1 ( s )  Vˆtp ( s )   (rt 1  Vˆtp ( st 1 )  Vˆtp ( st ))et ( s )
16
Eligibility traces for learning control: Q(l)
• There are various eligibility trace methods for Q-learning
• Update for every s,a pair




Qˆ t 1 ( s, a)  Qˆ t ( s, a )   rt 1   max Qˆ t ( st 1 , b)  Qˆ t ( st , at ) et ( s, a )
st-1
•
•
•
•
at-1
rt
st
bA
at
rt+1
st+1
at+1
agreedy
Pass information backwards through a non-greedy action
Lose convergence guarantee of one step Q-learning
Watkin’s Solution: zero all eligibilities after a non-greedy action
Problem: you lose most of the benefit of eligibility traces
17
Eligibility traces for learning control: Sarsa(l)
• Solution: use Sarsa since it’s on policy
• Update for every s,a pair


Qˆtp1 ( s, a)  Qˆtp ( s, a)   rt 1   Qˆ tp (st 1 , at 1 )  Qˆ tp (st , at ) et (s, a)
st
at
rt+1
st+1
at+1
rt+2
st+2
at+2
• Keeps convergence guarantees
18
Approximate Reinforcement Learning
• Why?
– To learn in reasonable time and space
(avoid Bellman’s curse of dimensionality)
– To generalise to new situations
• Solutions
– Approximate the value function
– Search in the policy space
– Approximate a model (and plan)
19
Linear Value Function Approximation
• Simplest useful class of problems
• Some convergence results
• We’ll focus on linear TD(l
Weight vector at time t
t  t 1 ,t  2  t  n  
Feature vector for state s
s  s 1 , s  2  s  n  
Our value estimate
Vˆtp  s   ts
Our objective is to minimise
MSE  t   P( s ) V
 
sS
p
 s   Vˆt  s 
p
20
2
Value Function Approximation: features
• There are numerous schemes, CMACs and RBFs are popular
• CMAC: n tiles in the space
(aggregate over all tilings)
s  s 1 , s  2  s  n  
• Features s  i   1 or 0
• Properties
– Coarse coding
– Regular tiling _ efficient access
– Use random hashing to
reduce memory
21
Linear Value Function Approximation
• We perform gradient descent using  t
• The update equation for TDl becomes


t 1  t   rt   Vˆtp  st 1   Vˆtp  st  et
• Where the eligibility trace is an n-dim vector updated using
et  let 1  t
• If the states are presented with the frequency they would be seen
under the policy p you are evaluating TDl converges close to  *
22
Value Function Approximation (VFA)
Convergence results
• Linear TDl converges if we visit states using the on-policy
distribution
• Off policy Linear TD(l and linear Q learning are known to diverge
in some cases
• Q-learning, and value iteration used with some averagers (including
k-Nearest Neighbour and decision trees) has almost sure
convergence if particular exploration policies are used
• A special case of policy iteration with Sarsa style updates and linear
function approximation converges
• Residual algorithms are guaranteed to converge but only very
slowly
23
Value Function Approximation (VFA)
TD-gammon
•
•
•
•
TD(l) learning and a Backprop net with one hidden layer
1,500,000 training games (self play)
Equivalent in skill to the top dozen human players
Backgammon has ~1020 states, so can’t be solved using DP
24
Model-based RL: structured models
• Transition model P is represented compactly using a
Dynamic Bayes Net
(or factored MDP)
• V is represented as a tree
• Backups look like goal
regression operators
• Converging with the AI
planning community
25
Reinforcement Learning with Hidden State
• Learning in a POMDP, or k-Markov environment
ot
st
ot+2
ot+1
at
rt+1
st+1
at+1
rt+2
st+2
at+2
• Planning in POMDPs is intractable
• Factored POMDPs look promising
• Policy search can work well
26
Policy Search
• Why not search directly for a policy?
• Policy gradient methods and Evolutionary methods
• Particularly good for problems with hidden state
27
Other RL applications
• Elevator Control (Barto & Crites)
• Space shuttle job scheduling (Zhang & Dietterich)
• Dynamic channel allocation in cellphone networks (Singh
& Bertsekas)
28
Hot Topics in Reinforcement Learning
•
•
•
•
•
•
•
Efficient Exploration and Optimal learning
Learning with structured models (eg. Bayes Nets)
Learning with relational models
Learning in continuous state and action spaces
Hierarchical reinforcement learning
Learning in processes with hidden state (eg. POMDPs)
Policy search methods
29
Reinforcement Learning: key papers
Overviews
R. Sutton and A. Barto. Reinforcement Learning: An Introduction. The MIT Press,
1998.
J. Wyatt, Reinforcement Learning: A Brief Overview. Perspectives on Adaptivity and
Learning. Springer Verlag, 2003.
L.Kaelbling, M.Littman and A.Moore, Reinforcement Learning: A Survey. Journal
of Artificial Intelligence Research, 4:237-285, 1996.
Value Function Approximation
D. Bersekas and J.Tsitsiklis. Neurodynamic Programming. Athena Scientific, 1998.
Eligibility Traces
S.Singh and R. Sutton. Reinforcement learning with replacing eligibility traces.
Machine Learning, 22:123-158, 1996.
30
Reinforcement Learning: key papers
Structured Models and Planning
C. Boutillier, T. Dean and S. Hanks. Decision Theoretic Planning: Structural
Assumptions and Computational Leverage. Journal of Artificial Intelligence
Research, 11:1-94, 1999.
R. Dearden, C. Boutillier and M.Goldsmidt. Stochastic dynamic programming with
factored representations. Artificial Intelligence, 121(1-2):49-107, 2000.
B. Sallans. Reinforcement Learning for Factored Markov Decision Processes
Ph.D. Thesis, Dept. of Computer Science, University of Toronto, 2001.
K. Murphy. Dynamic Bayesian Networks: Representation, Inference and Learning.
Ph.D. Thesis, University of California, Berkeley, 2002.
31
Reinforcement Learning: key papers
Policy Search
R. Williams. Simple statistical gradient algorithms for connectionist reinforcement
learning. Machine Learning, 8:229-256.
R. Sutton, D. McAllester, S. Singh, Y. Mansour. Policy Gradient Methods for
Reinforcement Learning with Function Approximation. NIPS 12, 2000.
Hierarchical Reinforcement Learning
R. Sutton, D. Precup and S. Singh. Between MDPs and Semi-MDPs: a framework
for temporal abstraction in reinforcement learning. Artificial Intelligence,
112:181-211.
R. Parr. Hierarchical Control and Learning for Markov Decision Processes. PhD
Thesis, University of California, Berkeley, 1998.
A. Barto and S. Mahadevan. Recent Advances in Hierarchical Reinforcement
Learning. Discrete Event Systems Journal 13: 41-77, 2003.
32
Reinforcement Learning: key papers
Exploration
N. Meuleau and P.Bourgnine. Exploration of multi-state environments: Local
Measures and back-propagation of uncertainty. Machine Learning, 35:117-154,
1999.
J. Wyatt. Exploration control in reinforcement learning using optimistic model
selection. In Proceedings of 18th International Conference on Machine Learning,
2001.
POMDPs
L. Kaelbling, M. Littman, A. Cassandra. Planning and Acting in Partially Observable
Stochastic Domains. Artificial Intelligence, 101:99-134, 1998.
33