Dynamic Programming

Download Report

Transcript Dynamic Programming

Chapter 4: Dynamic Programming
Objectives of this chapter:
 Overview of a collection of classical solution methods
for MDPs known as dynamic programming (DP)
 Show how DP can be used to compute value functions,
and hence, optimal policies
 Discuss efficiency and utility of DP
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
1
Policy Evaluation
Policy Evaluation: for a given policy , compute the

state-value function V
Recall:
State - value function for policy  :
  k

V (s)  E Rt st  s E   rt k 1 st  s
k 0


Bellman equation for V  :
V (s)    (s,a) PssRss   V ( s)

a
a
a

s 
— a system of S simultaneous linear equations
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
2
Iterative Methods
V0  V1 
 Vk  Vk1 
V

a “sweep”
A sweep consists of applying a backup operation to each state.
A full policy evaluation backup:
Vk 1 (s)    (s,a) P R   Vk (s)
a
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
s
a
ss
a
ss
3
Iterative Policy Evaluation
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
4
A Small Gridworld
 An undiscounted episodic task
 Nonterminal states: 1, 2, . . ., 14;
 One terminal state (shown twice as shaded squares)
 Actions that would take agent off the grid leave state unchanged
 Reward is –1 until the terminal state is reached
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
5
Iterative Policy Eval for the Small Gridworld
  random (uniform) action choices
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
6
Policy Improvement

Suppose we have computed V for a deterministic policy .
For a given state s,
would it be better to do an action a  (s) ?
The value of doing a in state s is :
Q (s,a)  E rt 1   V  (st 1 ) st  s, at  a
  PsasRsas   V  (s)
s 
It is better to switch to action a for state s if and only if
Q (s,a)  V  (s)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
7
Policy Improvement Cont.
Do this for all states to get a new policy
that is
greedy with respect to V  :
 (s)  argmax Q (s,a)
a
 argmax  Psas Rsas    V  (s)
a

Then V  V

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
s 
8
Policy Improvement Cont.
What if V   V  ?
i.e., for all s S, V (s)  max  PsasRsas   V  (s ) ?
a
s 
But this is the Bellman Optimality Equation.
So V   V  and both  and  
 are optimal policies.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
9
Policy Iteration
0  V
0
1
 1  V 
policy evaluation
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
 V 
*
*
*
policy improvement
“greedification”
10
Policy Iteration
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
11
Jack’s Cars Rental
Jack owns two locations for car rental.
Customers come to both locations at random with poison distribution
X
n
n!
e 
Suppose that  =3 and 4 for rent and 3 and 2 for return in two locations
He gets $10 if he rents a car.
Moving the car from one location to the second one costs $2
Maximum 20 cars at each location
Maximum 5 cars can be moved overnight
Use MDP with
  0.9
States are number of cars
Actions are number of cars moved each night
Starting policy  never moves any car
0
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
12
Jack’s Cars Rental
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
13
Value Iteration
Recall the full policy evaluation backup:
Vk 1 (s)    (s,a) PsasRsas   Vk (s)
s
a
Here is the full value iteration backup:
Vk 1 (s)  max  PsasRsas   Vk (s)
a
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
s
14
Value Iteration Cont.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
15
Gambler’s Problem
A gambler plys coins if it is heads he wins
As much as he gambled.
The game ends when he gets $100 or loses
His money
The state is his capital
s  {1,2,...,99}
And actions are his bets
a {1,2,..., min( s,100  s)}
The optimum policy maximizes the
probability of reaching the goal.
If the probability of coins coming
up heads is p=0.4 then the optimum
policy is as shown
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
16
Asynchronous DP
 All the DP methods described so far require exhaustive
sweeps of the entire state set.
 Asynchronous DP does not use sweeps. Instead it works like
this:
 Repeat until convergence criterion is met:
– Pick a state at random and apply the appropriate
backup
 Still need lots of computation, but does not get locked into
hopelessly long sweeps
 Can you select states to backup intelligently? YES: an agent’s
experience can act as a guide.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
17
Generalized Policy Iteration
Generalized Policy Iteration (GPI):
any interaction of policy evaluation and policy improvement,
independent of their granularity.
A geometric metaphor for
convergence of GPI:
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
18
Efficiency of DP
 To find an optimal policy is polynomial in the number of
states…
 BUT, the number of states is often astronomical, e.g., often
growing exponentially with the number of state variables
(what Bellman called “the curse of dimensionality”).
 In practice, classical DP can be applied to problems with a
few millions of states.
 Asynchronous DP can be applied to larger problems, and
appropriate for parallel computation.
 It is surprisingly easy to come up with MDPs for which DP
methods are not practical.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
19
Summary
 Policy evaluation: backups without a max
 Policy improvement: form a greedy policy, if only locally
 Policy iteration: alternate the above two processes
 Value iteration: backups with a max
 Full backups (to be contrasted later with sample backups)
 Generalized Policy Iteration (GPI)
 Asynchronous DP: a way to avoid exhaustive sweeps
 Bootstrapping: updating estimates based on other
estimates
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
20