Source - Department of Computing Science

Download Report

Transcript Source - Department of Computing Science

From
Markov Decision Processes
to
Artificial Intelligence
Rich Sutton
with thanks to:
Andy Barto
Satinder Singh
Doina Precup
The steady march of computing science
is changing artificial intelligence
• More computation-based approximate methods
– Machine learning, neural networks, genetic algorithms
• Machines are taking on more of the work
– More data, more computation
– Less handcrafted solutions, human understandability
• More search
– Exponential methods are still exponential…
but compute-intensive methods increasingly winning
• More general problems
– stochastic, non-linear, optimal
– real-time, large
state
action
state
action
The problem is
to predict and control
a doubly branching interaction
unfolding over time,
with a long-term goal
state
action
state
Agent
World
Sequential, state-action-reward
problems are ubiquitous
•
•
•
•
•
•
•
•
•
Walking
Flying a helicopter
Playing tennis
Logistics
Inventory control
Intruder detection
Repair or replace?
Visual search for objects
Playing chess, Go, Poker
•
•
•
•
•
•
•
•
•
Medical tests, treatment
Conversation
User interfaces
Marketing
Queue/server control
Portfolio management
Industrial process control
Pipeline failure prediction
Real-time load balancing
Markov Decision Processes (MDPs)
state
action
state
Discrete time t  0,1,2,3,
States
st  S
Actions
at  A
Policy
 : S  A  [0,1]
action
 (s,a)  Pr at  a st  s
Transition probabilities
state
Rewards
Psas  Pr st 1  s st  s,at  a
action
state
rt  
Rsas  E rt 1 st  s, at  a, st 1  s
MDPs Part II: The Objective
• “Maximize cumulative reward”
• Define the value of being in a state under a policy as
V  (s)  E r1  gr2  g 2 r3 
s0  s, 
There are other
possibilities...
where delayed rewards are discounted by g  [0,1]
• Defines a partial ordering over policies,
with at least one optimal policy:
Needs proving
Markov Decision Processes
• Extensively studied since 1950s
• In Optimal Control
– Specializes to Ricatti equations for linear systems
– And to HJB equations for continuous time systems
– Only general, nonlinear, optimal-control framework
• In Operations Research
– Planning, scheduling, logistics
– Sequential design of experiments
– Finance, marketing, inventory control, queuing, telecomm
• In Artificial Intelligence (last 15 years)
– Reinforcement learning, probabilistic planning
• Dynamic Programming is the dominant solution method
Outline
• Markov decision processes (MDPs)
• Dynamic Programming (DP)
– The curse of dimensionality
• Reinforcement Learning (RL)
– TD(l) algorithm
– TD-Gammon example
– Acrobot example
• RL significantly extends DP methods for solving MDPs
– RoboCup example
• Conclusion, from the AI point of view
– Spy plane example
The Principle of Optimality
• Dynamic Programming (DP) requires a decomposition
into subproblems
• In MDPs this comes from the Independence of Path
assumption
• Values can be written in terms of successor values, e.g.,
V  (s)  E r1  gr2  g 2 r3 
s 0  s, 

 E r1  gV (s1 ) s0  s, 

   (s,a) Ps s Rs s  gV ( s )
a
a
s 
a


“Bellman
Equations”
Dynamic Programming:
Sweeping through the states,
updating an approximation to the optimal value function
For example, Value Iteration:
Initialize:
V(s) : 0 s
Do forever:
V(s) : max  Psas  Rsas  gV (s)
a
*
s 


s  S
In theany
limit,
V
V
Pick
of the
maximizing
actions to get *
DP is repeated backups,
shallow lookahead searches
a re
Qui ckTi me ™ a nd a
d eco mp re sso r
ne e de d to see thi s p i cture.
s
a
Psas
V
V
s’
V(s’)
Rsas
s’’
V(s’’)
Dynamic Programming is the dominant
solution method for MDPs
• Routinely applied to problems with millions of states
• Worst case scales polynomially in |S| and |A|
• Linear Programming has better worst-case bounds
but in practice scales 100s of times worse
• On large stochastic problems, only DP is feasible
Perennial Difficulties for DP
1. Large state spaces
“The curse of dimensionality”
2. Difficulty calculating the dynamics, a
Pss and Rsas
e.g., from a simulation
3. Unknown dynamics


V(s) : max  Psas  Rsas  gV (s)
a
s 
s  S
The Curse of Dimensionality
Bellman, 1961
• The number of states grows exponentially with
dimensionality -- the number of state variables
• Thus, on large problems,
– Can’t complete even one sweep of DP
• Can’t enumerate states, need sampling!
– Can’t store separate values for each state
• Can’t store values in tables, need function approximation!


V(s) : max  Psas  Rsas  gV (s)
a
s 
s  S
Reinforcement Learning:
Using experience in place of dynamics
Let s0, a0 ,r1,s1,a1,r2,s2, a2,
be an observed sequence
with actions selected by 
For every time step, t,


V (st )  E rt1  gV (st 1 ) st ,  ,
“Bellman Equation”
which suggests the DP-like update:
st
V(st ) : E rt 1  gV(st 1 ) st , .
We don’t know this expected value, but we know the
actual
, an unbiased sample of it.
rt 1 
gV(sat 1step
)
In RL, we
take
toward this sample, e.g., half way
“Tabular TD(0)”
V(st ) : 12 V (st ) 
1
2
rt 1  gV (st1 )
Temporal-Difference Learning
(Sutton, 1988)
Updating a prediction based on its change (temporal
difference) from one moment to the next.
Tabular TD(0):
first prediction
better, later
prediction
V(st ) : V (st )  rt 1  gV (st1 )  V(st )
t  0,1,2,
Temporal difference
Or, V is, e.g., a neural network with parameter q
Then use gradient-descent TD(0):
: q uses
 rt1differences
 gV(st 1 )  V(s
V (st ) predictions
t  0,1,2,
TD(l),ql>0,
from
as well
t )qlater
TD-Gammon
Tesauro, 1 9 9 2 -1 9 9 5
... ...
...
≈ probability of winning
...
162
TD Error
St art wit h a random Net work
Play millions of games against it self
Learn a value funct ion from t his simulat ed experience
This produces arguably t he best player in t he world
Act ion select ion
by 2 -3 ply search
(Tesauro, 1992)
TD-Gammon vs an Expert-Trained Net
TD-Gammon
+features
.70
fraction
of games
won
against
Gammontool
TD-Gammon
.65
*
EP+features
"Neurogammon"
.60
.55
EP (BP net
trained from
expert moves)
.50
.45
0
10
20
40
number of hidden units
80
Examples of Reinforcement Learning
• Elevator Control
Crites & Barto
– (Probably) world's best down-peak elevator controller
• Walking Robot
Benbrahim & Franklin
– Learned critical parameters for bipedal walking
• Robocup Soccer Teams
e.g., Stone & Veloso, Reidmiller et al.
– RL is used in many of the top teams
• Inventory Management
Van Roy, Bertsekas, Lee & Tsitsiklis
– 10-15% improvement over industry standard methods
• Dynamic Channel Assignment
Singh & Bertsekas, Nie & Haykin
– World's best assigner of radio channels to mobile telephone calls
• KnightCap and TDleaf
Baxter, Tridgell & Weaver
– Improved chess play from intermediate to master in 300 games
Does function approximation beat
the curse of dimensionality?
• Yes… probably
• FA makes dimensionality per se largely irrelevant
• With FA, computation seems to scale with the
complexity of the solution (crinkliness of the value
function) and how hard it is to find it
• If you can get FA to work!
FA in DP and RL (1st bit)
• Conventional DP works poorly with FA
– Empirically [Boyan and Moore, 1995]
– Diverges with linear FA [Baird, 1995]
– Even for prediction (evaluating a fixed policy)
[Baird, 1995]
• RL works much better
– Empirically [many applications and Sutton, 1996]
– TD(l) prediction converges with linear FA [Tsitsiklis & Van Roy, 1997]
– TD(l) control converges with linear FA [Perkins & Precup, 2002]
• Why? Following actual trajectories in RL ensures that
every state is updated at least as often
as it is the basis for updating
DP+FA fails
RL+FA works
More transitions can go
in to a state than go out
Real trajectories always
leave a state after
entering it
Outline
• Markov decision processes (MDPs)
• Dynamic Programming (DP)
– The curse of dimensionality
• Reinforcement Learning (RL)
– TD(l) algorithm
– TD-Gammon example
– Acrobot example
• RL significantly extends DP methods for solving MDPs
– RoboCup example
• Conclusion, from the AI point of view
– Spy plane example
Moore, 1990
The Mountain Car Problem
Goal
Gravity wins
Minimum-Time-to-Goal Problem
SITUATIONS: car's position and
velocity
ACTIONS: three thrusts: forward,
reverse, none
REWARDS: always –1 until car
reaches the goal
No Discounting
Sutton, 1996
Value Functions Learned
while solving the Mountain Car problem
Goal
region
Minimize Time-to-Goal
Value = estimated time to goal
Lower
is better
Sparse, Coarse, Tile-Coding (CMAC)
Car velocity
Car position
Albus, 1980
Albus, 1980
Tile Coding (CMAC)
Example of Sparse Coarse-Coded Networks
.
fixed expansive
Re-representation
.
.
.
.
.
.
.
.
.
.
Linear
last
layer
features
Coarse: Large receptive fields
Sparse: Few features present at one time
The Acrobot Problem
Sutton, 1996
Goal: Raise tip above line
e.g., Dejong & Spong, 1994
fixed
base
Torque
applied
here
Minimum–Time–to–Goal:
4 state variables:
2 joint angles
2 angular velocities
q1
q2
tip
Reward = -1 per time step
Tile coding with 48 tilings
The RoboCup Soccer Competition
13 Continuous State Variables
(for 3 vs 2)
11 distances among
the players, ball,
and the center of
the field
2 angles to takers
along passing lanes
Stone & Sutton, 2001
RoboCup Feature Vectors
.
Full
soccer
state
Sparse, coarse,
tile coding
13 continuous
state variables
.
.
.
.
.
.
.
.
.
.
action
values
Linear
map q
r
Huge binary feature vector fs
(about 400 1’s and 40,000 0’s)
Stone & Sutton, 2001
Stone & Sutton, 2001
Learning Keepaway Results
3v2 handcrafted takers
Multiple,
independent
runs of TD(l)
Hajime Kimura’s RL Robots (dynamics knowledge)
QuickTime™ and a
decompressor
are needed to see this picture.
Before
QuickTime™ and a
decompressor
are needed to see this picture.
Backward
QuickTime™ and a
decompressor
are needed to see this picture.
After
QuickTime™ and a
decompressor
are needed to see this picture.
New Robot, Same algorithm
Assessment re: DP
• RL has added some new capabilities to DP methods
– Much larger MDPs can be addressed (approximately)
– Simulations can be used without explicit probabilities
– Dynamics need not be known or modeled
• Many new applications are now possible
– Process control, logistics, manufacturing, telecomm, finance,
scheduling, medicine, marketing…
• Theoretical and practical questions remain open
Outline
• Markov decision processes (MDPs)
• Dynamic Programming (DP)
– The curse of dimensionality
• Reinforcement Learning (RL)
– TD(l) algorithm
– TD-Gammon example
– Acrobot example
• RL significantly extends DP methods for solving MDPs
– RoboCup example
• Conclusion, from the AI point of view
– Spy plane example
A lesson for AI:
The Power of a “Visible” Goal
• In MDPs, the goal (reward) is part of the data,
part of the agent’s normal operation
• The agent can tell for itself how well it is doing
• This is very powerful… we should do more of it in AI
• Can we give AI tasks visible goals?
–
–
–
–
Visual object recognition? Better would be active vision
Story understanding? Better would be dialog, eg call routing
User interfaces, personal assistants
Robotics… say mapping and navigation, or search
• The usual trick is to make them into long-term
prediction problems
• Must be a way. If you can’t feel it, why care about it?
Assessment re: AI
• DP and RL are potentially powerful probabilistic
planning methods
– But typically don’t use logic or structured representations
• How is they as an overall model of thought?
– Good mix of deliberation and immediate judgments (values)
– Good for causality, prediction, not for logic, language
• The link to data is appealing…but incomplete
– MDP-style knowledge may be learnable, tuneable, verifiable
– But only if the “level” of the data is right
• Sometimes seems too low-level, too flat
Ongoing and Future Directions
• Temporal abstraction
[Sutton, Precup, Singh, Parr, others]
– Generalize transitions to include macros, “options”
– Multiple overlying MDP-like models at different levels
• States representation
[Littman, Sutton, Singh, Jaeger...]
– Eliminate the nasty assumption of observable state
– Get really real with data
– Work up to higher-level, yet grounded, representations
• Neuroscience of reward systems
[Dayan, Schultz, Doya]
– Dopamine reward system behaves remarkably like TD
• Theory and practice of value function approximation
[everybody]
Spy Plane Example
Sutton & Ravindran, 2001
(Reconnaissance Mission Planning)
25
15 (reward)
25 (mean time between
weather changes)
50
8
options
50
– Actions: which direction to fly now
– Options: which site to head for
• Options compress space and time
5
100
• Mission: Fly over (observe) most
valuable sites and return to base
• Stochastic weather affects
observability (cloudy or clear) of sites
• Limited fuel
• Intractable with classical optimal
control methods
• Temporal scales:
10
– Reduce steps from ~600 to ~6
– Reduce states from ~1011 to ~106
50
Base
100 decision steps
QO* (s, o)  rso   psosVO* ( s)
any state
(106)
s 
sites only (6)
Sutton & Ravindran, 2001
Spy Plane Results
Expected Reward/Mission
60
30
– Assumes options followed to
completion
– Plans optimal SMDP solution
• SMDP planner with re-evaluation
50
40
• SMDP planner:
High Fuel
Low Fuel
SMDP
SMDP
Static
planner
Planner Re-planner
with
re-evaluation
Temporal abstraction
of options on
finds better approximation
each step
than static planner, with
little more computation
than SMDP planner
– Plans as if options must be
followed to completion
– But actually takes them for only
one step
– Re-picks a new option on every
step
• Static planner:
– Assumes weather will not change
– Plans optimal tour among clear
sites
– Re-plans whenever weather
changes
Didn’t have time for
• Action Selection
– Exploration/Exploitation
– Action values vs. search
– How learning values leads to policy improvements
• Different returns, e.g., the undiscounted case
• Exactly how FA works, backprop
• Exactly how options work
– How planning at a high level can affect primitive actions
• How states can be abstracted to affordances
– And how this directly builds on the option work