Structure and Synthesis of Robot Motion Introduction

Download Report

Transcript Structure and Synthesis of Robot Motion Introduction

Decision Making
in Robots and Autonomous Agents
The Markov Decision Process (MDP) model
Subramanian Ramamoorthy
School of Informatics
25 January, 2013
In the MAB Model…
• We were in a single casino and the only decision is to pull
from a set of n arms
– except perhaps in the very last slides, exactly one state!
We asked the following,
• What if there is more than one state?
• So, in this state space, what is the effect of the distribution of
payout changing based on how you pull arms?
• What happens if you only obtain a net reward corresponding
to a long sequence of arm pulls (at the end)?
25/01/2013
2
Decision Making Agent-Environment Interface
Agent and environm ent int eract at discret e t ime steps
: t  0,1, 2,
Agent observes st at e at st ep
t:
st S
produces act ion at st ep
t : at  A(st )
get s resulting reward: rt 1 
and result ing next stat: e st 1
...
25/01/2013
st
at
rt +1
st +1
at +1
rt +2
st +2
rt +3 s
t +3
at +2
...
at +3
3
Markov Decision Processes
•
•
•
•
A model of the agent-environment system
Markov property = history doesn’t matter, only current state
If state and action sets are finite, it is a finite MDP.
To define a finite MDP, you need to give:
– state and action sets
– one-step “dynamics” defined by transition probabilities:
Psas  Prst 1  s st  s,at  a for alls, s S, a  A(s).
– reward probabilities:

25/01/2013
Rass  Ert 1 st  s,at  a,st 1  s for alls, s S, a  A(s).
4
An Example Finite MDP
Recycling Robot
• At each step, robot has to decide whether it should (1) actively
search for a can, (2) wait for someone to bring it a can, or (3) go to
home base and recharge.
• Searching is better but runs down the battery; if runs out of power
while searching, has to be rescued (which is bad).
• Decisions made on basis of current energy level: high, low.
• Reward = number of cans collected
25/01/2013
5
Recycling Robot MDP
S  high, low
A(high)  search
, wait

A(low)  search
, wait
, recharge

 expected no. of cans while searching
Rs
e a r c h
Rw
a i t
 expected no. of cans while waiting
Rs
R
e a r c hw a i t

25/01/2013
6
Enumerated In Tabular Form
If you were given this much, what can you say about
the behaviour (over time) of the system?
25/01/2013
7
A Very Brief Primer on
Markov Chains and Decisions
A model, as originally developed in Operations
Research/Stochastic Control theory
25/01/2013
8
Stochastic Processes
• A stochastic process is an indexed collection of random
variables .
– e.g., collection of weekly demands for a product
• One type: At a particular time t, labelled by integers, system is
found in exactly one of a finite number of mutually exclusive
and exhaustive categories or states, labelled by integers too
• Process could be imbedded in that time points correspond to
occurrence of specific events (or time may be equi-spaced)
• Random variables may depend on others, e.g.,
25/01/2013
9
Markov Chains
• The stochastic process is said to have a Markovian property if
• Markovian probability means that the conditional probability
of a future event given any past events and current state, is
independent of past states and depends only on present
• The conditional probabilities are transition probabilities,
• These are stationary if time invariant, called pij,
25/01/2013
10
Markov Chains
• Looking forward in time, n-step transition probabilities, pij(n)
• One can write a transition matrix,
• A stochastic process is a finite-state Markov chain if it has,
–
–
–
–
25/01/2013
Finite number of states
Markovian property
Stationary transition probabilities
A set of initial probabilities P{X0 = i} for all i
11
Markov Chains
• n-step transition probabilities can be obtained from 1-step
transition probabilities recursively (Chapman-Kolmogorov)
• We can get this via the matrix too
• First Passage Time: number of transitions to go from i to j for
the first time
– If i = j, this is the recurrence time
– In general, this itself is a random variable
25/01/2013
12
Markov Chains
• n-step recursive relationship for first passage time
• For fixed i and j, these fij(n) are nonnegative numbers so that
• If,
25/01/2013
, that state is a recurrent state, absorbing if n=1
13
Markov Chains: Long-Run Properties
Consider the 8-step transition matrix of the inventory example:
Interesting property: probability of being in state j after 8
weeks appears independent of initial level of inventory.
• For an irreducible ergodic Markov chain, one has limiting
probability
Reciprocal gives you
recurrence time mjj
25/01/2013
14
Markov Decision Model
• Consider the following application: machine maintenance
• A factory has a machine that deteriorates rapidly in quality
and output and is inspected periodically, e.g., daily
• Inspection declares the machine to be in four possible states:
–
–
–
–
0: Good as new
1: Operable, minor deterioration
2: Operable, major deterioration
3: Inoperable
• Let Xt denote this observed state
– evolves according to some “law of motion”, so it is a stochastic process
– Furthermore, assume it is a finite state Markov chain
25/01/2013
15
Markov Decision Model
• Transition matrix is based on the following:
• Once the machine goes inoperable, it stays there until repairs
– If no repairs, eventually, it reaches this state which is absorbing!
• Repair is an action – a very simple maintenance policy.
– e.g., machine from from state 3 to state 0
25/01/2013
16
Markov Decision Model
• There are costs as system evolves:
– State 0: cost 0
– State 1: cost 1000
– State 2: cost 3000
• Replacement cost, taking state 3 to 0, is 4000 (and lost
production of 2000), so cost = 6000
• The modified transition probabilities are:
25/01/2013
17
Markov Decision Model
• Simple question:
What is the average cost of this maintenance policy?
• Compute the steady state probabilities:
How?
• (Long run) expected average cost per day,
25/01/2013
18
Markov Decision Model
• Consider a slightly more elaborate policy:
– Repair when inoperable or needing major repairs, replace
• Transition matrix now changes a little bit
• Permit one more thing: overhaul
– Go back to minor repairs state (1) for the next time step
– Not possible if truly inoperable, but can go from major to minor
• Key point about the system behaviour. It evolves according to
– “Laws of motion”
– Sequence of decisions made (actions from {1: none,2:overhaul,3: replace})
• Stochastic process is now defined in terms of {Xt} and {Dt}
– Policy, R, is a rule for making decisions
• Could use all history, although popular choice is (current) state-based
25/01/2013
19
Markov Decision Model
• There is a space of potential policies, e.g.,
• Each policy defines a transition matrix, e.g., for Rb
0
25/01/2013
0
Which policy is best?
Need costs….
20
Markov Decision Model
• Cik = expected cost incurred during next transition if system is
in state i and decision k is made
State
Dec.
1
2
3
0
0
4
6
1
1
4
6
2
3
4
6
3
∞
∞
6
• The long run average expected cost for each policy may be
computed using
Rb is best
25/01/2013
21
Markov Decision Processes
Solution using Dynamic Programming
(*some notation changes upcoming)
25/01/2013
22
The RL Problem
Main Elements:
• States, s
• Actions, a
• State transition dynamics often, stochastic & unknown
• Reward (r) process - possibly
stochastic
Objective: Policy pt(s,a)
– probability distribution over
actions given current state
25/01/2013
Assumption:
Environment defines
a finite-state MDP
23
Back to Our Recycling Robot MDP
S  high, low
Rs
e a r c h
A(high)  search
, wait

Rw
a i t
A(low)  search
, wait
, recharge

 expected no. of cans while searching
 expected no. of cans while waiting
Rs
R
e a r c hw a i t

25/01/2013
24
Given an enumeration of transitions and corresponding
costs/rewards, what is the best sequence of actions?
We want to maximize the criterion:

Rt    k rt  k 1
k 0
So, what must one do?
25/01/2013
25
The Shortest Path Problem
25/01/2013
26
Finite-State Systems and Shortest Paths
– state space sk is a finite set for each k
– ak can get you from sk to fk(sk, ak) at a cost gk(xk, uk)
Length ≈ Cost
≈ Sum of length of arcs
Solve this first
Vk(i) = minj [akij + Vk+1(j)]
25/01/2013
27
Value Functions
• The value of a state is the expected return starting from that
state; depends on the agent’s policy:
State- val ue function for pol icy
p:
  k

V (s)  Ep Rt st  s Ep   rt k 1 st  s
k 0

p
• The value of taking an action in a state under policy p is the
expected return starting from that state, taking that action,
and thereafter following p :
Acti on- val ue function for polipcy:
 k

Q (s, a)  Ep Rt st  s, at  a Ep  rt  k 1 st  s,at  a 
k  0
28
25/01/2013
p
Recursive Equation for Value
The basic idea:
Rt  rt 1   rt 2   2 rt  3   3 rt  4
 rt 1   rt 2   rt 3   rt  4
2

 rt 1   Rt 1
So:
V p ( s)  Ep  Rt st  s
 Ep  rt 1   V st 1  st  s
25/01/2013
29
Optimality in MDPs – Bellman Equation
25/01/2013
30
Policy Evaluation
How to compute V(s) for an arbitrary policy p? (Prediction problem)
For a given MDP, this yields a system of simultaneous equations
– as many unknowns as states (BIG, |S| linear system!)
Solve iteratively, with a sequence of value functions,
3/02/2012
31
Policy Improvement
Does it make sense to deviate from p(s) at any state (following the
policy everywhere else)? Let us for now assume deterministic p(s)
- Policy Improvement Theorem [Howard/Blackwell]
3/02/2012
32
Computing Better Policies
Starting with an arbitrary policy, we’d like to approach truly optimal
policies. So, we compute new policies using the following,
Are we restricted to deterministic policies? No.
With stochastic policies,
3/02/2012
33
Grid-World Example
25/01/2013
34
Iterative Policy Evaluation in Grid World
Note: The value function can be searched
greedily to find long-term optimal actions
25/01/2013
35