Transcript Document

REINFORCEMENT
LEARNING
LEARNING TO PERFORM
BEST ACTIONS BY REWARDS
Tayfun Gürel
ROADMAP:









Introduction to the problem
Models of Optimal Behavior
An Immediate Reward Example and Solution
Markov Decision Process
Model Free Methods
Model Based Methods
Partially Observable Markov Decision Process
Applications
Conclusion
RL brings a way of programming
agents by reward and punishment
without specifying how the task is
to be achieved. (Kaelbling,1996)
Based on trial-error interactions
A set of problems rather than a
set of techniques
a: action
i: input
r: reward
s: state
The standard reinforcement-learning model
The reinforcement learning model
consists of:

a discrete set of environment states: S ;

a discrete set of agent actions A ; and

a set of scalar reinforcement signals;
typically {0,1} , or the real numbers
(different from supervised learning)
An example dialog for agent environment relationship:
Environment: You are in state 65. You have 4 possible actions.
Agent:
I'll take action 2.
Environment: You received a reinforcement of 7 units.
You are now in state 15.
You have 2 possible actions.
Agent:
I'll take action 1.
Environment: You received a reinforcement of -4 units.
You are now in state 65.
You have 4 possible actions.
Agent:
I'll take action 2.
Environment: You received a reinforcement of 5 units.
You are now in state 44.
You have 5 possible actions.
.
.
.
.
Some Examples
Bioreactor
actions: stirring rate, temperature control
states: sensory readings related to chemicals
reward: instant production rate of target chemical

Recycling robot
actions: search for a can, wait, or recharge
states: low battery, high battery
reward: + for having a can, - for running out of battery

ROADMAP:

Introduction to the problem

Models of Optimal Behavior







An Immediate Reward Example and Solution
Markov Decision Process
Model Free Methods
Model Based Methods
Partially Observable Markov Decision Process
Applications
Conclusion
Models of Optimal Behavior:
Agent tries to maximize one of the following:
finite-horizon Model:
infinite-horizon
discounted model:
average-reward model:
ROADMAP:









Introduction to the problem
Models of Optimal Behavior
An Immediate Reward Example and
Solution
Markov Decision Process
Model Free Methods
Model Based Methods
Partially Observable Markov Decision Process
Applications
Conclusion
k-armed bandit
k gambling machines
h pulls are allowed
Machines are not equivalent:
Trying to learn paying probabilities
Tradeoff between exploitation-exploration
Solution Strategies 1:
Dynamic Programming Approach
A belief state:
Expected pay-off
remaining:
Probability of action i
being paid:
Dynamic Programming Approach:
, then
Because there are no remaining pulls.
So, all values can be recursively computed:
Update probabilities after each action
ROADMAP:

Introduction to the problem
Models of Optimal Behavior
An Immediate Reward Example and Solution

Markov Decision Process







Model Free Methods
Model Based Methods
Partially Observable Markov Decision Process
Applications
Conclusion
MARKOV DECISION PROCESS
k-armed bandit
gives immediate
reward
DELAYED REWARD?
Characteristics of MDP:
a set of states : S
a set of actions : A
a reward function :
R:SxA R
A state transition function:
T: S x A  ∏ ( S)
T (s,a,s’): probability of
transition from s to s’ using
action a
MDP EXAMPLE:
Bellman Equation:
States and rewards
(Greedy policy selection)
Transition
function
Value Iteration Algorithm
Stop Iteration when V(s) differs less than є.
Policy difference ratio =< 2єγ / (1-γ )
( Williams & Baird 1993b)
AN ALTERNATIVE ITERATION: (Singh,1993)
(Important for model free learning)
Policy Iteration Algorithm
Policies converge faster than values.
Why faster convergence?
POLICY ITERATION ON GRID WORLD
POLICY ITERATION ON GRID WORLD
0
0
0
0
0.8
-0.8
0
0
0
POLICY ITERATION ON GRID WORLD
0
0
0
0
0
-0.8
0
0.64
0
0
0.8
0
0
0.8
0.46
0
0
0
POLICY ITERATION ON GRID WORLD
0
0.64
0
0
0.8
0.46
0
0
0
POLICY ITERATION ON GRID WORLD
0
0.64
0
0
0.51
0.46
0
0
0
0.77 0.93
0
0
0.8
0.59
0
0.36
0
POLICY ITERATION ON GRID WORLD
0
0.64
0
0
0.51
0.46
0
0
0
0.77 0.93
0
0
0.8
0.59
0
0.36
0
0.66
0.89 0.95
0.41
0.70
0
0.32 0.48 0.19
MDP Graphical Representation
β, α : T (s, action, s’ )
Similarity to HMMs
ROADMAP:

Introduction to the problem
Models of Optimal Behavior
An Immediate Reward Example and Solution
Markov Decision Process

Model Free Methods







Model Based Methods
Partially Observable Markov Decision Process
Applications
Conclusion
Model Free Methods
Models of the environment:
T: S x A  ∏ ( S) and R : S x A  R
Do we know them? Do we have to know them?



Monte Carlo Methods
Adaptive Heuristic Critic
Q Learning
Monte Carlo Methods

Idea:
Hold statistics about rewards for each state
Take the average
This is the V(s)



Based only on experience 
Assumes episodic tasks 
(Experience is divided into episodes and all episodes will
terminate regardless of the actions selected.)
Incremental in episode-by-episode sense not step-bystep sense.
Problem: Unvisited <s, a> pairs
(problem of maintaining exploration)
For every <s, a> make sure that:
P(<s, a> selected as a start state and action) >0
(Assumption of exploring starts )
Monte Carlo Control
How to select Policies:
(Similar to policy evaluation)
ADAPTIVE HEURISTIC CRITIC & TD(λ)
How the AHC learns, TD(0) algorithm:
AHC : TD Algorithm
Q LEARNING
Q values in Value Iteration:
But we don’t know
and
Instead use the following :
Decayed α properly?
Q values will converge. (Singh 1994)
Q-LEARNING CRITICS:




Simpler than AHC learning
Q-Learning is exploration sensitive
Analog to value iteration in MDP
Most popular Model free learning algorithm
ROADMAP:

Introduction to the problem
Models of Optimal Behavior
An Immediate Reward Example and Solution
Markov Decision Process
Model Free Methods

Model Based Methods







Partially Observable Markov Decision Process
Applications
Conclusion
Model Based Methods
Model free methods do not learn the
model parameters

Inefficient use of data! Learn the
model.
Certainty Equivalent Methods:
First learn the models of the environment by keeping
statistics, then learn the actions to take.
Objections:
 Arbitrary division between the learning phase and
acting phase
 Initial data gathering. (How to choose exploration
strategy without knowing the model)
 Changes in the environment?

 Better to learn the model and to act
simultaneously
DYNA
After action a, from s to s’ with reward r:
1.Update Transition model T and reward function R
2.Update Q values:
3.Do k more random updates (random s, a pairs ):
ROADMAP:









Introduction to the problem
Models of Optimal Behavior
An Immediate Reward Example and Solution
Markov Decision Process
Model Free Methods
Model Based Methods
Partially Observable Markov Decision
Process
Applications
Conclusion
POMDPs
What if state information (from sensors) is noisy?
Mostly the case!
MDP
techniques
are
suboptimal!
Two halls are
not the same.
POMDPs – A Solution Strategy
SE: Belief State Estimator (Can be based on HMM)
П: MDP Techniques
ROADMAP:










Introduction to the problem
Models of Optimal Behavior
An Immediate Reward Example and Solution
Markov Decision Process
Model Free Methods
Model Based Methods
Generalization
Partially Observable Markov Decision Process
Applications
Conclusion
APPLICATIONS

Juggling robot: dynamic programming
(Schaal & Atkeson 1994)

Box pushing robot: Q-learning
(Mahadevan& Connel 1991a)

Disk collecting robot: Q-learning
(Mataric 1994)
ROADMAP:










Introduction to the problem
Models of Optimal Behavior
An Immediate Reward Example and Solution
Markov Decision Process
Model Free Methods
Model Based Methods
Generalization
Partially Observable Markov Decision Process
Applications
Conclusion
Conclusion

RL is not supervised learning

Planning rather than classification


Poor performance on large problems.
New methods neded
(i.e. shaping, imitation, reflexes)
RL/MDPs extends HMMs. How?
MDP as a graph
Is it possible to represent it as an HMM?
Relation to HMMs
Recycling robot example revisited
as an HMM problem
Battery t-1
Battery t
Battery t+1
Action t-1
Action t
Action t+1
Battery={ low, high}
Action={wait, search, recharge}
Relation to HMMs
Recycling robot example revisited
as an HMM problem
Battery t-1
Battery t
Battery t+1
Action t-1
Action t
Action t+1
Battery={ low, high}
Action={wait, search, recharge}
Not representable as an HMM
HMMs vs MDPs
Once we have the MDP representation of the problem,
We can do inferences just like HMM by converting it to
probabilistic automaton. Vice versa is not possible.
(Actions, rewards)
Use HMMs to do probabilistic reasoning over time
Use MDP/RL to optimize the behavior.