Transcript Document
Partially Observable Markov
Decision Process
By
Nezih Ergin Özkucur
Contents
Markov Decision Process (MDP)
Value Iteration Algorithm
Reinforcement Learning
Partially Observable Markov Decision Process
(POMDP)
POMDP vs MDP
Value Function Representation
Exact algorithms
Kalman Filtering
ART2A Network
ARKAQ Learning Algorithm
2
Markov Decision Process
Consider an agent which must act rationally in
an environment.
At each discrete time step, agent must choose
one of the given decisions.
In the long term, agent tries to get good results.
MDP is a way to model this kind of problems.
By modeling the problem, we can run some
automated algorithms to solve it.
3
MDP Components
MDP can be defined by (S,A,T,R) where
S is a finite set of states which describes the situation of
the environment.
A is a finite set of actions, which agent must chose from
in each time step.
T (State transition function) is a mapping from SxA to
probability distrubutions over S. T(s,a,s`) is the
probability of being state s` when agent was in state s
and have chosen action a.
R (Reward Function) is a mapping from SxA to real
numbers.
4
Value Iteration Algorithm
A policy ( ) is a mapping from S to A
which gives action to select in each state.
Value of a state is expected long term
return starting from that state.
The Algorithm’s update rule:
5
Q-Learning Algorithm
Action Values:
Update rule:
6
Partially Observable Markov Decision
Process
Consider a MDP in which agent cannot
observe a state completely.
We can model this problem with POMDP
POMPD has 2 more components.
O is the finite observation set.
O(s,a,o) is the probability of making
observation o from state s after having
taken action a.
7
Agent’s Internal State
Agent can represent the situation of environment
with belief states.
A belief state (b) is a probability distrubition over
S. b(s) is probability of being state s when belief
state is b.
Next b can be calculated from previous b.
8
MDP vs POMDP
9
Belief State Example
Observations: [ goal non-goal ]
Step 1 b = [ 0.33 0.33 0
0.33 ]
Step 2 b = [ 0
0.5 0
0.5 ]
Step 3 b = [ 0
0
0
1
]
10
Value Iteration Algorithm
We can rewrite transition probabilities and reward
functions. And try to apply value iteration algorithm
The problem here is how can we represent value
function, and how can we iterate over infinite belief
space.
11
Value Function Representation
Value function can be approximated with vectors which has the
Piecewise Linear and Convex (PWLC) property.
12
Witness Algorithm
Start with a set of b, which are at the corners of the belief space.
At each iteration find a witness point which satisfies
where:
Calculate new vector and add to vector set.
Stop when there is no witness
13
Incremental Pruning Algorithm
14
Heuristic Search Value Iteration Algorithm
(HSVI)
15
ARKAQ Learning
16
Result of ARKAQ Learning Algorithm
4x4 Grid Problem
17
References
Leslie Pack Kaelbling, Michael L. Littman, and Anthony R. Cassandra.
Planning and acting in partially observable stochastic domains. Artificial
Intelligence, Volume 101, pp. 99-134, 1998
Anthony R. Cassandra, Michael L. Littman and Nevin L. Zhang. Incremental
pruning: A simple, fast, exact method for partially observable Markov
decision processes. Uncertainty in Artificial Intelligence (UAI), 1997
Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An
introduction. Cambridge, MA: MIT Press
Heuristic Search Value Iteration for POMDPs. T. Smith and R. Simmons. In
Proc. of UAI, 2004
Alp SARDAG Autonomous Strategy Planning Under. Uncertainty PhD
Thesis Boğaziçi University 2006
18