Transcript Document

Partially Observable Markov
Decision Process
By
Nezih Ergin Özkucur
Contents
 Markov Decision Process (MDP)
Value Iteration Algorithm
Reinforcement Learning
 Partially Observable Markov Decision Process
(POMDP)
POMDP vs MDP
Value Function Representation
Exact algorithms
 Kalman Filtering
 ART2A Network
 ARKAQ Learning Algorithm
2
Markov Decision Process
 Consider an agent which must act rationally in
an environment.
 At each discrete time step, agent must choose
one of the given decisions.
 In the long term, agent tries to get good results.
 MDP is a way to model this kind of problems.
 By modeling the problem, we can run some
automated algorithms to solve it.
3
MDP Components
 MDP can be defined by (S,A,T,R) where
S is a finite set of states which describes the situation of
the environment.
A is a finite set of actions, which agent must chose from
in each time step.
T (State transition function) is a mapping from SxA to
probability distrubutions over S. T(s,a,s`) is the
probability of being state s` when agent was in state s
and have chosen action a.
R (Reward Function) is a mapping from SxA to real
numbers.
4
Value Iteration Algorithm
A policy ( ) is a mapping from S to A
which gives action to select in each state.
Value of a state is expected long term
return starting from that state.
The Algorithm’s update rule:
5
Q-Learning Algorithm
Action Values:
Update rule:
6
Partially Observable Markov Decision
Process
Consider a MDP in which agent cannot
observe a state completely.
We can model this problem with POMDP
POMPD has 2 more components.
O is the finite observation set.
O(s,a,o) is the probability of making
observation o from state s after having
taken action a.
7
Agent’s Internal State
 Agent can represent the situation of environment
with belief states.
 A belief state (b) is a probability distrubition over
S. b(s) is probability of being state s when belief
state is b.
 Next b can be calculated from previous b.
8
MDP vs POMDP
9
Belief State Example
Observations: [ goal non-goal ]
Step 1 b = [ 0.33 0.33 0
0.33 ]
Step 2 b = [ 0
0.5 0
0.5 ]
Step 3 b = [ 0
0
0
1
]
10
Value Iteration Algorithm
 We can rewrite transition probabilities and reward
functions. And try to apply value iteration algorithm
 The problem here is how can we represent value
function, and how can we iterate over infinite belief
space.
11
Value Function Representation
 Value function can be approximated with vectors which has the
Piecewise Linear and Convex (PWLC) property.
12
Witness Algorithm
 Start with a set of b, which are at the corners of the belief space.
 At each iteration find a witness point which satisfies
where:
 Calculate new vector and add to vector set.
 Stop when there is no witness
13
Incremental Pruning Algorithm
14
Heuristic Search Value Iteration Algorithm
(HSVI)
15
ARKAQ Learning
16
Result of ARKAQ Learning Algorithm
4x4 Grid Problem
17
References
 Leslie Pack Kaelbling, Michael L. Littman, and Anthony R. Cassandra.
Planning and acting in partially observable stochastic domains. Artificial
Intelligence, Volume 101, pp. 99-134, 1998
 Anthony R. Cassandra, Michael L. Littman and Nevin L. Zhang. Incremental
pruning: A simple, fast, exact method for partially observable Markov
decision processes. Uncertainty in Artificial Intelligence (UAI), 1997
 Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An
introduction. Cambridge, MA: MIT Press
 Heuristic Search Value Iteration for POMDPs. T. Smith and R. Simmons. In
Proc. of UAI, 2004
 Alp SARDAG Autonomous Strategy Planning Under. Uncertainty PhD
Thesis Boğaziçi University 2006
18