Transcript Document
Propagating Uncertainty In POMDP
Value Iteration with Gaussian Process
Written by Eric Tuttle and Zoubin Ghahramani
Presenter by Hui Li
May 20, 2005
Outline:
• Framework of POMDP
• Framework of Gaussian Process
• Gaussian Process Value Iteration
• Results
• Conclusions
Framework of POMDP
The POMDP is defined by the tuple < S, A, T, R, ,O >
S is a finite set of states of the world.
A is a finite set of actions.
T: SA (S) is the state-transition function, the probability
of an action changing the the world state from one to another,T(s,
a, s’).
R: SA is the reward for the agent in a given world
state after performing an action, R(s, a).
is a finite set of observations.
O: SA () is the observation function, the probability of
making a certain observation after performing a particular action,
landing in state s’, O(s’, a, o).
POMDP agent can be decomposed into two parts: a state
estimator (SE) and a policy ().
WORLD
Action
Observation
o
SE
b
a
AGENT
b
The goal of a POMDP agent is to select actions which
T
maximize its expected total sum of future rewards E[t 0 t rt ].
Two functions are most often used in reinforcement
learning algorithms:
• value function (function of state)
t
v ( s) E rt k 1 | st s
k 0
Optimal value function: v * ( s ) max v ( s )
• Q function (function of state-action)
t
Q ( s) E rt k 1 | st s, at a
k 0
Optimal value function: Q* ( s, a ) max Q ( s, a )
The key assumption of POMDP is that the state is
unknown, partially observable. We rely on the concept of
a belief state, denoted b, to represent a probability
distribution over states. The belief is a sufficient statistic
for a given history:
bt : Pr(st | b0 , a0 , o1 ,...ot 1, at 1 , ot )
Pr(st | at 1 , ot , bt 1 )
After taking an action a and seeing an observation o, the
agent updates its belief state using Bayes’ rule:
bao ( s' ) SE(b, a , o) P r(s' | o, a, b)
P r(o | s' , a , b) P r(s' | a , b)
P r(o | a, b)
P r(o | s' , a , b) P r(s' | a, b, s ) P r(s | a , b)
sS
P r(o | a, b)
O ( s ' , a , o ) T ( s , a , s ' ) b( s )
sS
Bellman’s equations for POMDP are as follows:
Bellman’s equations for value function
V * (b) max R( s, a )b( s) Pr(o | a, b)V * ( SE(b, a, o))
a
o
sS
Bellman’s equations for Q function
Q* (b, a ) max R( s, a )b( s) Pr(o | a, b) max Q* ( SE(b, a, o), a )
a
a
o
sS
Framework of Gaussian Process regression
A Gaussian process regressor defines a distribution over possible
functions that could fit the data. In particular, the distribution of
a function y(x) is a Gaussian process if the probability density
p(y(x1), y(x2), …, y(xN)) for any finite set of points {x1,…, xN} is
a multivariate Gaussian.
Assume we have a Gaussian process with mean 0 and
covariance function K(xi, xj).
Suppose we have observed a set of training points and
target function values D = {(xn,tn), n = 1,…, N}.
t y η ~ N (0, C)
~ N(0, ), a Gaussian noise. Then C = + K.
With a new data x’, we have
y( x' ) | ({(xn , tn )nN1}, x' ) ~ N (K(x, x' )T C1t, K ( x' , x' ) K(x, x' )T C1K(x, x' ))
One general choice of covariance function is:
1
K ( xi , x j ) exp( || xi x j ||2w )
2
|| v ||2W vT Wv
With W a diagonal matrix.
: expected amplitude of the function
: a bias term that accommodates non-zero-mean functions.
Using maximum likelihood or MAP methods, the
parameters of the covariance function can be tuned.
Gaussian Process Value Iteration
Q function:
Qt (b, a) R( s, a)b( s) Pr(o | a, b) max Qt 1 ( SE(b, a, o), a)
sS
o
a
Model each of the action value functions Q( ,a) as a Gaussian
process. According to the definition of the Gaussian process,
Qt-1( ,a) is a multivariate normal distribution with mean a,bo
and covariance a,bo.
The major problem in computing the distribution Qt(b ,a) is
the max operator.
Two approximate ways for dealing with the max operator:
1. Approximate the max operator as simply passing
through the random variable with the highest mean.
max [Q
( SE(b, a, o), a)] Qt 1 ( SE(b, a, o), a * (bo))
Where
a * (bo) arg maxa a,bo
t 1
a
2. Take into account the effects of the max operator, but
ignore correlations among the function values.
If q1 and q2 are independent with distributions:
q1 ~ N ( 1,1 );
q2 ~ N ( 2 , 2 )
Then first two moments of variable q = max( q1, q2) are
given:
E(q) 1(m) 2(m) s (m)
E(q2 ) ( 1 12 )(m) (2 22 )(m) (1 2 )s (m)
2
Where
s2 12 22 ;
2
m ( 1 2 ) / s
is the cdf and is the pdf for a zero mean, unit variance
normal.
And q can be approximate using a Gaussian distribution.
Based on that, we can use a Gaussian distribution to
approximate max [Qt 1 ( SE(b, a, o), a)]
a
max [Q
t 1
( SE(b, a, o), a )] ~ N ( a* ( bo ),bo , a* ( bo ),bo )
a
Both methods produce a Gaussian approximation for the max
of a set of normally distributed vectors. And since Qta is
related to Qt-1* by a linear transformation , we have:
Qta g GQt*1 ~ N (Ga* g, Ga*GT )
Results
Conclusions
• In this paper, authors presented an algorithms – Gaussian
processes for approximate value iteration in POMDPs.
• The results using GP are comparable to that of the classical
methods.