Transcript Document

Propagating Uncertainty In POMDP
Value Iteration with Gaussian Process
Written by Eric Tuttle and Zoubin Ghahramani
Presenter by Hui Li
May 20, 2005
Outline:
• Framework of POMDP
• Framework of Gaussian Process
• Gaussian Process Value Iteration
• Results
• Conclusions
Framework of POMDP
The POMDP is defined by the tuple < S, A, T, R, ,O >
 S is a finite set of states of the world.
 A is a finite set of actions.
 T: SA  (S) is the state-transition function, the probability
of an action changing the the world state from one to another,T(s,
a, s’).
 R: SA   is the reward for the agent in a given world
state after performing an action, R(s, a).
 is a finite set of observations.
 O: SA  () is the observation function, the probability of
making a certain observation after performing a particular action,
landing in state s’, O(s’, a, o).
POMDP agent can be decomposed into two parts: a state
estimator (SE) and a policy ().
WORLD
Action
Observation
o
SE
b
a
AGENT
b

The goal of a POMDP agent is to select actions which
T
maximize its expected total sum of future rewards E[t 0  t rt ].
Two functions are most often used in reinforcement
learning algorithms:
• value function (function of state)




t
v ( s)  E  rt k 1 | st  s 
 k 0

Optimal value function: v * ( s )  max v ( s )

• Q function (function of state-action)




t
Q ( s)  E  rt k 1 | st  s, at  a 
 k 0

Optimal value function: Q* ( s, a )  max Q ( s, a )

The key assumption of POMDP is that the state is
unknown, partially observable. We rely on the concept of
a belief state, denoted b, to represent a probability
distribution over states. The belief is a sufficient statistic
for a given history:
bt : Pr(st | b0 , a0 , o1 ,...ot 1, at 1 , ot )
 Pr(st | at 1 , ot , bt 1 )
After taking an action a and seeing an observation o, the
agent updates its belief state using Bayes’ rule:
bao ( s' )  SE(b, a , o)  P r(s' | o, a, b)


P r(o | s' , a , b) P r(s' | a , b)
P r(o | a, b)
P r(o | s' , a , b) P r(s' | a, b, s ) P r(s | a , b)
sS
P r(o | a, b)
 O ( s ' , a , o )  T ( s , a , s ' ) b( s )
sS
Bellman’s equations for POMDP are as follows:
Bellman’s equations for value function


V * (b)  max R( s, a )b( s)    Pr(o | a, b)V * ( SE(b, a, o))
a
o
 sS

Bellman’s equations for Q function


Q* (b, a )  max R( s, a )b( s)    Pr(o | a, b) max Q* ( SE(b, a, o), a )
a
a
o
 sS

Framework of Gaussian Process regression
A Gaussian process regressor defines a distribution over possible
functions that could fit the data. In particular, the distribution of
a function y(x) is a Gaussian process if the probability density
p(y(x1), y(x2), …, y(xN)) for any finite set of points {x1,…, xN} is
a multivariate Gaussian.
Assume we have a Gaussian process with mean 0 and
covariance function K(xi, xj).
Suppose we have observed a set of training points and
target function values D = {(xn,tn), n = 1,…, N}.
t  y  η ~ N (0, C)
 ~ N(0, ), a Gaussian noise. Then C =  + K.
With a new data x’, we have
y( x' ) | ({(xn , tn )nN1}, x' ) ~ N (K(x, x' )T C1t, K ( x' , x' )  K(x, x' )T C1K(x, x' ))
One general choice of covariance function is:
1
K ( xi , x j )   exp(  || xi  x j ||2w )  
2
|| v ||2W  vT Wv
With W a diagonal matrix.
: expected amplitude of the function
: a bias term that accommodates non-zero-mean functions.
Using maximum likelihood or MAP methods, the
parameters of the covariance function can be tuned.
Gaussian Process Value Iteration
Q function:
Qt (b, a)   R( s, a)b( s)    Pr(o | a, b) max Qt 1 ( SE(b, a, o), a)
sS
o
a
Model each of the action value functions Q( ,a) as a Gaussian
process. According to the definition of the Gaussian process,
Qt-1( ,a) is a multivariate normal distribution with mean a,bo
and covariance a,bo.
The major problem in computing the distribution Qt(b ,a) is
the max operator.
Two approximate ways for dealing with the max operator:
1. Approximate the max operator as simply passing
through the random variable with the highest mean.
max [Q
( SE(b, a, o), a)]  Qt 1 ( SE(b, a, o), a * (bo))
Where
a * (bo)  arg maxa a,bo
t 1
a
2. Take into account the effects of the max operator, but
ignore correlations among the function values.
If q1 and q2 are independent with distributions:
q1 ~ N ( 1,1 );
q2 ~ N ( 2 , 2 )
Then first two moments of variable q = max( q1, q2) are
given:
E(q)  1(m)  2(m)  s (m)
E(q2 )  ( 1  12 )(m)  (2   22 )(m)  (1  2 )s (m)
2
Where
s2  12   22 ;
2
m  ( 1  2 ) / s
 is the cdf and  is the pdf for a zero mean, unit variance
normal.
And q can be approximate using a Gaussian distribution.
Based on that, we can use a Gaussian distribution to
approximate max [Qt 1 ( SE(b, a, o), a)]
a
max [Q
t 1
( SE(b, a, o), a )] ~ N ( a* ( bo ),bo ,  a* ( bo ),bo )
a
Both methods produce a Gaussian approximation for the max
of a set of normally distributed vectors. And since Qta is
related to Qt-1* by a linear transformation , we have:
Qta  g  GQt*1 ~ N (Ga*  g, Ga*GT )
Results
Conclusions
• In this paper, authors presented an algorithms – Gaussian
processes for approximate value iteration in POMDPs.
• The results using GP are comparable to that of the classical
methods.