Multi-agent MDP and POMDP Models for Coordination
Download
Report
Transcript Multi-agent MDP and POMDP Models for Coordination
Markov Models for
Multi-Agent Coordination
Maayan Roth
Multi-Robot Reading Group
April 13, 2005
Multi-Agent Coordination
Teams:
–
–
Agent work together to achieve a common goal
No individual motivations
Objective:
–
Generate policies (individually or globally) to yield
best team performance
Observability [Pynadath and Tambe, 2002]
“Observability” - degree to which agents can, either
individually or as a team, identify current world state
–
Individual observability
–
Collective observability
–
O1 + O2 + … + On uniquely identify the state
e.g. [Xuan, Lesser, Zilberstein, 2001]
Collective partial observability
–
MMDP [Boutilier, 1996]
DEC-POMDP [Bernstein et al., 2000]
POIPSG [Peskin, Kim, Meuleau, Kaelbling, 2000]
Non-observability
Communication
[Pynadath and Tambe, 2002]
“Communication” - explicit message-passing from
one agent to another
–
Free Communication
–
General Communication
–
No cost to send messages
Transforms MMDP to MDP, DEC-POMDP to POMDP
Communication is available but has cost or is limited
No Communication
No explicit message-passing
Taxonomy of Cooperative MAS
No
Communication
DEC-POMDP
Open-loop control?
DEC-MDP
MMDP
General
Communication
Free
Communication
MDP
Full
Observability
MDP
Collective
Observability
POMDP
Partial
Observability
Unobservability
Complexity
Individually Collectively
Observable Observable
Collectively
Partially
Observable
No Comm.
P-complete
NEXP-complete
NEXP-complete
General Comm.
P-complete
NEXP-complete
NEXP-complete
Free Comm.
P-complete
P-complete
PSPACE-complete
Multi-Agent MDP (MMDP) [Boutilier, 1996]
Also called IPSG (identical payoff stochastic games)
M = <S, {Ai}im, T, R>
–
–
–
–
S is set of possible world states
{Ai}im is set of joint actions, <a1, …, am> where ai Ai
T defines transition probabilities over joint actions
R is team reward function
State is fully observable by each agent
P-complete
Coordination Problems
<*,*>
<a,a>; <b,b>
s2
<a,*>
<b,*>
s3
<*,*>
+10
s5
-10
s6
+5
<a,b>; <b,a>
<*,*>
s1
s4
<*,*>
Multiple equilibria – two optimal joint policies
Randomization with Learning
Expand MMDP to include FSM that selects actions based on
whether agents are coordinated or uncoordinated
<*,*>
<a,b>; <b,a>
<a,a>
U
A
<*,*>
<b,b>
a
B
b
random(a,b)
Build the expanded MMDP by adding a coordination
mechanism at every state where a coordination problem is
discovered
Multi-Agent POMDP
DEC-POMDP [Bernstein et al., 2000], MTDP
[Pynadath et al., 2002], POIPSG [Peshkin et al.,
2000]
M = <S, {Ai}im, T, {i}im, O, R>
–
–
–
–
–
–
S is set of possible world states
{Ai}im is set of joint actions, <a1, …, am> where ai Ai
T defines transition probabilities over joint actions
{i}im is set of joint observations, <1, …, m> where i i
O defines observation probabilities over joint actions and joint
observations
R is team reward function
Complexity [Bernstein, Zilberstein, Immerman, 2000]
For all m 2, DEC-POMDP with m agents is
NEXP-complete
–
“For a given DEC-POMDP with finite time horizon T, and an
integer K, is there a policy for which yields a total reward
K?”
–
Agents must reason about all possible actions and
observations of their teammates
For all m 3, DEC-MDP with m agents is
NEXP-complete
Dynamic Programming
[Hansen, Bernstein, Zilberstein, 2004]
Finds optimal solution
Partially observable stochastic game (POSG)
framework
–
Experimental results show speed-up in some
domains
–
Iterate between DP step and pruning step to remove
dominated strategies
Still NEXP-complete, so hyper-exponential in worst case
No communication
Transition Independence
[Becker, Zilberstein, Lesser, Goldman, 2003]
DEC-MDP – collective observability
Transition independence:
–
Local state transitions
–
Each agent observes local state
Individual actions only affect local state transitions
Team connected through joint reward
Coverage set algorithm – finds optimal policy quickly
in experimental domains
No communication
Efficient Policy Computation
[Nair, Pynadath, Yokoo, Tambe, Marsella, 2003]
Joint Equilibrium-Based Search for Policies
–
–
–
initialize random policies for all agents
for each agent i:
fix the policies for all other agents
find the optimal policy for agent i
repeat until no policies change
Finds locally optimal policies with potentially
exponential speedups over finding the global
optimum
No communication
Communication
COM-MTDP framework for analyzing communication
–
–
Enhance model with {i}im where i is the set of possible
messages agent i can send
Reward function includes cost ( 0) of communication
If communication is free:
–
–
[Pynadath and Tambe, 2002]
DEC-POMDP reducible to single-agent POMDP (PSPACEcomplete)
Optimal communication policy is to communicate at every
time step
Under general communication (no restrictions on
cost), DEC-POMDP still NEXP-complete
–
i may contain every possible observation history
COMM-JESP [Nair, Roth, Yokoo, Tambe, 2004]
Add SYNC action to domain
–
–
If one agent chooses SYNC,
all other agents SYNC
At SYNC, send entire
observation history since last
SYNC
SYNC brings agents to
synchronized belief over world
states
Policies indexed by root
synchronized belief and
observation history since last
SYNC
“At-most
t=0
(SL () 0.5)
(SR () 0.5)
= HL
(SL (HR) 0.1275)
(SL (HL) 0.7225)
(SR (HR) 0.1275)
(SR (HL) 0.0225)
a = {Listen, Listen}
= HR
(SL (HR) 0.0225)
(SL (HL) 0.1275)
(SR (HR) 0.7225)
(SR (HL) 0.1275)
a = SYNC
t=2
(SL () 0.5)
(SR () 0.5)
t=2
(SL () 0.97)
(SR () 0.03)
K” heuristic – there must be a SYNC within at most K timesteps
Dec-Comm [Roth, Simmons, Veloso, 2005]
At plan-time, assume communication is free
–
generate joint policy using single-agent POMDPsolver
Reason over possible joint beliefs to execute
policy
Use communication to integrate local
observations into team belief
Tiger Domain:
(States, Actions)
Two-agent tiger problem [Nair et al., 2003]:
Individual Actions:
ai {OpenL, OpenR, Listen}
Robot can open left door,
open right door, or listen
S: {SL, SR}
Tiger is either behind left door
or behind right door
Tiger Domain:
(Observations)
Individual Observations:
I {HL, HR}
Robot can hear tiger behind left door
or hear tiger behind right door
Observations are noisy and
independent.
Tiger Domain:
(Reward)
Coordination problem – agents must act together for maximum
reward
Listen has small cost (-1 per agent)
Both agents opening door with tiger
leads to medium negative reward (-50)
Maximum reward (+20) when both
agents open door with treasure
Minimum reward (-100) when only one
agent opens door with tiger
Joint Beliefs
Joint belief (bt) – distribution over world states
Why compute possible joint beliefs?
–
–
–
action coordination
transition and observation functions depend on joint action
agent can’t accurately estimate belief if joint action is
unknown
To ensure action coordination, agents can only
reason over information known by all teammates
Possible Joint Beliefs
HL
b: P(SL) = 0.5
HL
L0
p: p(b) = 1.0
= {}
a = <Listen, Listen>
How should agents
select actions over
joint beliefs?
L1
b: P(SL) = 0.8
b: P(SL) = 0.5
b: P(SL) = 0.5
b: P(SL) = 0.2
p: p(b) = 0.29
p: p(b) = 0.21
p: p(b) = 0.21
p: p(b) = 0.29
= {HL,HL}
= {HL,HR}
= {HR,HL}
= {HR,HR}
p t p t 1 P( t | bt 1 , a t 1 )
Q-POMDP Heuristic
Select joint action over possible joint beliefs
Q-MDP (Littman et. al., 1995)
–
approximate solution to large POMDP using
underlying MDP
QMDP (b) arg max b( s) Va ( s)
a
sS
Q-POMDP
–
approximate solution to DEC-POMDP using
underlying single-agent POMDP
QPOMDP ( Lt ) arg max
a
p(L ) Q(b(L ), a)
t
i
Lti Lt
t
i
Q-POMDP Heuristic
QPOMDP ( Lt ) arg max
a
p(L ) Q(b(L ), a)
t
i
t
i
Lti Lt
b: P(SL) = 0.5
Choose joint action by
computing expected
reward over all leaves
p: p(b) = 1.0
= {}
b: P(SL) = 0.8
b: P(SL) = 0.5
b: P(SL) = 0.5
b: P(SL) = 0.2
p: p(b) = 0.29
p: p(b) = 0.21
p: p(b) = 0.21
p: p(b) = 0.29
= {HL,HL}
= {HL,HR}
= {HR,HL}
= {HR,HR}
Agents will independently select same joint action…
but action choice is very conservative (always <Listen,Listen>)
DEC-COMM: Use communication to add local observations to joint belief
Dec-Comm Example
HL
b: P(SL) = 0.5
p: p(b) = 1.0
HL
= {}
L1
b: P(SL) = 0.8
b: P(SL) = 0.5
b: P(SL) = 0.5
b: P(SL) = 0.2
p: p(b) = 0.29
p: p(b) = 0.21
p: p(b) = 0.21
p: p(b) = 0.29
= {HL,HL}
= {HL,HR}
= {HR,HL}
= {HR,HR}
aNC = Q-POMDP(L1) = <Listen,Listen>
L* = circled nodes
aC = Q-POMDP(L*) = <Listen,Listen>
Don’t communicate
Dec-Comm Example (cont’d)
b: P(SL) = 0.5
<HL,HL>
p: p(b) = 1.0
= {}
L1
b: P(SL) = 0.85
b: P(SL) = 0.5
b: P(SL) = 0.5
b: P(SL) = 0.15
p: p(b) = 0.29
p: p(b) = 0.21
p: p(b) = 0.21
p: p(b) = 0.29
= {HL,HL}
= {HL,HR}
= {HR,HL}
= {HR,HR}
<HL,HL>
a = <Listen, Listen>
L2
b: P(SL) = 0.97
b: P(SL) = 0.85
b: P(SL) = 0.85
b: P(SL) = 0.5
b: P(SL) = 0.85
b: P(SL) = 0.5
b: P(SL) = 0.5
b: P(SL) = 0.16
p: p(b) = 0.12
p: p(b) = 0.06
p: p(b) = 0.06
p: p(b) = 0.04
p: p(b) = 0.06
p: p(b) = 0.04
p: p(b) = 0.04
p: p(b) = 0.06
: {<HL,HL>,
: {<HL,HL>,
: {<HL,HL>,
: {<HL,HL>,
: {<HL,HR>,
: {<HL,HR>,
: {<HL,HR>,
: {<HL,HR>,
<HL,HL>}
<HL,HR>}
<HR,HL>}
<HL,HL>}
<HL,HR>}
<HR,HL>}
<HR,HR>}
<HR,HR>}
aNC = Q-POMDP(L2) = <Listen, Listen>
L* = circled nodes
aC = Q-POMDP(L*) = <OpenR,OpenR>
V(aC) - V(aNC) > ε
Agent 1 communicates
…
Dec-Comm Example (cont’d)
b: P(SL) = 0.5
<HL,HL>
p: p(b) = 1.0
= {}
L1
b: P(SL) = 0.85
b: P(SL) = 0.5
b: P(SL) = 0.5
b: P(SL) = 0.15
p: p(b) = 0.29
p: p(b) = 0.21
p: p(b) = 0.21
p: p(b) = 0.29
= {HL,HL}
= {HL,HR}
= {HR,HL}
= {HR,HR}
<HL,HL>
a = <Listen, Listen>
L2
b: P(SL) = 0.97
b: P(SL) = 0.85
b: P(SL) = 0.85
b: P(SL) = 0.5
b: P(SL) = 0.85
b: P(SL) = 0.5
b: P(SL) = 0.5
b: P(SL) = 0.16
p: p(b) = 0.12
p: p(b) = 0.06
p: p(b) = 0.06
p: p(b) = 0.04
p: p(b) = 0.06
p: p(b) = 0.04
p: p(b) = 0.04
p: p(b) = 0.06
: {<HL,HL>,
: {<HL,HL>,
: {<HL,HL>,
: {<HL,HL>,
: {<HL,HR>,
: {<HL,HR>,
: {<HL,HR>,
: {<HL,HR>,
<HL,HL>}
<HL,HR>}
<HR,HL>}
<HL,HL>}
<HL,HR>}
<HR,HL>}
<HR,HR>}
<HR,HR>}
…
Agent 1 communicates <HL,HL>
Agent 2 communicates <HL,HL>
Q-POMDP(L2) = <OpenR, OpenR>
Agents open right door!
References
Becker, R., Zilberstein, S., Lesser, V., Goldman, C. V. Transition-independent decentralized Markov decision processes. In
International Joint Conference on Autonomous Agents and Multi-Agent Systems, 2003.
Bernstein, D., Zilberstein, S., Immerman, N. The complexity of decentralized control of Markov decision processes. In
Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence, 2000.
Boutilier, C. Sequential optimality and coordination in multiagent systems. In Proceedings of the Sixteenth International Joint
Conference on Artificial Intelligence, 1999.
Hansen, E. A., Bernstein, D. S., Zilberstein, S. Dynamic programming for partially observable stochastic games. In National
Conference on Artificial Intelligence, 2004.
Nair, R., Roth, M., Yokoo, M., Tambe, M. Communication for improving policy computation in distributed POMDPs. To appear
in Proceedings of the International Conference on Autonomous Agents and Multiagent Systems, 2004.
Nair, R., Pynadath, D., Yokoo, M., Tambe, M., Marsella, S. Taming decentralized POMDPs: Towards efficient policy
computation for multiagent settings. In Proceedings of International Joint Conference on Artificial Intelligence, 2003.
Peshkin, L., Kim, K.-E., Meuleau, N., Kaelbling, L. P. Learning to cooperate via policy search. In Proceedings of the
Conference on Uncertainty in Artificial Intelligence, 2000.
Pynadath, D. and Tambe, M. The communicative multiagent team decision problem: Analyzing teamwork theories and models.
In Journal of Artificial Intelligence Research, 2002.
Roth, M., Simmons, R., Veloso, M. Reasoning about joint beliefs for execution-time communication decisions. To appear in
International Joint Conference on Autonomous Agents and Multi-Agent Systems, 2005.
Xuan, P., Lesser, V., Zilberstein, S. Communication decisions in multi-agent cooperation: Model and experiments. In
Proceedings of the International Joint Conference on Artificial Intelligence, 2001.