Decentralized POMDPs & Communication in Multiagent Models, Study notes of English Language

Multiagent models for partially observable environments, focusing on decentralized partially observable markov decision processes (dec-pomdps) and communicative models. The dec-tiger problem, multiagent planning frameworks, partially observable stochastic games, dec-pomdps, interactive pomdps, communication in dec-pomdps, extensive form games, and complexity results. It also mentions some algorithms for solving dec-pomdps.

Typology: Study notes

Pre 2010

Uploaded on 11/08/2009

koofers-user-agx
koofers-user-agx 🇺🇸

9 documents

1 / 18

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Multiagent models for
partially observable environments
Matthijs Spaan
Institute for Systems and Robotics
Instituto Superior T´
ecnico
Lisbon, Portugal
Reading group meeting, March 26, 2007
1/18
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12

Partial preview of the text

Download Decentralized POMDPs & Communication in Multiagent Models and more Study notes English Language in PDF only on Docsity!

Multiagent models for

partially observable environments

Matthijs Spaan

Institute for Systems and Robotics

Instituto Superior T ´

ecnico

Lisbon, Portugal

Reading group meeting, March 26, 2007

Overview

Multiagent models for partially observable environments:

Non-communicative models.

Communicative models.

Game-theoretic models.

Some algorithms.

Talk based on survey by Frans Oliehoek (2006).

Multiagent planning frameworks

Aspects:

communication

on-line vs. off-line

centralized vs. distributed

cooperative vs. self-interested

observability

factored reward

Partially observable stochastic games

Partially observable stochastic games (POSGs) (Hansen et al.,2004):

Extension of stochastic games (Shapley, 1953).

Hence self-interested.

Agents do not observe each other’s observations or actions.

Decentralized POMDPs

Decentralized partially observable Markov decision processes(Dec-POMDPs) (Bernstein et al., 2002):

Cooperative version of POSGs.

Only one reward, i.e., reward functions are identical for eachagent.

Reward function

R

S

×

A

1

×

×

A

n

R

Dec-MDPs:

Jointly observable Dec-POMDP: joint observation

o

o

1

,... , o

n

identifies the state.

But each agents only observes

o

i

MTDP (Pynadath and Tambe, 2002): essentially identical to Dec-POMDP.

Interactive POMDPs

Interactive POMDPs (Gmytrasiewicz and Doshi, 2005):

For self-interested agents.

Each agents keeps a belief over world states and otheragents’ models.

An agent’s model: local observation history, policy,observation function.

Leads to infinite hierarchy of beliefs.

Dec-POMDPs with communication

Dec-POMDP-Com (Goldman and Zilberstein, 2004)

Dec-POMDP plus:

is the alphabet of all possible messages.

σ

i

is a message sent by agent

i

C

Σ

R

is the cost of sending a message.

Reward depends on message sent: R

s, a

1

, σ

1

,... , a

n

, σ

n

, s

Instantaneous broadcast communication.

Fixed semantics.

Two policies: for domain-level actions, and forcommunicating.

Closely related model: Com-MTDP (Pynadath and Tambe,2002).

Extensive form games

8-card poker:

Dec-POMDP complexity results

Observability

Communication

fully

jointly

partial

none

none

P

NEXP

NEXP

NP

general

P

NEXP

NEXP

NP

free, instantaneous

P

P

PSPACE

NP

Dynamic programming for POSGs

Dynamic programming for POSGs (Hansen et al., 2004).

Uncertainty over state and the other agent’s futureconditional plans.

Define value function

V

t

over state and other agent’s depth-

t

policy trees: a

S

vector for each pair of policy trees.

Computing the

t

value function requires backing up all

combinations of all agents’ depth-

t

policy trees.

Prune (very weakly) dominated strategies.

Optimal for cooperative settings (DEC-POMDP).

Still infeasible for all but the smallest problems.

14/

Some algorithms

Joint Equilibrium based Search for Policies (Nair et al., 2003)

Use alternating maximization.

Converges to Nash equilibrium, which is a local optimum.

Keeps belief over state and other agents’ observationhistories.

This POMDP is transformed to an MDP over the beliefstates, and solved using value iteration.

Some algorithms (1)

Set-Coverage algorithm Becker et al. (2004):

For transition-independent Dec-MDPs with a particular jointreward structure.

Bounded Policy Iteration for Dec-POMDPs (Bernstein et al.,2005):

Optimize a finite-state controller with a bounded size.

Alternating maximization.