Slides - Southwestern

Download Report

Transcript Slides - Southwestern

Solving Interleaved and Blended
Sequential Decision-Making Problems
through Modular Neuroevolution
Jacob Schrum
([email protected])
Risto Miikkulainen
([email protected])
Introduction

Challenge: Discover behavior automatically
–
–


Simulations
Robotics
Video games
Why challenging?
–
–
–

Complex domains
Multiple agents
Multiple objectives
Multimodal behavior required
When is Multimodal Behavior Needed?



Agent must exhibit multiple behaviors
Domain consists of multiple tasks
How to identify/distinguish tasks?
–
Interleaved


–
Each task is separate
Sharp borders between tasks
Blended


Tasks partially overlap
Blurred border between tasks
What types of policies work best?

Previous results show great success in Ms. Pac-Man
(Schrum and Miikkulainen, GECCO 2014)
– Ms. Pac-Man has blended tasks

–

Neuroevolution discovers own task divisions
What about interleaved tasks?
–
Introduce a version of Ms. Pac-Man with interleaved tasks


Ghosts are all threats or all edible
What about human-specified task divisions?
–
Evolve multitask networks


Mixture of threat/edible ghosts
One policy for each task
Understand problem better with formalism (in paper)
Ms. Pac-Man


Multimodal behavior needed
Predator/prey variant
–

Goals: Maximize score by
–
–
–

Eating all pills in each level
Avoiding threatening ghosts
Eating ghosts (after power pill)
Non-deterministic
–

Pac-Man takes on both roles
Very noisy evaluations
Four mazes
–
Behavior must generalize
Human Play
Two Versions of Ms. Pac-Man
state contains a threat ghost
state contains an edible ghost

Blended
–
–
–
–
Original game
Edible and threat ghosts
present simultaneously
Distinct threat/edible regions
Four ghosts can be mix of
threat/edible/imprisoned

Interleaved
–
–
–
–
Modified version
Ghosts all edible or all threat
Eaten ghosts stay in prison
after being eaten
Easier to eat ghosts and stay
alive in this version
Modular Policy

One policy consisting of several policies/modules
–

Number preset, or learned
Means of arbitration also needed
–
Human-specified, or learned via preference neurons
Outputs:
Inputs:
Multitask (Caruana 1997)
Preference Neurons
Constructive Neuroevolution



Genetic Algorithms + Neural Networks
Three basic mutations + Crossover
Other structural mutations possible
(NEAT by Stanley and
Miikkulainen 2004)
Perturb Weight
Add Connection
Add Node
Module Mutation




A mutation that adds a
module
MM(Duplicate) copies
previous module
New module can diverge
as weights change
Can happen more than
once
MM(Duplicate)
(cf Calabretta et al 2000)
Experiments

Evolved using Modular Multiobjective NEAT

(http://nn.cs.utexas.edu/?mm-neat)
Each network type evaluated in 30 runs
–
–
–
–
Standard networks with one module (1M)
Nets with two (2M) or three (3M) modules + preference neurons
Nets starting with one module, subject to MM(D)
Multitask networks with human-specified task divisions


Interleaved:
– One module for edible, one for threats (MT)
Blended:
– One module if any ghost is edible, one otherwise (MT2)
– One module for edible, one for threats, one for mixed (MT3)
← Interleaved Results
•All modular approaches superior
•Final differences significant (p < 0.05)
•Higher scores overall (easier domain)
•Different behaviors have similar scores
Blended Results →
•Preference neurons superior
•Multitask as bad as one module
•Harder domain with lower scores
•Certain behaviors distinctly better
•Final differences significant (p < 0.05)
Interleaved Post-Evolution Evaluations:
Primary Module Usage vs. Score
Interleaved Post-Evolution Evaluations:
Primary Module Usage vs. Score
Separate modules
for escaping toward
power pill and
everything else
Separate modules for
threat and edible ghosts
Interleaved Post-Evolution Evaluations:
Primary Module Usage vs. Score
SMALL GAP
LITTLE OVERLAP
Separate modules for
threat and edible ghosts
Separate modules
for escaping toward
power pill and
everything else
Blended Post-Evolution Evaluations:
Primary Module Usage vs. Score
Blended Post-Evolution Evaluations:
Primary Module Usage vs. Score
Separate modules for
threat and edible ghosts
(Imperfect division)
LARGER GAP
MORE OVERLAP
Separate modules
for escaping toward
power pill and
everything else
Threat/Edible Split
With Multitask Networks
Interleaved Two Module
Multitask Champion
Blended Two Module
Multitask Champion
Blended Three Module
Multitask Champion
Similar behavior is exhibited in both domains. It is competent, but not outstanding
Threat/Edible Split
With Preference Neurons
Interleaved Two Module Champion
Blended Three Module Champion
(only uses two modules)
MM(D) champions are similar. Significant usage of 3+ modules does not occur.
Escape/Luring Behavior
(Only Possible With Preference Neurons)
Interleaved Two Module Champion
Blended Two Module Champion
A special “escape” module is rarely used, but its use has a big impact.
•Escaping ghosts when nearly surrounded helps Ms. Pac-Man survive longer.
•Escaping to a power pill after luring ghosts close helps eat ghosts to maximize points
Discussion

Obvious division is between edible and threat
–
–
Works fine in interleaved domain
Causes problems in blended domain


Strict Multitask divisions do not perform well
Better division: one module when surrounded
–
–
–
Very asymmetrical: surprising
Useful in interleaved domain, vital in blended domain
Module activates when Pac-Man almost surrounded


Often leads to eating power pill: luring
Helps Pac-Man escape in other risky situations
Conclusion

Interleaved tasks
–
–
–

Many approaches work well
Multitask network with one module per task sufficient
Preference neurons can still learn better task division
Blended tasks
–
–
–
–
Harder scenario dealt with in more real domains
Human-specified task divisions (Multitask) break down
Evolution needs freedom to discover own task division
Results in novel, successful task division
(escaping/luring)
Questions?
Contact me:
[email protected]
Movies:
http://nn.cs.utexas.edu/?interleaved-pm
http://nn.cs.utexas.edu/?blended-pm
Code:
http://nn.cs.utexas.edu/?mm-neat
Auxiliary Slides
Multimodal Behavior

Animals can perform many different tasks
Flying


Nesting
Imagine learning a monolithic policy as
complex as a cardinal’s behavior: HOW?
Problem more tractable if broken into
component behaviors
Foraging
Formalism (1/2)
 Markov Decision Process
 (States S , Actions A, Transition Probabilit ies P, Rewards R)
 Episodic tasks : e  s1 , , sm (visited states over m steps)
 Interleave d Tasks
 Partition of States : {S1 , , ST } (they cover S and are pairwise disjoint)
 Agent does not switch tas ks excessivel y
 Episode' s Thrashing Rate : e  xe / m (transitio ns over time steps)
wher e xe | {si | si  S j  si 1  S k  j  k} |
(sequentia l states in an episode are in different tasks)
 Domain' s Thrashing Rate : is average  e across all e (should be low)
Formalism (2/2)
 Semantic predicates (high - level understand ing of domain)
 One for each task : 1 , , T : S  {True, False}
i  {1, , T }s  Si ( i ( s )  True) (each predicate identifies states in a task)
i, j  {1, , T }s  Si (i  j   j ( s )  False ) (and only that task)
 Good predicates should naturally lead to low thrashing rate
 Define predicate' s task :  ( i )  {s  Si |  i ( s )  True}  Si
 Blended Tasks
 Good predicates do not lead to disjoint t asks
 ( i )   ( j ) not empty for some i  j
 Could treat area of overlap as separate interleave d tasks, but unnatural/ forced
 Blended region : s  S ( i ( s )  j ( s )) (some states are in two semantic tasks)
 Non - blended region : s  S ( i ( s )   j ( s )) (not a subset of other task )
Previous Work in Pac-Man

Custom Simulators
–
–
–
–

Screen Capture Competition: Requires Image Processing
–
–
–
–
–

Evolution & Fuzzy Logic: Handa & Isozaki 2008
Influence Map: Wirth & Gallagher 2008
Ant Colony Optimization: Emilio et al. 2010
Monte-Carlo Tree Search: Ikehata & Ito 2011
Decision Trees: Foderaro et al. 2012
Pac-Man vs. Ghosts Competition: Pac-Man
–
–
–
–

Genetic Programming: Koza 1992
Neuroevolution: Gallagher & Ledwich 2007, Burrow & Lucas 2009, Tan et al. 2011
Reinforcement Learning: Burrow & Lucas 2009, Subramanian et al. 2011, Bom 2013
Alpha-Beta Tree Search: Robles & Lucas 2009
Genetic Programming: Alhejali & Lucas 2010, 2011, 2013, Brandstetter & Ahmadi 2012
Monte-Carlo Tree Search: Samothrakis et al. 2010, Alhejali & Lucas 2013
Influence Map: Svensson & Johansson 2012
Ant Colony Optimization: Recio et al. 2012
Pac-Man vs. Ghosts Competition: Ghosts
–
–
–
Neuroevolution: Wittkamp et al. 2008
Evolved Rule Set: Gagne & Congdon 2012
Monte-Carlo Tree Search: Nguyen & Thawonmos 2013
Evolved Direction Evaluator




Inspired by Brandstetter and Ahmadi (CIG 2012)
Net with single output and direction-relative sensors
Each time step, run net for each available direction
Pick direction with highest net output
Right Preference
Left Preference
argmax
Left
Preference Neuron Arbitration

How can network decide which module to use?
–
–
Find preference neuron (grey) with maximum output
Corresponding policy neurons (white) define behavior
0.7 > 0.1, So use Module 2
Policy neuron for Module 2 has output of 0.5
[0.6, 0.1, 0.5, 0.7]
Outputs:
Inputs:
Output value 0.5 defines agent’s behavior
Pareto-based Multiobjective Optimization (Pareto 1890)
Imagine game with two objectives :
- Damage Dealt
- Health Remaining
Attack and retreat modes?
High health but did not deal much damage
Tradeoff between objectives


 
v dominates u , i.e. v  u 
1. i  {1,  , n}(vi  ui ) and
2. i  {1,  , n}(vi  ui )
Non - dominated points best :
A  F is Pareto optimal 
A contains all points in F s.t.


 
x  A y  F ( y  x )
Accomplish ed using NSGA - II (Deb et al. 2000)
Dealt lot of damage,
but lost lots of health
Non-dominated Sorting Genetic Algorithm II
(Deb et al. 2000)




Population P with size N; Evaluate P
Use mutation (& crossover) to get P´ size N; Evaluate P´
Calculate non-dominated fronts of P P´ size 2N
New population size N from highest fronts of P  P´
Direction Evaluator + Modules


Network is evaluated in each direction
For each direction, a module is chosen
–
–
Human-specified (Multitask) or Preference Neurons
Chosen module policy neuron sets direction preference
[0.5, 0.1, 0.7, 0.6] → 0.6 > 0.1: Left Preference is 0.7
[0.3, 0.8, 0.9, 0.1] → 0.8 > 0.1: Right Preference is 0.3
0.7 > 0.3
Ms. Pac-Man chooses to go left, based on Module 2
[Left Inputs]
[Right Inputs]