schrum.proposal - Neural Network Research Group

Download Report

Transcript schrum.proposal - Neural Network Research Group

Evolving Multimodal
Behavior
PhD Proposal
Jacob Schrum
11/4/09
Introduction

Challenge: Discover behavior automatically
 Simulations,

video games, robotics
Why challenging?
 Noisy
sensors
 Complex domains
 Continuous states/actions
 Multiple agents, teamwork
 Multiple objectives
 Multimodal behavior required (focus)
What is Multimodal Behavior?

Working definition:
 Agent
exhibits distinct kinds of actions
under different circumstances

Examples:
 Offensive
& defensive modes in soccer
 Search for weapons or opponents in
video game
 Animal with foraging & fleeing modes

Very important for teams
 Roles
correspond to modes
 Example domains will involve teamwork
Previous Approaches

Design Approaches
 Hand

Value-Function Based Approaches
 Learn

code in a structured manner
the utility of actions (RL)
Evolutionary Approaches
 Selectively
search based on performance
Design

Subsumption Architecture (Brooks 1986)
 Hierarchical
design
 Lower levels independent of higher levels
 Built incrementally
 Common in robotics
 Hand coded
Value-Function Based (1/2)

MAXQ (Dietterich 1998)
 Hand-designed
hierarchy
 TD learning at multiple levels
 Reduce state space
 Taxi domain
Still just a grid world
 Discrete state & action

Value-Function Based (2/2)

Basis Behaviors (Matarić 1997)
 Low-level
behaviors pre-defined
 Learn high-level control
 Discrete state space

High-level features (conditions)
 Reward
shaping necessary
 Applied to real robots
 Too
much expert knowledge
Evolutionary (1/2)

Layered Evolution (Togelius 2004)
 Evolve
components of
subsumption architecture
 Applied to:
EvoTanks (Thompson and Levine 2008)
 Unreal Tournament 2004 (van Hoorn et al. 2009)

 Must
specify:
Hierarchy
 Training tasks

 Similar
to Layered
Learning (Stone 2000)
Evolutionary (2/2)

Neuro-Evolving Robotic Operatives
(Stanley et al. 2005)
 ML
game
 Train robot army
 Many objectives
Weighted sum: z-scores method
 User changes weights during training

 Dynamic

objective management
Leads to multimodal behavior
Multiple Objectives

Multimodal problems are typically multiobjective
 Modes

associated with objectives
Traditional: weighted sum (Cohon 1978)
 Must
tune the weights
 Only one solution
 Bad for non-convex surfaces
 Need better formalism
Each point corresponds to
one set of specific weights
Cannot be captured
by weighted sum
Greatest Mass Sarsa
(Sprague and Ballard 2003)
Multiple MDPs with shared action space
 Learn via Sarsa(0) update rule:


Best value is sum of component values:
Used in sidewalk
navigation task
 Like weighted sum

Convex Hull Iteration
(Barrett and Narayanan 2008)

Changes MDP formalism:

Vector reward

Find solutions for all possible
weightings where
Maximize:

Results in compact set of solutions


Different trade-offs

Cannot capture non-convex surfaces
Discrete states/actions only

Need a way to capture non-convex surfaces!

Pareto-based Multiobjective Optimization
(Pareto 1890)

Imagine game with two objectives:


Damage Dealt
Health Remaining
dominates

iff
1.
High health but did not deal much damage
Tradeoff between objectives
and
2.

Population of points
not dominated are best:
Pareto Front
Dealt lot of damage,
but lost lots of health
Non-dominated Sorting Genetic Algorithm II
(Deb et al. 2000)




Population P with size N; Evaluate P
Use mutation to get P´ size N; Evaluate P´
Calculate non-dominated fronts of {P P´} size 2N
New population size N from highest fronts of {P  P´}
Constructive Neuroevolution
Genetic Algorithms + Neural Networks
 Build structure incrementally
 Good at generating control policies
 Three basic mutations (no crossover used)
 Other structural
mutations
possible

 More
later
Perturb Weight
Add Connection
Add Node
Evolution of Teamwork

Heterogeneous
 Different
roles
 Cooperation harder
to evolve
 Team-level
multimodal behavior

Homogeneous
 Shared
policy
 Individuals know how
teammates act
 Individuals fill roles as
needed: multimodal
Completed Work

Benefits of Multiobjective Neuroevolution
 Pareto-based

Targeting Unachieved Goals (TUG)
 Speed

up evolution with objective management
Evolving Multiple Output Modes
 Allow

leads to multimodal behavior
networks to have multiple policies/modes
Need a domain to experiment in …
Battle Domain


Evolved monsters (yellow)
Scripted fighter (green)
 Approach nearest monster
 Swing bat repeatedly



Monsters can hurt fighter
Bat can hurt monsters
Multiple objectives
 Deal
damage
 Avoid damage
 Stay alive

Can multimodal teamwork evolve?
Benefits of Multiobjective Neuroevolution

Research Questions:
 NSGA-II
better than z-scores (weighted sum)?
 Homogeneous or heterogeneous teams better?


30 trials for each combination
Three evaluations per individual
 Average
scores to overcome noisy evals
Incremental Evolution

Hard to evolve against scripted strategy
 Could

Incremental evolution against increasing speeds
 0%,


easily fail to evolve interesting behavior
40%, 80%, 90%, 95%, 100%
Increase speed when all
goals are met
End when goals met at 100%
Goals

Average population performance high enough?


Then increase speed
Each objective has a goal:
 At
least 50 damage to bot (1 kill)
 Less than 20 damage per monster on average (2 hits)
 Survive at least 540 time
steps (90% of trial)

AVG population objective
score met goal value?

Goal achieved
Evolved Behaviors

Baiting + Side-Swiping
 Lure
fighter
 Turns allow team to catch up
 Attacks on left side of fighter

Taking Turns
 Hit
and run
 Next counter-clockwise
monster rushes in
 Fighter hit on left side

Multimodal behaviors!
Multiobjective Conclusions
NSGA-II faster than z-scores
 NSGA-II more likely to generate
multimodal behavior

Many runs did not finish/were slow
 Several “successful” runs did not have
multimodal behavior

Targeting Unachieved Goals

Research Question:
 How
to speed up evolution, make more reliable
When objective’s goal is met, stop using it
 Restore objective if scores drop below goal
 Focuses on the most challenging objectives
 Combine NSGA-II with TUG

Tough Objectives
Evolved Behaviors

Alternating Baiting
 Bait
until another monster hits
 Then baiting monster attacks
 Fighter knocked back and forth

Synchronized Formation
 Move
as a group
 Fighter chases one bait
 Other monster rushes in with
side swipe attacks

More multimodal behaviors!
TUG Conclusions

TUG results in huge speed-up
 No




wasted effort on achieved goals
TUG runs finish more reliably
Heterogeneous runs have more multimodal
behavior than homogeneous
Some runs still did not finish
Some “successful” runs still did not have
multimodal behavior
Fight or Flight



Separate Fight and Flight trials
Fight = Battle Domain
Flight:
 Scripted
prey (red) instead of fighter
 Has no bat; has to escape
 Monsters confine and damage
 New objective: Deal damage in Flight


Flight task requires teamwork
Requires multimodal behavior
New-Mode Mutation
Encourage multimodal behavior
 New mode with inputs from preexisting mode

 Initially

very similar
Maximum preference node determines mode
Evolving Multiple Output Modes

Research Question:
 How


to evolve teams that do well in both tasks
Compare 1Mode to ModeMutation
Three evals in Fight and three in Flight
 Same
networks for two different tasks
1Mode Behaviors

Aggressive + Corralling
 Aggressive in Fight task
 Take lots of damage
 Deal lots of damage
 Corralling in Flight task

Run/Rush + Crowding
 Run/Rush in Fight task
 Good timing on attack
 Kill fighter w/o taking too much damage
 Crowding in Flight task
 Get too close to prey
 Knock prey out and it escapes

Networks can’t handle both tasks!
ModeMutation
Behaviors

Alternating Baiting + Corralling
 Alternating
baiting in Fight task
 Corralling in Flight task



Spread out to prevent escape
Individuals rush in to attack
Hit into Crowd + Crowding
 Hitting into Crowd in Fight task
 One attacker knocks fighter into others
 Crowding in Flight task
 Rush prey, ricochet back and forth
 Some times knocks prey free

Networks succeed at both tasks!
Mode Mutation Conclusions
ModeMutation slower than 1Mode
 ModeMutation better at producing
multimodal behaviors

Harder task resulted in more failed runs
 Many unused output modes created

 Slows
down execution
 Bloats output layer
Proposed Work

Extensions
Avoiding Stagnation by Promoting Diversity
2. Extending Evolution of Multiple Output Modes
3. Heterogeneous Teams Using Subpopulations
4. Open-Ended Evolution + TUG
1.


Evaluate in new tasks
Killer App: Unreal Tournament 2004
1. Avoiding Stagnation by Promoting Diversity
Behavioral diversity avoids stagnation
 Add a diversity objective (Mouret et al. 2009)
 Behavior vector:

 Given
input vectors,
concatenate outputs

1.2 -2
Diversity objective:
…
-1 2.2
…
 AVG
distance from
other behavior
vectors in pop.
0.5 1 0.2 1.7 -2
…
1.5 -1 0.6 0.3 2
2. Extending Evolution of Multiple Output Modes

Encourage mode differences
 Random

input sources
Probabilistic arbitration
 Bad
modes less likely to persist
 Like softmax action selection

Restrict New-Mode Mutation
 New
objective: punish unused modes, reward used modes
 Delete similar modes

Based on behavior metric
 Limit

modes: make best use of limited resources
Dynamically increase the limit?
3. Heterogeneous Teams Using Subpopulations

Each team member from different subpopulation
(Yong 2007)


Encourages division of labor across teammates
Different roles leads to multimodal team behavior
4. Open-Ended Evolution + TUG

Keep increasing goals
 Evolution
has something to strive towards
Preserves benefits of TUG
 Does not settle early
 When to increase goals?

 When
 As
all goals are achieved
individual goals are achieved
New Tasks


More tasks require more modes
Investigate single-agent tasks


Investigate complementary objectives



Only teams so far
TUG only helps contradictory?
Are hard when combined with others
Tasks:
1.
Predator
•
•
2.
Opposite of Flight
Partial observability
Sink the Ball
•
•
•
Very different from previous
Needs more distinct modes?
Less mode sharing?
Unreal Tournament 2004


Commercial First-Person Shooter (FPS)
Challenging domain
 Continuous
state and action
 Multiobjective
 Partial information
 Multimodal behaviors required


Programming API: Pogamut
Competitions:
 Botprize
 Deathmatch
Unreal Deathmatch



Packaged bots are hand-coded
Previous winners of botprize hand-coded
Learning attempts
 Simplified
version of game (van Hoorn et al. 2009)
 Limited to certain behaviors (Kadlec 2008)

Multimodal behavior in full game: not done yet
Unreal Teams

Team Deathmatch
 Largely

ignored?
Capture the Flag
 Teams
protect own flag
 Bring enemy flag to base
 GP approach could not beat UT bots (Kadlec 2008)

Domination
 King
of hill
 Teams defend key locations
 RL approach learned group strategy
hand-coded bots (Smith et al. 2007)
of
Review

System for developing multimodal behavior
 Multiobjective
Evolution
 Targeting Unachieved Goals
 New-Mode Mutation
 Behavioral Diversity
 Extending Mode Mutation
 Subpopulations
 Open-Ended Evolution

Final evaluation in Unreal Tournament 2004
Conclusion

Create system:
 Automatically
discovers multimodal behavior
 No high-level hierarchy needed
 No low-level behaviors needed
 Works in continuous, noisy environments
 Discovers team behavior as well
Agents w/array of different useful behaviors
 Lead to better agents/behaviors in
simulations, games and robotics

Questions?
Auxiliary Slides
Design (2)

Behavior Trees (Isla 2005)
 Top-down
approach
 Used first in Halo 2
 Other commercial games since
 “Brute Force Approach to Common Sense”
 Add a behavior for every situation
 Hand coded
Evolutionary (0)

Dangerous Foraging (Stanley et al. 2003)
 Don’t
know if food is safe or poison
 Partial information
 Multimodal:



Eat food
Avoid food
Adaptive Teams of Agents (Bryant
 Roman
legions defend
 Homogeneous teams
 Barbarians plunder
 Multimodal:


Defend town
Chase barbarian
and Miikkulainen 2003)
Other MOEAs

PESA-II (Corne et al. 2001)
strength = 0
strength = 1
In archive
 External
archive and
internal population
 Region-based selection
using squeeze factor

SPEA2 (Zitzler and Thiele 1999)
 External
archive and
internal population
 Fitness based on strength
and density
squeeze factor = 2
squeeze factor = 1
1. Predator

Scripted agent is predator
 Chases
nearest monster
Monsters run to avoid damage
 When combined with Flight,
is partially observable

 Opposite

behavior from Flight
Complementary objectives
 Avoid
damage
 Stay alive
2. Sink the Ball

Monsters push ball around
 No
scripted agent
 Monster agent sensors tuned to ball


Move ball to goal to sink it
Very different from previous tasks
 More

distinct behavioral modes
Complementary objectives
 Min.
distance between ball and goal
 Min. time to sink goal
Heterogeneous z vs. NSGA-II
Homogeneous z vs. NSGA-II
The z-scores method was faster
Heterogeneous NSGA-II vs. TUG
Homogeneous NSGA-II vs. TUG
Heterogeneous 1Mode vs.
ModeMutation
1Mode was faster
Homogeneous 1Mode vs.
ModeMutation
1Mode was faster