Reinforcement Learning in Real

Download Report

Transcript Reinforcement Learning in Real

Reinforcement
Learning in RealTime Strategy
Games
Nick Imrei
Supervisors: Matthew Mitchell & Martin Dick
Outline

Reasons



Background




What this research is about
Motivation and Aim
RTS games
Reinforcement Learning explained
Applying RL to RTS
This project



Methodology
Evaluation
Summary
Motivation and Aims

Problem:
has been a neglected area – game developers
have adopted the “not broken so why fix it”
philosophy
 Internet Thrashing – my own experience
 AI

Aim:
 Use
learning to develop a human-like player
 Simulate beginner → intermediate level play
 Use RL and A-life-like techniques

E.g. Black and White, Pengi [Scott]
RTS Games – The Domain

Two or more teams of individuals/cohorts in a war-like
situation on a series of battlefields


Teams can have a variety of:





E.g. Command & Conquer, Starcraft, Age of Empires, Red Alert,
Empire Earth
Weapons
Units
Resources
Buildings
Players required to manage all of the above to achieve
the end goal.
(Destroy all units, capture flag, etc.)
Challenges offered in RTS
games
Real time constraints on actions
 High level strategies combined with lowlevel tactics
 Multiple goals and choices

The Aim and Approach

Create a human-like opponent
 Realistic
 Diverse
behavior (not boring)
 This is difficult to do!

Tactics and Strategy
 Agents
will be reactive to environment
 Learn rather than code – Reinforcement
learning
The Approach Part 1 –
Reinforcement Learning

Reward and Penalty

Action Rewards / Penalties



Strategic Rewards / Penalties





Penalize being shot
Reward killing a player on the other team
Securing / occupying a certain area
Staying in certain group formations
Destroying all enemy units
Aim to receive maximum reward over time
Problem: Credit assignment

What rewards should be given to which behaviors?
The Approach Part 2 – Credit
Assignment
States and actions
 Decide on a state space and action space
 Assign values to

 States,
or
 States and Actions

Train the agent in this space
Reinforcement Learning
example
Reinforcement Learning
example
Why use Reinforcement
Learning?
Well suited to problems where there is a
delayed reward (tactics and strategy)
 The trained agent moves in (worst case)
linear time (reactive)
 Problems:

 Large
state spaces (state aggregation)
 Long training times (ER and shaping)
The Approach Part 3 – Getting
Diversity
A-life-like behavior using aggregated state spaces
Agent
Agent state
space
Research Summary:


Investigate this approach using a simple RTS game
Issues:

Empirical Research



Applying RL in a novel way
Not using entire state space
Need to investigate



Appropriate reward functions
Appropriate state spaces
Problems with Training



Will need lots of trials - the propagation problem
No. trials can be reduced using Shaping [Mahadevan] and
Experience Replay [Lin]
Self play – other possibilities include A* and human opponents
Tesauro, Samuel
Methodology

Hypothesis:
 “The
combination of RL and reduced state spaces in
a rich (RTS) environment will lead to human-like
gameplay”


Empirical investigation to test hypothesis
Evaluate system behavior
 Analyze
the observed results
 Describe interesting phenomenon
Evaluation

Measure the diversity of strategies
 How
big a change (and what type) is required to
change the behaviour – a qualitative analysis of this

Success of strategies
 I.e. what level of gameplay does it achieve
 Time to win, points scored, resembles humans


Compare to human strategies
“10 requirements of a challenging and realistic
opponent” [Scott]
Summary thus far…



Interested in a human-level game program
Want to avoid brittle, predictable programmed
solutions
Search program space for most diverse
solutions using RL to direct search
 Allows
specifications of results, without needing to
specify how this can be achieved

Evaluate the results
The Game – Maps and Terrain


2 Armies of equal amount
on an n*n map.
Terrain:



Grass, Trees, Boundary
Squares and Swamp
All units can move on these
squares, however
Different terrain types affect
a soldier’s attributes each
in a different way
The Game – Soldiers

Soldier Attributes
include:
 Sight
Range
 Weapon Range
 Fatigue
 Speed
 Health
 Direction
 Relation Lines
Experiments Part 1: Hand-coded
Strategies

Create 8 different Hand-coded Strategies
 Incl.

Horde, Disperse, Central Defense, etc.
Test their effectiveness based on:
 Time
taken to win
 Time taken to eliminate an enemy once
spotted
 Damage sustained when victorious
Results of Experiments Part 1



Units deployed closer resulted in quicker games.
No strategy was consistently successful against
all others.
The 3 most successful were:
 Occupy
 Horde
 Central Defense

Strategies meant nothing once army sizes were
> 150 on a 80*80 map.
Experiments Part 2: Control
Architectures

Centralized
 All
units are controlled by one entity
 Only do what is commanded (no autobehavior)
 View area = Central controllers Viewscreen
 Group Formation
 Unit Selection
 Unit Commanding
Experiments Part 2: Control
Architectures

Localized
 Units
are independently maneuvered
controlled ala Artificial-life
 Viewing space is only what they see
individually
 Formation ; Cohorts
 Unit Selection & Movement done via an
A-life State Machine
Experiments Part 2: Control
Architectures

Testing:
 Given
the best 3 techniques from part1,
 Program them in a centralized and localized
manner
 Base their effectiveness on criteria from part 1
 Observe the realism of the 6 new hand-coded
strategies
Results of Experiments Part 2
As individual unit sight and weapon range
increased, localized performed better.
 A-life performed better on rougher terrain,
whereas centralized often got stuck.
 Centralized formation takes less time,
hence it did better in situations where the
ArmySize : MapSize ratio increased.

Results of Experiments Part 2

Realism Evaluation:
 Localized
resembles more a group of soldiers
 Centralized better resembles human
gameplay.

Given its success a local framework is
used as a template for the learning agent
Learning Agents - Architecture
Each agent will work off the same learning
table.
 Is expected to speed up learning – by
learning from everyone’s mistakes rather
than just your own
 Agents are trained against all opponents
from parts 1 and 2

Learning Agents – Representing
the world

States
 Divide
sight range up into sections
 Each section can have a combination of an
ally, a health spot, an enemy or none.
 (On or off a health spot) 288 possible world
states.

Actions
 Move
& Shoot (left, forward, right, back)
Learning Agents – Representing
the world

Rewards:
 Positive:


Shooting an enemy
Moving to a health spot
 Negative:



Being shot / killed
Being on a health spot when health is full
Reinforcement:
 Q(s,a)
= R(s,a) + γ Σs’P(s’|s,a)Q(s,a,s’)
Results of Experiments Part 3


Learning of behaviors was achieved in only a few
simulations
Agents developed the following behaviors:







Shoot when seen unless health is low
If Health is low, move to health spot
Units form a health-spot queue
Diversion of a centralized opponents attention
Learning agents were consistently successful against all
others bar centralized hording.
Agents told what to do – not how to do it
Human testing didn’t prove too successful!
Conclusions




A localized approach was found to be more
successful overall than a centralized one.
Given the sans base and resource element of
the game, the all out aggressive strategies faired
the best
Learning strategies were successful against
most programmed ones
Diversion and health spot sharing behaviors
were observed
Future Work

Extending the RTS game so it has:
 Resources
and resource gathering
 Different Unit types
 Base building and maintenance

Testing the RL/A-life framework in other
game genres including Role Playing
Games, Sim games and Sports.
References




Bob Scott. The illusion of intelligence. AI Game
Programming Wisdom, pages 16–20, 2002.
Sridhar Mahadevan and Jonathan Connell. Automatic
programming of behavior-based robots using
reinforcement learning. Artificial Intelligence 55, pages
311–364, 1992.
L Lin. Reinforcement learning for robots using neural
networks. PhD thesis, School of Computer Science,
Carnegie Mellon University, Pittsburgh USA, 1993.
Mark Bishop Ring. Continual Learning in Reinforcement
Environments. MIT Press, 1994.
Stay Tuned!


For more information, see
http://www.csse.monash.edu.au/~ngi/
Thanks for listening!