Evolutionary Reinforcement Learning Systems

Download Report

Transcript Evolutionary Reinforcement Learning Systems

Evolutionary Reinforcement
Learning Systems
Presented by Alp Sardağ
Goal
• Two main branches of reinforcement learning:
– Search the space of functions that asses values to utility
of states.
– Search the space of functions that asses values to
behaviors (Q-learning).
• Describe the evolutionary algorithm approach to
reinforcement learning and compare and contrast
with more common TD methods
Sequential Decision Task
This is an example of sequential decision task, the optimal sequence
For an agent that starts from a1 is R,D,R,D,D,R,R,D
Learning from reinforcements
• The EA and TD methods have similarities:
– Solve diffucult sequential decision tasks.
– These methods are normally model-free,
whereas dynamic programming approaches
learn from a complete mathematical model of
the underlying system.
Supervised vs. Unsupervised
Learning
• In supervised learning the agent receives the
correct decisions paired with specific sensory
input.
• In reinforcement learning, the agent does not learn
from examples of correct behavior but uses system
payoffs as aguide to form effective decision
policies.
– A decision making may not receive feedback after
every decision.
– They are applicable in problems where significant
domain knowledge is either unavailable or costly to
obtain.
TD Reinforcement Learning
Q-learning update scheme:
Q(x,a)Q(x,a)+(R(i)+maxQ(y,b)-Q(x,a))
After a long-run:
•Q-values will converge to the optimal values.
•A reinforcement learning system can thus use the Q values to
evaluate each action that is possible from a given state.
Evolutionary Algorithm
• Evolutionary algorithms are global search techniques.
• They are built on Darwin’s theory of evolution by natural
selection.
• Numerous potential solutions are encoded in structures,
called chromosomes.
• During each iteration, the EA evaluates solutions adn
generates offspring based on the fitness of each solution in
the task.
• Substructures, or genes, of the solutions are then modified
through genetic operators such as mutation or
recombination.
• The idea: structures that led to good solutions in previous
evaluations can be mutated or combined to form even better
solutions.
Basic Steps in Evolutionary Algorithm
Procedure EA
begin
t=0;
initialize P(t);
evaluate structures in P(t)
while termination condition not satisfied do
begin
t=t+1;
select P(t) form P(t-1);
alter structures in P(t);
evaluate structures in P(t);
end
end.
Similarities EA and RL
• EA method can learn “online” through the direct
interaction with the underlying system
• They are adaptive to changes.
• Online learning is often time-consuming and
dangerous. Teherfore researchers in EA and TD
learning train in simulation then apply the learned
control policy to real system.
Differences between TD and EA approach
• This can be summarized along three
dimensions:
– Policy representation
– Credit assignment
– Memory
Policy Representation
• What they represent:
– TD methods form q(x,a)v.
– EA methods uses direct mapping from state to
recommended action (e.G. Q(x) a).
• Conclusion: TD try to solve a more diffucult
problem than reinforcement posed. In
addition to choosing best action, it tells us
why.
The size of hypothesis space
• The hypothesis space for direct policy
representation (e.g. Q(x) a)
– The total number of functions: cn
where n: number of states & c: number of actions
• The hypothesis space for a value function
representation (e.g. Q(x,a) v)
– The total number of functions: wcn
where w: the number of possible values
• NOTE: the size of hypothesis space does not
reflect the difficulty of the problem northe
efficiency of the methods that search the space.
Credit Assignment
• TD reinforcement learning: Credit is chained
backward. In this manner, payoffs are distributed
across sequence of actions. A single reward value
becomes associated with each individual state and
decision pair.
• EA reinforcement learning: rewards are only
associated with sequences of decisions. Thus,
which individual decisions are most responsible
for a good or poor decision policy is irrelevant to
the EA.
Issue in Credit Assignment
• In TD, the update is focused on a single state and action.
• In EA, after a recombination the evaluation is on the entire
sequence of actions.
Memory
• TD reinforcement learning: maintain
statistics concerning every state-action
pairs. This method sustain information
about both good and bad state-action pairs.
• EA reinforcement learning: maintain
information only about good states-action
pairs. Thus, memory losses occur.
Example
• Unlike TD the information content decreases in EA, population
actually decreases during learning.
• State loss can also occur in EA reinforcement learning.
Issues in RL
•
•
•
•
Exploration vs. Exploitation
Perceptual Aliasing
Generalization
Unstable Environments
Exploration vs. Exploitation for TD
• Too much exploration could result in lower average
rewards and too little could prevent the learning system
from discovering new optimal states.
– Solution 1:
– Solution 2:
U(i)  (1- )U(i)+ (R(i) + maxa F(jU(j),N(a,i)))
where
R+
if n<Ne
F(u,n) =
u
otherwise
{
Exploration vs. Exploitation for EA
• Too much exploration could result in lower
average rewards and too little could prevent the
learning system from discovering new optimal
states.
– Solution comes from the nature of EA. Early in an
evolutionary search, selection pressure is low because
most policies within the population have lower fitness.
As the evolution proceeds, highly fit policy evolve,
increasing the selective pressure within in the
population.
Perceptual Aliasing or Hidden State
• In real world situations, the agent often will
not have access to complete information on
the state of its world.
– TD methods are vulnerable to hidden states.
Ambiguous state information misleads the TD
method.
– Since EA methods associate values with entire
decision sequences, credits are based on net
results.
Example
In TD reinforcement learning:
Generalization
• The generalization of policy desicions from
one area of the state space to another.
• Since the number of possible states of the
world grows exponentially with the size of
the task. Solution is to apply action
decisions from observed states to
unobserved states. (ANN, rule bases)
Comparison of EA and TD
• Generalization of TD: In some large-scale
problems, approximating the value function works
well, while in some simple toy problems it fails. In
discrete table one update affects one state, whereas
in generalized case more than one state and create
noisiness. (who cares toy problems?)
• Generalization of EA: Since it make less frequent
update and base them on more global informatio,
it is more diffucult for single observation to affect
the global decision problem.
Unstable Environments
• The agent must adapt its current decision policy in
response to changes that occur in its world.(e.g.
Faulty sensors, new obstructions, etc.)
– Because TD makes constant updates to decision polic,
it should respond to changes as soon as they occur.
– Since EA do not update any policy until an individual
or a population of individuals have been completely
evaluated over several actions, the response to changes
delayed.
Evolutionary Reinforcement Learning
Implementations
•
•
•
•
Learning Classifier Systems
SAMUEL
GENITOR
SANE
Learning Classifier Systems
• Mesaages trigger Classifiers which are symbolic if-then rules that map
sensory input to action
• Classifiers are in a competition, resolved by a bidding algorithm.
• Classifiers messages may trigger new classifiers.
• Predate TD learning.
• EA selects, mutates, and recombines classifiers that received the most
credit by bidding algorithm.
• The result was dissaponting.
SAMUEL
• A system that learns to solve sequential decision problems.
• SAMUEL searches the space of decision policies for stes of conditionaction rules.
• Each individual is a rule set that specifies its behavior.
• Each gene is a rule that maps states to actions
• Three major components:
– Problem specific module: task environment simulation
– Performance module: interacts with the world, obtain payoffs
– Learning module: uses GA to evolve behaviors.
GENITOR
• GENITOR uses neural networks to represent the
decision policy, meaning generalization the state
space.
• GENITOR relies solely on its EA to adjust the
weights in NN.
• NN (individual) is represented in the population as
a sequence of connection weights.
• The croosover is realized on the basis that the
offspring performance.
• Genetic operators are applied aynchronously.
SANE
• SANE (symbiotic, Adaptive Neuro-Evolution) was
designed as a fast, efficient method fro building NN in
domains where it is not possible to generate trainning data.
• NN forms a direct mapping from sensors to actions and
provides effective generalization over the state space.
• Individuals are complete NNs.
• Two separate population:
– Population of neurons : population of building blocks of NN.
– Population of network blueprints : population of combination of
building blocks of NN.
SANE
Conclusion
• EA and TD learn through interactions with
the actual system.
• EA and TD do not require precise
mathematical model of the domain.
• The main differences:
– Policy representation
– Credit assignment
– Memory