Reinforcement learning and human behavior

Download Report

Transcript Reinforcement learning and human behavior

Reinforcement learning and
human behavior
Hanan Shteingart and Yonatan Loewenstein
MTAT.03.292 Seminar in Computational Neuroscience
Zurab Bzhalava
Introduction
• Operant Learning
• Dominant computational approach to model
operant learning is model-free RL
• Human behavior is far more complex
• Remaining Challenges
Reinforcement Learning
RL: A class of learning problems in which an agent interacts
with an unfamiliar, dynamic and stochastic environment
Goal: Learn a policy to maximize some measure of long-term
reward
Markov Decision Process
•
•
•
•
A (finite) set of states S
A (finite) set of actions A
Transition Model: T(s, a, s’) = P(s’ | a ,s)
Reward Function: R(s)
•
ᵧ is a discount factor ᵧ ∈ [0; 1]
• Policy π
• Optimal policy π*
Markov Decision Process
Bellman equation:
Biological Algorithms
• Behavioral control
• Evaluate the world quickly
• Choose appropriate behavior based on those
valuations
midbrain's dopamine neurons
• Central role in guiding our behavior and
thoughts
• Valuation of our world
– Value of money
– Other human being
•
•
•
•
•
Major role in decision-making
Reward-dependent learning
Malfunction in mental illness
Related to Parkinson's disease.
Schizophrenia
Reinforcement signals define an
agent's goals
1. organism is in state X an receives reward
information;
2. organism queries stored value of state X;
3. organism updates stored value of state X
based on current reward information;
4. organism selects action based on stored
policy
5. organism transitions to state Y and receives
reward information.
The reward-prediction error
hypothesis
Difference between the experienced
predicted “reward” of an event
and
• Neurons of the ventral tegmental area
• phasic activity changes encode a 'prediction
error about summed future reward'
prediction-error signal encoded in
dopamine neuron firing.
Value binding
Human reward responses
•
•
•
•
•
•
Orbitofrontal Cortex (OFC)
Amygdala (Amyg)
Nucleus Accumbens
Sublenticular extended amygdala
Hypothalamus (Hyp)
Ventral Tegmental Area (VTA)
Human reward responses
Model-based RL vs Model-free RL
• goal-directed vs habitual behaviors
• Implemented by two anatomically distinct
systems (subject of debate)
• Some findings suggest:
– Medial striatum is more engaged during planning
– Lateral striatum is more engaged during choices in
extensively trained tasks
Model-based RL vs Model-free RL
(b) Model-free RL
(c) Model-based RL
Human subjects in exhibited a mixture of both effects.
Challenges in relating human
behavior to RL algorithms
• Humans tend to alternate rather than repeat an
action after receiving a positively surprising
payoff
• Tremendous heterogeneity in reports on human
operant learning
• Probability matching or not
Heterogeneity in world model
Questions?
Learning the world model
Questions?
Reference List:
• Reinforcement learning and human behavior
Hanan Shteingart and Yonatan Loewenstein
• The ubiquity of model-based reinforcement learning
Bradley B Doll Dylan A Simon3 and Nathaniel D Daw
• Computational roles for dopamine in behavioral control
P. Read Montague1,2, Steven E. Hyman3 & Jonathan D. Cohen4,5