Lecture 4 - TeachLine

Download Report

Transcript Lecture 4 - TeachLine

Decision making
?
Probability in games of chance
Blaise Pascal
1623 - 1662
How much should I bet on ’20’?
E[gain] = Σgain(x) Pr(x)
Decisions under uncertainty
Maximize expected value
(Pascal)
Bets should be assessed
according to
 p  x  gain  x 
x
Decisions under uncertainty
The value of an alternative is a monotonous function
of the
• Probability of reward
• Magnitude of reward
Do Classical Decision Variables
Influence Brain Activity in LIP?
LIP
Varying Movement Value
Platt and Glimcher 1999
What Influences LIP?
Related to Movement Desirability
• Value/Utility of Reward
• Probability of Reward
Varying Movement Probability
What Influences LIP?
Related to Movement Desirability
• Value/Utility of Reward
• Probability of Reward
Decisions under uncertainty
Neural activity in area LIP depends on:
• Probability of reward
• Magnitude of reward
Relative or absolute reward?
Dorris and Glimcher 2004
?
$X
$Y
$Z
$A $B
$C
$D
$E
Maximization of utility
Consider a set of alternatives X and a binary
relation on it,  X  X , interpreted as “preferred
at least as”.
Consider the following three axioms:
C1. Completeness: For every x, y  X , x y or x
C2. Transitivity: For every
x, y, z  X , x y and y z imply x
C3. Separability
y
z
Theorem: A binary relation can be represented by a
real-valued function if and only if it satisfies C1-C3
Under these conditions, the function u is unique up to
increasing transformation
(Cantor 1915)
A face utility function?
In there an explicit representation of
‘value’ of a choice in the brain?
Neurons in the orbitofrontal cortex encode value
Padoa-Schioppa and Assad, 2006
Examples of neurons encoding the chosen value
A neuron encoding the value of A
A neuron encoding the value of B
A neuron encoding the chosen juice taste
Encoding takes place at different times
post-offer (a, d, e, blue),
pre-juice (b, cyan),
post-juice (c, f, black)
How does the brain learn the values?
The computational problem
The goal is to maximize the sum of rewards
 end 
Vt  E   r 
  t 
The computational problem
The value of the state S1 depends on the policy
If the animal chooses ‘right’ at S1,
V  S1   R  ice cream  V  S2 
How to find the optimal policy in a
complicated world?
How to find the optimal policy in a
complicated world?
• If values of the different states are known
then this task is easy
V  St   rt  V  St 1 
How to find the optimal policy in a
complicated world?
• If values of the different states are known
then this task is easy
How can the values of the different states
be learned?
V  St   rt  V  St 1 
V(St) = the value of the state at time t
rt = the (average) reward delivered at time t
V(St+1) = the value of the state at time t+1
The TD (temporal difference) learning algorithm
V  St   V  St   t
where
t  rt  V  St 1   V  St  
is the TD error.
Schultz, Dayan and Montague, Science, 1997
2
1
3
4
6
5
CS
7
8
Reward
Before trial 1:
V  S1   V  S2  
 V  S9   0
In trial 1:
• no reward in states 1-7
t  rt  V  St 1   V  St    0
V  St   V  St   t  0
• reward of size 1 in states 8
t  rt  V  S9   V  S8    1
V  S8   V  St   t  
9
1
2
3
4
5
6
CS
7
8
9
Reward
Before trial 2:
V  S1   V  S2  
 V  S7   V  S9   0
V  S8   
In trial 2, for states 1-6
t  rt  V  St 1   V  St    0
V  St   V  St   t  0
For state 7,
t  rt  V  St 1   V  St    
2
V  S7   V  S7   t  
1
2
3
4
5
6
CS
7
8
9
Reward
Before trial 2:
For state 8,
V  S1   V  S2  
 V  S7   V  S9   0
V  S8   
t  rt  V  St 1   V  St    1  
V  S8   V  S8   t     1       2   
2
1
3
4
5
6
CS
7
8
9
Reward
Before trial 3:
V  S1   V  S2  
 V  S6   V  S9   0
V  S7    2 V  S8     2   
In trial 2, for states 1-5
t  rt  V  St 1   V  St    0
V  St   V  St   t  0
For state 6,
t  rt  V  St 1   V  St     2
3
V  S7   V  S7   t  
2
1
3
4
5
6
7
CS
Before trial 3:
9
8
Reward
V  S1   V  S2  
 V  S6   V  S9   0
V  S7    2 V  S8     2   
For state 7,
t  rt  V  St 1   V  St      2       2 1   
2
V  S7   V  S7   t      2 1     3  2
2
2
For state 8,
t  rt  V  St 1   V  St    1    2   
V  S   V  S     2 1      1    2    
3
1
2
3
4
5
CS
After many trialsV
6
7
8
Reward
 S1  
 V  S8   1 V  S9   0
t  rt  V  St 1   V  St    0
Except for the CS whose time is unknown
9
Schultz, 1998
“We found that these neurons encoded the difference between
the current reward and a weighted average of previous rewards,
a reward prediction error, but only for outcomes that were
better than expected”.
Bayer and Glimcher, 1998
Bayer and Glimcher, 1998