A Short Trip to Reinforcement Learning

Download Report

Transcript A Short Trip to Reinforcement Learning

Reinforcement
Learning
Presented by:
Bibhas Chakraborty and Lacey Gunter
7/16/2015
1
What is Machine Learning?
 A method to learn about some phenomenon
from data, when there is little scientific theory
(e.g., physical or biological laws) relative to
the size of the feature space.
 The goal is to make an “intelligent” machine,
so that it can make decisions (or, predictions)
in an unknown situation.
 The science of learning plays a key role in
areas like statistics, data mining and artificial
intelligence. It also arises in engineering,
medicine, psychology and finance.
7/16/2015
2
Types of Learning
 Supervised Learning
- Training data: (X,Y). (features, label)
- Predict Y, minimizing some loss.
- Regression, Classification.
 Unsupervised Learning
- Training data: X. (features only)
- Find “similar” points in high-dim X-space.
- Clustering.
7/16/2015
3
Types of Learning (Cont’d)
 Reinforcement Learning
- Training data: (S, A, R). (State-Action-Reward)
- Develop an optimal policy (sequence of
decision rules) for the learner so as to
maximize its long-term reward.
- Robotics, Board game playing programs.
7/16/2015
4
Example of Supervised Learning
 Predict the price of a stock in 6 months from
now, based on economic data. (Regression)
 Predict whether a patient, hospitalized due to a
heart attack, will have a second heart attack.
The prediction is to be based on demographic,
diet and clinical measurements for that patient.
(Logistic Regression)
7/16/2015
5
Example of Supervised Learning
 Identify the numbers in a handwritten ZIP code,
from a digitized image (pixels). (Classification)
7/16/2015
6
Example of Unsupervised Learning
 From the DNA
micro-array data,
determine which
genes are most
“similar” in terms of
their expression
profiles. (Clustering)
7/16/2015
7
Examples of Reinforcement Learning
 How should a robot behave so as to optimize its
“performance”? (Robotics)
 How to automate the motion of a helicopter?
(Control Theory)
 How to make a good chess-playing program?
(Artificial Intelligence)
7/16/2015
8
History of Reinforcement Learning
 Roots in the psychology of animal learning
(Thorndike,1911).
 Another independent thread was the problem
of optimal control, and its solution using
dynamic programming (Bellman, 1957).
 Idea of temporal difference learning (on-line
method), e.g., playing board games (Samuel,
1959).
 A major breakthrough was the discovery of Qlearning (Watkins, 1989).
7/16/2015
9
What is special about RL?
 RL is learning how to map states to actions, so
as to maximize a numerical reward over time.
 Unlike other forms of learning, it is a multistage
decision-making process (often Markovian).
 An RL agent must learn by trial-and-error. (Not
entirely supervised, but interactive)
 Actions may affect not only the immediate
reward but also subsequent rewards (Delayed
effect).
7/16/2015
10
Elements of RL
 A policy
- A map from state space to action space.
- May be stochastic.
 A reward function
- It maps each state (or, state-action pair) to
a real number, called reward.
 A value function
- Value of a state (or, state-action pair) is the
total expected reward, starting from that
state (or, state-action pair).
7/16/2015
11
The Precise Goal
 To find a policy that maximizes the Value
function.
 There are different approaches to achieve this
goal in various situations.
 Q-learning and A-learning are just two different
approaches to this problem. But essentially both
are temporal-difference methods.
7/16/2015
12
The Basic Setting
 Training data: n finite horizon trajectories, of
the form {s0 , a0 , r0 ,...,sT , aT , rT , sT 1}.
 Deterministic policy: A sequence of decision
rules { 0 , 1 ,..., T }.
 Each π maps from the observable history
(states and actions) to the action space at that
time point.
7/16/2015
13
Value and Advantage
 Time t state value function, for history (st , at 1 ) is
T

Vt (st , at 1 )  E  rj (S j , A j , S j 1 ) | St  s t , A t 1  at 1 .
 j t

 Time t state-action value function, Q-function, is


Qt (st , at )  E rt (St , At , St 1 )  Vt 1 (St 1, At ) | St  st , At  at .
 Time t advantage, A-function, is
t (st , at )  Qt (st , at ) Vt (st , at 1 ).
7/16/2015
14
Optimal Value and Advantage
 Optimal time t value function for history (st , at 1 ) is
T


*
Vt (st , at 1 )  max E  rj (S j , A j , S j 1 ) | S t  st , A t 1  at 1 .
 
 j t

 Optimal time t Q-function is


Qt* (st , at )  E rt (St , At , St 1 )  Vt*1 (St 1, At ) | St  st , At  at .
 Optimal time t A-function is
t* (st , at )  Qt* (st , at ) Vt* (st , at 1 ).
7/16/2015
15
Return (sum of the rewards)
 The conditional expectation of the return is
T 1
T
 T
E  rt ST , AT    t (St , At )  t 1 (St 1 , At )  V0 (S0 )
t 0
 t 0
 t 0
where the advantages μ are
t (St , At )  Qt (St , At ) Vt (St , At 1 )
and the  , called temporal difference errors, are
t (St , At 1 )  rt 1 (St , At 1 )  Vt (St , At 1 )  Qt 1 (St 1, At 1 )
7/16/2015
16
Return (continued)
 Conditional expectation of the return is a
telescoping sum
T 1
T
 T
E  rt ST , AT    t   t 1  V0 ( S0 )
t 0
 t 0
 t 0
T
T 1
t 0
t 0
  [Qt  Vt ]   [rt  Vt 1  Qt ]  V0 ( S0 )
 Temporal difference errors have conditional
mean zero
E[rt 1 (St , At 1 )  Vt (St , At 1 ) | St 1, At 1 ]  Qt 1 (St 1, At 1 )
7/16/2015
17
Q-Learning
Watkins,1989
 Estimate the Q-function using some approximator
(for example, linear regression or neural networks
or decision trees etc.).
 Derive the estimated policy as an argument of the
maximum of the estimated Q-function.
 Allow different parameter vectors at different time
points.
 Let us illustrate the algorithm with linear
regression as the approximator, and of course,
squared error as the appropriate loss function.
7/16/2015
18
Q-Learning Algorithm
 Set
 For
QT 1  0.
t  T , T  1,...,0,
Yt  rt  maxQt 1 (S t 1 , A t , at 1 ;ˆt 1 ).
at 1
2
1
ˆ
t  arg min  Yt ,i  Qt (S t ,i , A t ,i ; )  .
n i 1

n
 The estimated policy satisfies
ˆQ,t (st , at 1 )  arg maxQt (st , at ;ˆt ),t.
at
7/16/2015
19
What is the intuition?
 Bellman equation gives
Qt* ( St , At )  E rt  max Qt*1 (S t 1 , A t , at 1 ) | S t , A t .


at 1
*
Q

Q
 If t 1
t 1 and the training set were infinite,
then Q-learning minimizes
2
E rt  maxQt*1 (S t 1 , A t , at 1 )  Qt (S t , A t ; ) .


at 1
which is equivalent to minimizing
E[Qt*  Qt (St , At ; )]2 .
7/16/2015
20
A Success Story
 TD Gammon (Tesauro, G., 1992)
- A Backgammon playing program.
- Application of temporal difference learning.
- The basic learner is a neural network.
- It trained itself to the world class level by
playing against itself and learning from the
outcome. So smart!!
- More information:
http://www.research.ibm.com/massive/tdl.html
7/16/2015
21
A-Learning
Murphy, 2003 and Robins, 2004
 Estimate the A-function (advantages) using
some approximator, as in Q-learning.
 Derive the estimated policy as an argument of
the maximum of the estimated A-function.
 Allow different parameter vectors at different
time points.
 Let us illustrate the algorithm with linear
regression as the approximator, and of course,
squared error as the appropriate loss function.
7/16/2015
22
A-Learning Algorithm
(Inefficient Version)
 For t  T , T  1,...,0,
T
Yt   rj 
j t
T

j t 1
j
(S j , A j ;ˆ j ).
2
n
1
ˆt  arg min  Yt ,i  t (S t ,i , A t ,i ; )  E ( t | S t ,i , A t 1,i ) .
n i 1

t (S t , A t ;ˆt )  t (S t , A t ;ˆt )  max t (S t , , A t 1 , at ;ˆt )
at
 The estimated policy satisfies
ˆ A,t (st , at 1 )  arg max t (st , at ;ˆt ),t.
at
7/16/2015
23
Differences between Q and A-learning
 Q-learning
 At time t we model the main effects of the history,
(St,,At-1) and the action At and their interaction
 Our Yt-1 is affected by how we modeled the main
effect of the history in time t, (St,,At-1)
 A-learning
 At time t we only model the effects of At and its
interaction with (St,,At-1)
 Our Yt-1 does not depend on a model of the main
effect of the history in time t, (St,,At-1)
7/16/2015
24
Q-Learning Vs. A-Learning
 Relative merits and demerits are not
completely known till now.
 Q-learning has low variance but high
bias.
 A-learning has high variance but low bias.
 Comparison of Q-learning with A-learning
involves a bias-variance trade-off.
7/16/2015
25
References
 Sutton, R.S. and Barto, A.G. (1998). Reinforcement




Learning- An Introduction.
Hastie, T., Tibshirani, R. and Friedman, J. (2001). The
Elements of Statistical Learning-Data Mining, Inference and
Prediction.
Murphy, S.A. (2003). Optimal Dynamic Treatment Regimes.
JRSS-B.
Blatt, D., Murphy, S.A. and Zhu, J. (2004).
A-Learning for Approximate Planning.
Murphy, S.A. (2004). A Generalization Error for Q-Learning.
7/16/2015
26