A Short Trip to Reinforcement Learning
Download
Report
Transcript A Short Trip to Reinforcement Learning
Reinforcement
Learning
Presented by:
Bibhas Chakraborty and Lacey Gunter
7/16/2015
1
What is Machine Learning?
A method to learn about some phenomenon
from data, when there is little scientific theory
(e.g., physical or biological laws) relative to
the size of the feature space.
The goal is to make an “intelligent” machine,
so that it can make decisions (or, predictions)
in an unknown situation.
The science of learning plays a key role in
areas like statistics, data mining and artificial
intelligence. It also arises in engineering,
medicine, psychology and finance.
7/16/2015
2
Types of Learning
Supervised Learning
- Training data: (X,Y). (features, label)
- Predict Y, minimizing some loss.
- Regression, Classification.
Unsupervised Learning
- Training data: X. (features only)
- Find “similar” points in high-dim X-space.
- Clustering.
7/16/2015
3
Types of Learning (Cont’d)
Reinforcement Learning
- Training data: (S, A, R). (State-Action-Reward)
- Develop an optimal policy (sequence of
decision rules) for the learner so as to
maximize its long-term reward.
- Robotics, Board game playing programs.
7/16/2015
4
Example of Supervised Learning
Predict the price of a stock in 6 months from
now, based on economic data. (Regression)
Predict whether a patient, hospitalized due to a
heart attack, will have a second heart attack.
The prediction is to be based on demographic,
diet and clinical measurements for that patient.
(Logistic Regression)
7/16/2015
5
Example of Supervised Learning
Identify the numbers in a handwritten ZIP code,
from a digitized image (pixels). (Classification)
7/16/2015
6
Example of Unsupervised Learning
From the DNA
micro-array data,
determine which
genes are most
“similar” in terms of
their expression
profiles. (Clustering)
7/16/2015
7
Examples of Reinforcement Learning
How should a robot behave so as to optimize its
“performance”? (Robotics)
How to automate the motion of a helicopter?
(Control Theory)
How to make a good chess-playing program?
(Artificial Intelligence)
7/16/2015
8
History of Reinforcement Learning
Roots in the psychology of animal learning
(Thorndike,1911).
Another independent thread was the problem
of optimal control, and its solution using
dynamic programming (Bellman, 1957).
Idea of temporal difference learning (on-line
method), e.g., playing board games (Samuel,
1959).
A major breakthrough was the discovery of Qlearning (Watkins, 1989).
7/16/2015
9
What is special about RL?
RL is learning how to map states to actions, so
as to maximize a numerical reward over time.
Unlike other forms of learning, it is a multistage
decision-making process (often Markovian).
An RL agent must learn by trial-and-error. (Not
entirely supervised, but interactive)
Actions may affect not only the immediate
reward but also subsequent rewards (Delayed
effect).
7/16/2015
10
Elements of RL
A policy
- A map from state space to action space.
- May be stochastic.
A reward function
- It maps each state (or, state-action pair) to
a real number, called reward.
A value function
- Value of a state (or, state-action pair) is the
total expected reward, starting from that
state (or, state-action pair).
7/16/2015
11
The Precise Goal
To find a policy that maximizes the Value
function.
There are different approaches to achieve this
goal in various situations.
Q-learning and A-learning are just two different
approaches to this problem. But essentially both
are temporal-difference methods.
7/16/2015
12
The Basic Setting
Training data: n finite horizon trajectories, of
the form {s0 , a0 , r0 ,...,sT , aT , rT , sT 1}.
Deterministic policy: A sequence of decision
rules { 0 , 1 ,..., T }.
Each π maps from the observable history
(states and actions) to the action space at that
time point.
7/16/2015
13
Value and Advantage
Time t state value function, for history (st , at 1 ) is
T
Vt (st , at 1 ) E rj (S j , A j , S j 1 ) | St s t , A t 1 at 1 .
j t
Time t state-action value function, Q-function, is
Qt (st , at ) E rt (St , At , St 1 ) Vt 1 (St 1, At ) | St st , At at .
Time t advantage, A-function, is
t (st , at ) Qt (st , at ) Vt (st , at 1 ).
7/16/2015
14
Optimal Value and Advantage
Optimal time t value function for history (st , at 1 ) is
T
*
Vt (st , at 1 ) max E rj (S j , A j , S j 1 ) | S t st , A t 1 at 1 .
j t
Optimal time t Q-function is
Qt* (st , at ) E rt (St , At , St 1 ) Vt*1 (St 1, At ) | St st , At at .
Optimal time t A-function is
t* (st , at ) Qt* (st , at ) Vt* (st , at 1 ).
7/16/2015
15
Return (sum of the rewards)
The conditional expectation of the return is
T 1
T
T
E rt ST , AT t (St , At ) t 1 (St 1 , At ) V0 (S0 )
t 0
t 0
t 0
where the advantages μ are
t (St , At ) Qt (St , At ) Vt (St , At 1 )
and the , called temporal difference errors, are
t (St , At 1 ) rt 1 (St , At 1 ) Vt (St , At 1 ) Qt 1 (St 1, At 1 )
7/16/2015
16
Return (continued)
Conditional expectation of the return is a
telescoping sum
T 1
T
T
E rt ST , AT t t 1 V0 ( S0 )
t 0
t 0
t 0
T
T 1
t 0
t 0
[Qt Vt ] [rt Vt 1 Qt ] V0 ( S0 )
Temporal difference errors have conditional
mean zero
E[rt 1 (St , At 1 ) Vt (St , At 1 ) | St 1, At 1 ] Qt 1 (St 1, At 1 )
7/16/2015
17
Q-Learning
Watkins,1989
Estimate the Q-function using some approximator
(for example, linear regression or neural networks
or decision trees etc.).
Derive the estimated policy as an argument of the
maximum of the estimated Q-function.
Allow different parameter vectors at different time
points.
Let us illustrate the algorithm with linear
regression as the approximator, and of course,
squared error as the appropriate loss function.
7/16/2015
18
Q-Learning Algorithm
Set
For
QT 1 0.
t T , T 1,...,0,
Yt rt maxQt 1 (S t 1 , A t , at 1 ;ˆt 1 ).
at 1
2
1
ˆ
t arg min Yt ,i Qt (S t ,i , A t ,i ; ) .
n i 1
n
The estimated policy satisfies
ˆQ,t (st , at 1 ) arg maxQt (st , at ;ˆt ),t.
at
7/16/2015
19
What is the intuition?
Bellman equation gives
Qt* ( St , At ) E rt max Qt*1 (S t 1 , A t , at 1 ) | S t , A t .
at 1
*
Q
Q
If t 1
t 1 and the training set were infinite,
then Q-learning minimizes
2
E rt maxQt*1 (S t 1 , A t , at 1 ) Qt (S t , A t ; ) .
at 1
which is equivalent to minimizing
E[Qt* Qt (St , At ; )]2 .
7/16/2015
20
A Success Story
TD Gammon (Tesauro, G., 1992)
- A Backgammon playing program.
- Application of temporal difference learning.
- The basic learner is a neural network.
- It trained itself to the world class level by
playing against itself and learning from the
outcome. So smart!!
- More information:
http://www.research.ibm.com/massive/tdl.html
7/16/2015
21
A-Learning
Murphy, 2003 and Robins, 2004
Estimate the A-function (advantages) using
some approximator, as in Q-learning.
Derive the estimated policy as an argument of
the maximum of the estimated A-function.
Allow different parameter vectors at different
time points.
Let us illustrate the algorithm with linear
regression as the approximator, and of course,
squared error as the appropriate loss function.
7/16/2015
22
A-Learning Algorithm
(Inefficient Version)
For t T , T 1,...,0,
T
Yt rj
j t
T
j t 1
j
(S j , A j ;ˆ j ).
2
n
1
ˆt arg min Yt ,i t (S t ,i , A t ,i ; ) E ( t | S t ,i , A t 1,i ) .
n i 1
t (S t , A t ;ˆt ) t (S t , A t ;ˆt ) max t (S t , , A t 1 , at ;ˆt )
at
The estimated policy satisfies
ˆ A,t (st , at 1 ) arg max t (st , at ;ˆt ),t.
at
7/16/2015
23
Differences between Q and A-learning
Q-learning
At time t we model the main effects of the history,
(St,,At-1) and the action At and their interaction
Our Yt-1 is affected by how we modeled the main
effect of the history in time t, (St,,At-1)
A-learning
At time t we only model the effects of At and its
interaction with (St,,At-1)
Our Yt-1 does not depend on a model of the main
effect of the history in time t, (St,,At-1)
7/16/2015
24
Q-Learning Vs. A-Learning
Relative merits and demerits are not
completely known till now.
Q-learning has low variance but high
bias.
A-learning has high variance but low bias.
Comparison of Q-learning with A-learning
involves a bias-variance trade-off.
7/16/2015
25
References
Sutton, R.S. and Barto, A.G. (1998). Reinforcement
Learning- An Introduction.
Hastie, T., Tibshirani, R. and Friedman, J. (2001). The
Elements of Statistical Learning-Data Mining, Inference and
Prediction.
Murphy, S.A. (2003). Optimal Dynamic Treatment Regimes.
JRSS-B.
Blatt, D., Murphy, S.A. and Zhu, J. (2004).
A-Learning for Approximate Planning.
Murphy, S.A. (2004). A Generalization Error for Q-Learning.
7/16/2015
26