ppt - CSE, IIT Bombay
Download
Report
Transcript ppt - CSE, IIT Bombay
REINFORCEMENT LEARNING
Group 11
Ashish Meena
04005006
Rohitashwa Bhotica 04005010
Hansraj Choudhary
04d05005
Piyush Kedia
04d05009
4/1/2016
1
Outline
Introduction
Learning Models
Motivation
Reinforcement Learning Framework
Q – Learning Algorithm
Applications
Summary
4/1/2016
2
Introduction
Machine Learning
◦ Construction of programs that automatically
improve with experience.
Types of Learning
◦ Supervised Learning
◦ Unsupervised Learning
◦ Reinforcement Learning
4/1/2016
3
Supervised Learning
- Training data: (X,Y). (features, label)
- Predict Y, minimizing some loss.
- Regression, Classification.
Example
◦ Predict whether a patient, hospitalized due to a heart
attack, will have a second heart attack. The prediction
is to be based on demographic, diet and clinical
measurements for that patient. (Logistic Regression)
4/1/2016
4
Unsupervised Learning
Unsupervised Learning
- Training data: X. (features only)
- Find “similar” points in high-dim
- Clustering.
Example
X-space.
◦ From the DNA micro-array data, determine which
genes are most “similar” in terms of their expression
profiles. (Clustering)
4/1/2016
5
Reinforcement Learning
Training data: (S, A, R). (State-Action-Reward)
Develop an optimal policy (sequence of decision rules)
for the learner so as to maximize its long – term
reward.
4/1/2016
6
Reinforcement Learning
Agent
Policy
Action
State
Reward
Environment
0 : r0
1 : r1
2 : r2
s0 a
s1 a
s2 a
4/1/2016
7
Motivation
Supervised and unsupervised learning fail
in many situations.
Example
◦ FLIGHT CONTROLS SYSTEMS
For a set of all sensor readings at a given time
deciding how the flight controls should function.
In case of supervised learning the labels for many
features are unknown.
Performing trial-and-error interactions with the
environment reinforcement learning is capable of
solving such problems.
4/1/2016
8
Learning by Interaction
• Learning to ride a bicycle
• Actions:
• Turn handle bars RIGHT.
• Turn handle bars LEFT.
• Rewards:
• Positive if the cycle is perfectly balanced.
• Negative if the angle between the cycle and the
ground decreases
4/1/2016
9
Inspiration from Natural Learning
Dopamine
◦ Neurohormone occurring in a wide variety of animals,
including both vertebrates and invertebrates
◦ Regulates and controls behavior by inducing pleasurable effects
equivalent to rewards
◦ Motivates us to proactively to perform certain activities
4/1/2016
10
Markov Decision Processes
A discrete set of environment states S.
A discrete set of agent actions A.
At each discrete time agent observes state st Є S and
chooses action st Є A
Receives reward rt
State changes to st+1
Markov Assumptions
st+1 = δ(st,at)
rt = r(st,at)
δ and r are usually unknown to the learner
δ and r may also be non deterministic
4/1/2016
11
Simple deterministic example
4/1/2016
12
Model-free v/s Model-based Methods
Model-free Methods
◦ Eliminate need to know the model
◦ Learn a policy without estimating the model.
◦ Example Q Learning
Model Based Methods
◦ Model consists of the Transition Probability
Function and the Reward Function.
◦ Interactively estimate the model and calculate the
policy.
◦ Example Dyna
4/1/2016
13
Precise Goal
Maximization of long-term reward.
An optimal policy has to be learnt in order to
do this.
Long-term
reward may have different
interpretations according to how the future
is taken into account. i.e. different models of
optimal behaviour
Models of Optimal Behaviour
Three main models
Finite Horizon Model
◦ Optimize the reward for the next h steps
rt represents the reward for the (t+1)th
action performed
h=1 represents a greedy strategy
Models of Optimal Behaviour (2)
Infinite Horizon Discounted Model
◦ Models future rewards being less valuable than in the present
◦ Has infinite horizon i.e. considers infinite actions in the future
◦ Makes use of a discount factor g between 0 and 1 which
represents the importance of the future.
◦ This is the model used in the Q-Learning algorithm
Models of Optimal Behaviour (3)
Average-Reward Model
◦ Reward per action is considered
◦ Infinite future is taken into account
◦ Rewards in the future are equally valuable
The Learning Task
Execute actions, observe results and
◦ Learn a policy, π : S → A maximizing
for all states S
•
Infinite horizon discounted model is being used here
Value Function
For each possible policy π the agent can adopt the value
function is
V ( S ) rt g rt 1 g 2 rt 2 ....
In terms of the value function the learning task can be
reformulated to learn the optimal policy p* such that
arg max V ( s), (S )
*
Learning the Value Function
• The value of the optimal V written as
V * (s) = max V π (s)
A look-ahead search can be performed from any state
to determine the optimal policy using
π* (s) = argmax [r(s, a) + g V *(d(s, a))]
a
Learning V * is only possible when the agent has perfect
knowledge of both d and r.
Thus need to define the Q function arises
Example V* Values
g 0.9
4/1/2016
21
The Q Function
The evaluation function Q(s, a) is defined
as
Q(s, a) = r(s, a) + g V *(d(s, a))
If the Q-function is learnt the optimal
action can be selected without knowing d
and r.
*( s) arg max Q( s, a)
a
Learning the Q function
Notice the relationship between Q and V*
V * s max Q s, a'
a'
Rewriting the value of Q
Learning the Q-value
FOR each <s, a> DO
Q̂s, a 0
◦ Initialize table entry:
Observe current state s
WHILE (true) DO
◦ Select action a and execute it
◦ Receive immediate reward r
◦ Observe new state s’
◦ Update table entry for Q̂s, a
using the training rule:
Qˆ s, a r s, a g max Qˆ s', a'
a'
◦ Move: record transition from s to s’
4/1/2016
24
Q Learning Illustration
Consider that an optimal strategy is being
learnt for the given set of states and rewards
for actions
0
100
0
0
0
0
0
0
G
0
100
0
0
The reward for all actions is 0 unless it is
moving to the goal state.
4/1/2016
25
Q Learning Illustration(2)
The actual values of V and Q taking g=0.9
are:
4/1/2016
26
Learning Step
Since an absorbing state exists learning will
consist of a series of episodes with a random
start state
Q s1 , aright r g max Q s2 , a'
a'
0 0.9 max [63,81,100]
90
4/1/2016
27
Uncertain State Transitions
State transition probability function is used
T ( s, a, s ')
◦ Denotes the transition probability from State s to
s’ when action a is performed.
Q function is modified to
Q s, a r(s,a) g T ( s, a, s ') max Q s', a'
a'
s 'S
4/1/2016
28
Applications
Cell Phone Channel Allocation
Cobot: A Social Reinforcement Learning Agent
Car Simulation: Using Reinforcement Learning
Network Packet Routing
Elevator Scheduling
Use of RL to improve the performance of natural
language question answering systems
Reinforcement Learning Methods for Military Applications
4/1/2016
29
Cell Phone Channel Allocation
Learns channel allocations for cell phones
◦ Channels are limited
◦ Allocations affect adjacent cells
◦ Want to minimize dropped and blocked calls
4/1/2016
30
Cell Phone Channel Allocation Cont…
States
◦ Occupied and unoccupied channels for each cell
Availability: Number of free channels for cell
Actions
◦ Call arrival
Evaluate possible free channels
Assign one that has highest value
◦ Call termination
Free channel
Consider reassigning each ongoing call to just-released channel
Perform reassignment (if any) with highest value
Rewards and Values
◦ Reward is number of on-going calls
4/1/2016
31
Performance of FA, BDCL, and RL
FA=fixed assignment method (FA) BDCL=borrowing with directional channel locking
Cobot: A Social Reinforcement Learning Agent
Cobot is a software agent
◦ Apply RL in a complex human online social chat based
environment
The goal is to interact with other members and to
become a real part of his social fabric
Takes certain actions under his own initiative
Any user can reward or punish him.
Cobot has a incremental database of “social statistics”
◦ e.g. how frequently and in what ways users interacted
with one another provided summaries of these statistics
as a service
Cobot Cont…
States
◦
One state space corresponding to each user
State space contains a number of features containing
statistics about that particular user
Actions
◦
Null Action Choose to remain silent for this time period.
◦
Topic Starters Introduce a conversational topic
◦
Social Commentary Make a comment describing the current
social state of the Living Room, such as “It sure is quiet” or
“Everyone here is friendly.”
Cobot
cont…
The RL Reward Function
◦ Reward verb
E.g. hug
◦ Punish verb
E.g. spank
◦ These verbs give a numerical (positive and negative,
respectively) training signal to Cobot
Car Simulation: Using Reinforcement Learning
The drivers do not know the track information
beforehand
Takes appropriate action to avoid bumping into the wall
by learning from the past experience and the given
rewards
States
◦ state of the car is represented by two variables
The distance of the car to the left wall of the track
the car’s velocity towards right wall
Car Simulation cont…
Action
◦ Turn left or right to go in right direction
Reward
◦ The car will be given a negative reward if
Car bumping into the wall
car going backwards
◦ Positive if
Going on correct direction
Conclusion
The basic reinforcement learning model consists of:
◦ a set of environment states S
◦ a set of actions A and
◦ a set of scalar "rewards"
Learner take actions in an environment so as to maximize
its long-term reward
Any problem domain that can be cast as a Markov decision
process can potentially benefit from this technique.
Unlike supervised learning, reinforcement learning systems
do not require explicit input-output pairs for training
4/1/2016
38
References
Reinforcement Learning: A User’s Guide Bill Smart Department of Computer
Science and Engineering Washington University in St. Louis
http://www.cse.wustl.edu/~wds/ ICAC 2005
Machine Learning, Tom Mitchell, McGraw Hill, 1997.
Harmon, M., Harmon, S.: Reinforcement Learning : A Tutorial, Wright State University,
2000.
Rich Sutton: Reinforcement Learning: Tutorial, AT& T Labs
http://www.cs.ualberta.ca/~sutton/Talks/RL-Tutorial/RL-Tutorial.ppt
Kaelbling, L.P., Littman, M.L., and Moore, A.W. "Reinforcement learning: A survey".
Journal of Artificial Intelligence Research, 4, 1996.
http://en.wikipedia.org/wiki/Reinforcement_learning
A Social Reinforcement Learning Agent by Charles Isbell, Christian Shelton, Michael
Kearns, Satinder Singh and Peter Stone. In Proceedings of the Fifth International
Conference on Autonomous Agents (AGENTS), pages 377-384, 2001.
Winner of Best Paper Award
4/1/2016
39
Questions & Comments
4/1/2016
40