ppt - CSE, IIT Bombay

Download Report

Transcript ppt - CSE, IIT Bombay

REINFORCEMENT LEARNING
Group 11
Ashish Meena
04005006
Rohitashwa Bhotica 04005010
Hansraj Choudhary
04d05005
Piyush Kedia
04d05009
4/1/2016
1
Outline







Introduction
Learning Models
Motivation
Reinforcement Learning Framework
Q – Learning Algorithm
Applications
Summary
4/1/2016
2
Introduction

Machine Learning
◦ Construction of programs that automatically
improve with experience.

Types of Learning
◦ Supervised Learning
◦ Unsupervised Learning
◦ Reinforcement Learning
4/1/2016
3
Supervised Learning
- Training data: (X,Y). (features, label)
- Predict Y, minimizing some loss.
- Regression, Classification.

Example
◦ Predict whether a patient, hospitalized due to a heart
attack, will have a second heart attack. The prediction
is to be based on demographic, diet and clinical
measurements for that patient. (Logistic Regression)
4/1/2016
4
Unsupervised Learning

Unsupervised Learning
- Training data: X. (features only)
- Find “similar” points in high-dim

- Clustering.
Example
X-space.
◦ From the DNA micro-array data, determine which
genes are most “similar” in terms of their expression
profiles. (Clustering)
4/1/2016
5
Reinforcement Learning

Training data: (S, A, R). (State-Action-Reward)

Develop an optimal policy (sequence of decision rules)
for the learner so as to maximize its long – term
reward.
4/1/2016
6
Reinforcement Learning
Agent
Policy
Action
State
Reward
Environment
0 : r0
1 : r1
2 : r2
s0 a
 s1 a
 s2 a
 
4/1/2016
7
Motivation


Supervised and unsupervised learning fail
in many situations.
Example
◦ FLIGHT CONTROLS SYSTEMS
 For a set of all sensor readings at a given time
deciding how the flight controls should function.
 In case of supervised learning the labels for many
features are unknown.
 Performing trial-and-error interactions with the
environment reinforcement learning is capable of
solving such problems.
4/1/2016
8
Learning by Interaction
• Learning to ride a bicycle
• Actions:
• Turn handle bars RIGHT.
• Turn handle bars LEFT.
• Rewards:
• Positive if the cycle is perfectly balanced.
• Negative if the angle between the cycle and the
ground decreases
4/1/2016
9
Inspiration from Natural Learning

Dopamine
◦ Neurohormone occurring in a wide variety of animals,
including both vertebrates and invertebrates
◦ Regulates and controls behavior by inducing pleasurable effects
equivalent to rewards
◦ Motivates us to proactively to perform certain activities
4/1/2016
10
Markov Decision Processes

A discrete set of environment states S.

A discrete set of agent actions A.

At each discrete time agent observes state st Є S and
chooses action st Є A


Receives reward rt

State changes to st+1
Markov Assumptions

st+1 = δ(st,at)

rt = r(st,at)

δ and r are usually unknown to the learner

δ and r may also be non deterministic
4/1/2016
11
Simple deterministic example
4/1/2016
12
Model-free v/s Model-based Methods

Model-free Methods
◦ Eliminate need to know the model
◦ Learn a policy without estimating the model.
◦ Example Q Learning

Model Based Methods
◦ Model consists of the Transition Probability
Function and the Reward Function.
◦ Interactively estimate the model and calculate the
policy.
◦ Example Dyna
4/1/2016
13
Precise Goal
Maximization of long-term reward.
 An optimal policy has to be learnt in order to
do this.
 Long-term
reward may have different
interpretations according to how the future
is taken into account. i.e. different models of
optimal behaviour

Models of Optimal Behaviour
Three main models
 Finite Horizon Model

◦ Optimize the reward for the next h steps
rt represents the reward for the (t+1)th
action performed

h=1 represents a greedy strategy
Models of Optimal Behaviour (2)
 Infinite Horizon Discounted Model
◦ Models future rewards being less valuable than in the present
◦ Has infinite horizon i.e. considers infinite actions in the future
◦ Makes use of a discount factor g between 0 and 1 which
represents the importance of the future.
◦ This is the model used in the Q-Learning algorithm
Models of Optimal Behaviour (3)

Average-Reward Model
◦ Reward per action is considered
◦ Infinite future is taken into account
◦ Rewards in the future are equally valuable
The Learning Task

Execute actions, observe results and
◦ Learn a policy, π : S → A maximizing
for all states S
•
Infinite horizon discounted model is being used here
Value Function

For each possible policy π the agent can adopt the value
function is
V  ( S )  rt  g rt 1  g 2 rt  2 ....

In terms of the value function the learning task can be
reformulated to learn the optimal policy p* such that

  arg max V ( s), (S )
*

Learning the Value Function
• The value of the optimal V written as
V * (s) = max V π (s)

A look-ahead search can be performed from any state
to determine the optimal policy using
π* (s) = argmax [r(s, a) + g V *(d(s, a))]
a


Learning V * is only possible when the agent has perfect
knowledge of both d and r.
Thus need to define the Q function arises
Example V* Values
g  0.9
4/1/2016
21
The Q Function

The evaluation function Q(s, a) is defined
as
Q(s, a) = r(s, a) + g V *(d(s, a))

If the Q-function is learnt the optimal
action can be selected without knowing d
and r.
 *( s)  arg max Q( s, a)
a
Learning the Q function

Notice the relationship between Q and V*
V *  s   max Q  s, a' 
a'

Rewriting the value of Q
Learning the Q-value

FOR each <s, a> DO
Q̂s, a   0
◦ Initialize table entry:

Observe current state s

WHILE (true) DO
◦ Select action a and execute it
◦ Receive immediate reward r
◦ Observe new state s’
◦ Update table entry for Q̂s, a 
using the training rule:
Qˆ  s, a   r  s, a   g max Qˆ  s', a' 
a'
◦ Move: record transition from s to s’
4/1/2016
24
Q Learning Illustration

Consider that an optimal strategy is being
learnt for the given set of states and rewards
for actions
0
100
0
0
0
0
0
0

G
0
100
0
0
The reward for all actions is 0 unless it is
moving to the goal state.
4/1/2016
25
Q Learning Illustration(2)

The actual values of V and Q taking g=0.9
are:
4/1/2016
26
Learning Step

Since an absorbing state exists learning will
consist of a series of episodes with a random
start state
Q  s1 , aright   r  g max Q  s2 , a' 
a'
 0  0.9 max [63,81,100]
 90
4/1/2016
27
Uncertain State Transitions

State transition probability function is used
T ( s, a, s ')
◦ Denotes the transition probability from State s to
s’ when action a is performed.

Q function is modified to
Q  s, a   r(s,a)  g  T ( s, a, s ') max Q  s', a' 
a'
s 'S
4/1/2016
28
Applications
Cell Phone Channel Allocation
Cobot: A Social Reinforcement Learning Agent
Car Simulation: Using Reinforcement Learning
Network Packet Routing
Elevator Scheduling
Use of RL to improve the performance of natural
language question answering systems
Reinforcement Learning Methods for Military Applications
4/1/2016
29
Cell Phone Channel Allocation

Learns channel allocations for cell phones
◦ Channels are limited
◦ Allocations affect adjacent cells
◦ Want to minimize dropped and blocked calls
4/1/2016
30
Cell Phone Channel Allocation Cont…

States
◦ Occupied and unoccupied channels for each cell
 Availability: Number of free channels for cell

Actions
◦ Call arrival
 Evaluate possible free channels
 Assign one that has highest value
◦ Call termination
 Free channel
 Consider reassigning each ongoing call to just-released channel
 Perform reassignment (if any) with highest value

Rewards and Values
◦ Reward is number of on-going calls
4/1/2016
31
Performance of FA, BDCL, and RL
FA=fixed assignment method (FA) BDCL=borrowing with directional channel locking
Cobot: A Social Reinforcement Learning Agent

Cobot is a software agent
◦ Apply RL in a complex human online social chat based
environment

The goal is to interact with other members and to
become a real part of his social fabric

Takes certain actions under his own initiative

Any user can reward or punish him.

Cobot has a incremental database of “social statistics”
◦ e.g. how frequently and in what ways users interacted
with one another provided summaries of these statistics
as a service
Cobot Cont…

States
◦
One state space corresponding to each user
 State space contains a number of features containing
statistics about that particular user

Actions
◦
Null Action Choose to remain silent for this time period.
◦
Topic Starters Introduce a conversational topic
◦
Social Commentary Make a comment describing the current
social state of the Living Room, such as “It sure is quiet” or
“Everyone here is friendly.”
Cobot

cont…
The RL Reward Function
◦ Reward verb
 E.g. hug
◦ Punish verb
 E.g. spank
◦ These verbs give a numerical (positive and negative,
respectively) training signal to Cobot
Car Simulation: Using Reinforcement Learning

The drivers do not know the track information
beforehand
Takes appropriate action to avoid bumping into the wall
by learning from the past experience and the given
rewards
 States

◦ state of the car is represented by two variables
 The distance of the car to the left wall of the track
 the car’s velocity towards right wall
Car Simulation cont…

Action
◦ Turn left or right to go in right direction

Reward
◦ The car will be given a negative reward if
 Car bumping into the wall
 car going backwards
◦ Positive if
 Going on correct direction
Conclusion

The basic reinforcement learning model consists of:
◦ a set of environment states S
◦ a set of actions A and
◦ a set of scalar "rewards"



Learner take actions in an environment so as to maximize
its long-term reward
Any problem domain that can be cast as a Markov decision
process can potentially benefit from this technique.
Unlike supervised learning, reinforcement learning systems
do not require explicit input-output pairs for training
4/1/2016
38
References

Reinforcement Learning: A User’s Guide Bill Smart Department of Computer
Science and Engineering Washington University in St. Louis
http://www.cse.wustl.edu/~wds/ ICAC 2005

Machine Learning, Tom Mitchell, McGraw Hill, 1997.

Harmon, M., Harmon, S.: Reinforcement Learning : A Tutorial, Wright State University,
2000.

Rich Sutton: Reinforcement Learning: Tutorial, AT& T Labs
http://www.cs.ualberta.ca/~sutton/Talks/RL-Tutorial/RL-Tutorial.ppt

Kaelbling, L.P., Littman, M.L., and Moore, A.W. "Reinforcement learning: A survey".
Journal of Artificial Intelligence Research, 4, 1996.

http://en.wikipedia.org/wiki/Reinforcement_learning
A Social Reinforcement Learning Agent by Charles Isbell, Christian Shelton, Michael
Kearns, Satinder Singh and Peter Stone. In Proceedings of the Fifth International
Conference on Autonomous Agents (AGENTS), pages 377-384, 2001.

Winner of Best Paper Award
4/1/2016
39
Questions & Comments
4/1/2016
40