ECE-453 Lecture 1 - The University of Tennessee, Knoxville

Download Report

Transcript ECE-453 Lecture 1 - The University of Tennessee, Knoxville

ECE-517: Reinforcement Learning in
Artificial Intelligence
Lecture 1: Course Logistics, Introduction
August 23, 2012
Dr. Itamar Arel
College of Engineering
Electrical Engineering and Computer Science Department
The University of Tennessee
Fall 2012
1
But first, a quick anonymous survey …
ECE-517 - Reinforcement Learning in AI
2
Outline
Course logistics and requirements
Course roadmap & outline
Introduction
ECE-517 - Reinforcement Learning in AI
3
Course Objectives
Introduce the concepts & principles governing reinforcementbased machine learning systems
Review fundamental theory



Markov Decision Processes (MDPs)
Dynamic Programming (DP)
Practical systems; role of Neural Networks in NDP
RL learning schemes (Q-Learning, TD-Learning, etc.)
Limitations of existing techniques and how they can be
improved
Discuss software and hardware implementation considerations
Long-term goal: to contribute to your understanding of the
formalism, trends and challenges in constructing RL-based
agents
ECE-517 - Reinforcement Learning in AI
4
Course Prerequisites
A course on probability theory or background in
probability theory is required
Matlab/C/C++ competency

A Matlab tutorial has been posted on the course
website (under the schedule page)
Open mindedness & imagination …
ECE-517 - Reinforcement Learning in AI
5
Course Assignments
2 small projects


Main goal – provide students with basic hands-on experience
in ML behavioral simulation and result interpretation
Analysis complemented by simulation
MATLAB programming oriented

Reports should include all background, explanations and
results
5 problem assignment sets


Will cover the majority of the topics discussed in class
Assignments should be handed in before the beginning of the
class
Final project


Each student/group is assigned a topic
Project report & in-class presentation
ECE-517 - Reinforcement Learning in AI
6
Sony AIBO Lab
Located at MK 630
6 Sony AIBO dog robots (3rd generation)
Local wireless network (for communicating with the
dogs)
Code for lab project/s will be written in Matlab
 Interface has been prepared
Time slots should be coordinated with
Instructor & TA
ECE-517 - Reinforcement Learning in AI
7
Textbooks & Reference Material
Lecture notes will be posted weekly on the course website
(web.eecs.utk.edu/~itamar/courses/ECE-517) as well as …




Updated schedule
Assignment sets, sample codes
Grades
General announcements
Reading assignments will be posted on schedule page
Textbook:

R. Sutton and A. Barto, “Reinforcement Learning: An
Introduction,” 1998. (available online!)
ECE-517 - Reinforcement Learning in AI
8
Grading policy & office hours
2 Small Projects – 25% (12.5 points each)
5 Assignment sets – 25% (5 points each)
Midterm (in-class) – 20%
Final project – 30%
Instructor: Dr. Itamar Arel


Office Hours: T/Tr 2:00 – 3:00 PM (MK 608)
TA: Derek Rose ([email protected]) – office @ MK 606
Office hours: contact TA

My email: [email protected]
Students are strongly encouraged to visit the course
website (web.eecs.utk.edu/~itamar/courses/ECE-517) for
announcements, lecture notes, updates etc.
ECE-517 - Reinforcement Learning in AI
9
UTK Academic Honesty Statement
An essential feature of the University of Tennessee,
Knoxville, is a commitment to maintaining an atmosphere of
intellectual integrity and academic honesty. As a student
of the university, I pledge that I will neither knowingly
give nor receive any inappropriate assistance in academic
work, thus affirming my own personal commitment to
honor and integrity.
Bottom line: DO YOUR OWN WORK!
ECE-517 - Reinforcement Learning in AI
10
Understanding and constructing intelligence
What is intelligence? How do we defined/evaluate it?
How can we design adaptive systems that optimize their
performance as time goes by?
What are the limitations of RL based algorithms?
How can artificial Neural Networks help us scale intelligent
systems?
In what ways can knowledge be efficiently represented?
This course is NOT about …





Robotics
Machine learning (in the general sense)
Legacy AI – symbolic reasoning, logic, etc.
Image/vision/signal processing
Control systems theory
Why the course name “RL in AI” ?
ECE-517 - Reinforcement Learning in AI
11
Course Outline & Roadmap
Introduction
Review of basic probability theory

Discrete-time/space probability theory

Discrete Markov Chains
Dynamic Programming

Markov Decision Processes (MDPs)

Partially Observable Markov Decision Processes (POMDPs)
Approximate Dynamic Programming (a.k.a. Reinforcement Learning)

Temporal Difference (TD) Learning, Planning
Midterm – Tuesday, Oct 9, 2012
Neuro-Dynamic Programming

Feedfoward & Recurrent Neural Networks

Neuro-dynamic RL architecture
Applications and case studies
Final project presentations – Nov 27– Dec 4, 2012
A detailed schedule is posted at the course website
ECE-517 - Reinforcement Learning in AI
12
Outline
Course logistics and requirements
Course outline & roadmap
Introduction
ECE-517 - Reinforcement Learning in AI
13
What is Machine Learning?
Discipline focusing on computer algorithms that learn to
perform “intelligent” tasks
Learning is based on observation of data
Generally: learning to do better in the future based on
what has been observed/experienced in the past
ML is a core subarea of AI, which also intersects with
physics, statistics, theoretical CS, etc.
Examples of “ML” Problems





Optical character recognition
Face detection
Spoken language understanding
Customer segmentation
Weather prediction, etc.
ECE-517 - Reinforcement Learning in AI
14
Introduction
Why do we need good ML technology?


Human beings are lazy creatures …
Service robotics
$10B market in 2015 – Japan only!




Pattern recognition (speech, vision)
Data mining
Military applications
… many more
Many ML problems can be
formulated as RL problems …
ECE-517 - Reinforcement Learning in AI
15
Introduction (cont.)
Learning by interacting with our environment is probably
the first to occur to us when we think about the nature of
learning
Humans have no direct teachers
We do have direct sensormotor connection to the
environment
We learn as we go along


Interaction with environment teaches us what “works” and
what doesn’t
We construct a “model” of our environment
This course explores a computational approach to learning
from interaction with the environment
ECE-517 - Reinforcement Learning in AI
16
What is Reinforcement Learning
Reinforcement learning is learning what to do – how to map
situations to actions – in order to maximize a long-term
objective function driven by rewards.
It is a form of unsupervised learning
Two key components at the core of RL:


Trial-and-error – adapting internal representation, based on
experience, to improve future performance
Delayed reward – actions are produced so as to yield longterm (not just short-term) rewards
The “agent” must be able to:



Sense its environment
Produce actions that can affect the environment
Have a goal (“momentary” cost metric) relating to its state
ECE-517 - Reinforcement Learning in AI
17
What is Reinforcement Learning (cont’)
RL attempts to solve the Credit Assignment problem
 what is the long-term impact of an action taken now
Unique to RL systems - major challenge in ML


Necessitates an accurate model of the
environment being controlled/interacted-with
Something animals and humans do very well, and
computers do very poorly
We’ll spend most of the semester formalizing solutions to
this problem



Philosophically
Computationally
Practically (implementation considerations)
ECE-517 - Reinforcement Learning in AI
18
The Big Picture
Artificial Intelligence 
Machine Learning 
Reinforcement Learning
Types of Machine Learning


Supervised Learning: learn from labeled examples
Unsupervised Learning: process unlabeled examples
Example: clustering data into groups

Reinforcement Learning: learn from interaction
Defined by the problem
Many approaches are possible (including evolutionary)
Here we will focus on a particular family of approaches
Autonomous learning
ECE-517 - Reinforcement Learning in AI
19
Software vs. Hardware
Historically, ML has been in CS turf

Confinement to the Von Neumann architecture
Software limits scalability …



Human brain has 1011 processors operating
at once
However, each runs at ~150 Hz
It’s the massive parallelism that gives it power
Even 256 processors is not “massive parallelism”
“Computer Engineering” perspective




FPGA devices (reconfigurable computing)
GPUs
ASIC Prospect
UTK/MIL group focus
ECE-517 - Reinforcement Learning in AI
20
Exploration vs. Exploitation
A fundamental trade-off in RL


Exploitation of what worked in the past (to yield high
reward)
Exploration of new, alternative action paths so as to
learn how to make better action selections in the future
The dilemma is that neither exploration nor
exploitation can be pursued exclusively without failing
at the task
On a stochastic task, each action must be tried many
times to gain a reliable estimate of its expected
reward
We will review mathematical methods proposed to
address this basic issue
ECE-517 - Reinforcement Learning in AI
21
The Reinforcement Learning Framework
Reward
Actions
Observations
Environment
(a.k.a. Plant)
Agent
(a.k.a. Controller)
ECE-517 - Reinforcement Learning in AI
22
Some Examples or RL
A master chess player


Planning-anticipating replies and counter-replies
Immediate, intuitive judgment
A mobile robot decides whether to enter a room or try to
find its way back to a battery-charging station
Playing backgammon


Obviously, a strategy is necessary
Some luck involved (stochastic game)
In all cases, the agent tries to achieve a goal despite
uncertainty about its environment
The affect of an action cannot be fully predicted
Experience allows the agent to improve its performance
over time
ECE-517 - Reinforcement Learning in AI
23
Origins of Reinforcement Learning
Artificial Intelligence
Control Theory (MDP)
Operations Research
Cognitive Science and Psychology
More recently, Neuroscience
RL has solid foundations and is a well-established
research field
ECE-517 - Reinforcement Learning in AI
24
Elements of Reinforcement Learning
Beyond the agent and the environment, we have the
following four main elements in RL
1) Policy - defines the learning agent's way of behaving at any
given time. Roughly speaking, a policy is a mapping from
(perceived) states of the environment to actions to be taken
when in those states.
o Usually stochastic (adapts as you go along)
o Enough to determine the agent’s behavior
2) Reward function - defines the goal in a RL learning problem.
Roughly speaking, it maps each perceived state (or stateaction pair) of the environment to a single number, a reward,
indicating the intrinsic desirability of that state
o Agent’s goal is to maximize the reward over time
o May be stochastic
o Drive the policy employed and its adaptation
ECE-517 - Reinforcement Learning in AI
25
Elements of Reinforcement Learning (cont.)
3) Value function Whereas a reward function indicates what is
good in an immediate sense, a value function specifies what is
good in the long run.
 Roughly speaking, the value of a state is the total amount of
reward an agent can expect to accumulate over the future,
starting from that state.
It allows the agent to look over the “horizon”
Actions are derived from value estimations, not rewards
We measure rewards, but we estimate and act upon values –
corresponds to strategic/long-term thinking
 Intuitively a prerequisite for intelligence/intelligent-control
(plants vs. animals)
 Obtaining a good value function is a key challenge in designing
good RL systems
ECE-517 - Reinforcement Learning in AI
26
Elements of Reinforcement Learning (cont.)
4) Model – an observable entity that mimics the behavior of the
environment.
 For example, given a state and action, the model might predict
the resultant next state and next reward
 As we will later discuss, predictability and auto-associative
memory, are key attributes of the mammal brain
 Models are used for planning – any way of deciding on a course
of action by considering possible future scenarios prior to them
actually occurring
 Note that RL can work (sometimes very well) with an incomplete
model
 We’ll go over a range of model platforms to achieve the above
As a side note: RL is essentially an optimization problem.
However, it is one of the many optimization problems that
are extremely hard to (optimally) solve.
ECE-517 - Reinforcement Learning in AI
27
An Extended Example: Tic-Tac-Toe
Consider a classical tic-tac-toe game, whereby the winner
places three marks in a row, horizontally, vertically or
diagonally
Let’s assume:


We are playing against an imperfect player
Draws and losses are equally bad for us
Q: Can we design a player that’ll find imperfections in the
opponent’s play and learn to maximize chances of winning?
Classical machine learning schemes would never visit a
state that has the potential to lead to a loss
We want to exploit the weaknesses of the opponent, so we
may decide to visit a state that has the potential of
leading to a loss
ECE-517 - Reinforcement Learning in AI
28
An Extended Example: Tic-Tac-Toe (cont.)
Using dynamic programming (DP), we can compute an
optimal solution for any opponent


However, we would need specifications of the opponent
(e.g. state-action probabilities)
Such information is usually unavailable to us
In RL we estimate this information from experience
We later apply DP, or other sequential decision
making schemes, based on the model we obtained by
experience
A policy tells the agent how to make its next move
based on the state of the board

Winning probabilities can be derived by knowing opponent
ECE-517 - Reinforcement Learning in AI
29
An Extended Example: Tic-Tac-Toe (cont.)
How do we solve this in RL …

Set up a table of numbers – one for each state of the game
This number will reflect the probability of winning from that
particular state
This is treated as the state’s value, and the entire learned table
denotes the value function
If V(a)>V(b)  state a is preferred over state b





All states with three X’s in a row have win prob. of 1
All states with three O’s in a row have win prob. of 0
All other states are preset to prob. 0.5
When playing the game, we make a move that we predict
would result in a state with the highest value (exploitation)
Occasionally, we chose randomly among the non-zero valued
states (exploratory moves)
ECE-517 - Reinforcement Learning in AI
30
An Extended Example: Tic-Tac-Toe (cont.)
ECE-517 - Reinforcement Learning in AI
31
An Extended Example: Tic-Tac-Toe (cont.)
While playing, we update the values of the states
The current value of the earlier state is adjusted to be
closer to the value of the later state
V(s)  V(s) + a[V(s’) –V(s)]
where: 0<a<1 is a learning parameter (step-size param.)
s – state before move / s’ – state after move
This update rule is an example of Temporal-Difference
Learning method
This method performs quite well – converges to the
optimal policy (for a fixed opponent)
Can be adjusted to allow for slowly-changing opponents
ECE-517 - Reinforcement Learning in AI
32
An Extended Example: Tic-Tac-Toe (cont.)
RL features encountered …


Emphasis on learning from interaction – in this case with the
opponent
Clear goal – correct planning takes into account delayed
rewards
For example setting up traps for a shortsighted opponent

No model of opponent exists a priori
Although this example is a good one, RL methods can …


be applied to infinite horizon problems (not state-terminal)
be applied to cases where there is no external adversary (e.g.
“game against nature”)
Backgammon example  1020 states using Neural Nets
ECE-517 - Reinforcement Learning in AI
33
For next class …
Read chapter 1 in SB (up to and including section 1.5)
Read chapter 2 in SB
ECE-517 - Reinforcement Learning in AI
34