Machine Learning

Download Report

Transcript Machine Learning

Learning
Improving the performance
of the agent
-w.r.t. the external performance
measure
Dimensions:
What can be learned?
--Any of the boxes representing
the agent’s knowledge
--action description, effect probabilities,
causal relations in the world (and the
probabilities of causation), utility models
(sort of through credit assignment), sensor
data interpretation models
What feedback is available?
--Supervised, unsupervised,
“reinforcement” learning
--Credit assignment problem
What prior knowledge is available?
-- “Tabularasa” (agent’s head is a blank
slate) or pre-existing knowledge
Dimensions of Learning
•
•
“Representation” of the knowledge
Degree of Guidance
•
– Tabula Rasa
• No background knowledge other
than the training examples
– Supervised
• Teacher provides training
examples & solutions
–
• No assistance from teacher
–
E.g. Clustering; Inducing hidden
variables
– In-between
• Either feedback is given only for
some of the examples
–
Semi-supervised Learning
• Or feedback is provided after a
sequence of decision are made
–
– Knowledge-based learning
• Examples are interpreted in the
context of existing knowledge
E.g. Classification
– Unsupervised
Reinforcement Learning
Degree of Background Knowledge
•
Knowledge Level vs. Speedup
Learning
– If you do have background
knowledge, then a question is
whether the learned knowledge is
“entailed” by the background
knowledge or not
–
(Entailment can be logical or
probabilistic)
• If it is entailed, then it is called
“deductive” learning
• If it is not entailed, then it is called
inductive learning
Inductive Learning
(Classification Learning)
•
Given a set of labeled examples
– Find the rule that underlies
the labeling
• (so you can use it to
predict future unlabeled
examples)
– Tabularasa, fully supervised
• Too hard as given.. Need to
constrain the space of rules
• Bias: Start with a specific
form of hypothesis space
• With a given bias, Inductive
learning reduces to
winnowing through the
hypotheses space—checking
to see which of them fit the
data best
--similar to predicting credit card fraud,
predicting who are likely to respond to junk mail
predicting what items you are likely to buy
Closely related to
* Function learning
or curve-fitting
(regression)
Uses different biases in predicting Russel’s waiting habbits
K-nearest
neighbors
If patrons=full and day=Friday
then wait (0.3/0.7)
If wait>60 and Reservation=no
then wait (0.4/0.9)
Decision Trees
--Examples are used to
--Learn topology
--Order of questions
Association rules
--Examples are used to
--Learn support and
confidence of association
rules
SVMs
Neural Nets
--Examples are used to
--Learn topology
--Learn edge weights
Russell waits
RW
None
some
full
T
0.3
0.2
0.5
F
0.4
0.3
0.3
Wait time?
Patrons?
Naïve bayes
(bayesnet learning)
--Examples are used to
--Learn topology
--Learn CPTs
Friday?
Inductive Learning
(Classification Learning)
•
Given a set of labeled examples,
and a space of hypotheses
– Find the rule that underlies
the labeling
• (so you can use it to
predict future unlabeled
examples)
– Tabularasa, fully supervised
• Idea:
– Loop through all hypotheses
• Rank each hypothesis in
terms of its match to data
• Pick the best hypothesis
•
•
–
Main variations:
Bias: the “sort” of rule are you
looking for?
– If you are looking for only
conjunctive hypotheses,
there are just 3n
Search:
– Greedy search
– Decision tree learner
– Systematic search
– Version space learner
– Iterative search
– Neural net learner
It can be shown that
sample complexity of PAC
learning is proportional to
1/e, 1/d AND log |H|
The main problem is that
the space of hypotheses is too large
Given examples described in terms of n boolean variables
n
There are 2 2 different hypotheses
For 6 features, there are 18,446,744,073,709,551,616 hypotheses
A good hypothesis will
have fewest false positives
(Fh+) and fewest false negatives (Fh-)
[Ideally, we want them to be zero]
False +ve: The learner classifies the example
as +ve, but it is actually -ve
Rank(h) = f(Fh+, Fh-)
--f depends on the domain
by default f=Sum; but can give
different weights to different
errors (Cost-based learning)
Medical domain
--Higher cost for F--But also high cost for F+
Spam Mailer
--Very low cost for F+
--higher cost for FTerrorist/Criminal Identification
--High cost for F+ (for the individual)
--High cost for F- (for the society)
Ranking hypotheses
H1: Russell waits only in italian restaurants
false +ves: X10,
false –ves: X1,X3,X4,X8,X12
H2: Russell waits only in cheap french restaurants
False +ves:
False –ves: X1,X3,X4,X6,X8,X12
Final exam options (vote):
Option 1. Take-home (that will be given after reading day
and will be due ¼ 10th)
Option 2. In class exam (on 10th)
Option 3. No exam. Rao gets rest, and everybody gets A+
11/28
Blog questions
Decision Tree learning
Perceptron learning
Complexity measured in number of
Samples required to PAC-learn
What is a reasonable goal in designing
a learner?
•
•
(Idea) Learner must classify all
new instances (test cases) correctly
always
Any test cases?
– Test cases drawn from the same
distribution as the training cases
•
Always?
– May be the training samples are
not completely representative of
the test samples
– So, we go with “probably”
•
Correctly?
– May be impossible if the training
data has noise (the teacher may
make mistakes too)
– So, we go with “approximately”
• The goal of a learner then is to
produce a probably
approximately correct (PAC)
hypothesis, for a given
approximation (error rate) e
and probability d.
• When is a learner A better than
learner B?
– For the same e,d bounds, A
needs fewer trailing samples
than B to reach PAC.
Learning
Curves
N ¸ 1/² ( log 1/± + log |H|)
Deriving Sample Complexity for
PAC Learning
H
• IDEA: We want to compute the probability that a “bad”
hypothesis (that makes more than ² error on the test
cases) is chosen for being consistent with the training
H
examples, and constrain it to be less than ±
• Probability that an hb 2 Hbad is consistent with a single
training example is · (1 - ²) (since error rate of hb > ²).
e
hb
f
bad
– This holds ONLY because we assume training and test instances
are drawn with the same distribution
• The probability that it is consistent with all N training
examples is ·(1-²)N
• The probability that at least one bad hypothesis does
this is · |Hbad| (1-²)N · |H| (1-²)N ( since |Hbad| · |H|)
• We want this probability be less than ±.
– That is |H| (1-²)N · ±
• Since (1 - ²) · e-² we can have it if |H| e-²N · ± or
N ¸ 1/² (log 1/± + log |H|)
+ hb+
- +
- +
+ +
+
+
+
+
- +
Uses different biases in predicting Russel’s waiting habbits
K-nearest
neighbors
If patrons=full and day=Friday
then wait (0.3/0.7)
If wait>60 and Reservation=no
then wait (0.4/0.9)
Decision Trees
--Examples are used to
--Learn topology
--Order of questions
Association rules
--Examples are used to
--Learn support and
confidence of association
rules
SVMs
Neural Nets
--Examples are used to
--Learn topology
--Learn edge weights
Russell waits
RW
None
some
full
T
0.3
0.2
0.5
F
0.4
0.3
0.3
Wait time?
Patrons?
Naïve bayes
(bayesnet learning)
--Examples are used to
--Learn topology
--Learn CPTs
Friday?
Inductive Learning
(Classification Learning)
•
Given a set of labeled examples,
and a space of hypotheses
– Find the rule that underlies
the labeling
• (so you can use it to
predict future unlabeled
examples)
– Tabularasa, fully supervised
• Idea:
– Loop through all hypotheses
• Rank each hypothesis in
terms of its match to data
• Pick the best hypothesis
•
•
–
Main variations:
Bias: the “sort” of rule are you
looking for?
– If you are looking for only
conjunctive hypotheses,
there are just 3n
Search:
– Greedy search
– Decision tree learner
– Systematic search
– Version space learner
– Iterative search
– Neural net learner
It can be shown that
sample complexity of PAC
learning is proportional to
1/e, 1/d AND log |H|
The main problem is that
the space of hypotheses is too large
Given examples described in terms of n boolean variables
n
There are 2 2 different hypotheses
For 6 features, there are 18,446,744,073,709,551,616 hypotheses
Learning Decision Trees---How?
Basic Idea:
--Pick an attribute
--Split examples in terms
of that attribute
--If all examples are +ve
label Yes. Terminate
--If all examples are –ve
label No. Terminate
--If some are +ve, some are –ve
continue splitting recursively
(Special case: Decision Stumps
If you don’t feel like splitting
any further, return the majority
label )
Which one to pick?
Decision Trees & Sample
Complexity
• Decision Trees can Represent any boolean
function
• ..So PAC-learning decision trees should be
n
2
exponentially hard (since there are 2 hypotheses)
• ..however, decision tree learning algorithms use
greedy approaches for learning a good (rather than
the optimal) decision tree
– Thus, using greedy rather than exhaustive search of
hypotheses space is another way of keeping complexity
low (at the expense of losing PAC guarantees)
Depending on the order we pick, we can get smaller or bigger trees
Which tree is better?
Why do you think so??
Basic Idea:
--Pick an attribute
--Split examples in terms
of that attribute
--If all examples are +ve
label Yes. Terminate
--If all examples are –ve
label No. Terminate
--If some are +ve, some are –ve
continue splitting recursively
--if no attributes left to split?
(label with majority element)
Would you split on
patrons or Type?
The Information Gain
Computation
P+ : N+ /(N++N-)
P- : N- /(N++N-)
# expected comparisons
needed to tell whether a
given example is +ve or -ve
I(P+ ,, P-) = -P+ log(P+) - P- log(P- )
N+
NThe difference
is the information
gain
Splitting on
feature fk
N1+
N1-
I(P1+ ,, P1-)
N2+
N2-
I(P2+ ,, P2-)
Nk+
Nk-
I(Pk+ ,, Pk-)
So, pick the feature
with the largest Info Gain
I.e. smallest residual info
k
Given k mutually exclusive and exhaustive
events E1….Ek whose probabilities are
p1….pk
The “information” content (entropy) is defined as
S i -pi log2 pi
A split is good if it reduces the entropy..
S
i=1
[Ni+ + Ni- ]/[N+ + N-] I(Pi+ ,, Pi-)
A simple example
Ex Masochistic Anxious
Nerdy
1
F
T
F
HATES
EXAM
Y
2
F
F
T
N
V(M) = 2/4 * I(1/2,1/2) + 2/4 * I(1/2,1/2)
= 1
V(A) = 2/4 * I(1,0) + 2/4 * I(0,1)
3
T
F
F
N
4
T
T
T
Y
= 0
V(N) = 2/4 * I(1/2,1/2) + 2/4 * I(1/2,1/2)
= 1
So Anxious is the best attribute to split on
Once you split on Anxious, the problem is solved
m-fold cross-validation
Split N examples into
m equal sized parts
for i=1..m train with
all parts except ith
test with the ith part
Evaluating the Decision Trees
Russell Domain
Lesson:
Every bias makes
some concepts easier to
learn and others
harder to learn…
“Majority” function
(say yes if majority of
attributes are yes)
Learning curves… Given N examples, partition them into Ntr the training set and Ntest the test instances
Loop for i=1 to |Ntr|
Loop for Ns in subsets of Ntr of size I
Train the learner over Ns
Test the learned pattern over Ntest and compute the accuracy (%correct)
Problems with Info. Gain. Heuristics
• Feature correlation: We are splitting on one feature at a time
• The Costanza party problem
– No obvious easy solution…
• Overfitting: We may look too hard for patterns where there are none
– E.g. Coin tosses classified by the day of the week, the shirt I was
wearing, the time of the day etc.
– Solution: Don’t consider splitting if the information gain given by
the best feature is below a minimum threshold
• Can use the c2 test for statistical significance
– Will also help when we have noisy samples…
• We may prefer features with very high branching
– e.g. Branch on the “universal time string” for Russell restaurant example
–
Branch on social security number to look for patterns on who will get A
– Solution: “gain ratio” --ratio of information gain with the attribute A to the
information content of answering the question “What is the value of A?”
• The denominator is smaller for attributes with smaller domains.
Decision Stumps
• Decision stumps are decision
trees where the leaf nodes do not
necessarily have all +ve or all –
ve training examples
– Could happen either because
examples are noisy and misclassified or because you want to
stop before reaching pure leafs
• When you reach that node, you
return the majority label as the
decision.
• (We can associate a confidence
with that decision using the P+
and P-)
N+
N-
Splitting on
feature fk
N1+
N1-
N2+
N2-
Nk+
Nk-
P+= N1+ / N1++N1-
Sometimes, the best decision tree for a problem
could be a decision stump (see coin toss example next)
12/3: The Last Class
Fill & Return the participation forms
Todays Agenda: Perceptrons (until 11:30)
Interactive Review
Take Home Final will be delivered by e-mail ~Wed
Will be due ~12/10
Uses different biases in predicting Russel’s waiting habbits
K-nearest
neighbors
If patrons=full and day=Friday
then wait (0.3/0.7)
If wait>60 and Reservation=no
then wait (0.4/0.9)
Decision Trees
--Examples are used to
--Learn topology
--Order of questions
Association rules
--Examples are used to
--Learn support and
confidence of association
rules
SVMs
Neural Nets
--Examples are used to
--Learn topology
--Learn edge weights
Russell waits
RW
None
some
full
T
0.3
0.2
0.5
F
0.4
0.3
0.3
Wait time?
Patrons?
Naïve bayes
(bayesnet learning)
--Examples are used to
--Learn topology
--Learn CPTs
Friday?
Decision Surface Learning
(aka Neural Network Learning)
• Idea: Since classification is
really a question of finding
a surface to separate the
+ve examples from the -ve
examples, why not directly
search in the space of
possible surfaces?
• Mathematically, a surface
is a function
– Need a way of learning
functions
– “Threshold units”
“Neural Net” is a collection of threshold units
with interconnections
I1
= 1 if w1I1+w2I2 > k
= 0 otherwise
w1
differentiable
t=k
I2
w2
Single Layer
Recurrent
Feed Forward
Uni-directional connections
Any linear decision
surface can be represented
by a single layer neural net
Multi-Layer
Any “continuous”
decision surface
(function) can be
approximated to any
degree of accuracy by
some 2-layer neural net
Bi-directional
connections
Can act as
associative
memory
The “Brain” Connection
A Threshold Unit
Threshold Functions
differentiable
…is sort of like a neuron
Perceptron Networks
What happened to the
“Threshold”?
--Can model as an extra
weight with static input
I1
w1
t=k
I2
w2
==
I0=-1
w1
w0= k
t=0
w2
Perceptron Learning
• Perceptron learning algorithm
Loop through training examples
– If the activation level of the output unit is
1 when it should be 0, reduce the weight
on the link to the jth input unit by a*Ij,
where Ii is the ith input value and a a
learning rate
– If the activation level of the output unit is
0 when it should be 1, increase the weight
on the link to the ith input unit by a*Ij
– Otherwise, do nothing
Until “convergence”
A nice applet at:
http://neuron.eng.wayne.edu/java/Perceptron/New38.html
Perceptron Learning as Gradient
Descent Search in the weight-space
E
1
(T  O ) 2

2 i
E (W ) 
1

2 i



T  g  W j I j  



 j


2


E
  I j (T  O ) g   W j I j 
W j
j




1
( sigmoid fn)
x
1 e
g ' ( x)  g ( x)(1  g ( x))
g ( x) 
Often a constant
learning rate parameter
is used instead


W j  W j  I j (T  O ) g   W j I j 
j




Ij
I
Can Perceptrons Learn All Boolean Functions?
--Are all boolean functions linearly separable?
Comparing Perceptrons and Decision Trees
in Majority Function and Russell Domain
Majority function
Majority function is linearly seperable..
Russell Domain
Russell domain is apparently not....
Encoding: one input unit per attribute. The unit takes as many
distinct real values as the size of attribute domain
Max-Margin Classification &
Support Vector Machines
•
•
Any line that separates the +ve &
–ve examples is a solution
And perceptron learning finds one
of them
– But could we have a preference
among these?
– may want to get the line that
provides maximum margin
(equidistant from the nearest +ve/ve)
• The nereast +ve and –ve holding
up the line are called support
vectors
•
This changes the problem into an
optimization one
– Quadratic Programming can be
used to directly find such a line
Learning is Optimization after all!
Lagrangian Dual
Two ways to learn non-linear
decision surfaces
• First transform the data
into higher dimensional
space
• Find a linear surface
– Which is guaranteed to
exist
• Transform it back to the
original space
• TRICK is to do this
without explicitly doing a
transformation
• Learn non-linear surfaces
directly (as multi-layer
neural nets)
• Trick is to do training
efficiently
– Back Propagation to the
rescue..
Linear Separability in High Dimensions
“Kernels” allow us to consider separating surfaces in high-D
without first converting all points to high-D
Kernelized Support Vector Machines
•
•
•
•
Turns out that it is not always
necessary to first map the data into
high-D, and then do linear separation
The quadratic programming
formulation for SVM winds up using
only the pair-wise dot product of
training vectors
Dot product is a form of similarity
metric between points
If you replace that dot product by
any non-linear function, you will, in
essence, be transforming data into
some high-dimensional space and
then finding the max-margin linear
classifier in that space
–
•
Which will correspond to some
wiggly surface in the original
dimension
The trick is to find the RIGHT
similarity function
–
Which is a form of prior knowledge
Kernelized Support Vector Machines
•
•
•
•
Turns out that it is not always
necessary to first map the data into
high-D, and then do linear separation
The quadratic programming
formulation for SVM winds up using
only the pair-wise dot product of
training vectors
Dot product is a form of similarity
metric between points
If you replace that dot product by
any non-linear function, you will, in
essence, be tranforming data into
some high-dimensional space and
then finding the max-margin linear
classifier in that space
–
•
Which will correspond to some
wiggly surface in the original
dimension
The trick is to find the RIGHT
similarity function
–
Which is a form of prior knowledge
0
0
A
A
6
Polynomial Kernel: K (A ; A ) = (( 100 à 1)( 100 à 1) à 0:5) ï
Those who ignore easily available domain knowledge are
doomed to re-learn it…
Santayana’s brother
Domain-knowledge & Learning
• Classification learning is a problem addressed by both people
from AI (machine learning) and Statistics
• Statistics folks tend to “distrust” domain-specific bias.
– Let the data speak for itself…
– ..but this is often futile. The very act of “describing” the data points
introduces bias (in terms of the features you decided to use to
describe them..)
• …but much human learning occurs because of strong domainspecific bias..
• Machine learning is torn by these competing influences..
– In most current state of the art algorithms, domain knowledge is
allowed to influence learning only through relatively narrow
avenues/formats (E.g. through “kernels”)
• Okay in domains where there is very little (if any) prior knowledge (e.g.
what part of proteins are doing what cellular function)
• ..restrictive in domains where there already exists human expertise..
Multi-layer Neural Nets
How come back-prop
doesn’t get stuck in
local minima?
One answer: It is actually
hard for local minimas to
form in high-D, as the “trough”
has to be closed in all dimensions
Multi-Network Learning can learn Russell Domains
Russell Domain
…but does it slowly…
Practical Issues in Multi-layer
network learning
• For multi-layer networks, we need to learn both
the weights and the network topology
– Topology is fixed for perceptrons
• If we go with too many layers and connections, we
can get over-fitting as well as sloooow
convergence
– Optimal brain damage
• Start with more than needed hidden layers as well as
connections; after a network is learned, remove the nodes and
connections that have very low weights; retrain
Humans make 0.2%
Neumans (postmen) make 2%
Other impressive applications:
--no-hands across K-nearest-neighbor
The test example’s class is determined
america
by the class of the majority of its k nearest
--learning to speak
neighbors
Need to define an appropriate distance measure
--sort of easy for real valued vectors
--harder for categorical attributes
True hypothesis eventually dominates…
probability of indefinitely producing uncharacteristic data 0
Bayesian prediction is optimal
(Given the hypothesis prior,
all other predictions are less likely)
Also, remember the Economist article that shows
that humans have strong priors..
..note that the Economist article says humans are
able to learn from few examples only because of priors..
So, BN learning is just probability estimation!
(as long as data is complete!)
How Well (and WHY) DOES NBC
WORK?
• Naïve bayes classifier is darned easy to implement
– Good learning speed, classification speed
– Modest space storage
– Supports incrementality
• It seems to work very well in many scenarios
– Lots of recommender systems (e.g. Amazon books recommender) use it
– Peter Norvig, the director of Machine Learning at GOOGLE said, when
asked about what sort of technology they use “Naïve bayes”
• But WHY?
– NBC’s estimate of class probability is quite bad
•
BUT classification accuracy is different from probability estimate accuracy
– [Domingoes/Pazzani; 1996] analyze this
Bayes Network Learning
• Bias: The relation between the class label and class
attributes is specified by a Bayes Network.
• Approach
Russell waits
– Guess Topology
– Estimate CPTs
RW
None
some
full
T
0.3
0.2
0.5
F
0.4
0.3
0.3
• Simplest case: Naïve Bayes
Wait time?
Patrons?
Friday?
– Topology of the network is “class label” causes all the attribute
values independently
– So, all we need to do is estimate CPTs P(attrib|Class)
• In Russell domain, P(Patrons|willwait)
– P(Patrons=full|willwait=yes)=
#training examples where patrons=full and will wait=yes
#training examples where will wait=yes
– Given a new case, we use bayes rule to compute the class label
Class label is the disease; attributes are symptoms
Naïve Bayesian Classification
• Problem: Classify a given example E into one of the
classes among [C1, C2 ,…, Cn]
– E has k attributes A1, A2 ,…, Ak and each Ai can take d
different values
• Bayes Classification: Assign E to class Ci that
maximizes P(Ci | E)
P(Ci| E) = P(E| Ci) P(Ci) / P(E)
• P(Ci) and P(E) are a priori knowledge (or can be easily extracted
from the set of data)
• Estimating P(E|Ci) is harder
– Requires P(A1=v1 A2=v2….Ak=vk|Ci)
• Assuming d values per attribute, we will need ndk probabilities
• Naïve Bayes Assumption: Assume all attributes are
independent P(E| Ci) = P P(Ai=vj | Ci )
– The assumption is BOGUS, but it seems to WORK (and
needs only n*d*k probabilities
NBC in terms of BAYES networks..
NBC assumption
More realistic assumption
Estimating the probabilities for NBC
Given an example E described as A1=v1 A2=v2….Ak=vk we want to
compute the class of E
– Calculate P(Ci | A1=v1 A2=v2….Ak=vk) for all classes Ci and say
that the class of E is the one for which P(.) is maximum
– P(Ci | A1=v1 A2=v2….Ak=vk)
Common factor
= P P(vj | Ci ) P(Ci) / P(A1=v1 A2=v2….Ak=vk)
Given a set of training N examples that have already been classified into n
classes Ci
Let #(Ci) be the number of examples that are labeled as Ci
Let #(Ci, Ai=vi) be the number of examples labeled as Ci
that have attribute Ai set to value vj
P(Ci) = #(Ci)/N
P(Ai=vj | Ci) = #(Ci, Ai=vi) / #(Ci)
USER PROFILE
Example
P(willwait=yes) = 6/12 = .5
P(Patrons=“full”|willwait=yes) = 2/6=0.333
P(Patrons=“some”|willwait=yes)= 4/6=0.666
Similarly we can show that
P(Patrons=“full”|willwait=no)
=0.6666
P(willwait=yes|Patrons=full) = P(patrons=full|willwait=yes) * P(willwait=yes)
----------------------------------------------------------P(Patrons=full)
= k* .333*.5
P(willwait=no|Patrons=full) = k* 0.666*.5
Using M-estimates to improve
probablity estimates
• The simple frequency based estimation of P(Ai=vj|Ck) can be
inaccurate, especially when the true value is close to zero, and
the number of training examples is small (so the probability that
your examples don’t contain rare cases is quite high)
• Solution: Use M-estimate
P(Ai=vj | Ci) = [#(Ci, Ai=vi) + mp ] / [#(Ci) + m]
– p is the prior probability of Ai taking the value vi
• If we don’t have any background information, assume uniform
probability (that is 1/d if Ai can take d values)
– m is a constant—called “equivalent sample size”
• If we believe that our sample set is large enough, we can keep m small.
Otherwise, keep it large.
• Essentially we are augmenting the #(Ci) normal samples with m more
virtual samples drawn according to the prior probability on how Ai takes
values
– Popular values p=1/|V| and m=|V| where V is the size of the vocabulary
Also, to avoid overflow errors do addition of logarithms of probabilities
(instead of multiplication of probabilities)
How Well (and WHY) DOES NBC
WORK?
• Naïve bayes classifier is darned easy to implement
– Good learning speed, classification speed
– Modest space storage
– Supports incrementality
• It seems to work very well in many scenarios
– Lots of recommender systems (e.g. Amazon books recommender) use it
– Peter Norvig, the director of Machine Learning at GOOGLE said, when
asked about what sort of technology they use “Naïve bayes”
• But WHY?
– NBC’s estimate of class probability is quite bad
•
BUT classification accuracy is different from probability estimate accuracy
– [Domingoes/Pazzani; 1996] analyze this
Reinforcement Learning
Based on slides from Bill Smart
http://www.cse.wustl.edu/~wds/
What is RL?
“a way of programming agents by reward and
punishment without needing to specify how the
task is to be achieved”
[Kaelbling, Littman, & Moore, 96]
Basic RL Model
1.
2.
3.
4.
5.
6.
7.
Observe state, st
Decide on an action, at
Perform action
Observe new state, st+1
Observe reward, rt+1
Learn from experience
Repeat
World
S
R
A
Goal: Find a control policy that will maximize the observed
rewards over the lifetime of the agent
An Example: Gridworld
Canonical RL domain
•
•
•
•
States are grid cells
4 actions: N, S, E, W
Reward for entering top right cell
-0.01 for every other move
Minimizing sum of rewards  Shortest path
• In this instance
+1
The Promise of Learning
The Promise of RL
Specify what to do, but not how to do it
• Through the reward function
• Learning “fills in the details”
Better final solutions
• Based of actual experiences, not programmer
assumptions
Less (human) time needed for a good solution
Learning Value Functions
We still want to learn a value function
• We’re forced to approximate it iteratively
• Based on direct experience of the world
Four main algorithms
•
•
•
•
Certainty equivalence
Temporal Difference (TD) learning
Q-learning
SARSA
Certainty Equivalence
Collect experience by moving through the world
• s0, a0, r1, s1, a1, r2, s2, a2, r3, s3, a3, r4, s4, a4, r5, s5, ...
Use these to estimate the underlying MDP
• Transition function, T: SA → S
• Reward function, R: SAS → 
Compute the optimal value function for this MDP
• And then compute the optimal policy from it
Temporal Difference (TD)
[Sutton, 88]
TD-learning estimates the value function directly
• Don’t try to learn the underlying MDP
Keep an estimate of Vp(s) in a table
• Update these estimates as we gather more
experience
• Estimates depend on exploration policy, p
• TD is an on-policy method
TD-Learning Algorithm
Initialize Vp(s) to 0, s
Observe state, s
Perform action, p(s)
Observe new state, s’, and reward, r
Vp(s) ← (1-a)Vp(s) + a(r + gVp(s’))
Go to 2
0 ≤ a ≤ 1 is the learning rate
•
How much attention do we pay to new experiences
TD-Learning
Vp(s) is guaranteed to converge to V*(s)
• After an infinite number of experiences
• If we decay the learning rate

a
t 0
t


a
t 0
2
t

c
• at 
will work
ct
In practice, we often don’t need value convergence
• Policy convergence generally happens sooner
Actor-Critic Methods
[Barto, Sutton, & Anderson, 83]
TD only evaluates a particular policy
Policy
(actor)
• Does not learn a better policy
V
Value
Function
(critic)
We can change the policy as we learn V
• Policy is the actor
• Value-function estimate is the critic
s
r
World
Success is generally dependent on the starting
policy being “good enough”
a
Q-Learning
[Watkins & Dayan, 92]
Q-learning iteratively approximates the state-action
value function, Q
• Again, we’re not going to estimate the MDP directly
• Learns the value function and policy simultaneously
Keep an estimate of Q(s, a) in a table
• Update these estimates as we gather more
experience
• Estimates do not depend on exploration policy
• Q-learning is an off-policy method
Q-Learning Algorithm
Initialize Q(s, a) to small random values, s, a
Observe state, s
Pick an action, a, and do it
Observe next state, s’, and reward, r
Q(s, a) ← (1 - a)Q(s, a) + a(r + gmaxa’Q(s’, a’))
Go to 2
0 ≤ a ≤ 1 is the learning rate
•
We need to decay this, just like TD
Picking Actions
We want to pick good actions most of the time, but
also do some exploration
• Exploring means that we can learn better policies
• But, we want to balance known good actions with
exploratory ones
• This is called the exploration/exploitation problem
Picking Actions
e-greedy
• Pick best (greedy) action with probability e
• Otherwise, pick a random action
Boltzmann (Soft-Max)
• Pick an action based on its Q-value
• P(a | s) 
e
 Q(s, a) 


 t

e
a'
 Q(s, a' ) 


 t

, where t is the “temperature”
SARSA
SARSA iteratively approximates the state-action
value function, Q
• Like Q-learning, SARSA learns the policy and the
value function simultaneously
Keep an estimate of Q(s, a) in a table
•
•
•
•
Update these estimates based on experiences
Estimates depend on the exploration policy
SARSA is an on-policy method
Policy is derived from current value estimates
SARSA Algorithm
Initialize Q(s, a) to small random values, s, a
Observe state, s
Pick an action, a, and do it (just like Q-learning)
Observe next state, s’, and reward, r
Q(s, a) ← (1-a)Q(s, a) + a(r + gQ(s’, p(s’)))
Go to 2
0 ≤ a ≤ 1 is the learning rate
•
We need to decay this, just like TD
On-Policy vs. Off Policy
On-policy algorithms
• Final policy is influenced by the exploration policy
• Generally, the exploration policy needs to be “close”
to the final policy
• Can get stuck in local maxima
Off-policy algorithms
• Final policy is independent of exploration policy
• Can use arbitrary exploration policies
• Will not get stuck in local maxima
Convergence Guarantees
The convergence guarantees for RL are “in the
limit”
• The word “infinite” crops up several times
Don’t let this put you off
• Value convergence is different than policy
convergence
• We’re more interested in policy convergence
• If one action is really better than the others, policy
convergence will happen relatively quickly
Rewards
Rewards measure how well the policy is doing
• Often correspond to events in the world
• Current load on a machine
• Reaching the coffee machine
• Program crashing
• Everything else gets a 0 reward
Things work better if the rewards are incremental
• For example, distance to goal at each step
• These reward functions are often hard to design
The Markov Property
RL needs a set of states that are Markov
• Everything you need to know to make a decision is
included in the state
Not holding key
• Not allowed to consult the past
Holding key
Rule-of-thumb
K
• If you can calculate the reward
function from the state without
any additional information,
you’re OK
S
G
But, What’s the Catch?
RL will solve all of your problems, but
•
•
•
•
We need lots of experience to train from
Taking random actions can be dangerous
It can take a long time to learn
Not all problems fit into the MDP framework
Learning Policies Directly
An alternative approach to RL is to reward whole
policies, rather than individual actions
• Run whole policy, then receive a single reward
• Reward measures success of the whole policy
If there are a small number of policies, we can
exhaustively try them all
• However, this is not possible in most interesting
problems
Policy Gradient Methods
Assume that our policy, p, has a set of n realvalued parameters, q = {q1, q2, q3, ... , qn }
• Running the policy with a particular q results in a
reward, rq
R
• Estimate the reward gradient,
, for each qi
θ i
R
• θi  θi  a
θi
This is another
learning rate
Policy Gradient Methods
This results in hill-climbing in policy space
• So, it’s subject to all the problems of hill-climbing
• But, we can also use tricks from search, like random
restarts and momentum terms
This is a good approach if you have a
parameterized policy
• Typically faster than value-based methods
• “Safe” exploration, if you have a good policy
• Learns locally-best parameters for that policy
An Example: Learning to Walk
[Kohl & Stone, 04]
RoboCup legged league
• Walking quickly is a big advantage
Robots have a parameterized gait controller
• 11 parameters
• Controls step length, height, etc.
Robots walk across soccer pitch and are timed
• Reward is a function of the time taken
An Example: Learning to Walk
Basic idea
1. Pick an initial q = {q1, q2, ... , q11}
2. Generate N testing parameter settings by perturbing q
qj = {q1 + d1, q2 + d2, ... , q11 + d11},
di  {-e, 0, e}
3. Test each setting, and observe rewards
qj → rj
4. For each qi  q
5.
d

Calculate q1+, q10, q1- and set θ'i  θi  0
 d
Set q ← q’, and go to 2


if θi largest
0
if θi largest

if θi largest
Average reward
when qni = qi - di





An Example: Learning to Walk
Initial
Final
Video: Nate Kohl & Peter Stone, UT Austin
Value Function or Policy Gradient?
When should I use policy gradient?
• When there’s a parameterized policy
• When there’s a high-dimensional state space
• When we expect the gradient to be smooth
When should I use a value-based method?
• When there is no parameterized policy
• When we have no idea how to solve the problem
Summary for Part I
Background
• MDPs, and how to solve them
• Solving MDPs with dynamic programming
• How RL is different from DP
Algorithms
•
•
•
•
•
Certainty equivalence
TD
Q-learning
SARSA
Policy gradient