mstreeter_nescai_200.. - Carnegie Mellon University
Download
Report
Transcript mstreeter_nescai_200.. - Carnegie Mellon University
An Asymptotically
Optimal Algorithm for
the Max k-Armed
Bandit Problem
Matthew Streeter
& Stephen Smith
Carnegie Mellon University
NESCAI, April 29 2006
1
Outline
Problem statement & motivations
Modeling payoff distributions
An asymptotically optimal algorithm
2
D1
Probability
The k-Armed
Bandit
Payoff
Quick Time™a nd a
TIFF ( Unco mpre ssed ) dec ompr esso r
ar e nee ded to see this pictur e.
Machine 1
D2
Payoff
Quick Time™a nd a
TIFF ( Unco mpre ssed ) dec ompr esso r
ar e nee ded to see this pictur e.
Machine 2
D3
Probability
You are in a room with k
slot machines
Pulling the arm of
machine i returns a payoff
drawn (independently at
random) from unknown
distribution Di
Allowed n total pulls
Goal: maximize total
payoff
> 50 years of papers
Probability
Payoff
Quick Time™a nd a
TIFF ( Unco mpre ssed ) dec ompr esso r
ar e nee ded to see this pictur e.
Machine 3
3
D1
Probability
The Max k-Armed
Bandit
Payoff
Quick Time™a nd a
TIFF ( Unco mpre ssed ) dec ompr esso r
ar e nee ded to see this pictur e.
Machine 1
D2
Payoff
Quick Time™a nd a
TIFF ( Unco mpre ssed ) dec ompr esso r
ar e nee ded to see this pictur e.
Machine 2
D3
Probability
You are in a room with k
slot machines
Pulling the arm of
machine i returns a payoff
drawn (independently at
random) from unknown
distribution Di
Allowed n total pulls
Goal: maximize highest
payoff
Introduced ~2003
Probability
Payoff
Quick Time™a nd a
TIFF ( Unco mpre ssed ) dec ompr esso r
ar e nee ded to see this pictur e.
Machine 3
4
D1
Probability
The Max k-Armed
Bandit: Motivations
Given: some optimization
problem, k randomized
heuristics
Assumption:
Each time you run
a run hasTabu
each
the Search
same
heuristic, get a solution
computational cost
with a certain quality
Allowed n runs
Hill Climbing
Goal: maximize quality of
best solution
Cicirello & Smith (2005)
show competitive
performance on RCPSP
Simulated
Payoff
Quick Time™a nd a
TIFF ( Unco mpre ssed ) dec ompr esso r
ar e nee ded to see this pictur e.
Probability
D2
Payoff
Quick Time™a nd a
TIFF ( Unco mpre ssed ) dec ompr esso r
ar e nee ded to see this pictur e.
D3
Probability
Payoff
Quick Time™a nd a
TIFF ( Unco mpre ssed ) dec ompr esso r
ar e nee ded to see this pictur e.
Annealing
5
The Max k-Armed
Bandit: Example
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Given n pulls, what strategy maximizes the
(expected) maximum payoff?
If n=1, should pull arm 1 (higher mean)
If n=1000, should pull arm 2 (higher variance)
6
Modeling Payoff
Distributions
7
Can’t Handle Arbitrary Payoff
Distributions
Probability
1
Arm 1
0.8
Arm 2
0.6
0.4
0.2
0
0
1
2
3
Payoff
4
5
Needle in the haystack: can’t distinguish arms
until you get payoff > 0, at which point highest
payoff can’t be improved
8
Assumption
We will assume each machine returns
payoff from a generalized extreme value
(GEV) distribution
Why? Extremal Types Theorem: max. of n
independent draws from some fixed
distribution converges in distribution a GEV
Compare to Central Limit Theorem: sum
converges in distribution
of n draws
a Gaussian
9
The GEV distribution
Z has a GEV distribution if
1
z s
Pr[Z z] exp1 s
for constants s, , and > 0.
determines mean
determines standard deviation
s determines shape
10
Example payoff distribution:
Job Shop Scheduling
Job shop scheduling: assign start times to
operations, subject to constraints.
Length of schedule = latest completion
time of any operation
Goal: find a schedule with minimum
length
Many heuristics (branch and bound,
simulated annealing...)
11
Example payoff distribution:
Job Shop Scheduling
“ft10” is a notorious instance of the job
shop scheduling problem
Heuristic h: do hill-climbing 500 times
Ran h 1000 times on ft10; fit GEV to
payoff data
12
Example payoff distribution:
Distribution truncated
Job shop scheduling
at 931. Optimal schedule
length = 930 (Carlier
1
& Pinson, 1986)
z 931 .123
G fit (z) exp
197
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
-(schedule length)
Best of 50,000 sampled
schedules has length
1014
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
num. runs
13
An Asymptotically
Optimal Algorithm
14
Notation
mi(t) = expected maximum payoff you get
from pulling the ith arm t times
m*(t) = max1ik mi(t)
S(t) = expected maximum payoff you get
by following strategy S for t pulls
15
The Algorithm
Strategy S* ( and to be determined):
For i from 1 to k:
For remaining n-kD pulls:
Using D pulls, estimate mi(n). Pick D so that with
probability 1-, estimate is within of true mi(n).
Pull arm with max. estimated mi(n)
Guarantee: S*(n) = m*(n) - o(1).
16
The GEV distribution
Z has a GEV distribution if
1
z s
Pr[Z z] exp1 s
for constants s, , and > 0.
determines mean
determines standard deviation
s determines shape
17
Behavior of the GEV
E[max. payoff]
20
s>0
s=0
15
Lots of algebra
Not so bad
s<0
10
5
0
1
10
100
1000
10000
Num. pulls
18
Predicting mi(n)
Estimation procedure: linear interpolation!
E[max. payoff]
20
15
10
Empirical mi(1)
Predicted mi(n)
5
Empirical mi(2)
0
1
10
100
1000
10000
Num. pulls
Estimate mi(1) and mi(2) , then interpolate to get
mi(n)
19
Predicting mi(n): Lemma
Let X be a random variable with
(unknown) mean and standard deviation
max. O(-2 log -1) samples of X suffice to
obtain an estimate such that with
probability at least 1-, estimate is within
of true value.
Proof idea: use “median of means”
20
Predicting mi(n)
E[max. payoff]
20
15
10
Empirical mi(1)
Predicted mi(n)
5
Empirical mi(2)
0
1
10
100
1000
10000
Num. pulls
Equation for line:
mi(n) = mi(1)+[mi(1)-mi(2)](log n)
Estimating mi(n) requires O((log n)2 -2 log -1)
pulls
21
The Algorithm
Strategy S* ( and to be determined):
For i from 1 to k:
Using D pulls, estimate mi(n). Pick D so that with
probability 1-, estimate is within of true mi(n).
For remaining n-kD pulls:
Pull arm with max. predicted mi(n)
Guarantee: S*(n) = m*(n) - o(1)
Three things make S* less than optimal:
m*(n) - m*(n-kD)
22
Analysis
Three things make S* less than optimal:
m*(n) - m*(n-kD)
Setting =n-2, =n-1/3 takes care of the first two.
Then:
m*(n)-m*(n-kD)
= O(log n - log(n-kD))
= O(kD/n)
= O(k(log n)2 -2(log -1)/n)
= O(k(log n)3 n-1/3) = o(1)
23
Summary & Future Work
Defined max k-armed bandit problem
and discussed applications to heuristic
search
Presented an asymptotically optimal
algorithm for GEV payoff distributions
(we analyzed special case s=0)
Working on applications to scheduling
problems
24
The Extremal Types Theorem
Define Mn = max. of n draws, and suppose
lim n P rn Mn z G(z) z
where each rn is a linear “rescaling function”.
Then G is either a point mass or a “generalized
extreme value distribution”:
1
z s
G(z) exp1 s
for constants s, , and > 0.
25