mstreeter_nescai_200.. - Carnegie Mellon University

Transcript mstreeter_nescai_200.. - Carnegie Mellon University

An Asymptotically
Optimal Algorithm for
the Max k-Armed
Bandit Problem
Matthew Streeter
& Stephen Smith
Carnegie Mellon University
NESCAI, April 29 2006
1
Outline



Problem statement & motivations
Modeling payoff distributions
An asymptotically optimal algorithm
2
D1
Probability
The k-Armed
Bandit
Payoff



Quick Time™a nd a
TIFF ( Unco mpre ssed ) dec ompr esso r
ar e nee ded to see this pictur e.
Machine 1
D2
Payoff
Quick Time™a nd a
TIFF ( Unco mpre ssed ) dec ompr esso r
ar e nee ded to see this pictur e.
Machine 2
D3
Probability

You are in a room with k
slot machines
Pulling the arm of
machine i returns a payoff
drawn (independently at
random) from unknown
distribution Di
Allowed n total pulls
Goal: maximize total
payoff
> 50 years of papers
Probability

Payoff
Quick Time™a nd a
TIFF ( Unco mpre ssed ) dec ompr esso r
ar e nee ded to see this pictur e.
Machine 3
3
D1
Probability
The Max k-Armed
Bandit
Payoff



Quick Time™a nd a
TIFF ( Unco mpre ssed ) dec ompr esso r
ar e nee ded to see this pictur e.
Machine 1
D2
Payoff
Quick Time™a nd a
TIFF ( Unco mpre ssed ) dec ompr esso r
ar e nee ded to see this pictur e.
Machine 2
D3
Probability

You are in a room with k
slot machines
Pulling the arm of
machine i returns a payoff
drawn (independently at
random) from unknown
distribution Di
Allowed n total pulls
Goal: maximize highest
payoff
Introduced ~2003
Probability

Payoff
Quick Time™a nd a
TIFF ( Unco mpre ssed ) dec ompr esso r
ar e nee ded to see this pictur e.
Machine 3
4

D1
Probability
The Max k-Armed
Bandit: Motivations
Given: some optimization
problem, k randomized
heuristics
Assumption:
Each time you run
a run hasTabu
each
the Search
same
heuristic, get a solution
computational cost
with a certain quality
Allowed n runs
Hill Climbing
Goal: maximize quality of
best solution
Cicirello & Smith (2005)
show competitive
performance on RCPSP
Simulated
Payoff
Quick Time™a nd a
TIFF ( Unco mpre ssed ) dec ompr esso r
ar e nee ded to see this pictur e.
Probability

D2
Payoff
Quick Time™a nd a
TIFF ( Unco mpre ssed ) dec ompr esso r
ar e nee ded to see this pictur e.


D3
Probability

Payoff
Quick Time™a nd a
TIFF ( Unco mpre ssed ) dec ompr esso r
ar e nee ded to see this pictur e.
Annealing
5
The Max k-Armed
Bandit: Example
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.

Given n pulls, what strategy maximizes the
(expected) maximum payoff?


If n=1, should pull arm 1 (higher mean)
If n=1000, should pull arm 2 (higher variance)
6
Modeling Payoff
Distributions
7
Can’t Handle Arbitrary Payoff
Distributions
Probability
1
Arm 1
0.8
Arm 2
0.6
0.4
0.2
0
0

1
2
3
Payoff
4
5
Needle in the haystack: can’t distinguish arms
until you get payoff > 0, at which point highest
payoff can’t be improved
8
Assumption



We will assume each machine returns
payoff from a generalized extreme value
(GEV) distribution
Why? Extremal Types Theorem: max. of n
independent draws from some fixed
distribution converges in distribution a GEV
Compare to Central Limit Theorem: sum
converges in distribution
of n draws
a Gaussian
9
The GEV distribution

Z has a GEV distribution if
1 

  z   s 
Pr[Z  z]  exp1 s
 
     


for constants s, , and  > 0.

 determines mean
 determines standard deviation
s determines shape
10
Example payoff distribution:
Job Shop Scheduling




Job shop scheduling: assign start times to
operations, subject to constraints.
Length of schedule = latest completion
time of any operation
Goal: find a schedule with minimum
length
Many heuristics (branch and bound,
simulated annealing...)
11
Example payoff distribution:
Job Shop Scheduling



“ft10” is a notorious instance of the job
shop scheduling problem
Heuristic h: do hill-climbing 500 times
Ran h 1000 times on ft10; fit GEV to
payoff data
12
Example payoff distribution:
Distribution truncated
Job shop scheduling
at 931. Optimal schedule
length = 930 (Carlier
1 

& Pinson, 1986)
 z  931 .123 
G fit (z)  exp
  197


QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
-(schedule length)





Best of 50,000 sampled
schedules has length
1014
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
num. runs
13
An Asymptotically
Optimal Algorithm
14
Notation



mi(t) = expected maximum payoff you get
from pulling the ith arm t times
m*(t) = max1ik mi(t)
S(t) = expected maximum payoff you get
by following strategy S for t pulls
15
The Algorithm

Strategy S* ( and  to be determined):

For i from 1 to k:


For remaining n-kD pulls:


Using D pulls, estimate mi(n). Pick D so that with
probability 1-, estimate is within  of true mi(n).
Pull arm with max. estimated mi(n)
Guarantee: S*(n) = m*(n) - o(1).
16
The GEV distribution

Z has a GEV distribution if
1 

  z   s 
Pr[Z  z]  exp1 s
 
     


for constants s, , and  > 0.

 determines mean
 determines standard deviation
s determines shape
17
Behavior of the GEV
E[max. payoff]
20
s>0
s=0
15
Lots of algebra
Not so bad
s<0
10
5
0
1
10
100
1000
10000
Num. pulls
18
Predicting mi(n)

Estimation procedure: linear interpolation!
E[max. payoff]
20
15
10
Empirical mi(1)
Predicted mi(n)
5
Empirical mi(2)
0
1
10
100
1000
10000
Num. pulls

Estimate mi(1) and mi(2) , then interpolate to get
mi(n)
19
Predicting mi(n): Lemma


Let X be a random variable with
(unknown) mean  and standard deviation
 max. O(-2 log -1) samples of X suffice to
obtain an estimate  such that with
probability at least 1-, estimate is within 
of true value.
Proof idea: use “median of means”
20
Predicting mi(n)
E[max. payoff]
20
15
10
Empirical mi(1)
Predicted mi(n)
5
Empirical mi(2)
0
1
10
100
1000
10000
Num. pulls


Equation for line:
mi(n) = mi(1)+[mi(1)-mi(2)](log n)
Estimating mi(n) requires O((log n)2 -2 log -1)
pulls
21
The Algorithm

Strategy S* ( and  to be determined):

For i from 1 to k:


Using D pulls, estimate mi(n). Pick D so that with
probability 1-, estimate is within  of true mi(n).
For remaining n-kD pulls:

Pull arm with max. predicted mi(n)

Guarantee: S*(n) = m*(n) - o(1)

Three things make S* less than optimal:





m*(n) - m*(n-kD)
22
Analysis

Three things make S* less than optimal:







m*(n) - m*(n-kD)
Setting =n-2, =n-1/3 takes care of the first two.
Then:
m*(n)-m*(n-kD)
= O(log n - log(n-kD))
= O(kD/n)
= O(k(log n)2 -2(log -1)/n)
= O(k(log n)3 n-1/3) = o(1)
23
Summary & Future Work



Defined max k-armed bandit problem
and discussed applications to heuristic
search
Presented an asymptotically optimal
algorithm for GEV payoff distributions
(we analyzed special case s=0)
Working on applications to scheduling
problems
24
The Extremal Types Theorem

Define Mn = max. of n draws, and suppose
  
lim n P rn Mn  z  G(z) z
where each rn is a linear “rescaling function”.
Then G is either a point mass or a “generalized

extreme value distribution”:
1 

  z   s 
G(z)  exp1 s
 
     


for constants s, , and  > 0.

25

mstreeter_nescai_200.. - Carnegie Mellon University

Transcript mstreeter_nescai_200.. - Carnegie Mellon University

Directory