Yahoo! Presentation Template

Download Report

Transcript Yahoo! Presentation Template

Ta m i n g t h e m o n s t e r :
A fast and simple algorithm for contextual
bandits
PRESENTED BY Satyen Kale
Joint work with Alekh Agarwal, Daniel Hsu, John Langford, Lihong Li and Rob
Schapire
Learning to interact: example #1
 Loop:
›
1. Patient arrives with symptoms, medical history, genome, …
›
2. Physician prescribes treatment.
›
3. Patient’s health responds (e.g., improves, worsens).
 Goal: prescribe treatments that yield good health outcomes.
Learning to interact: example #2
 Loop:
›
1. User visits website with profile, browsing history, …
›
2. Website operator choose content/ads to display.
›
3. User reacts to content/ads (e.g., click, “like”).
 Goal: choose content/ads that yield desired user behavior.
Contextual bandit setting (i.i.d. version)
 Set X of contexts/features and K possible actions
 For t = 1,2,…,T:
0. Nature draws (xt, rt) from distribution D over X × [0,1]K.
› 1. Observe context xt. [e.g., user profile, browsing history]
› 2. Choose action at ϵ [K]. [e.g., content/ad to display]
› 3. Collect reward rt(at). [e.g., indicator of click or positive feedback]
›
 Goal: algorithm for choosing actions at that yield high reward.
 Contextual setting: use features xt to choose good actions at.
 Bandit setting: rt(a) for a ≠ at is not observed.
›
Exploration vs. exploitation
Learning objective and difficulties
 No single action is good in all situations – need to exploit
context.
 Policy class Π: set of functions (“policies”) from X  [K]
(e.g., advice of experts, linear classifiers, neural networks).
 Regret (i.e., relative performance) to policy class Π:
… a strong benchmark if Π contains a policy with high reward.
 Difficulties: feedback on action only informs about subset of policies;
explicit bookkeeping is computationally infeasible when Π is large.
Arg max oracle (AMO)
 Given fully-labeled data (x1, r1),…,(xt, rt), AMO returns
 Abstraction for efficient search of policy class Π.
 In practice: implement using standard heuristics (e.g., convex
relax., backprop) for cost-sensitive multiclass learning algorithms.
Our results
 New fast and simple algorithm for contextual bandits
›
Optimal regret bound (up to log factors):
›
Amortized
calls to argmax oracle (AMO) per round.
 Comparison to previous work
›
[Thompson’33]: no general analysis.
›
[ACBFS’02]: Exp4 algorithm; optimal regret, enumerates policies.
›
[LZ’07]: ε-greedy variant; suboptimal regret, one AMO call/round.
›
[DHKKLRZ’11]: “monster paper”; optimal regret, O(T5K4) AMO calls/round.
Note: Exp4 also works in adversarial
setting.
Rest of this talk
1. Action distributions, reward estimates via
inverse probability weights [oldies but goodies]
2. Algorithm for finding policy distributions
that balance exploration/exploitation
New
3. Warm-start / epoch trick
New
Basic algorithm structure (same as Exp4)
 Start with initial distribution Q1 over policies Π.
 For t=1,2,…,T:
›
0. Nature draws (xt,rt) from distribution D over X × [0,1]K.
›
1. Observe context xt.
›
2a. Compute distribution pt over actions {1,2,…,K} (based on Qt and xt).
›
2b. Draw action at from pt.
›
3. Collect reward rt(at).
›
4. Compute new distribution Qt+1 over policies Π.
Inverse probability weighting (old trick)
 Importance-weighted estimate of reward from round t:
 Unbiased, and has range & variance bounded by 1/pt(a).
 Can estimate total reward and regret of any policy:
Constructing policy distributions
Optimization problem (OP):
Find policy distribution Q such that:
Theorem: If we obtain policy distributions
Qt via solving (OP), then with high
probability, regret after T rounds is at most
Low estimated regret (LR) – “exploitation"
Low estimation variance (LV) – “exploration”
Feasibility
 Feasibility of (OP): implied by minimax argument.
 Monster solution [DHKKLRZ’11]: solves variant of
(OP) with ellipsoid algorithm, where Separation
Oracle = AMO + perceptron + ellipsoid.
Coordinate descent algorithm
 INPUT: Initial weights Q.
 LOOP:
Claim: Can check by
making one AMO call
per iteration.
IF (LR) is violated, THEN replace Q by cQ.
› IF there is a policy π causing (LV) to be violated, THEN
›
• UPDATE Q(π) = Q(π) + α.
›
ELSE
• RETURN Q.
Above, both 0 < c < 1 and α have closed form expressions.
(Technical detail: actually optimize over sub-distributions Q that may sum to < 1.)
Iteration bound for coordinate descent
# steps of coordinate descent =
Also gives bound on sparsity of Q.
Analysis via a potential function argument.
Warm-start
 If we warm-start coordinate descent (initialize with Qt to get Qt+1),
then only need
coordinate descent iterations over all T rounds.
 Caveat: need one AMO call/round to even check if (OP) is solved.
Epoch trick
 Regret analysis: Qt has low instantaneous expected regret
(crucially relying on i.i.d. assumption).
›
Therefore same Qt can be used for O(t) more rounds!
 Epoching: Split T rounds into epochs, solve (OP) once per epoch.
 Doubling: only update on rounds 21,22,23,24,…
›
Total of O(log T) updates, so overall # AMO calls unchanged (up to log factors).
 Squares: only update on rounds 12,22,32,42,…
›
Total of O(T1/2) updates, each requiring
AMO calls, on average.
Experiments
Algorithm
Epsilongreedy
Bagging
Linear UCB
“Online
Cover”
[Supervised]
Loss
0.095
0.059
0.128
0.053
0.051
Time
(seconds)
22
339
212000
17
6.9
Bandit problem derived from classification task (RCV1).
Reporting progressive validation loss.
“Online Cover” = variant with stateful AMO.