Contextual Bandits - UVa CS
Download
Report
Transcript Contextual Bandits - UVa CS
Contextual Bandits in a
Collaborative Environment
Qingyun Wu1, Huazheng Wang1, Quanquan Gu2, Hongning Wang1
1Department of Computer Science
2Department of Systems & Information Engineering
University of Virginia
Running Example: News Recommendation
Exploitation-Exploration dilemma:
β’ Focus on information that raises user interest
β’ Explore new information to improve user
experience in the long run
CS@UVa
Collaborative Bandits
1
Exploitation VS. Exploration
π1 = 0.6
π2 = 0.4
Find some icon to make thes
three different, we should re
such unnecessary confusion
π3 = 0.8
t
Selection
π1
π2
π3
CS@UVa
Collaborative Bandits
2
Exploitation VS. Exploration
π1 = 0.6
π2 = 0.4
π3 = 0.8
t
Selection
π1
Article 1
0.3
π2
π3
CS@UVa
Collaborative Bandits
3
Exploitation VS. Exploration
π1 = 0.6
Selection
π1
π2
π2 = 0.4
π3 = 0.8
t
Article 1 Article 2
0.3
0.2
0.3
π3
CS@UVa
Collaborative Bandits
4
Exploitation VS. Exploration
π1 = 0.6
π2 = 0.4
π3 = 0.8
t
Selection
π1
π2
π3
CS@UVa
Article 1 Article 2
Article 3
0.3
0.2
0.3
0.3
0.2
0.1
Collaborative Bandits
5
Exploitation VS. Exploration
π1 = 0.6
π2 = 0.4
π3 = 0.8
t
Selection
π1
π2
π3
CS@UVa
Article 1 Article 2
Article 3
0.3
0.2
0.3
0.3
0.2
?
0.1
Collaborative Bandits
6
Exploitation VS. Exploration
π1 = 0.6
π2 = 0.4
π3 = 0.8
t
Selection
π1
π2
π3
CS@UVa
Article 1 Article 2
Article 3
Article 1
0.3
0.2
0.3
0.4
0.3
0.2
0.3
0.1
01
Collaborative Bandits
7
Exploitation VS. Exploration
π1 = 0.6
Selection
π1
π2
π3
CS@UVa
π2 = 0.4
π3 = 0.8
Exploit too early
and too much!
t
Article 1 Article 2
Article 3
Article 1 Article 1
Article 1 β¦
Article 1
0.3
0.2
0.3
0.4
0.5
0.55
0.6
0.3
0.2
0.2
0.2
0.2
β¦
0.2
0.1
01
0.1
0.1
β¦
0.1
Collaborative Bandits
8
Running Example: News Recommendation
Multi-armed Bandit:
Player ο Recommendation system
Arms ο News articles (with side information)
Reward ο User Click
CS@UVa
Collaborative Bandits
9
A Collaborative Environment
β’ Motivation:
β’ Cold start problem in Personalized systems
β’ Bandits are not independent
--Social influence among users:
e.g., content and opinion sharing among friends
in social network
β’ Solution:
Model dependency among users
CS@UVa
Collaborative Bandits
10
Related work
Move this after page 9 to exp
state-of-the-art bandit soluti
contextual bandit
And do not name this slide a
related work, you can refer t
as βstate-of-the-artβ
β’ LinUCB[1,2]:
--Use ridge regression and Upper Confidence
Bound to balance exploitation and
exploration.
--Bandits/users are independent.
CS@UVa
Collaborative Bandits
11
Related work
Create a figure to illustrate th
idea, just like what you have
the next slide, show a social
network and demonstrate ho
does network-based
regularization work
β’ GOB.Lin[5]:
--Use ridge regression and Upper Confidence
Bound to balance exploitation and
exploration (as in LinUCB).
--Graph Laplacian based regularization upon
ridge regression to model dependency.
--Similar users are assumed to share similar
parameters.
CS@UVa
Collaborative Bandits
12
Related work
β’ CLUB[4]:
--Adaptively cluster users into groups according to
the closeness of estimated bandit models
--Threshold to delete edges is based on confidence
balls of the usersβ models.
CS@UVa
Collaborative Bandits
13
Model Assumption
β’ A collaborative assumption:
R-sub-Gaussian
Move this slide after the next
slide, which supposes to talk
about context-free case
CS@UVa
Collaborative Bandits
14
Model Assumption
Those icons stand for news
article? Can you find better
icons?
?
re we do not need to make it
ntextual, just put the estimated
ward is fine
CS@UVa
Weighted average of
expected reward
Collaborative Bandits
Change those colored balls t
user icons
15
Model Assumption
I guess you misunderstood my
point: using the reward symbol in
the previous slide, and use
context vector in this slide, since
only context vector helps us
generalize across arms
CS@UVa
Collaborative Bandits
16
Exploitation
β’ Objective Function:
β’ Parameter Estimation: Ridge Regression
CS@UVa
Information Propagation among users
Collaborative Bandits
17
through relational matrix W
Exploration
β’ UCB-type exploration strategy
Every iteration:
Upper Confidence Bound
Estimated reward
(exploitation)
CS@UVa
Collaborative Bandits
Confidence interval
(exploration)
18
Exploration
β’ Can be proved: with probability at least 1 β πΏ,
Exploration parameter (upper bound of
parameter estimation error)
CS@UVa
Collaborative Bandits
19
Arm selection strategy
β’ Exploitation vs. Exploration
Exploitation
CS@UVa
Exploration
Collaborative Bandits
20
Algorithm-CoLin
CS@UVa
Collaborative Bandits
21
Regret Analysis
Regret is not your optimization
target, but one typical
performance metric
β’ Regret: optimization target
Cumulated regret
CS@UVa
Regret at time t
Collaborative Bandits
Received reward
Optimal reward
22
Regret Analysis
β’ Lemma 1: For any πΏ > 0, with probability at least
1 β πΏ, the estimation error of bandit parameter in
CoLin satisfy,
Define Ξ±t as the upper bound of
β’ Theorem 1: With probability at least 1 β πΏ, the
cumulated regret in CoLin satisfy,
CS@UVa
Collaborative Bandits
23
Regret Analysis
β’ Compare with LinUCB (assume each user interact
π
with the learner π = times):
π
-- When π is an identity matrix, i.e., nothing
is shared among users:
CoLin goes back to LinUCB.
--When π is uniform, i.e, all users are uniformly
connected to share:
CoLin achieves an
regret reduction.
CS@UVa
Collaborative Bandits
24
Regret Analysis
β’ Compare with GOB.Lin: Bring back the illustration for
GOB.Lin in slide 12 to remind the
GOB.Lin :
audience what this baseline is
--Use graph Laplacian based model regularization to
model dependency.
-- Assumes the neighboring bandit parameters to be
close to each other
-- Only capture connectivity of the network
CS@UVa
Collaborative Bandits
25
Regret Analysis
β’ Compare with GOB.Lin:
--When π is an identity matrix, i.e., no
dependency among users, these two algorithmsβ
regret bound touches (both go back to LinUCB).
--When there is dependency among users,
GOB.Lin will always lead to a worse and faster
growing regret than our CoLin algorithm.
CS@UVa
Collaborative Bandits
26
Experimental Results
β’ Synthetic Dataset
o Simulated 1000 articles, 100
users, each with a 5dimensional feature vector
o W is constructed based on
the cosine similarity of user
features
CoLin
Convergence of cumulated regret
CS@UVa
Collaborative Bandits
27
Experimental Results
β’ Synthetic Dataset
Accuracy of bandit parameter estimation
o Simulated 1000 articles, 100
users, each with a 5dimensional feature vector
o W is constructed based on
the cosine similarity of user
features
CoLin
CoLin
CS@UVa
Collaborative Bandits
28
Experimental Results
β’ Yahoo! Today Module
o 10 days data from Yahoo! Today
Module
o 45,811,833 user visits
o Users are clustered into 160
groups
o User relation matrix is
constructed according to the
similarity of centroids of user
groups
CS@UVa
Collaborative Bandits
Normalized CTR on Yahoo dataset
CoLin
29
Experimental Results
β’ LastFM & Delicious
oLasfFM: Extracted from
music steaming service
Last.fm
oDelicious: Extracted from
social bookmark sharing
service Delicious
oBoth contain user friendship
information(around 2000
users)
oUsers are clustered into
groups(200) using graph-cut
CS@UVa
CoLin
Collaborative Bandits
LastFM Dataset
30
Experimental Results
β’ LastFM & Delicious
oLasfFM: Extracted from
music steaming service
Last.fm
oDelicious: Extracted from
social bookmark sharing
service Delicious
oBoth contain user friendship
information(around 2000
users)
oUsers are clustered into
groups(200) using graph-cut
CS@UVa
Delicious Dataset
CoLin
Collaborative Bandits
31
Experimental Results
β’ Effectiveness of Collaboration
Rank user clusters by number of observations
Group 1 ( Learning Bucket ): Top 50 user clusters
Group 2 ( Testing Bucket ): The 50 user clusters that are most
connected to users in Group 1, among the bottom 100 clusters
Warm-Start Setting:
--1. Run algorithms on the learning bucket to estimate
parameters for both group of users.
--2.run and evaluate algorithms on the testing bucket.
Cold-Start Setting:
--Directly run and evaluate the bandit algorithms on the
testing bucket.
CS@UVa
Collaborative Bandits
32
Experimental Results
β’ Effectiveness of Collaboration
LastFM Dataset
o Compare the reward
difference between WarmStart setting and Cold-Start
setting: (warm-cold)
o In LinUCB, (warm- cold) is
always 0 (No collaboration)
CS@UVa
CoLin
Collaborative Bandits
33
Experimental Results
β’ Effectiveness of Collaboration
Delicious Dataset
o Compare the reward
difference between WarmStart setting and Cold-Start
setting: (warm-cold)
o In LinUCB, (warm- cold) is
always 0 (No collaboration)
CS@UVa
CoLin
Collaborative Bandits
34
Experimental Results
β’ Effectiveness of collaboration: User-based
Analysis
Delicious Dataset
Improved user: the user who
is served with improved
recommendations from a
collaborative bandit algorithm
than those from isolated
LinUCBs.
CS@UVa
Collaborative Bandits
35
Experimental Results
β’ Effectiveness of collaboration: User-based
LastFM Dataset
Analysis
Improved user: the user who
is served with improved
recommendations from a
collaborative bandit algorithm
than those from isolated
LinUCBs.
CS@UVa
Collaborative Bandits
36
Future Work
β’ Dynamically estimate the graph structure in
the collaborative bandits setting
β’ Decentralize the computation by utilizing the
sparse graph structure
CS@UVa
Collaborative Bandits
37
Acknowledgement
β’ Thank the conference for awarding the travel
grant
β’ Thank the NSF grant IIS-1553568 for
supporting this work
CS@UVa
Collaborative Bandits
38
References
β’
β’
β’
β’
β’
β’
[1]. L. Li, W. Chu,J. Langford, and R. E. Schapire. A contextual-bandit approach to
personalized news article recommendation. In Proceedings of 19th WWW, pages
661β670. ACM, 2010.
[2]. W. Chu, L. Li, L. Reyzin, and R. E. Schapire. Contextual bandits with linear payoff
functions. In International Conference on Artificial Intelligence and Statistics, pages
208β214, 2011.
[3]. Y. Abbasi-yadkori, D.PaΜl, and C. SzepesvaΜri. Improved algorithms for linear
stochastic bandits. In NIPS, pages 2312β2320. 2011.
[4]. C. Gentile, S. Li, and G. Zappella. Online clustering of bandits. In
Proceedings of the 31st International Conference on Machine Learning, 2014
[5]. N. Cesa-Bianchi, C. Gentile,and G. Zappella. A gang of bandits. In Pro. NIPS,
2013.
[6]. J. Kawale, H. H. Bui, B. Kveton, L. Tran-Thanh, and S. Chawla. Efficient
Thompson sampling for online matrix-factorization recommendation. In NIPS,
pages 1297β1305, 2015.
CS@UVa
Collaborative Bandits
39
Contextual Bandits
β’ In each round:
(1). Observe context vector π₯π of each arm π β π
(2). Picks an arm ππ‘
β¦
(3). Receive corresponding reward πππ‘ ,π’
(4). Update model
CS@UVa
Collaborative Bandits
or
40