Exact Inference of User-Perceived Relevance and Preferences from

Download Report

Transcript Exact Inference of User-Perceived Relevance and Preferences from

BBM: Bayesian Browsing Model
from Petabyte-scale Data
Chao Liu, MSR-Redmond
Fan Guo, Carnegie Mellon University
Christos Faloutsos, Carnegie Mellon University
Massive Log Streams
•
Search log
– 10+ terabyte each day (keeps increasing!)
– Involves billions of distinct (query, url)’s
• Questions
– Can we infer user-perceived relevance for each
(query, url) pair?
– How many passes of the data are needed? Is one
enough?
– Can the inference be parallel?
• Our answer: Yes, Yes, and Yes!
BBM: Bayesian Browsing Model
query
URL1
URL2
URL3
URL4
S1
S2
S3
S4
Relevance
E1
E2
E3
E4
Examine
Snippet
C1
C2
C3
C4
ClickThroughs
Dependencies in BBM
S1
S2
…
Si
E1
E2
…
Ei
C1
C2
…
Ci
di  i  ri
the preceding
click position
before i
Road Map
• Exact Model Inference
• Algorithms through an Example
• Experiments
• Conclusions
Notations
• For a given query
– Top-M positions, usually M=10
• Positional relevance S  ( S1 , S2 ,..., SM )
• M(M+1)/2 combinations of (r, d)’s
– n search instances C k  (C1k , C2k ,..., CMk ) k  1, 2,..., n
– N documents impressed in total: (d1 , d2 ,..., d N )
• Document relevance R  ( R1 , R2 ,..., RN )
•
Model Inference
• Ultimate goal
• Observation: conditional independence
P(C|S) by Chain Rule
• Likelihood of search
• From S to R:
k
C
instance
Putting things together
• Posterior
1:n
C
with
• Re-organize by Rj’s
How many
times dj was
clicked
How many times dj was not clicked
when it is at position (r + d) and the
preceding click is on position r
What p( R | C ) Tells US
1:n
• Exact inference with joint posterior in closed form
• Joint posterior factorizes and hence mutually independent
• At most M(M+1)/2 + 1 numbers to fully characterize each posterior
– Count vector: e  (e0 , e1 , e2 ,..., eM ( M 1) 2 )
Road Map
• Exact Model Inference
• Algorithms through an Example
• Experiments
• Conclusions
LearnBBM: One-Pass Counting
Find Rj
An Example
• Compute
• Count vector for R4
d
10
N4
N4, r, d
3
0
2
1
0
0
0
0
10
0
1
2
r
LearnBBM on MapReduce
• Map: emit((q,u), idx)
• Reduce: construct the
count vector
Example on MapReduce
Map
Map
Map
(U1, 0)
(U2, 4)
(U3, 0)
(U1, 1)
(U3, 0)
(U4, 7)
(U1, 1)
(U3, 0)
(U4, 0)
Reduce
(U1, 0, 1, 1)
p( R1 )  R1 (1  R1 ) 2
(U2, 4)
p( R2 )  1  0.98R2
(U3, 0, 0, 0)
(U4, 0, 7)
p ( R3 )  R33
p( R4 )  R4 (1  R4 )
Road Map
• Exact Model Inference
• Algorithms through an Example
• Experiments
• Conclusions
Experiments
• Compare with the User Browsing Model (Dupret and
Piwowarski, SIGIR’08)
– The same dependence structure
– But point-estimation of document relevance rather than Bayesian
– Approximate inference through iterations
• Data:
– Collected from Aug and Sept 2008
– 10 algorithmic results only
– Split to training/test sets according
to time stamps for each query
– 51 million search instances of
1.15 million distinct queries,
10X larger than the SIGIR’08 study
Overall Comparison on Log-Likelihood
• Experiments in 20 batches
ll ll
 1)*100%
• LL Improvement Ratio = (e
2
1
Comparison w.r.t. Frequency
• Intuition
– Hard to predict clicks for infrequent queries
– Easy for frequent ones
Model Comparison on Efficiency
57 times faster
Petabyte-Scale Experiment
• Setup:
– 8 weeks data, 8 jobs
– Job k takes first kweek data
• Experiment platform
– SCOPE: Easy and Efficient Parallel Processing of
Massive Data Sets [Chaiken et al, VLDB’08]
Scalability of BBM
• Increasing computation load
– more queries, more urls, more impressions
• Near-constant elapse time
• 3 hours
• Scan 265 terabyte data
• Full posteriors for 1.15
billion (query, url) pairs
Computation Overload
Elapse Time on SCOPE
Road Map
• Exact Model Inference
• Algorithms through an Example
• Experiments
• Conclusions
Conclusions
• Bayesian Browsing Model for Search streams
–
–
–
–
–
–
Exact Bayesian inference
Joint posterior in closed form
A single pass suffices
Map-Reducible for Parallelism
Admissible to incremental updates
Perfect for mining click streams
• Models for other stream data
– Browsing, twittering, Web 2.0, etc?
Thanks!