predictive distribution

Download Report

Transcript predictive distribution

Bayesian Extension to the Language
Model for Ad Hoc Information
Retrieval
Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping
Microsoft Research Cambridge, U.K.
University of Twente The Netherlands
ACM SIGIR 2003
Session: Retrieval Models
Abstract
• Many smoothed estimators (including Laplace and
Bayes-smoothing) are approximations to the
Bayesian predictive distribution
• Derive the full predictive-distribution in a form
amenable to implementation by classical IR
models, and compare it to other estimators.
• The proposed model outperforms Bayessmoothing, and its combination with linear
interpolation smoothing outperforms all other
estimators.
Introduction (1/2)
• Language Model
– Computes the relevance of a document d with
respect to a query q by estimating a factorized form
of the distribution P(q, d)
• Bayesian statistics
– useful concepts and tools for estimation
– powerful mathematical framework for data
modeling when the data is scarce and/or uncertain
Introduction (2/2)
• Bayes-smoothing or Dirichlet smoothing
– The best smoothing techniques used today in
Language Model
– an approximation to the full Bayesian inference
model: in face, it’s the maximum poster
approximation to the predictive distribution
• In this paper
– we derive analytically the predictive distribution of
the most commonly used query Language Model
The Unigram Query Model (1/4)
• Unigram Query Model
– Consider a query q and a doc collection of N docs
C:={dl}l=1,…,N
q : (q1 ,..., qi ,...qV ) 
V
d l : (dl ,1 ,..., dl ,i ,..., dl ,V ) 
V
– qi: # of times the term i appears in the query
– V: the size of the vocabulary
The Unigram Query Model (2/4)
– Consider a multinomial generation model for each doc,
parameterized by the vector V
l  (l ,1 ,...l ,i ,..., l ,V )  [0,1]V ,  l ,i  1
i 1
– Length of a query (nq) and a doc (nl)
• The sum of their components (e.g. nq : i qi )
– The probability of generating a particular query q with
counts q and doc dl
P(q | l )  Z
1
q
p(dl | l )  Z
V
 (
1
dl
i 1
l ,i
)
qi
V
 (
i 1
l ,i
)
dl ,i
 nq 
where Zq  

q
,...,
q
V 
 1
nl


where Zdl  

d
,...,
d
l ,V 
 l ,1
The Unigram Query Model (3/4)
– The unigram query model postulates that the relevance
of a document to a query can be measured by the
probability that the query is generated by the document
– By this it is meant the likelihood of the query P(q| θl)
when the parameter θl are estimated using dl as a
sample of the underlying distribution
– The central problem of this model is then the estimation
of the parameters θl,i from the document counts dl, the
collection counts {cfi:=Σdl,i}i=1..V and the size of the
collection N
The Unigram Query Model (4/4)
• Given an infinite amount of data
– empirical estimates (maximum likelihood estimate)
d l ,i
ˆ
 l ,i 
nl
– Little data for the estimation of these parameters, the
empirical estimator is not good.
• unseen words
– Two smoothing techniques
• Maximum-posterior estimator
• Linearly-interpolated maximum likelihood estimator
Bayesian Language Model (1/4)
• Bayesian techniques
– rather than find a single point estimate for the parameter
vector θl, a distribution over θl (posterior) is obtained by
Bayes’ rule
P(l ) P(d l | l )
P (l | d l ) 
P (d l )
– Predictive distribution (where assume doc and query are
generated by the same distribution)
P (q | d l )   P(q | l )P(l | d l ) d l


1
P (q|l ) P (d l | l ) P (l )d


P (d l ) l
Bayesian Language Model (2/4)
• Prior probability, P(θ)
– It’s central to Bayesian inference, especially for small
data samples
– In most cases, the only available choice for a prior is the
natural conjugate of the generating distribution
• The natural conjugate of a multinomial distribution is
the Dirichlet distribution
V
P( )  Z  (i )
'
i 1
i 1
where Z 
'
(n )

v
I 1
( i )
Bayesian Language Model (3/4)
• Under this prior
– the posterior distribution is Dirichlet as well
P( | d l ) 
(ndl  n )

V
i 1
v
 ( d l ,i   i )
 (i )
i 1
– the predictive distribution
V
q d
'
P(q | d l )  Z q Z d    (i )
i
l
 Zq

dl ,i  i 1
l ,i  i 1
i 1
i|qi  0


nq
j 1
qi
g 1
(dl ,i   i  g  1)
(ndl  n  j  1)
Bayesian Language Model (4/4)
• New Document Scoring function
log P(q | d l ) 
qi
  log(1  
i|( qi , di )  0 g 1
d l ,i
)
i  g 1
nq
  log( ndl  n  j  1)
j=1
qi
   log( i  g  1)  log Z q
i|qi 0 g 1
where the last two terms can be dropped as they are
document independent
Setting the hyper-parameter values (αi)
• One is to set all the αi to some constant
• A better option is to fit the prior distribution to the collection
statistics
d

l l ,i
– Average term count term ti is proportional to P(vi | C ) :
n

l l
– The mean of posterior distribution P(q|θl)
is known to be:    i
i
n
– Setting this mean to be equal to the average term count
i
n
 P (vi | C )
– Therefore, setting αi=μP(vi|C) and nα =μ
where μ is a free parameter in this model
Relationship to Other Smoothing Models (1/3)
• Maximum Posterior (MP) distribution
– A standard approximation to the Bayesian predictive
distribution
V
qi
P(q | d l )  P(q | lMP )  Z q-1  (lMP
)
,i
• For a Dirichelet prior
d l ,i   i  1
MP
 l ,i 
nl  n  V
i 1
d l ,i
– αi=1, obtain the maximum likelihood estimator  
n
l
– αi=2 or αi=λ+1, obtain Laplace smoothing estimator
d l ,i  
LA
 l ,i 
nl   V
ML
l ,i
Relationship to Other Smoothing Models (2/3)
– αi=μP(vi|C), obtain the Bayesian-smoothing estimator
dl ,i   P(vi | C )
BS
 l ,i 
nl  
– Linear interpolation (LI) smoothing
lLI,i  lML
,i  (1   ) P(vi | C )
• Scoring function resulting from these estimator (BS, LI),
rewrite the unigram query model
log P(q | l ) rank

i|( qi , dl ,i )  0
βl: doc dependant constant
qi log
 l ,i
l P(vi | C )
 nq log l
Relationship to Other Smoothing Models (3/3)
• General formulation of the unigram query model
– A fast inverted index can be used to retrieve the weights
needed to compute the first term
– # of operations to compute the first term depends only on
# of term-indices matching
– the cost of computing the second term is negligeable
• Bayesian predictive model propose in this paper
– # of operations to compute the first term is different
– the last term cannot be pre-computed
– Slightly more expensive, but also can be implemented in a
real scale IR system
Empirical Evaluation (1/3)
• Data
– TREC-8 document collection
– TREC-6 and TREC-8 queries and query-relevance sets
• Data Pre-processing is standard
– Terms was stemmed by Porter stemmer
– Stop words and words fewer than 3 times are removed
– Query are constructed from the title and description
Empirical Evaluation (2/3)
• Results
– Bayes predictive model
the optimal parameter
setting is roughly the
same
– Linear interpolation
smoothing yields better
results than Bayessmoothing and the
Bayes predictive model
Empirical Evaluation (3/3)
• Combination of Bayes predictive model and linear
interpolation smoothing and Bayes-smoothing
Conclusion
• Present a first Bayesian analysis of the unigram query Language
Model for ad hoc retrieval, and propose a new scoring function
derived from the Bayesian predictive distribution
• Work remains to be done
– Combine these two approaches
– Automatically adapt the μ scaling parameter
– Bayesian inference framework could be applied to other
Language Model and extend to other tasks such as relevance
feedback, query expansion and adaptive filtering