Bayesian Extension to the Language Model for Ad Hoc Information
Download
Report
Transcript Bayesian Extension to the Language Model for Ad Hoc Information
Bayesian Extension to the
Language Model for Ad Hoc
Information Retrieval
Hugo Zaragoza, Djoerd Hiemstra,
Michael Tipping
Presented by Chen Yi-Ting
Outline
Introduction
The Unigram Query Model
Bayesian Language Model
Relationship to other smoothing models
Empirical evaluation
Conclusion
Introduction
To propose a Bayesian extension of the
Language model.
Bayesian statistics provide a powerful yet
intuitive mathematical framework for data
modeling when the data is scarce and/or
uncertain and there is some prior knowledge
about it.
To derive analytically the predictive distribution
of the most commonly used query Language
Model.
The Unigram Query Model
The specific form of Language Model-
query unigram model.
Consider a query q and a document collection of N
documents C:= {d l }l 1,....., N , both queries and documents
being represented as vectors of indexed term counts:
q : (q1 ,....., qi ,....., qV ) N V
d l : ( dl ,1 ,....., dl ,i ,....., dl ,V ) N V
Furthermore, consider a multinomial generation model for
each document:
V
l : (l ,1 ,.....,l ,i ,.....,l ,V ) [0,1] , l ,i 1
V
i 1
The Unigram Query Model
The specific form of Language Model-
query unigram model.
Finally, let us also define the length of a query
and of
( nq )
a document (nl ) as the sum of their components.
Under this model, the probability of generating a
particular query q with counts q is given by the
V
n !
n
product:
q
1
Z
P (q | l ) Z q (l ,i )
q
,.....,
q
q!
1
dl
i
1
i 1
q
V
q
V
i 1 i
similarly for documents:
P(dl | l ) Z
1
dl
V
(
i 1
l ,i
)
dl ,i
nl
nl !
Z dl1
V
d
,.....,
d
l
,1
l
,
V
i 1 dl ,i !
The Unigram Query Model
The unigram query model postulates that the
relevance of a document to a query can be
measured by probability that the query is generated
by the document.
l dl
Given an infinite amount of data, the value could be easily
estimated by their empirical estimates:
d
ˆl ,i l ,i
nl
Problems:
Underestimating the probability of rare words and overestimating the probability of frequent ones.
Unseen words. Smoothed estimates.
Bayesian Language Model
The problem of small data samples and resulting
parameter uncertainty suggests the use of
Bayesian techniques.
Bayes’ rule:
P (l | dl )
P (l ) P (dl | l )
P(dl )
A more powerful approach is to take account of posterior
uncertainty when evaluating the probability of q by
computing the predictive distribution:
P(q | dl ) P(q | l ) P(l | dl ) dl
In the case of abundant data, it can be seen that
P(q | dl ) P(q | ˆl )
Bayesian Language Model
The choice of prior probability distributions is
central to Bayesian inference.
In most cases the only available choice for a prior
is the natural conjugate of the generating
distribution.
The natural conjugate of a multinomial distribution
is the Dirichlet distribution:
V
P( ) Z (i )i 1
Z
i 1
n
V
i 1
i
Posterior distribution is Dirichelet as well:
P( )
ndl n
V
i 1
dl i
V
dl i 1
(
)
i
i 1
Bayesian Language Model
Finally, we can compute the predictive distribution
in (7) as follows:
By taking the log and separating out the terms in the
query not appearing in the document.
Bayesian Language Model
Setting
A better option is to fit the prior distribution to the
collection statistics, since these are known.
In the absence of any other information, we will assume
that the documents resemble the average document.
The average term count for term
l d l ,i
P (vi | C ) :
l
nl
By mean of the posterior distribution in (3) is known to
be
i
i
n
P (vi | C )
which can be satisfied setting i P(vi | C )
and n
Relationship to other smoothing models
A standard approximation to the Bayesian
predictive distribution (7) is the so called
Maximum posterior (MP) distribution.
V
qi
P(q | d ) P(q | lMP ) Z q1 (lMP
)
,i
i 1
For a Dirichelet prior, the maximum posterior distribution
is known analytically
d l ,i i 1
MP
l ,i
nl n V
Setting i =1 Maximum likelihood estimator:
d l ,i
ML
l ,i
nl
Relationship to other smoothing models
Setting i =2 or i = 1
estimators:
LA1
l ,i
d l ,i 1
,
Laplace smoothing
LA
l ,i
d l ,i
nl V
nl V
Setting i = P(vi | C ) Bayes-smoothing estimate:
dl ,i P(vi | C )
BS
l ,i
nl
These different smoothed estimators are approximations
to the full estimator obtained by 1) replacing the
predictive distribution by the maximum posterior
distribution. And 2) choosing a particular value of
Relationship to other smoothing models
The linear interpolation (LI) smoothing:
ML
lLI
(1 ) P(vi | C )
,i
l ,i
We first note that in the case of BS and LI the
probability of and unseen word can be written as
l P(vi | C ) there l is a document dependant
constant and P(vi | C ) was previsouly defined.
To rewrite the unigram query model (3) as:
Relationship to other smoothing models
Bayesian predictive model proposed in this
paper:
The number of operations to compute the first
term depends now on the number of terms
matching.
The last term cannot be pre-computed since it
depends on the query, but its computational cost
remains neigligible.
Empirical evaluation
TREC-8 document collection
TREC-6 and TREC-8 queries and query
relevance sets.
Data pre-processing was standard.
Terms were stemmed
Stop words were removed as well as words
appearing fewer than 3 times.
Empirical evaluation
Empirical evaluation
Empirical evaluation
Conclusion
We proposed a new scoring function derived
from the Bayesian predictive distribution,
which many smoothed estimators
approximate.
We have shown that its computational cost is
only slightly greater than that of other existing
estimators.
The new model performs better than Bayessmoothing but not as well as linear
interpolation smoothing.