Transcript ppt
Discriminative Models for
Information Retrieval
Ramesh Nallapati
UMass
SIGIR 2004
Abstract
Discriminative model vs. Generative model
Performance Comparison
Discriminative – attractive theoretical properties
Discriminative – Maximum Entropy, Support Vector Machine
Generative – Language Modeling
Experiment
Ad-hoc Retrieval
ME is worse than LM, SVM are on par with LM
Home-Page Finding
Prefer SVM over LM
Introduction
Traditional IR
A problem of measuring the similarity between docs and
query, such as Vector Space Model
Shortcoming
Term-weights – empirically tuned
No theoretical basis for computing optimum weights
Binary independence retrieval (BIR)
Robertson and Sparck Jones (1976)
First model that viewed IR as a classification problem
This allows us to leverage many sophisticated techniques
developed in ML domanin
Discriminative models
Good success in many applications of ML
Discriminative and Generative Classifiers
Pattern Classification
The problem of classifying an example based on its vector of
features x into its class C through a posterior probability
P(C|x) or simply a confidence score g(C|x)
Discriminative models
Model the posterior directly or learn a direct map from
inputs x to the class labels C
Generative models
Model the class-conditional probability P(x|C) and the
prior probability P(C) and estimate the posterior through
the Bayes’ rule
P(C | x) P( x | C ) P(C )
Probabilistic IR models as Classifiers (1/3)
Binary Independence Retrieval (BIR) model
Ranking is done by the log-likelihood ration of relevance
P( R | D)
P( D | R) P( R)
) log(
)
P( R | D)
P( D | R) P( R)
P( xi 1| R)
P( xi 0 | R) P( R)
log(
)
i: xi 1 P ( xi 1| R ) i: xi 0 P ( xi 0 | R ) P ( R )
log(
Ranking is done by the log-likelihood ration of relevance
The model has not met with good empirical success owing
to the difficulty in estimating the class conditional P(xi=1|R)
Assume uniform probability distribution over the entire
vocabulary and update the probabilities as relevant docs are
provided by the user
Probabilistic IR models as Classifiers (2/3)
Two-Poisson model
Follow the same framework as that of BIR model, but they use
a mixture of two Poisson distributions to model the class
conditions P ( D | R ) and P ( D | R )
n
P( D | R) ( P (c( wi ) k | E ) P ( E | R ) P (c ( wi ) k | E ) P ( E | R ))
i 1
exp(l )l k
exp( m)m k
( p
(1 p)
)
k!
k!
i 1
n
This also is a generative model
Similar to the BIR model, it also needs relevance feedback for
accurate parameter estimation
Probabilistic IR models as Classifiers (3/3)
Language Models
Ponte and Croft (1998)
The ranking of a doc is given by the probability of
generation of the query from doc’s language model
n
P(Q | D) P (q1 , q2 ,.., qn | D ) P (qi | D )
i 1
This model circumvent the problem of estimating the model
of relevant documents that the BIR model and Two-Poisson
suffer from
LM can be considered generative classifiers in a multi-class
classification sense
The Case for Discriminative Models for IR (1/3)
Discriminative vs. Generative
One should solve the problem (classification) directly and
never solve a more general problem (class-conditional) as an
intermediate step
Model Assumptions
GM
Terms are conditionally independent
LM assume docs obey a multinomial distribution of terms
DM
Typically make few assumptions and in a sense, let the
data speak for itself.
The Case for Discriminative Models for IR (2/3)
Expressiveness
GM - LM are not expressive enough to incorporate many
features into the model
DM - It can include all features effortlessly into a single
model
Learning arbitrary features
In view of the many query dependent and queryindependent doc features and user-preferences that
influence features, we believe that a DM that learns all the
features is best suited for the generalized IR problem
The Case for Discriminative Models for IR (3/3)
Notion of Relevance
In LM, there is no explicit notion of relevance. There has
been considerable controversy on the missing relevance
variable in LM.
We believe that Robertson’s view of IR as a binary
classification problem of relevance is more realistic than the
implicit notion of relevance as it exists in LM.
Discriminative Models Used in Current Work (1/2)
Maximum Entropy Model
The principle of ME – model all that is known and assume
nothing about that which is unknown
The parametric form of the ME probability function can be
expressed by
n
1
P ( R | D, Q )
exp( i , R fi ( D, Q))
Z (Q, D)
i 1
The feature weights (λ) are learned from training data using
a fast gradient descent algorithm
As in Robertson’s BIR model, we use the log-likelihood ratio
as the scoring function for ranking as shown follows
P( R | D, Q) n
log
(i , R i , R ) fi ( D, Q)
P( R | D, Q) i 1
Discriminative Models Used in Current Work (2/2)
Support Vector Machines (SVM)
Basic idea – find the hyper-plane that separate the two classes of
training examples with the largest margin
If f(D,Q) is the vector of features, then the discriminative function
is given by
g ( R | D, Q) w ( f ( D, Q)) b
The SVM is trained such that g(R|D,Q)>=1 for positive (relevant)
examples and g(R|D,Q) <= -1 for negative (non-relevant)
examples as long as the data is separable
Both DMs
Retaining the basic framework of the BIR model, while avoid
estimating the class-conditional and instead directly compute the
posterior P(R|Q,D) or the mapping function g(R|R,Q)
Other Modeling Issues
Out of Vocabulary words (OOV) problem
Test queries are almost always guaranteed to contain words
that are not seen in the training queries
The features are not based on words themselves, but on
query-based statistics of documents such as the total
frequency of occurrences or the sum-total of the idf-values
of the query terms
Unbalanced data
The classes (non-relevant) is a large portion of all the
examples, while the other (relevant) class have only a small
percent of of the examples
Over-sampling the minority class
Under-sampling the majority class
Experiments and Results (1/8)
Ad-hoc retrieval
Data set
Preprocessing
K-stemmer and removing stop-words
Only use title queries for retrieval
LM
Training the LM consists of learning the optimal value of
the smoothing parameter
All LM runs were performed using Lemur
Experiments and Results (2/8)
DM
Features
SVM
svm-light for SVM runs
Linear kernel gives the best performance on most
data set (converge rapidly)
ME
The toolkit of Zhang
Experiments and Results (3/8)
The comparison of performance of LM, SVM and ME
50% (8/16) is indistinguishable; 12.5% (2/16) is that SVM is
statistically better than LM; 37.5 (6/16) is that LM is superior
to SVM
Experiments and Results (4/8)
Discussion
Official TREC runs – query expansion
DM can improve the performance by including other features
such as proximity of query terms, occurrence of query terms
as noun-phrases, etc. Such features would not be easy to
incorporate into the LM framework
In the emergence of modern IR collections such as web and
scientific literature that are characterized by a diverse variety
of features, we will increasingly rely on models that can
automatically learn these features from examples
Experiments and Results (5/8)
Home-page finding on web collection
Choose the home-page finding task of TREC-10 where many
features such as title, anchor-text and link structure
influence relevance
Example – returned the web-page http://trec.nist.gov
where the query “Text Retrieval Conference” is issued
Corpus – WT10G
Queries
50 for training, 50 for development and 145 for testing
Evaluation Metrics
The mean reciprocal rank (MRR)
Success rate – an answer is found in top 10
Failure rate – no answer is returned in top 100
Experiments and Results (6/8)
Three Indexes
A content index consisting of the textual content of the
documents with all the HTML tags removed
An index of the anchor text documents
An index of the titles of all documents
20 Features
Use 6 previous features from each of the three indexes
Two additional features
URL-depth – a home-page typically is at depth 1
Link-Factor
log(1
num-links( D)
)
Avg-num-links
Experiments and Results (7/8)
Performance on development set
Performance on test set
Experiments and Results (8/8)
Discussion
SVMs leverage a variety of features and improve on the
baseline LM performance by 48.6% in MRR
The best run in TREC-10 achieved an MRR of 0.77 on the
test set; however, their feature weights were optimized
using empirical means while our models learn them
automatically.
Only demonstrate the learning ability of SVMs
We believe there is a lot more that needs to be in defining
the right kind of features, such as PageRank for the linkfactor feature.
Related Works
A few attempts in applying discriminative models for IR
Cooper and Huizinga make a strong case for applying the
maximum entropy approach to the problems of information
retrieval
Kantor and Lee extend the analysis of the principle of maximum
entropy in the context of information retrieval
Greiff and Ponte showed that the classic binary independence
model and the maximum entropy approach are equivalent
Gey suggested the method of logistic regression, which is
equivalent to the method of maximum entropy used in our work
Conclusion and Future Work (1/2)
Treat IR as a problem of binary classification
Quantify relevance explicitly
Permit us apply sophisticated pattern classification technique
Explore SVMs and MEs to IR
Their main utility to IR lies in their ability to learn
automatically a variety of features that influence relevance
Ad-hoc retrieval
SVMs perform as well as LMs
Home-page finding
SVMs outperform the baseline runs by about 50% in
MRR
Conclusion and Future Work (2/2)
Future Work
Further improvement through better feature engineering and
by leveraging a huge body of literature on SMVs and other
learning algorithms
Evaluate the performance of SVMs on ad-hoc retrieval task
with longer queries
Enhance features such as proximity of query terms,
synonyms, etc.
Study user modeling by incorporating user-preferences as
features in the SVMs