Transcript ppt

Discriminative Models for
Information Retrieval
Ramesh Nallapati
UMass
SIGIR 2004
Abstract

Discriminative model vs. Generative model


Performance Comparison



Discriminative – attractive theoretical properties
Discriminative – Maximum Entropy, Support Vector Machine
Generative – Language Modeling
Experiment


Ad-hoc Retrieval
 ME is worse than LM, SVM are on par with LM
Home-Page Finding
 Prefer SVM over LM
Introduction



Traditional IR
 A problem of measuring the similarity between docs and
query, such as Vector Space Model
 Shortcoming
 Term-weights – empirically tuned
 No theoretical basis for computing optimum weights
Binary independence retrieval (BIR)
 Robertson and Sparck Jones (1976)
 First model that viewed IR as a classification problem
 This allows us to leverage many sophisticated techniques
developed in ML domanin
Discriminative models
 Good success in many applications of ML
Discriminative and Generative Classifiers

Pattern Classification



The problem of classifying an example based on its vector of
features x into its class C through a posterior probability
P(C|x) or simply a confidence score g(C|x)
Discriminative models
 Model the posterior directly or learn a direct map from
inputs x to the class labels C
Generative models
 Model the class-conditional probability P(x|C) and the
prior probability P(C) and estimate the posterior through
the Bayes’ rule
P(C | x)  P( x | C ) P(C )
Probabilistic IR models as Classifiers (1/3)

Binary Independence Retrieval (BIR) model

Ranking is done by the log-likelihood ration of relevance
P( R | D)
P( D | R) P( R)
)  log(
)
P( R | D)
P( D | R) P( R)
P( xi  1| R)
P( xi  0 | R) P( R)
 log( 
)

i: xi 1 P ( xi  1| R ) i: xi  0 P ( xi  0 | R ) P ( R )
log(



Ranking is done by the log-likelihood ration of relevance
The model has not met with good empirical success owing
to the difficulty in estimating the class conditional P(xi=1|R)
Assume uniform probability distribution over the entire
vocabulary and update the probabilities as relevant docs are
provided by the user
Probabilistic IR models as Classifiers (2/3)

Two-Poisson model

Follow the same framework as that of BIR model, but they use
a mixture of two Poisson distributions to model the class
conditions P ( D | R ) and P ( D | R )
n
P( D | R)   ( P (c( wi )  k | E ) P ( E | R )  P (c ( wi )  k | E ) P ( E | R ))
i 1
exp(l )l k
exp( m)m k
 ( p
 (1  p)
)
k!
k!
i 1
n


This also is a generative model
Similar to the BIR model, it also needs relevance feedback for
accurate parameter estimation
Probabilistic IR models as Classifiers (3/3)

Language Models


Ponte and Croft (1998)
The ranking of a doc is given by the probability of
generation of the query from doc’s language model
n
P(Q | D)  P (q1 , q2 ,.., qn | D )   P (qi | D )
i 1


This model circumvent the problem of estimating the model
of relevant documents that the BIR model and Two-Poisson
suffer from
LM can be considered generative classifiers in a multi-class
classification sense
The Case for Discriminative Models for IR (1/3)

Discriminative vs. Generative


One should solve the problem (classification) directly and
never solve a more general problem (class-conditional) as an
intermediate step
Model Assumptions

GM



Terms are conditionally independent
LM assume docs obey a multinomial distribution of terms
DM

Typically make few assumptions and in a sense, let the
data speak for itself.
The Case for Discriminative Models for IR (2/3)

Expressiveness



GM - LM are not expressive enough to incorporate many
features into the model
DM - It can include all features effortlessly into a single
model
Learning arbitrary features

In view of the many query dependent and queryindependent doc features and user-preferences that
influence features, we believe that a DM that learns all the
features is best suited for the generalized IR problem
The Case for Discriminative Models for IR (3/3)

Notion of Relevance


In LM, there is no explicit notion of relevance. There has
been considerable controversy on the missing relevance
variable in LM.
We believe that Robertson’s view of IR as a binary
classification problem of relevance is more realistic than the
implicit notion of relevance as it exists in LM.
Discriminative Models Used in Current Work (1/2)

Maximum Entropy Model


The principle of ME – model all that is known and assume
nothing about that which is unknown
The parametric form of the ME probability function can be
expressed by
n
1
P ( R | D, Q ) 
exp( i , R fi ( D, Q))
Z (Q, D)
i 1


The feature weights (λ) are learned from training data using
a fast gradient descent algorithm
As in Robertson’s BIR model, we use the log-likelihood ratio
as the scoring function for ranking as shown follows
P( R | D, Q) n
log
  (i , R  i , R ) fi ( D, Q)
P( R | D, Q) i 1
Discriminative Models Used in Current Work (2/2)

Support Vector Machines (SVM)

Basic idea – find the hyper-plane that separate the two classes of
training examples with the largest margin

If f(D,Q) is the vector of features, then the discriminative function
is given by
g ( R | D, Q)  w ( f ( D, Q))  b
The SVM is trained such that g(R|D,Q)>=1 for positive (relevant)
examples and g(R|D,Q) <= -1 for negative (non-relevant)
examples as long as the data is separable
Both DMs

Retaining the basic framework of the BIR model, while avoid
estimating the class-conditional and instead directly compute the
posterior P(R|Q,D) or the mapping function g(R|R,Q)


Other Modeling Issues

Out of Vocabulary words (OOV) problem



Test queries are almost always guaranteed to contain words
that are not seen in the training queries
The features are not based on words themselves, but on
query-based statistics of documents such as the total
frequency of occurrences or the sum-total of the idf-values
of the query terms
Unbalanced data



The classes (non-relevant) is a large portion of all the
examples, while the other (relevant) class have only a small
percent of of the examples
Over-sampling the minority class
Under-sampling the majority class
Experiments and Results (1/8)

Ad-hoc retrieval
 Data set



Preprocessing
 K-stemmer and removing stop-words
Only use title queries for retrieval
LM
 Training the LM consists of learning the optimal value of
the smoothing parameter
 All LM runs were performed using Lemur
Experiments and Results (2/8)

DM



Features
SVM
 svm-light for SVM runs
 Linear kernel gives the best performance on most
data set (converge rapidly)
ME
 The toolkit of Zhang
Experiments and Results (3/8)


The comparison of performance of LM, SVM and ME
50% (8/16) is indistinguishable; 12.5% (2/16) is that SVM is
statistically better than LM; 37.5 (6/16) is that LM is superior
to SVM
Experiments and Results (4/8)

Discussion



Official TREC runs – query expansion
DM can improve the performance by including other features
such as proximity of query terms, occurrence of query terms
as noun-phrases, etc. Such features would not be easy to
incorporate into the LM framework
In the emergence of modern IR collections such as web and
scientific literature that are characterized by a diverse variety
of features, we will increasingly rely on models that can
automatically learn these features from examples
Experiments and Results (5/8)

Home-page finding on web collection
 Choose the home-page finding task of TREC-10 where many
features such as title, anchor-text and link structure
influence relevance
 Example – returned the web-page http://trec.nist.gov
where the query “Text Retrieval Conference” is issued
 Corpus – WT10G
 Queries
 50 for training, 50 for development and 145 for testing
 Evaluation Metrics
 The mean reciprocal rank (MRR)
 Success rate – an answer is found in top 10
 Failure rate – no answer is returned in top 100
Experiments and Results (6/8)

Three Indexes




A content index consisting of the textual content of the
documents with all the HTML tags removed
An index of the anchor text documents
An index of the titles of all documents
20 Features


Use 6 previous features from each of the three indexes
Two additional features
 URL-depth – a home-page typically is at depth 1

Link-Factor
log(1 
num-links( D)
)
Avg-num-links
Experiments and Results (7/8)

Performance on development set

Performance on test set
Experiments and Results (8/8)

Discussion




SVMs leverage a variety of features and improve on the
baseline LM performance by 48.6% in MRR
The best run in TREC-10 achieved an MRR of 0.77 on the
test set; however, their feature weights were optimized
using empirical means while our models learn them
automatically.
Only demonstrate the learning ability of SVMs
We believe there is a lot more that needs to be in defining
the right kind of features, such as PageRank for the linkfactor feature.
Related Works





A few attempts in applying discriminative models for IR
Cooper and Huizinga make a strong case for applying the
maximum entropy approach to the problems of information
retrieval
Kantor and Lee extend the analysis of the principle of maximum
entropy in the context of information retrieval
Greiff and Ponte showed that the classic binary independence
model and the maximum entropy approach are equivalent
Gey suggested the method of logistic regression, which is
equivalent to the method of maximum entropy used in our work
Conclusion and Future Work (1/2)

Treat IR as a problem of binary classification



Quantify relevance explicitly
Permit us apply sophisticated pattern classification technique
Explore SVMs and MEs to IR



Their main utility to IR lies in their ability to learn
automatically a variety of features that influence relevance
Ad-hoc retrieval
 SVMs perform as well as LMs
Home-page finding
 SVMs outperform the baseline runs by about 50% in
MRR
Conclusion and Future Work (2/2)

Future Work




Further improvement through better feature engineering and
by leveraging a huge body of literature on SMVs and other
learning algorithms
Evaluate the performance of SVMs on ad-hoc retrieval task
with longer queries
Enhance features such as proximity of query terms,
synonyms, etc.
Study user modeling by incorporating user-preferences as
features in the SVMs