No Slide Title

Download Report

Transcript No Slide Title

LM Approaches to Filtering
Richard Schwartz, BBN
LM/IR ARDA 2002
September 11-12, 2002
UMASS
1
Topics
 LM approach
– What is it?
– Why is it preferred?
 Controlling Filtering decision
2
What is LM Approach?
 We distinguish all ‘statistical’ approaches from ‘probabilistic’
approaches.
 The tf-idf metric computes various statistics of words and
documents.
 By ‘probabilistic’ approaches, we (I) mean methods where we
compute the probability of a document being relevant to a user’s
need, given the query, the document, and the rest of the world,
using a formula that arguably computes
P(Doc is Relevant | Query, Document, Collection, etc.)
 If we use Bayes’ rule, we end up with the prior for each document,
p(Doc is Relevant | Everything except Query) and the likelihood of
the query p(Q | Doc is Relevant)
 The LM approach is a solution to the second part of this.
 The prior probability component is also important.
3
What it is not
 If we compute a LM for the query and a document and
ask the probability that the two underlying LMs are the
same, I would NOT call this a posterior probability
model.
 The LMs would not be expected to be the same even
with long queries.
4
Issues in LM Approaches for Filtering
 We (ideally) have three sets of documents:
– Positive documents
– Negative documents
– Large corpus of unknown (mostly negative) documents
 We can estimate a model for both positive and
negative documents
– We can find more positive documents in large corpus
– We use large corpus to smooth models from positive
and negative documents
 We compute the probability of each of each new
document given each of the models
 The log of the ratio of these two likelihoods is a
score that indicates whether the document is
positive or negative.
5
Language Modeling Choices
 We can model the probability of the document given the
topic in many ways.
 A simple unigram mixture works surprisingly well.
– Weighted mixture of distributions from the topic training and
the full corpus
 We improve over the ‘naïve Bayes’ model significantly
by using the Estimate Maximize technique
 We can extend the model in many ways:
– Ngram model of words
– Phrases: proper names, collocations
 Because we use a formal generative model, we know
how to incorporate any effect we want.
– E.g., probability of features of top-5 documents given some
document is relevant
6
How to Set the Threshold
 For filtering, we are required to make a hard decision of
whether to accept the document, rather than just rank
the documents.
 Problems:
– The score for a particular document depends on many factors
that are not important for the decision
• Length of document
• Percentage of low-likelihood words
– The range of scores depends on the particular topic.
 Would like to map the score for any document and topic
into a real posterior probability
7
Score Normalization Techniques
 By using the relative score for two models, we remove
some of the variance due to the particular document.
 We can normalize for the peculiarities of the topic by
computing the distribution of scores for Off-Topic
documents.
 Advantages of using Off-Topic documents:
– We have a very large number of documents
– We can fix the probability of false alarms
8
The Bottom Line
 For TDT tracking, the probabilistic approach to
modeling the document and to score normalization
results in better performance, whether for monolanguage, cross-language, speech recognition output,
etc.
 Large improvement will come after multiple sites start
using similar techniques.
9
Grand Challenges
 Tested in TDT
– Operating with small amounts of training data for each category
• 1 to 4 documents per event
– Robustness to changes over time
• adaptation
– Multi-lingual domains
– How to set threshold for filtering
– Using model of ‘eventness’
 Large hierarchical category sets
– How to use the structure
 Effective use of prior knowledge
 Predicting performance and characterizing classes
 Need a task where both the discriminative and the LM approach
will be tested.
10
What do you really want?
 If a user provides a document about the 9/11 World
Trade Center crash and says they want “more like this”,
do they want documents about:
–
–
–
–
–
Airplane crashes
Terrorism
Building fires
Injuries and Death
Some combination of the above
 In general, we need a way to clarify which combination
of topics the user wants
 In TDT, we predefine the task to mean we want more
about this specific event (and not about some other
terrorist airplane crash into a building).
11