No Slide Title
Download
Report
Transcript No Slide Title
LM Approaches to Filtering
Richard Schwartz, BBN
LM/IR ARDA 2002
September 11-12, 2002
UMASS
1
Topics
LM approach
– What is it?
– Why is it preferred?
Controlling Filtering decision
2
What is LM Approach?
We distinguish all ‘statistical’ approaches from ‘probabilistic’
approaches.
The tf-idf metric computes various statistics of words and
documents.
By ‘probabilistic’ approaches, we (I) mean methods where we
compute the probability of a document being relevant to a user’s
need, given the query, the document, and the rest of the world,
using a formula that arguably computes
P(Doc is Relevant | Query, Document, Collection, etc.)
If we use Bayes’ rule, we end up with the prior for each document,
p(Doc is Relevant | Everything except Query) and the likelihood of
the query p(Q | Doc is Relevant)
The LM approach is a solution to the second part of this.
The prior probability component is also important.
3
What it is not
If we compute a LM for the query and a document and
ask the probability that the two underlying LMs are the
same, I would NOT call this a posterior probability
model.
The LMs would not be expected to be the same even
with long queries.
4
Issues in LM Approaches for Filtering
We (ideally) have three sets of documents:
– Positive documents
– Negative documents
– Large corpus of unknown (mostly negative) documents
We can estimate a model for both positive and
negative documents
– We can find more positive documents in large corpus
– We use large corpus to smooth models from positive
and negative documents
We compute the probability of each of each new
document given each of the models
The log of the ratio of these two likelihoods is a
score that indicates whether the document is
positive or negative.
5
Language Modeling Choices
We can model the probability of the document given the
topic in many ways.
A simple unigram mixture works surprisingly well.
– Weighted mixture of distributions from the topic training and
the full corpus
We improve over the ‘naïve Bayes’ model significantly
by using the Estimate Maximize technique
We can extend the model in many ways:
– Ngram model of words
– Phrases: proper names, collocations
Because we use a formal generative model, we know
how to incorporate any effect we want.
– E.g., probability of features of top-5 documents given some
document is relevant
6
How to Set the Threshold
For filtering, we are required to make a hard decision of
whether to accept the document, rather than just rank
the documents.
Problems:
– The score for a particular document depends on many factors
that are not important for the decision
• Length of document
• Percentage of low-likelihood words
– The range of scores depends on the particular topic.
Would like to map the score for any document and topic
into a real posterior probability
7
Score Normalization Techniques
By using the relative score for two models, we remove
some of the variance due to the particular document.
We can normalize for the peculiarities of the topic by
computing the distribution of scores for Off-Topic
documents.
Advantages of using Off-Topic documents:
– We have a very large number of documents
– We can fix the probability of false alarms
8
The Bottom Line
For TDT tracking, the probabilistic approach to
modeling the document and to score normalization
results in better performance, whether for monolanguage, cross-language, speech recognition output,
etc.
Large improvement will come after multiple sites start
using similar techniques.
9
Grand Challenges
Tested in TDT
– Operating with small amounts of training data for each category
• 1 to 4 documents per event
– Robustness to changes over time
• adaptation
– Multi-lingual domains
– How to set threshold for filtering
– Using model of ‘eventness’
Large hierarchical category sets
– How to use the structure
Effective use of prior knowledge
Predicting performance and characterizing classes
Need a task where both the discriminative and the LM approach
will be tested.
10
What do you really want?
If a user provides a document about the 9/11 World
Trade Center crash and says they want “more like this”,
do they want documents about:
–
–
–
–
–
Airplane crashes
Terrorism
Building fires
Injuries and Death
Some combination of the above
In general, we need a way to clarify which combination
of topics the user wants
In TDT, we predefine the task to mean we want more
about this specific event (and not about some other
terrorist airplane crash into a building).
11