Maximum Likelihood Estimation for Information Thresholding
Download
Report
Transcript Maximum Likelihood Estimation for Information Thresholding
Maximum Likelihood Estimation for
Information Thresholding
Carnegie
Mellon
Yi Zhang & Jamie Callan
Carnegie Mellon University
{yiz,callan}@cs.cmu.edu
1
Overview
Adaptive filtering: definition and challenges
Threshold based on score distribution and the sampling
bias problem
Maximum likelihood estimation for score distribution
parameters
Results of Experiments
Conclusion
2
Adaptive Filtering
Given an initial description of information needs, a filtering system
sifts through a stream of documents,and delivers relevant documents to
a user as soon as the document arrives. Relevance feedback maybe
available for some of the delivered documents, thus user profiles can
be updated adaptively.
Filtering
System
3
Adaptive Filtering
Three major problems
Learning corpus statistics, such as idf
Learning user profile, such as adding or deleting key words and adjusting
term weights. (Scoring method)
Learning delivery threshold. (Binary judgment)
Evaluation Measures
Linear utility = r1*RR+r2*NR+r3*RN+r4*NN
Optimizing linear utility => Finding P(relevant|document)
In one dimension: P(relevant|document) = P(relevant|score)
F measure
( 1 β 2 )Precision * Recall
F
β 2*Precision Recall
4
A Model of Score Distribution: Assumptions
and Empirical Justification
120
16
Relevant:
1
P( score | R r )
e
2
Non-relevant:
P( score | R nr ) e
number of documents
100
14
( scoreu ) 2
2 2
number of documents
12
10
8
6
4
80
60
40
20
2
( x c )
0.40
0.42
0.44
0.46
0.48
0.50
0.52
0.43
0.46
0.47
0.48
0.49
0.50
0.48
0.49
0.50
160
140
50
number of documents
number of documents
According to other researchers,
this is generally true for various
statistical searching systems
(scoring methods, Manmatha’s
paper, Arampatzis’s paper)
0.45
document score
60
0.44
document score
0
40
30
20
10
0
0.34
120
100
80
60
40
20
0.36
0.38
0.40
0.42
0.44
document score
0.46
0.48
0.5
0
0.43
0.44
0.45
0.46
0.47
document score
Figure 1. Density of document scores: TREC9 OHSU5
Topic 3 and Topic 5
Optimize for Linear Utility Measure: from
Score Distribution to Probability of Relevancy
p: p(r) ratio of relevant documents
P ( score | r ) P (r )
P ( score )
P ( score | r ) P(r )
P ( score | r ) P (r ) P( score | nr ) P (nr )
P (r | score )
1
e
2
1
e
2
( scoreu ) 2
2 2
( scoreu ) 2
2 2
*p
* p e ( scorec ) * (1 p )
6
Optimize for F Measure: From Score
Distribution to Precision and Recall
If set threshold at θ:
non-relevant document
0.07
0.06
0.04
Recall ( ) norm(u, , x)dx
0.04
0.03
relevant document
0.02
0.01
0
Precision ( )
0.4
0.4
1
2
i
0.43
0.44
0.45
0.46
0.47
score
p norm (u , , x ) dx
p norm (u , , x) dx (1 p ) exp( , x) dx
2
(
1
)* P* R
* arg max F arg max
R 2 *P
7
What We Have Now?
A model for score distribution
Algorithms to find the optimal threshold for different
evaluation measures given the model
Learning task: find the parameters for the model?
8
Bias Problem for Parameter Estimation while
Filtering
We only receive feedback for
documents delivered
Parameter estimation based on
random sampling assumption is
biased
Sampling criteria depends on
threshold, which changes over
time
Solution: maximum likelihood
principle, which is guaranteed to
be unbiased
14
0
12
0
number of documents
Estimation
based on
documents
delivered
10
0
8
0
Estimation based on
6
0
all relevant documents
4
0
2
0
0.34 0.36
0.38
0.40
0.42
0.44
0.46
0.48
0.5
document score
Figure: Estimation of parameters for relevant
document scores of TREC9 OHSU Topic 3
with a fixed dissemination threshold 0.4435 9
Unbiased Estimation of Parameters Based on
Maximum Likelihood Principle (1)
ML: the best estimation of parameters is the one that
maximizes the probability of training data:
(u * , * , * , p * ) * arg max P ( D | )
N
arg max P ( Di | )
i 1
N
arg max log( P ( Di | ))
i 1
N
arg max log( P ( Score Scorei , Ri | , Score i ))
i 1
where
(u , , , p )
10
Unbiased Estimation of Parameters Based on
Maximum Likelihood Principle (2)
For each item inside the sum operation of the previous formula:
P( Score Scorei , Ri | , Score i )
P( Score Scorei , Score i , Ri | )
P( Score i | )
P( Score Scorei , Score i | Ri , ) PRi |
P( Score i | )
P( Score Scorei | Ri , ) P( Ri | )
P( Score i | )
11
Unbiased Estimation of Parameters Based on
Maximum Likelihood Principle (3)
Calculating the denominator:
g (u , , , p, i ) P ( Score i | )
P ( R r | ) P ( Score i | R r , )
P ( R nr | ) P ( Score i | R nr , )
non-relevant document
0.07
p
P( Score x | R
i
r ) dx
0.06
0.04
i
(1 p ) P ( Score x | Ri nr ) dx
0.04
i
p
i
p
i
1
e
2
1
e
2
( x u ) 2
2 2
dx (1 p ) e
0.03
relevant document
0.02
*( x c )
dx
0.01
i
( x u ) 2
2 2
0
dx (1 p )e
( i c )
0.4
0.4
1
2
i
0.43
0.44
0.45
0.46
0.47
score
12
Unbiased Estimation of Parameters Based on
Maximum Likelihood Principle (4)
N
(u , , , p ) arg max LPi
*
*
*
*
( u , , , p ) i 1
• For a relevant document delivered:
LPi
( Score i u ) 2
ln( p /( g (u, , , p, i )))
2
• For a non-relevant document delivered:
2
LPi ( Scorei c ) ln((1 p ) / g( u , , , p , i )))
13
Relationship to Arampatzis’s Estimation
If no threshold exists g (u, , , p,i ) P(Score i | ) 1
The previous formula becomes:
N
(u * , * , * , p * ) arg max LPi
•
•
( u , , , p ) i 1
For a relevant document delivered:
( Scorei u ) 2
LPi
2 2
For a non-relevant document delivered:
LPi ( Scorei c) ln(( 1 p) )
Corresponding result will be the same as Arampatzis’s
14
Unbiased Estimation of Parameters Based on
Maximum Likelihood Principle (5)
Optimization
using conjugate gradient
descent algorithm
Smoothing using conjugate prior:
1
2
p
(
1
p
)
Prior for p: beta distribution:
Prior
Set:
2
for variance:
2 2
1 0.001, 2 0.001, 0.005
15
Experimental Methodology (1)
Optimization goal (similar to the measure used by
TREC9):
T9U’=2*Relevant_Retrieved-Non_Relevant_Retrieved=2RR-NR
Corresponding rule: deliver if : P( R r | score ) 0.33
Dataset
OHSUMED data (348566 articles from 1887 to 1991. 63 OHSUMED queries and
500 MeSH headings to simulate user profiles)
FT data (210158 articles from Financial Times 1991 to 1994. TREC topics 351-400
to simulate user profiles)
Each profile begins with 2 relevant documents and an
initial user profile
No profile updating for simplicity.
16
Experimental Methodology (2)
Four
runs for each profile
Run1 : biased estimation of parameters because sampling bias was not considered
Run3 : maximum likelihood estimation.
Both runs will stop delivering documents if the threshold is set too high,
especially in the early stages of filtering. We introduced a minimum delivery
ratio: If a profile has not achieved the minimum delivery ratio, its threshold
will be decreased automatically:
Run 2: biased estimation + minimum delivery ratio
Run 4: maximum likelihood estimation + minimum delivery ratio
Time: 21 minutes for the whole process of 63 OHSU topics on 4
years of OHSUMED data (ML algorithm)
17
Results: OHSUMED Data
OHSU
topics
MESH
topics
Run 1: Biased
estimation
Run 2: Biased
estimation+ min.
delivery Ratio
Run 3:
Unbiased
estimation
Run4:Unbiased
estimation+min.
delivery ratio
T9U’ utility
1.84
3.25
2.7
8.17
Avg. docs. delivered per
profile
3.83
9.65
5.73
18.40
Precision
0.37
0.29
0.36
0.32
Recall
0.036
0.080
0.052
0.137
T9U’ utility
1.89
4.28
2.44
13.10
Avg. docs. delivered per
profile
3.51
11.82
6.22
27.91
Precision
0.42
0.39
0.40
0.34
Recall
0.018
0.046
0.025
0.068
18
Results: Financial Times
Run 1:
Biased
estimation
Run 2:
Biased
estimation +
min. delivery
ratio
Run 3:
Unbiased
estimation
Run 4:
Unbiased estimation +
min. delivery ratio
T9U’ utility
1.44
-0.209
0.65
0.84
Avg. docs.
Delivered per
profile
9.58
10.44
9.05
12.27
Precision
0.20
0.17
0.22
0.26
Recall
0.161
0.167
0.15
0.193
19
Result Analysis: Difference Between Run 4
and Run 2 on TREC9 OHSU Topics
Utility: ML - Biased
Docs delivered:ML -Biased
120
100
100
80
80
60
60
40
40
20
20
0
0
-20
-20
0
20
40
60
80
-40
Topics
•For some of the topics , ML (run 4) has a
much higher utility than Run 2, while they are
similar in most of the other topics
0
20
40
60
80
Topics
• For
most of the topics, ML (Run 4) delivered
more documents than Run 2
20
Conclusion
Score density distribution
Relevant documents: normal distribution
Non-relevant documents: exponential distribution
Bias problem due to non-random sampling can be solved
based on the maximum likelihood principle
Significant improvement in the TREC-9 filtering task.
Future work
Thresholding while updating profiles
Non-random sampling problem in other task
21