R - University of Maryland Institute for Advanced Computer Studies

Download Report

Transcript R - University of Maryland Institute for Advanced Computer Studies

Evaluation
LBSC 708A/CMSC 838L
Session 6, October 16, 2001
Philip Resnik and Douglas W. Oard
Agenda
Philip Resnik
– Questions
– Evaluation fundamentals
– User-centered strategies
• Doug Oard
– System-centered strategies
– Lessons From TREC
Basis for Selection
• Visceral (9)
– “I’ll know it when I see it” (expected answer)
• Conscious (6)
– Might be true for known item retrieval
• Formalized (0)
– Used in the TREC interactive evaluations
• Compromised (3)
– I can’t imagine a case were this would be right
Muddiest Points
• Clustering (4)
– How are clusters formed? (what are heuristics?)
– What meaning do clusters have?
• How starfields and contours are used (3)
• Projection to fewer dimensions (2)
– How are the dimensions selected? [FastMap]
• Summarization (2)
– Distinction between salience and selectivity
• Why did we devote an entire class to HCI?
Single-Link Clustering
• Pick any document as the first cluster “seed”
• Add the most similar document to each cluster
– Cosine measure, Okapi BM 25, etc.
– Start a new cluster if none are within a threshold
– Adding the same document joins two clusters
• Optionally check large clusters for splits
– Recluster largest clusters with a tighter threshold
FastMap Projection
• Select any point (document) at random
• Find the furthest point (least similar document)
• Use coordinates of two points to define a line
– A line defines an “orthogonal hyperplane”
• Project all points to the orthogonal hyperplane
– Project to the line, then subtract from original point
• Repeat for desired number or dimensions
Visualizing FastMap
Evaluation Criteria
• Effectiveness
– User-machine system, ranked list quality, …
• Efficiency
– Retrieval time, indexing time, index size
• Usability
– Learnability, novice use, expert use
IR Effectiveness Evaluation
• User-centered strategy
– Given several users, and at least 2 retrieval systems
– Have each user try the same task on both systems
– Measure which system works the “best”
• System-centered strategy
– Given documents, queries, and relevance judgments
– Try several variations on the retrieval system
– Measure which ranks more good docs near the top
Measures of Effectiveness
• Good measures of effectiveness should
– Capture some aspect of what the user wants
– Have predictive value for other situations
• Different queries, different document collection
– Be easily replicated by other researchers
– Be expressed as a single number
• Allows two systems to be easily compared
• No IR measures of effectiveness are that good!
Comparing Alternative Approaches
• Achieve a meaningful improvement
– An application-specific judgment call
• Achieve reliable improvement in unseen cases
– Can be verified using statistical tests
Evolution of Evaluation in
Language Technology
•
•
•
•
•
•
•
Evaluation by inspection of examples
Evaluation by demonstration
Evaluation by improvised demonstration
Evaluation on data using a figure of merit
Evaluation on test data
Evaluation on common test data
Evaluation on common, unseen test data
Resnik’s Ten Commandments of
Language Technology Evaluation
•
•
•
•
•
•
•
•
•
•
Thou shalt define pertinent evaluation metrics
Thou shalt define replicable evaluation metrics
Thou shalt report all relevant system parameters
Thou shalt establish upper bounds on performance
Thou shalt establish lower bounds on performance
Thou shalt test differences for statistical significance
Thou shalt not mingle training data with test data
Thou shalt not mingle training data with test data
Thou shalt not mingle training data with test data
Thou shalt not mingle training data with test data
Statistical Significance Tests
• How sure can you be that an observed
difference doesn’t simply result from the
particular queries you chose?
Experiment 1
Query System A System B
Experiment 2
Query System A System B
1
2
3
4
5
6
7
Average
1
2
3
4
5
6
7
Average
0.20
0.21
0.22
0.19
0.17
0.20
0.21
0.20
0.40
0.41
0.42
0.39
0.37
0.40
0.41
0.40
0.02
0.39
0.16
0.58
0.04
0.09
0.12
0.20
0.76
0.07
0.37
0.21
0.02
0.91
0.46
0.40
Estimating from Samples
Population:
mean 
Sample:
mean X
• The mean is a point estimator
- unbiased: X approaches  as N approaches infinity
- consistent: X gets more accurate as N gets bigger
Sampling, continued
Population:
mean 
Sample:
mean X2
Sample:
mean X1
Sample:
mean X3
Problem: you can’t do all possible experiments.
Statistics is the science of inference from the experiments you do.
Sampling and Experimentation
Sample:
mean X2
Population: mean 
Sample:
mean X1
Sample:
mean X3

Most experiments will produce a result somewhere near the true value.
Some won’t. So how can you infer anything from experimental results?
Hypothesis Testing Paradigm
•
•
•
•
•
Choose some experimental variable X
Establish a null hypothesis H0
Identify distribution of X assuming H0 is true
Do experiment to get an empirical value x
Ask: “What’s the probability I would have
gotten this value of x if H0 were true?”
• If this probability p is low, reject H0
Hypothesis Testing Example
• In a southern state, 50% of eligible voters are
African American.
• In a particular court case, 4 of 80 potential jurors
were African American.
• Was the pool of potential jurors chosen fairly?
Hypothesis Testing Example,
cont’d
•
•
•
•
“Experiment”: choosing a jury
X: the number of African American jurors
H0 (null hypothesis): jury pool distribution reflects the population
Expected distribution of X: binomial (each voter equally likely)
Intuition: when N=80,  (expected value of X) is 40
• Probability of getting x  4 when N=80: 0.0000000000000000014
• Null hypothesis is vanishingly unlikely!
95% of outcomes
0
40
80
X
Back to IR Evaluation!
• How sure can you be that an observed difference
doesn’t simply result from the particular queries you
chose?
Experiment 1
Query System A System B
Experiment 2
Query System A System B
1
2
3
4
5
6
7
Average
1
2
3
4
5
6
7
Average
0.20
0.21
0.22
0.19
0.17
0.20
0.21
0.20
0.40
0.41
0.42
0.39
0.37
0.40
0.41
0.40
0.02
0.39
0.16
0.58
0.04
0.09
0.12
0.20
0.76
0.07
0.37
0.21
0.02
0.91
0.46
0.40
Sign Test
• Compare some figure of merit for each topic
– Note which system produces the bigger value
• Null hypothesis: either condition is equally likely to
produce the bigger value for any query
• Value X to measure: how often B gives you a bigger value
than A.
• Compute the probability of the x you got
– Any statistics package contains the formula for this
• Treat p < 0.05 as“statistically significant”
– If p < 0.05, this result probably didn’t happen by chance
– But the results still need to pass the “meaningful” test!
Student’s t test
0.20
0.21
0.22
0.19
0.17
0.20
0.21
0.40
0.41
0.42
0.39
0.37
0.40
0.41
avg1
avg2
Null hypothesis: the reds and the blues were sampled from
the same underlying distribution
Measurable value: t = look it up if you’re interested!
Note: this test is only valid if you assume the two distributions
are normal and have the same variance.
Also note: doesn’t assume the values are paired.
Paired t test
• More powerful than the sign test
– If the assumptions are satisfied
• Compute the figure of merit difference
– On a topic by topic basis for enough topics to
approximate a normal distribution
• Assume that the topics are independent
• Compute the probability of the outcome you got
– Microsoft Excel includes a TTEST function
• Treat p < 0.05 as “statistically significant”
User Studies
• The only way to account for interface issues
– By studying the interface component
– By studying the complete system
• Formative tests
– Provide a basis for system development
• Summative tests
– Designed to assess performance
Questionnaires
• Collect demographic data
– For example, computer experience
– Basis for interpreting results
• Subjective self-assessment
– Which did they think was more effective?
– Often at variance with objective results!
• Preference
– Which interface did they prefer? Why?
Qualitative User Studies
• Observe user behavior
–
–
–
–
Instrumented software, eye trackers, etc.
Face and keyboard cameras
Think-aloud protocols
Interviews and focus groups
• Organize the data
– For example, group it into overlapping categories
• Look for patterns and themes
• Develop a “grounded theory”
Quantitative User Studies
• Select independent variable(s)
– e.g., what info to display in selection interface
• Select dependent variable(s)
– e.g., time to find a known relevant document
• Run subjects in different orders
– Average out learning and fatigue effects
• Compute statistical significance
– p<0.05 is sufficient to reject the null hypothesis
that the independent variable has no effect
Agenda
• Philip Resnik
– Questions
– Evaluation fundamentals
– User-centered strategies
Doug Oard
– System-centered strategies
– Lessons From TREC
Which is the Best Rank Order?
a
R
b
c
R
d
R
R
R
e
R
R
R
f
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
g
R
R
R
R
R
h
R
R
R
R
R
R
R
R
R
R
IR Test Collections
• Documents
– Representative quantity
– Representative sources and topics
• Topics (formalized)
– Used to form queries (compromised)
• Relevance judgments
– For each document, with respect to each topic
– This is the expensive part!
Some Assumptions
• Unchanging, known queries
– The same queries are used by each system
• Binary relevance
– Every document is either relevant or it is not
• Unchanging, known relevance
– The relevance of each doc to each query is known
• But only used for evaluation, not retrieval!
• Focus on effectiveness, not efficiency
Set-Based MOE
• Precision
– How much of what was found is relevant?
• Often of interest, particularly for interactive searching
• Recall
– How much of what is relevant was found?
• Particularly important for law and patent searches
• Fallout
– How much of what was irrelevant was rejected?
• Useful when different size collections are compared
The Contingency Table
Action
Doc
Relevant
Retrieved
Not Retrieved
Relevant Retrieved
Relevant Rejected
Not relevant Irrelevant Retrieved Irrelevant Rejected
Relevant Retrieved
Precision 
Retrieved
Relevant Retrieved
Recall 
Relevant
Irrelevant Rejected
Fallout 
Not Relevant
Single-Figure Set-Based Measures
• Balanced F-measure
– Harmonic mean of recall and precision
F 
1
0.5 0.5

P
R
– Weakness: What if no relevant documents exist?
• Utility functions
– Reward relevant retrieved, Penalize non-relevant
• For example, 3R+ - 2N+
– Weakness Hard to normalize, so hard to average
MOE for Ranked Retrieval
• Start with the first relevant doc on the list
– Compute recall and precision at that point
• Move down the list, 1 relevant doc at a time
– Computing recall and precision at each point
• Plot precision for every value of recall
– Interpolate with a nonincreasing step function
• Repeat for several queries
• Average the plots at every point
R
The Precision-Recall Curve
Action
Doc=10
Relevant=4
R
Retrieved
Not Retrieved
Relevant Retrieved
Relevant Rejected
Irrelevant Retrieved Irrelevant Rejected
Not relevant=6
1.2
1
1
0.8
0.6
0.4
0.2
Precision
Recall
0
Precision
1.2
Interpolated Precision
R
R
0.8
0.6
0.4
0.2
0
1 2 3 4 5 6 7 8 9 10
Rank
0
0.5
Recall
1
1.5
1.2
1
0.8
0.6
0.4
0.2
0
0
0.5
Recall
1
1.5
Single-Number Measures
• Mean precision at a fixed number of documents
– Precision at 10 docs is the “AltaVista measure”
• Mean precision at a fixed recall level
– Adjusts for the total number of relevant docs
• Mean Average Precision (MAP)
– Average of precision at recall=0.0, 0.1, …, 1.0
– Area under the precision/recall curve
• Mean breakeven point
– Value at which precision = recall
Precision at recall=0.1
Average Precision
1
Breakeven Point
0.9
0.8
0.7
Precision at 10 docs
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.2
0.4
0.6
Recall
0.8
1
Single-Figure MOE Weaknesses
• Mean precision at 10 documents
– Averaged over very different numbers of relevant docs
• Mean precision at constant recall
– A specific recall fraction is rarely the user’s goal
• Mean breakeven point
– Nobody ever searches at the breakeven point
• Mean Average Precision
– Users typically operate near an extreme of the curve
• So the average is not very informative
Why Mean Average Precision?
• It is easy to trade between recall and precision
– Adding related query terms improves recall
• But naive query expansion techniques kill precision
– Limiting matches by part-of-speech helps precision
• But it almost always hurts recall
• Comparisons should give some weight to both
– Average precision is a principled way to do this
• Rewards improvements in either factor
Visualizing Average Precision
Precision
1.0
0.5
0.353
0.0
0.0
0.2
0.4
0.6
Recall
0.8
1.0
Average Precision
Visualizing Mean Average Precision
1.0
0.8
0.6
0.4
0.2
0.0
Topic
How Much is Enough?
• The maximum average precision is 1.0
– But inter-rater reliability is 0.8 or less
– So 0.8 is a practical upper bound at every point
• Precision 0.8 is sometimes seen at low recall
• Improvement should be meaningful
– Rule of thumb: 5 is noticeable, 10% is important
• Is the improvement statistically significant?
– Typically requires at least 25 topics (50 is better)
Obtaining Relevance Judgments
• Exhaustive assessment can be too expensive
– TREC has 50 queries for >1 million docs each year
• Random sampling won’t work either
– If relevant docs are rare, none may be found!
• IR systems can help focus the sample
– Each system finds some relevant documents
– Different systems find different relevant documents
– Together, enough systems will find most of them
Pooled Assessment Methodology
• Each system submits top 1000 documents
• Top 100 documents for each are judged
– All are placed in a single pool
– Duplicates are eliminated
– Placed in an arbitrary order to avoid bias
• Evaluated by the person that wrote the query
• Assume unevaluated documents not relevant
– Overlap evaluation shows diminishing returns
• Compute average precision over all 1000 docs
TREC Overview
• Documents are typically distributed in April
• Topics are distributed in early June
– Queries are formed from topics using standard rules
• Top 1000 selections are due in mid-August
• Relevance judgments available in October
• Results presented in late November each year
Lessons From TREC
• Incomplete judgments are useful
– If sample is unbiased with respect to systems tested
• Different relevance judgments change absolute score
– But rarely change comparative advantages when averaged
• Evaluation technology is predictive
– Results transfer to operational settings)
Adapted from a presentation by Ellen Voorhees at the University of Maryland, March 29, 1999
Effects of Incomplete Judgments
• Additional relevant documents are:
– roughly uniform across systems
– highly skewed across topics
• Systems that don’t contribute to pool get
comparable results
Inter-Judge Agreement
# Relevant per Topic by Assessor
500
400
300
200
100
0
Primary
Assessor Group
Overlap
Primary & A
.421
Primary & B
.494
A & B
.426
All 3
.301
A
B
Effect of Different Judgments
0.4
Line 1
Average Precision
0.3
Line 2
Mean
Original
0.2
Union
Intersection
0.1
0
System
Net Effect of Different Judges
• Mean Kendall t between system rankings
produced from different qrel sets: .938
• Similar results held for
–
–
–
–
Different query sets
Different evaluation measures
Different assessor types
Single opinion vs. group opinion judgments
Query-Averaged Recall-Precision
1
0.9
0.8
Precision
0.7
0.6
run1
0.5
run2
0.4
run3
0.3
0.2
0.1
0
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Recall
1
Single-Topic Recall Precision
1
0.9
0.8
Precision
0.7
0.6
run1
0.5
run2
0.4
run3
0.3
0.2
0.1
0
0
0.11 0.22 0.33 0.44 0.56 0.67 0.78 0.89
Recall
1
What Query Averaging Hides
1
0.9
0.8
Precision
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
Recall
0.6
0.7
0.8
0.9
1
Single Number Summary Scores
• Precision (X): # rel in top X / X
• Recall(Y): # rel in top Y/ R
• Average precision: Avgr (Prec(rank of r))
• R-Precision: Prec(R)
• Recall at .5 precision
– use Prec(10) if precision < .5 in top 10
• Rank of first relevant (expected search length)
Document Level Measures
• Example: precision at 10 documents
• Easily interpreted
• Less sensitive to ranking
– Only differences that cross cut-off affect results
• May better reflect actual applications
• But less useful for tuning a system
• Don’t average well
– Working in different parts of recall-precision curve
– Some queries don’t have that many documents
TREC Topic
39
39
39
39
39
38
38
38
38
38
37
37
37
37
37
36
36
36
36
36
35
35
35
35
35
9
7
5
3
1
9
7
5
3
1
9
7
5
3
1
9
7
5
3
1
9
7
5
3
1
Known Relevant Documents
Example:
Number Relevant at 30 Documents
400
350
300
250
200
150
100
50
0
Average Precision
• Less easily interpreted
• Sensitive
– Changing one rank will change average precision
– Documents closest to the top have the greatest effect
• Stable
– Small change in rank makes small change in score
P(10)
Runs Ranked by Different
Measures
P(30)
R-Prec
Ave Prec
Recall at
.5 Prec
Recall
(1000)
Total Rel
Rank of
1st Rel
INQ502
INQ502
ok7ax
ok7ax
att98atdc
ok7ax
ok7ax
tno7tw4
ok7ax
ok7ax
INQ502
att98atdc
ok7ax
tno7exp1
tno7exp1
bbn1
att98atdc
INQ501
ok7am
att98atde
mds98td
att98atdc att98atdc
INQ502
att98atde att98atdc att98atdc
ok7am
ok7am
att98atde
bbn1
nect’chall
INQ501
nect’chall att98atde
INQ502
INQ502 Cor7A3rrf att98atde tnocbm25
nect’chall att98atde
INQ501
mds98td
att98atde
ok7am
INQ502
MerAbtnd
nect’chdes
ok7am
bbn1
bbn1
INQ501
bbn1
INQ501
att98atdc
ok7am
nect’chdes mds98td
tno7exp1
ok7as
pirc8Aa2
ok7am
acsys7al
mds98td
INQ503 nect’chdes
INQ501
bbn1
INQ502 Cor7A3rrf mds98td
INQ503
bbn1
nect’chall
pirc8Aa2
nect’chall
pirc8Ad
pirc8Aa2
ibms98a
Cor7A3rrf tno7exp1
ok7as
Cor7A3rrf tno7exp1
INQ501
nect’chdes Cor7A3rrf
tno7tw4
mds98td
tno7exp1
acsys7al
Cor7A3rrf nect’chdes mds98td
ok7ax
MerAbtnd pirc8Aa2
acsys7al
ok7as
acsys7al
nect’chall
acsys7al
att98atde
acsys7al Cor7A3rrf pirc8Aa2 nect’chdes Cor7A2rrd acsys7al
nect’chall
Brkly25
iowacuhk1
ok7as
Cor7A3rrf nect’chall
INQ503
mds98td
pirc8Ad
nect’chdes
Ranked by measure averaged over 50 topics
Correlations Between Rankings
P(30)
P(10)
P(30)
R Prec
Ave Prec
R at .5 P
Recall(1000)
Total Rels
.8851
R Prec
.8151
.8676
Ave
Recall Recall Total Rank
Prec at .5 P (1000) Rels 1st Rel
.7899 .7855 .7817 .7718 .6378
.8446 .8238 .7959 .7915 .6213
.9245 .8654 .8342 .8320 .5896
.8840 .8473 .8495 .5612
.7707 .7762 .5349
.9212 .5891
.5880
Kendall’s t computed between pairs of rankings
Two Minute Papers
• If I demonstrated a new retrieval technique that
achieved a statistically significant improvement
in average precision on the TREC collection,
what would be the most serious limitation to
consider when interpreting that result?
• What was the muddiest point in today’s lecture?