notes on image quality

Download Report

Transcript notes on image quality

Statistics and Image Evaluation
Oleh Tretiak
Medical Imaging Systems
Fall, 2002
1
Which Image is Better?
Case A
Case B
2
Method
• Rating on a 1 to 5 scale (5 is best)
• Rating performed by 21 subject
• Statistics:
– Average, maximum, minimum, standard
deviation, standard error for Case A, Case B
– Difference per viewer between Case A and
Case B, and above statistics on difference
3
Observations
Case A
Case B
Case A Case B
4 3 3 4 4 5 4 4 4 5 4 4 3 4 4 5 5 4 4 5 5
2 2 2 2 2 1 1 2 3 1 2 2 1 1 1 1 1 1 1 1 2
2 1 1 2 2 4 3 2 1 4 2 2 2 3 3 4 4 3 3 4 3
4
Statistics
average
maximum
minimum
sample st.
dev
standard
error
Case A
4.14
5.00
3.00
Case A Case B
Case B
1.52
2.62
3.00
4.00
1.00
1.00
0.65
0.60
1.02
0.14
0.13
0.22
5
Conclusions
• Image in Case A has a higher average ranking than
that in Case B.
• The highest ranking for B is the same as the
lowest ranking for A. In all other cases, the
rankings for B are lower than those for A.
• Consider the difference (rightmost column on
previous slide). The ratio of average to the
standard error (the z value) is 2.62/.22 ~ 12. This
value of z is extremely unlikely if the means are
the same.
6
Experimental Design
• How many observers should we use to test differences
between pictures?
• We expect difference between two kinds of pictures will be
0.5 ranking units. We exoect the standard deviation on
difference measurement to be 1.0 (see experiment above).
We would like to determine this reliably. We therefore
want a confidence interval on the mean to be [mean - 0.5,
mean + 0.5] at 99% confidence. How many observers
should we use?
• Answer: z0.005 = 2.6. Standard error must be 0.5/2.6 = 0.19.
Std. err = std. dev. /sqrt(n). Therfore n = (1.0/0.19)^2 = 28
7
Today’s Lecture
•
•
•
•
•
Hypothesis testing
Two kinds of errors
ROC analysis
Visibility of blobs
Quantitative quality measures
8
Hypothesis Testing Example
256x256
128x128
9
Question: Which is better?
• Testing method
– Quality rating by multiple viewers
– Compute per-viewer difference in quality
– Find mean and standard deviation of the
difference
– Compute the z score (mean/std. error)
– How to interpret?
10
Null Hypothesis (H0)
• Assume that the mean is zero (no
difference)
• Find a range of z that would occur when the
mean is zero.
• Accept the null hypothesis if z is in this
range (no difference)
• Reject null hypothesis if z falls outside the
range
11
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
-4
-3
-2
-1
0
1
2
3
4
We show the normal distribution with 0 mean and s = 1. The shaded area
has probability 0.95, and the two white areas have each probability 0.025.
If we observe gaussian variables with mean zero, 95% of the observations
will have value between -1.96 and 1.96. The area outside this interval
(0.05 in this case) is called the significance level of the test.
12
Two Kinds of Errors
• In a decision task with two alternatives,
there are two kinds of errors
• Suppose the alternatives are ‘healthy’ and
‘sick’
– Type I error: say healthy if sick
– Type II error: say sick if healthy
13
H1
b
H0
a
Decision threshold
• X - observation, t - threshold
a = Pr[X > t | H0] (Type I error)
b = Pr[X < t | H1] (Type II error)
Choosing t, we can trade off between the
two types of errors
14
Examples of Threshold
Measurement
• Show blobs and noise.
15
Examples
• Measurement of psychophysical threshold
– Detectible flicker, detectible contrast
• Medical diagnosis
– Negative (healthy), positive (sick)
• Home security
– Friend or terrorist
16
Probability of Error
• Pe= P0a + P1b
• Why bother with two types of error, why
not just Pe?
• In many cases, P1 << P0!
• Two types of error are typically of different
consequence. We therefore don’t want to
mix them together.
17
ROC Terminology
• ROC — receiver operating characteristic
• H0 — friend, negative; H1 — enemy,
positive
• Pr[X > t | H0] = probability of false alarm =
probability of false positive = PFP = a
• Pr[X > t | H1] = probability of detection =
probability of true positive = PTP = b
18
The ROC
H1
PTP
PTP
H0
PFP
1
Decision threshold
• The ROC shows the
tradeoff between PFP
and PTP as the
threshold is varied
AZ
PFP
0
0
1
19
How Do We Estimate the ROC?
• Radiological diagnosis setting
– Positive and negative cases
– The true diagnosis must be evaluated by a
reliable method
• Cases are evaluated by radiologist(s), who
report the data on a discrete scale
– 1 = definitely negative, 5 = definitely positive
20
Binormal Model
• Negative: Normal, mean = 0, st. dev. = 1
f N ( x) 
1
2
exp( x / 2)
2
• Negative: Normal, mean = a, st. dev. = b
fP ( x) 
1
2
2
exp[(x  a) / 2b ]
2
21
Some Binormal Plots
b = 1, a = 1, 2, 3
b = 1, a = 1, 2, 3
b = 2, a = 1, 2, 3
b = 0.5, a = 1, 2, 3
Az ~ area under ROC curve
22
Az formula

Az   a / 1 + b
a
1
2
3
0.5
0.8145
0.9632
0.9964
b
1
0.7602
0.9214
0.9831
2

2
0.6726
0.8145
0.9101
23
Experimental Framework
• Set of positive and negative cases
– Need reliable diagnosis
• Radiologist interprets cases
– Radiologist report on a scale
• Certainly Negative, Probably Negative, Unclear, Probably
Positive, Certainly Positive
• Estimate ROC, Az
• Compare results in studies with conventional and
image processing
24
Statistical Estimation
• Result of experiment is a sample
• If N is very large, estimate is the same as theory
• For practical N, the estimate is true ± error
Standard deviations of estimates of a, b, and Az
for varying numbers of observations. Horizontal
axis: number of positive and negative
observations. Top curve, sa, middle curve: sb.
Trials were with a = b = 1. Standard deviations
of estimates of a, b, and Az for varying numbers
of observations. Horizontal axis: number of
positive and negative observations. Top curve,
sa, middle curve: sb. Trials were with a = b = 1
0.6
0.5
0.4
0.3
0.2
0.1
0 1
10
10 2
10 3
25
Another Approach: Nonparametric
Model
• Ordinal Dominance Graph
• Donald Bamber, Area above the Ordinal Dominance Graph and the
Area below the Receiver Operating Characteristic Graph, J. of Math.
Psych. 12: 387-415 (1975).
• Method: computer frequencies of occurrence for different threshold
levels from sample, plot on probability scale.
1
Monte Carlo, a = 1, b =
1, 10 positive and 10
negative cases
0.8
0.6
0.4
0.2
0
0
0.5
1
26
Ordinal Dominance - examples
1
1
0.8
0.8
0.6
0.6
0.4
0.4
(10, 10)
0.2
0
0
0.5
0.2
1
0
1
1
0.8
0.8
0.6
0.6
0.4
0.4
(40, 40)
0.2
0
0
0.5
0
0
0.5
1
(100, 100)
0.2
1
(20, 20)
0
0.5
1
27
Theory
• Area asymptotically normal
s a2
P(X  Y ) + (N X  1)BXXY + (NY  1)BYYX  4(N X + NY  1)( A – 1 / 2)2

4N X NY
BYYX  P(Y1 ,Y2  X) + P(X  Y1 ,Y2 )  P(Y1  X  Y2 )  P(Y2  X  Y1 )
Worst case
2
s max
A(1  A)

min( N X , NY )
28
Metz
• University of Chicago ROC project:
•
http://wwwradiology.uchicago.edu/krl/toppage11.htm
• Software for estimating Az, also sample st.
dev. And confidence intervals.
• Versatile
29
Example
• Compare image processing with
conventional
• Design:
– Should we use same cases for both?
– Yes, better comparison
• Now results from two studies are
correlated!
– Metz software can handle this
30
Design Parameters
(1) unpaired (uncorrelated) test results. The two "conditions" are applied
to independent case samples -- for example, from two different
diagnostic tests performed on the different patients, from two different
radiologists who make probability judgments concerning the presence
of a specified disease in different images, etc.;
(2) fully paired (correlated) test results, in which data from both of two
conditions are available for each case in a single case sample. The two
"conditions" in each test-result pair could correspond, for example, to
two different diagnostic tests performed on the same patient, to two
different radiologists who make probability judgments concerning the
presence of a specified disease in the same image, etc.; and
(3) partially-paired test results -- for example, two different diagnostic
tests performed on the same patient sample and on some additional
patients who received only one of the diagnostic tests.
31
Summary: ROC
• Compare modalities, evaluate effectiveness of a modality
• Need to know the truth
• Issue: two kinds of error
– Specificity, Sensitivity
– Scalar comparison not suitable
• Statistical problem
– More data, better answer
• ROC methodology
– Metz methods and software allow computation of confidence
intervals, significance for tests with practical design parameters
32
Recent Work
• Beiden SV, Wagner RF, Campbell G, Metz CE, Jiang Y. Componentsof-variance models for random-effects ROC analysis: The case of
unequal variance structures across modalities. Academic Radiol. 8:
605-615, 2001
• Gefen S, Tretiak OJ, Piccoli CW, Donohue KD, Petropulu AP, Shankar
PM, Dumane VA, Huang L, Kutay MA, Genis V, Forsberg F, Reid JM,
Goldberg BB, ROC Analysis of Ultrasound Tissue Characterization
Classifiers For Breast Cancer Diagnosis, IEEE Trans. Med. Im. In
press
35