#### Transcript notes on image quality

Statistics and Image Evaluation Oleh Tretiak Medical Imaging Systems Fall, 2002 1 Which Image is Better? Case A Case B 2 Method • Rating on a 1 to 5 scale (5 is best) • Rating performed by 21 subject • Statistics: – Average, maximum, minimum, standard deviation, standard error for Case A, Case B – Difference per viewer between Case A and Case B, and above statistics on difference 3 Observations Case A Case B Case A Case B 4 3 3 4 4 5 4 4 4 5 4 4 3 4 4 5 5 4 4 5 5 2 2 2 2 2 1 1 2 3 1 2 2 1 1 1 1 1 1 1 1 2 2 1 1 2 2 4 3 2 1 4 2 2 2 3 3 4 4 3 3 4 3 4 Statistics average maximum minimum sample st. dev standard error Case A 4.14 5.00 3.00 Case A Case B Case B 1.52 2.62 3.00 4.00 1.00 1.00 0.65 0.60 1.02 0.14 0.13 0.22 5 Conclusions • Image in Case A has a higher average ranking than that in Case B. • The highest ranking for B is the same as the lowest ranking for A. In all other cases, the rankings for B are lower than those for A. • Consider the difference (rightmost column on previous slide). The ratio of average to the standard error (the z value) is 2.62/.22 ~ 12. This value of z is extremely unlikely if the means are the same. 6 Experimental Design • How many observers should we use to test differences between pictures? • We expect difference between two kinds of pictures will be 0.5 ranking units. We exoect the standard deviation on difference measurement to be 1.0 (see experiment above). We would like to determine this reliably. We therefore want a confidence interval on the mean to be [mean - 0.5, mean + 0.5] at 99% confidence. How many observers should we use? • Answer: z0.005 = 2.6. Standard error must be 0.5/2.6 = 0.19. Std. err = std. dev. /sqrt(n). Therfore n = (1.0/0.19)^2 = 28 7 Today’s Lecture • • • • • Hypothesis testing Two kinds of errors ROC analysis Visibility of blobs Quantitative quality measures 8 Hypothesis Testing Example 256x256 128x128 9 Question: Which is better? • Testing method – Quality rating by multiple viewers – Compute per-viewer difference in quality – Find mean and standard deviation of the difference – Compute the z score (mean/std. error) – How to interpret? 10 Null Hypothesis (H0) • Assume that the mean is zero (no difference) • Find a range of z that would occur when the mean is zero. • Accept the null hypothesis if z is in this range (no difference) • Reject null hypothesis if z falls outside the range 11 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 -4 -3 -2 -1 0 1 2 3 4 We show the normal distribution with 0 mean and s = 1. The shaded area has probability 0.95, and the two white areas have each probability 0.025. If we observe gaussian variables with mean zero, 95% of the observations will have value between -1.96 and 1.96. The area outside this interval (0.05 in this case) is called the significance level of the test. 12 Two Kinds of Errors • In a decision task with two alternatives, there are two kinds of errors • Suppose the alternatives are ‘healthy’ and ‘sick’ – Type I error: say healthy if sick – Type II error: say sick if healthy 13 H1 b H0 a Decision threshold • X - observation, t - threshold a = Pr[X > t | H0] (Type I error) b = Pr[X < t | H1] (Type II error) Choosing t, we can trade off between the two types of errors 14 Examples of Threshold Measurement • Show blobs and noise. 15 Examples • Measurement of psychophysical threshold – Detectible flicker, detectible contrast • Medical diagnosis – Negative (healthy), positive (sick) • Home security – Friend or terrorist 16 Probability of Error • Pe= P0a + P1b • Why bother with two types of error, why not just Pe? • In many cases, P1 << P0! • Two types of error are typically of different consequence. We therefore don’t want to mix them together. 17 ROC Terminology • ROC — receiver operating characteristic • H0 — friend, negative; H1 — enemy, positive • Pr[X > t | H0] = probability of false alarm = probability of false positive = PFP = a • Pr[X > t | H1] = probability of detection = probability of true positive = PTP = b 18 The ROC H1 PTP PTP H0 PFP 1 Decision threshold • The ROC shows the tradeoff between PFP and PTP as the threshold is varied AZ PFP 0 0 1 19 How Do We Estimate the ROC? • Radiological diagnosis setting – Positive and negative cases – The true diagnosis must be evaluated by a reliable method • Cases are evaluated by radiologist(s), who report the data on a discrete scale – 1 = definitely negative, 5 = definitely positive 20 Binormal Model • Negative: Normal, mean = 0, st. dev. = 1 f N ( x) 1 2 exp( x / 2) 2 • Negative: Normal, mean = a, st. dev. = b fP ( x) 1 2 2 exp[(x a) / 2b ] 2 21 Some Binormal Plots b = 1, a = 1, 2, 3 b = 1, a = 1, 2, 3 b = 2, a = 1, 2, 3 b = 0.5, a = 1, 2, 3 Az ~ area under ROC curve 22 Az formula Az a / 1 + b a 1 2 3 0.5 0.8145 0.9632 0.9964 b 1 0.7602 0.9214 0.9831 2 2 0.6726 0.8145 0.9101 23 Experimental Framework • Set of positive and negative cases – Need reliable diagnosis • Radiologist interprets cases – Radiologist report on a scale • Certainly Negative, Probably Negative, Unclear, Probably Positive, Certainly Positive • Estimate ROC, Az • Compare results in studies with conventional and image processing 24 Statistical Estimation • Result of experiment is a sample • If N is very large, estimate is the same as theory • For practical N, the estimate is true ± error Standard deviations of estimates of a, b, and Az for varying numbers of observations. Horizontal axis: number of positive and negative observations. Top curve, sa, middle curve: sb. Trials were with a = b = 1. Standard deviations of estimates of a, b, and Az for varying numbers of observations. Horizontal axis: number of positive and negative observations. Top curve, sa, middle curve: sb. Trials were with a = b = 1 0.6 0.5 0.4 0.3 0.2 0.1 0 1 10 10 2 10 3 25 Another Approach: Nonparametric Model • Ordinal Dominance Graph • Donald Bamber, Area above the Ordinal Dominance Graph and the Area below the Receiver Operating Characteristic Graph, J. of Math. Psych. 12: 387-415 (1975). • Method: computer frequencies of occurrence for different threshold levels from sample, plot on probability scale. 1 Monte Carlo, a = 1, b = 1, 10 positive and 10 negative cases 0.8 0.6 0.4 0.2 0 0 0.5 1 26 Ordinal Dominance - examples 1 1 0.8 0.8 0.6 0.6 0.4 0.4 (10, 10) 0.2 0 0 0.5 0.2 1 0 1 1 0.8 0.8 0.6 0.6 0.4 0.4 (40, 40) 0.2 0 0 0.5 0 0 0.5 1 (100, 100) 0.2 1 (20, 20) 0 0.5 1 27 Theory • Area asymptotically normal s a2 P(X Y ) + (N X 1)BXXY + (NY 1)BYYX 4(N X + NY 1)( A – 1 / 2)2 4N X NY BYYX P(Y1 ,Y2 X) + P(X Y1 ,Y2 ) P(Y1 X Y2 ) P(Y2 X Y1 ) Worst case 2 s max A(1 A) min( N X , NY ) 28 Metz • University of Chicago ROC project: • http://wwwradiology.uchicago.edu/krl/toppage11.htm • Software for estimating Az, also sample st. dev. And confidence intervals. • Versatile 29 Example • Compare image processing with conventional • Design: – Should we use same cases for both? – Yes, better comparison • Now results from two studies are correlated! – Metz software can handle this 30 Design Parameters (1) unpaired (uncorrelated) test results. The two "conditions" are applied to independent case samples -- for example, from two different diagnostic tests performed on the different patients, from two different radiologists who make probability judgments concerning the presence of a specified disease in different images, etc.; (2) fully paired (correlated) test results, in which data from both of two conditions are available for each case in a single case sample. The two "conditions" in each test-result pair could correspond, for example, to two different diagnostic tests performed on the same patient, to two different radiologists who make probability judgments concerning the presence of a specified disease in the same image, etc.; and (3) partially-paired test results -- for example, two different diagnostic tests performed on the same patient sample and on some additional patients who received only one of the diagnostic tests. 31 Summary: ROC • Compare modalities, evaluate effectiveness of a modality • Need to know the truth • Issue: two kinds of error – Specificity, Sensitivity – Scalar comparison not suitable • Statistical problem – More data, better answer • ROC methodology – Metz methods and software allow computation of confidence intervals, significance for tests with practical design parameters 32 Recent Work • Beiden SV, Wagner RF, Campbell G, Metz CE, Jiang Y. Componentsof-variance models for random-effects ROC analysis: The case of unequal variance structures across modalities. Academic Radiol. 8: 605-615, 2001 • Gefen S, Tretiak OJ, Piccoli CW, Donohue KD, Petropulu AP, Shankar PM, Dumane VA, Huang L, Kutay MA, Genis V, Forsberg F, Reid JM, Goldberg BB, ROC Analysis of Ultrasound Tissue Characterization Classifiers For Breast Cancer Diagnosis, IEEE Trans. Med. Im. In press 35