T 1 - American Statistical Association
Download
Report
Transcript T 1 - American Statistical Association
Comparing Diagnostic
Accuracies of Two Tests in
Studies with Verification Bias
Marina Kondratovich, Ph.D.
Division of Biostatistics,
Center for Devices and Radiological Health,
U.S. Food and Drug Administration.
No official support or endorsement by the Food and Drug Administration of
this presentation is intended or should be inferred.
September, 2005
1
Outline
Introduction: examples,
diagnostic accuracy,
verification bias
I. Ratio of true positive rates and ratio
of false positive rates
II. Multiple imputation
III. Types of missingness in subsets
Summary
2
Comparison of two qualitative tests,
T1 and T2, or combinations of them
T1
T2
Pos
Neg
Pos
A
B
Neg
C
D
N
Examples:
• Cervical cancer:
T1- Pap test (categorical values), T2 - HPV test (qualitative test);
Reference method – colposcopy/biopsy
• Prostate cancer:
T1 - DRE (qualitative test), T2 - PSA (quantitative test with cutoff of 4 pg/mL);
Reference method – biopsy;
• Abnormal cells on a Pap slide;
T1 - Manual reading of a Pap slide; T2 - Computer-aided reading of a Pap slide;
Reference method – reading of a slide by Adjudication Committee
3
Diagnostic Accuracy of Medical test
θ2
y1
T1
Pair: PLR1 = Se1/(1-Sp1) =
y1/ x1 = tangent of θ1
(slope of line)
related to PPV
Se
θ1
x1
Pair: Sensitivity = TPR
Specificity = TNR
(x1, y1),
where x1 = FPR = 1 - Sp1
y1 = TPR = Se1
1- Sp
NLR1 = (1-Se1)/Sp1 =
(1-y1)/ (1-x1) = tangent of θ2
(slope of line)
4
related to NPV
Boolean Combinations
“OR” and “AND” of T1 and Random Test
θ2
T1 OR Random Test
y1
T1
y-y1 = (1-y1)/(1-x1) * (x-x1)
Random Test: + with prob. α
- with prob. 1-α
Se
θ1
x1
y-y1 = NLR1 * (x-x1)
1- Sp
Combination OR
SeOR = Se1 + (1-Se1)*α =
y1 + (1-y1)*α
SpOR = Sp1*(1-α) = (1-x1)*(1-α)
NLR(T1 OR Random Test) = (1-y1)/(1-x1)
5
Boolean Combinations
“OR” and “AND” of T1 and Random Test
θ2
y-y1 = PLR1 * (x-x1)
y-y1 = y1/x1 * (x-x1)
y1
T1
Random Test: + with prob. α
- with prob. 1-α
Se
T1 AND Random Test
θ1
x1
1- Sp
Combination AND
SeAND = Se1*α = y1*α
SpAND = Sp1 +(1-Sp1)*(1-α) =
(1-x1) + x1*(1-α)
PLR(T1 AND Random Test) = y1/x1
6
Comparing Medical Tests
PPV<PPV1
NPV>NPV1
PPV>PPV1
NPV>NPV1
T1
Se
PPV<PPV1
NPV<NPV1
PPV>PPV1
NPV<NPV1
1- Sp
More detail in: Biggerstaff, B.J. Comparing diagnostic tests: a simple graphic using
likelihood ratios. Statistics in Medicine 2000, 19 :649-663
7
Formal Model: Prospective study, comparison of two qualitative tests,T1 and T2,
or combinations of them
T1
T2
Pos
Neg
Pos
A
B
Neg
C
D
N
Disease D+
Non-Disease D-
T1
T2
T1
Pos
Neg
Pos
a1
b1
Neg
c1
d1
T2
Pos
Neg
Pos
a0
b0
Neg
c0
d0
N1
a1 + a0 = A; b1 + b0 = B; c1 + c0 = C; d1 + d0 = D,
N1 + N 0 = N
N0
8
Example: condition of interest -cervical disease,
T1- Pap test, T2 – biomarker, Reference- colposcopy/biopsy
Pap test
T2
Pos
Neg
Pos
43
285
Neg
71
6,601
7,000
Disease D+
Non-Disease D-
T1
T2
T1
Pos
Neg
Pos
13
15
Neg
1
14
28
T2
Pos
Neg
Pos
30
270
Neg
70
300
100
9
Verification Bias
In studies for the evaluation of diagnostic devices,
sometimes the reference (“gold”) standard is not applied to all
study subjects.
If the process by which subjects were selected for
verification depends on the results of the medical tests, then the
statistical analysis of accuracies of these medical tests without
the proper corrections is biased.
This bias is often referred as verification bias (or variants of
it, work-up bias, referral bias, and validation bias).
10
I. Ratio of True Positive Rates and
Ratio of False Positive Rates
Not all subjects (or none) with both negative results were verified by the Reference method.
Estimates of sensitivities
and specificities based only
on verified results are biased.
T1
Pos
Neg
Pos
A
B
Neg
C
D
T2
Ratio of sensitivities and
ratio of false positive rates
are unbiased2.
N
Disease D+
T1
T2
Pos
Neg
Pos
a1
b1
Neg
c1
[d1]
Non-Disease DT1
T2
[N1]
Pos
Neg
Pos
a0
b0
Neg
c0
[d0]
[N0]
ˆ (T ) a b
Se
2
1 1
ˆ (T ) a1 c1
Se
1
ˆ (T ) a b
1 Sp
2
0 0
ˆ (T ) a0 c0
1 Sp
1
2 Schatzkin, A., Connor, R.J., Taylor, P.R., and Bunnag, B. “Comparing new and old screening tests when a reference11
procedure cannot be performed on all screeners”. American Journal of Epidemiology 1987, Vol. 125, N.4, p.672-678
I. Ratio of TP Rates and Ratio of FP Rates (cont.)
Statement of the problem:
Se2/Se1 = y2/y1 = Ry
(1-Sp2)/(1-Sp1) = x2/x1 = Rx
Can we make conclusions about effectiveness of
Test2 if we know only ratio of True Positive rates and
ratio of False Positive rates between Test1 and Test2?
For sake of simplicity,
consider that Test2 has higher theoretical sensitivity, Se2/Se1=Ry >1
(true parameters not estimates)
12
I. Ratio of TP Rates and Ratio of FP Rates (cont.)
A) Se2/Se1=Ry >1
(increase in sensitivity)
y1
(1-Sp2)/(1-Sp1) = Rx <1
(decrease in false positive
rates)
T1
Se
x1
1- Sp
For any Test1,
Test2 is effective
(superior than Test1)
13
I. Ratio of TP Rates and Ratio of FP Rates (cont.)
y1
B) Se2/Se1=Ry >1
(increase in sensitivity);
(1-Sp2)/(1-Sp1) = Rx >1
(increase in false positive rates);
Ry >= Rx > 1
T1
It is easy to show that
PLR2=Se2/(1-Sp2)=Ry/Rx*PLR1
and then PLR2 >= PLR1
Se
x1
1- Sp
For any Test1, Test2 is effective
(superior than Test1 because PPV and
NPV of Test2 are higher than ones
of Test1 )
14
I. Ratio of TP Rates and Ratio of FP Rates (cont.)
Example: condition of interest -cervical disease,
T1- Pap test, T2 – biomarker, Reference- colposcopy/biopsy
Pap test
T2
Pos
Neg
Pos
43
285
Neg
71
6,601
7,000
Disease D+
Non-Disease D-
T1
T2
T1
Pos
Neg
Pos
13
15
Neg
1
28
14
ˆ
Se
28
2
2.0
ˆ
Se1 14
T2
Pos
Neg
Pos
30
270
Neg
70
300
100
ˆ
1 Sp
300
2
3.0
ˆ
100
1 Sp
1
15
I. Ratio of TP Rates and Ratio of FP Rates (cont.)
C) Se2/Se1=Ry >1
(increase in sensitivity);
y1
T1 OR Random Test
T1
(1-Sp2)/(1-Sp1) = Rx >1
(increase in false positive rates);
Ry < Rx
Increase in false positive rates is
higher than increase in true
positive rates
Se
x1
1- Sp
Can we make conclusions about
effectiveness of Test2 ?
16
I. Ratio of TP Rates and Ratio of FP Rates (cont.)
y1
T1
Theorem:
Test2 is above the line of
combination T1 OR Random Test
T1 OR Random Test if
(Rx-1)/(Ry-1) < PLR1/NLR1
Se
Example, Ry=2 and Rx=3.
(Rx-1)/(Ry-1)=(3-1)/(2-1)=2.
x1
1- Sp
Depends on accuracy of T1:
if PLR1/NLR1> 2 then T2 is superior for
confirming absence of disease
(NPV↑, PPV↓);
if PLR1/NLR1< 2 then T2 is inferior overall
17
(NPV↓, PPV↓).
I. Ratio of TP Rates and Ratio of FP Rates (cont.)
For situation C:
C) Se2/Se1=Ry >1 (increase in sensitivity);
(1-Sp2)/(1-Sp1) = Rx >1
(increase in false positive rates);
Ry < Rx
(increase in FPR is higher than increase in TPR)
In order to do conclusions about effectiveness of Test2,
we should have information about
the diagnostic accuracy of Test1.
18
I. Ratio of TP Rates and Ratio of FP Rates (cont.)
Rx 1
1
y1
1
Rx Ry
Rx Ry
1 x1
Ry 1
Se2/Se1=Ry>1
then Se1 <=1/Ry;
(1-Sp2)/(1-Sp1)=Rx >1 then (1-Sp1)<=1/Rx
Hyperbola
If T1 is in the green area,
then T2 is superior for confirming
absence of Disease (higher NPV
1/Ry
and lower PPV)
If T1 is in the red area,
then T2 is inferior overall (lower
NPV and lower PPV)
1/Rx
19
I. Ratio of TP Rates and Ratio of FP Rates (cont.)
Summary:
If in the clinical study of comparing accuracies of two tests,
Test2 and Test1, it is anticipated a statistically higher increase in
TP rates of Test2 than increase in FP rates then conclusions about
effectiveness of Test2 can be made without information about
diagnostic accuracy of Test1.
In most practical situations, when it is anticipated that increase
in FP rates of Test2 is higher than increase in TP rates (or not
enough sample size to demonstrate that increase in TP is
statistically higher than increase in FP), then information about
diagnostic accuracy of Test1 is needed in order to make
conclusions about effectiveness of Test2.
20
II. Verification Bias: Subjects Negative on Both Tests
If a random sample of the subjects with both negative tests results are verified by
reference standard then the unbiased estimates of sensitivities and specificities for
Test1 and Test2 can be constructed.
T1
T2
Pos
Neg
Pos
A
B
Neg
C
D
N
Disease D+
T1
T2
Pos
Neg
Pos
a1
b1
Neg
c1
[d1]
Non-Disease DT1
T2
[N1]
Pos
Neg
Pos
a0
b0
Neg
c0
[d0]
[N0]
21
II. Verification Bias: Bias Correction
Verification Bias Correction Procedures:
1. Begg, C.B., Greenes, R.A. (1983) Assessment of diagnostic tests when disease
verification is subject to selection bias. Biometrics 39, 207-215.
2. Hawkins, D.M., Garrett, J.A., Stephenson, B. (2001) Some issues in resolution of
diagnostic tests using an imperfect gold standard. Statistics in Medicine 2001; 20,
1987-2001.
Multiple Imputation
• The absence of the disease status for some subjects can be considered as a
problem of missing data.
• Multiple imputation is a Monte Carlo simulation where the missing
disease status of the subjects are replaced by simulated plausible values
based on the observed data, each of the imputed datasets is analyzed
separately and diagnostic accuracies of tests are evaluated. Then the results
are combined to produce the estimates and confidence intervals that
incorporate uncertainties related to the missing verified disease status for
some subjects.
22
II. Verification Bias: Subjects Negative on Both Tests (cont.)
Usually, according to the study
protocol, all subjects from the
subsets A, B and C should have
the verified disease status and the
verification bias is related to the
subjects to whom both tests
results are negative.
T1
T2
Pos
Neg
Pos
A
B
Neg
C
D
In practice, sometimes, not all
subjects from the subsets A,
B, and C may be compliant
about disease verification:
T1
T2
Pos
Neg
Pos
A
70%
B
50%
Neg
C
30%
D
N
N
Verification Bias !
23
III. Different Types of Missingness
In order to correctly adjust for verification bias, the type of
missingness should be investigated.
Missing data mechanisms:
Missing Completely At Random (MCAR) – missingness is
unrelated to the values of any variables (whether the disease
status or observed variables);
Missing At Random (MAR) – missingness is unrelated to the
disease status but may be related to the observed values of
other variables.
For details, see Little, R.J.A and Rubin, D. (1987) Statistical Analysis with Missing Data. New York:
John Wiley.
24
III. Different Types of Missingness
Example: Prospective study for prostate cancer. 5,000 men were screened with
digital rectal exam (DRE) and prostate specific antigen (PSA) assay.
Results of DRE are Positive, Negative.
PSA, a quantitative test, is dichotomized by threshold of 4 ng/ml:
Positive (PSA > 4), Negative (PSA ≤ 4).
D+ = Prostate cancer; D- = No prostate cancer (ref. standard = biopsy).
PSA+
DRE+
DRE-
150
750
105 biopsies (70%) 375 biopsies (50%)
PSA-
250
3,850
75 biopsies (30%)
No biopsies
5,000
25
All Subjects
DRE+
DRE-
PSA+
150
105 biopsies (70%)
750
375 biopsies (50%)
PSA-
250
75 biopsies (30%)
3,850
No biopsies
Subjects with Verified Disease Status
D+ (Positive Biopsy)
DRE+
DRE-
PSA+
60
110
PSA-
25
n/a
D- (Negative Biopsy)
DRE+
DRE-
PSA+
45
265
PSA-
50
n/a
26
III. Different Types of Missingness (cont.)
• Do the subjects without biopsies differ from the subjects with
biopsies?
Propensity score = conditional probability that the subject
underwent the verification of disease (biopsy in this example) given
a collection of observed covariates (the quantitative value of the
PSA test, Age, Race and so on).
Statistical modeling of relationship between membership in the
group of verified subjects by logistic regression:
outcome – underwent verification (biopsy): yes, no
predictor – PSAQuantitative, covariates.
27
III. Different Types of Missingness (cont.)
DRE+
DRE-
PSA+
150
105 biopsies
(70%)
750
375 biopsies
(50%)
PSA-
250
75 biopsies
(30%)
3,850
No biopsies
5,000
For subgroup A (PSA+, DRE+), probability that a subject has a
missed biopsy does not appear to depend neither on PSA values
nor on the observed covariates (age, race).
Type of missingness - Missing Completely At Random.
Similar, for group B (PSA+, DRE-).
28
III. Different Types of Missingness (cont.)
DRE+
DRE-
PSA+
150
105 biopsies
(70%)
750
375 biopsies
(50%)
PSA-
250
75 biopsies
(30%)
3,850
No biopsies
5,000
For subgroup C (PSA-, DRE+), probability that a subject has a
missed biopsy does depend on the quantitative value of PSA. So,
the value of the PSA is a significant predictor for biopsy
missingness in this subgroup (the larger value of PSA, the lower
probability of missing biopsy).
29
Type of missingness - Missing At Random.
III. Different Types of missingness (cont.)
D+
DDRE+
DRE-
PSA+
86
220
PSA-
50
83 (biased)
n/a
DRE+
DRE-
PSA+
64
530
PSA-
200
167 (biased)
n/a
Adjustment for verification without proper investigation of type of missingness
(biased estimates):
ˆ ( PSA) 306
Se
1.81
ˆ ( DRE ) 169
Se
ˆ ( PSA) 594
1 Sp
2.57
ˆ ( DRE ) 231
1 Sp
Adjustment for verification taking into account different types of missingness
(unbiased estimates):
ˆ ( PSA) 306
Se
2.25
ˆ
Se( DRE ) 136
ˆ ( PSA) 594
1 Sp
2.25
ˆ
1 Sp( DRE ) 264
30
III. Different Types of missingness (cont.)
Correct adjustment for verification bias produces the estimates
demonstrating that an increase in FP rates for the New test
(PSA) is about the same as an increase in TP rates while
incorrect adjustment for verification bias showed that the
increase in FP rates was larger than the increase in TP rates.
So, naïve estimation of the risk for the subgroup C based on the
assumption that the missing results of biopsy were Missing
Completely At Random produces biased estimation of the
performance of the New PSA test (underestimation of the
performance of the New test).
For proper adjustment, information on the distribution of test
results in the subjects who are not selected for verification should
be available.
31
Summary
In most practical situations, estimation of only ratios of True
Positive and False Positive rates does not allow one to make
conclusions about effectiveness of the test.
The absence of disease status can be considered as the problem
of missing data. Multiple imputation technique can be used for
correction of verification bias. Information on the distribution of
test results in the subjects who are not selected for verification
should be available.
The investigation of the type of missingness should be done for
obtaining unbiased estimates of performances of medical tests. All
subsets of subjects should be checked for missing disease status.
Precision of the estimated diagnostic accuracies depends
primarily on the number of verified cases available for statistical
32
analysis.
References
1. Begg C.B. and Greenes R.A. (1983). Assessment of diagnostic tests when disease
verification is subject to selection. Biometrics, 39, 207-215.
2. Biggerstaff, B.J. (2000) Comparing diagnostic tests: a simple graphic using
likelihood ratios. Statistics in Medicine 2000, 19 :649-663
3. Hawkins, DM, JA Garrett and B Stephenson. (2001) Some issues in resolution of
diagnostic tests using an imperfect gold standard. Statistics in Medicine; 20:19872001.
4. Kondratovich MV (2003) Verification bias in the evaluation of diagnostic tests.
Proceedings of the 2003 Joint Statistical Meeting, Biopharmaceutical Section, San
Francisco, CA.
5. Ransohoff DF, Feinstein AR. (1978) Problems of spectrum and bias in evaluating
the
efficacy of diagnostic tests. New England Journal Of Medicine. 299: 926-930
6. Schatzkin A., Connor R.J., Taylor P.R., and Bunnag B. (1987) Comparing new and
old screening tests when a reference procedure cannot be performed on all screeners.
American Journal of Epidemiology, vol.125, N.4, p. 672- 678.
7. Zhou X. (1994) Effect of verification bias on positive and negative predictive values.
Statistics in Medicine; 13; 1737-1745
8. Zhou X. (1998) Correcting for verification bias in studies of a diagnostic test’s
accuracy. Statistical Methods in Medical Research; 7; p.337-353.
33
9. http://www.fda.gov/cdrh/pdf/p930027s004b.pdf