Inter-rater Reliability Diagnostic

Download Report

Transcript Inter-rater Reliability Diagnostic

Advanced Statistical Analysis in
Epidemiology:
Inter-rater Reliability
Diagnostic Cutpoints
Test Comparison
Discrepant Analysis
Polychotomous Logistic Regression
and
Generalized Estimating Equations
Jeffrey J. Kopicko, MSPH
Tulane University School of Public Health and Tropical Medicine
Diagnostic Statistics Typically
Assess a 2 x 2 contingency
table taking the form:
True + True - Total
Test + a
b
a+b
Test -
c
d
c+d
Total
a+c
b+d
a+b+c+d
Inter-rater Reliability
Suppose that two different tests
exist for the diagnosis of a
specific disease. We are
interested in determining if the
new test is as reliable in
diagnosing the disease as the old
test (“gold standard”).
Inter-rater Reliability continued
In 1960, Cohen proposed a
statistic that would provide a
measure of reliability between the
ratings of two different
radiologists in the interpretation
of x-rays. He called it the Kappa
coefficient.
Inter-rater Reliability continued
Cohen’s Kappa can be used to
assess the reliability between two
raters or diagnostic tests. Based
on the previous contingency
table, it has the following form
and interpretation:
Inter-rater Reliability continued
Cohen’s Kappa:
2(ad  bc)

(( a  c)(c  d )  (b  d )( a  b))
where
K > 0.75
excellent reproducibility
0.4<=K<=0.75 good reproducibility
0<=K<0.4
marginal reproducibility
*Rosner, 1986
Inter-rater Reliability continued
Cohen’s Kappa is appropriately
used when the prevalence of the
disease is low and the marginal
totals of the contingency table
are distributed evenly. When
these are not the case, Cohen’s
Kappa will be erroneously low.
Inter-rater Reliability continued
Byrt, et al. proposed a solution to
these possible biases in 1994.
They called their solution the
“Prevalence-Adjusted BiasAdjusted Kappa” or PABAK. It
has the same interpretation as
Cohen’s Kappa and the following
form:
Inter-rater Reliability continued
1. Take the mean of b and c.
(b  c)
m
2
2. Take the mean of a and d.
(a  d )
n
2
3. Compute PABAK using these
means and the original Cohen’s
Kappa formula and this table.
Yes
No
Yes
n
m
No
m
n
Inter-rater Reliability continued
• PABAK is preferable in all instances,
regardless of the prevalence or the
potential bias between raters.
• More meaningful statistics regarding the
diagnostic value of a test can be
computed, however.
Diagnostic Measures
•
•
•
•
•
Prevalence
Sensitivity
Specificity
Predictive Value Positive
Predictive Value Negative
Diagnostic Measures continued
Prevalence
Definition: Prevalence quantifies the proportion of
individuals in a population who have the disease
at a specific instant and provides and estimate of
the probability (risk) that an individual will be ill at
a point in time.
Formula:
(a  c)
prevalence 
(a  b  c  d )
Diagnostic Measures continued
Sensitivity
Definition: Sensitivity is defined as the probability
of testing positive if the disease is truly present.
Formula:
a
sensitivity 
( a  c)
Diagnostic Measures continued
Specificity
Definition: Specificity is defined as the probability
of testing negative if the disease is truly absent.
Formula:
d
specificit y 
(b  d )
Diagnostic Measures continued
Predictive Value Positive
Definition: Predictive Value Positive (PV+ ) is
defined as the probability that a person actually
has the disease given that he or she tests
positive.
Formula:
a
PV 
( a  b)

Diagnostic Measures continued
Predictive Value Negative
Definition: Predictive Value Negative (PV- ) is
defined as the probability that a person actually
disease-free given that he or she tests negative.
Formula:
d
PV 
(c  d )

Example: Cervical Cancer
Screening
The standard of care for
cervical cancer/dysplasia
detection is the Pap smear.
We want to assess a new
serum DNA detection test for
the Humanpapilloma Virus.
Pap +
DNA +
DNA Total
Pap 50
5
55
Total
35
410
445
Prevalence = 55/500 = 0.110
Sensitivity = 50/55 = 0.909
Specificity = 410/445 = 0.921
PV+ = 50/85 = 0.588
PV- = 410/415 = 0.988
85
415
500
Reciever Operating Characteristic
(ROC) Curves
Sensitivities and Specificities are used to:
1. Determine the diagnostic value of a test.
2. Determine the appropriate cutpoint for
continuous data.
3. Compare the diagnostic values of two or
more tests.
ROC Curves continued
1. For every gap in continuous data, the
mean value is taken as the cutoff. This is
where there is a change in the contingency
table distribution.
2. At each new cutpoint, the sensitivity and
specificity is calculated.
3. The sensitivity is graphed versus 1specificity.
ROC Curves continued
4. Since the sensitivity and specificity are
proportions, the total area of the graph is 1.0
units.
5. The area under the curve is the statistic of
interest.
6. The area under a curve produced by
chance alone is 0.50 units.
ROC Curves continued
7. If the area under the diagnostic test curve
is significantly above 0.50, then the test is a
good predictor of disease.
8. If the area under the diagnostic test curve
is significantly below 0.50, then the test is an
inverse predictor of disease.
9. If the area under the diagnostic test curve
is not significantly different from 0.50, then
the test is a poor predictor of disease.
ROC Curves continued
10. An individual curve can be compared to
0.50 using the N(0, 1) distribution.
11. Two or more diagnostic tests can be
compared also using the N(0, 1) distribution.
12. A diagnostic cutpoint can be determined
for tests with continuous outcomes in order to
maximize the sensitivity and specificity of the
test.
Nonparametric Reciever Operating Curve (ROC) Plot of Serum RCP Level
1
0.9
1.0
True Positive Fraction (Sensitivity)
0.8
0.9
1.1
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.2
0.4
0.6
True Negative Fraction (1-Specificity)
0.8
1
ROC Curves continued
Determining Diagnostic Cutpoints
optimum cutpoint 
SUP sensitivit y * specificit y
ROC Curves continued
Determining Diagnostic Cutpoints
Cut Point
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
Sensitivity Specificity Sens*Spec
1
0
0
1
0.04
0.04
1
0.2
0.2
1
0.4
0.4
0.980769
0.68 0.666923
0.923077
0.82 0.756923
0.923077
0.88 0.812308
0.865385
0.96 0.83077
0.846154
0.96 0.812308
0.807692
0.98 0.791538
0.788462
1 0.788462
0.788462
1 0.788462
0.730769
1 0.730769
0.730769
1 0.730769
0.730769
1 0.730769
Nonparametric Reciever Operating Curve (ROC) Plot of Serum RCP Level
1
0.9
1.0
True Positive Fraction (Sensitivity)
0.8
0.9
1.1
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.2
0.4
0.6
True Negative Fraction (1-Specificity)
0.8
1
ROC Curves continued
Diagnostic Value of a Test
a1  ao
z
~ N (0,1)
sea1  a0
where a1 = area under the diagnostic test curve,
ao = 0.50, se a1 is the standard error of the area,
and se ao = 0.00.
ROC Curves continued
Diagnostic Value of a Test
For the RCP example, the area
under the curve is 0.987, with a pvalue of <0.001. The optimal
cutpoint for this test is 1.1 ng/ml.
ROC Curves continued
Comparing the areas under 2 or more curves
In order to compare the areas under two
or more ROC curves, use the same
formula, substituting the values for the
second curve for those previously defined
for chance alone.
Nonparametric Receiver Operating Characteristic Curves for Different Tests of CMV Retinitis
1
0.9
0.8
0.7
Sensitivity
0.6
0.5
0.4
0.3
Antigenemia
0.2
Digene
Amplicor
0.1
Chance
0
0
0.1
0.2
0.3
0.4
0.5
1-Specificity
0.6
0.7
0.8
0.9
1
ROC Curves continued
Comparing the areas under 2 or more curves
For the CMV retinitis example, the Digene
test had the largest area (although not
significantly greater than antigenemia).
The cutpoint was determined to be 1,400
cells/cc. The sensitivity was 0.85 and the
specificity was 0.84. Bonferronni
adjustments must be made for >2
comparisons.
ROC Curves continued
Another Application?
Remember when Cohen’s Kappa was
unstable at extreme prevalence and/or
when there was bias among the raters?
What about using ROC curves to assess
inter-rater reliability?
ROC Curves continued
Another limitation to K is that it provides
only a measure of agreement, regardless
of whether the raters correctly classify the
state of the items. K can be high,
indicating excellent reliability, even though
both raters incorrectly assess the items.
ROC Curves continued
The two areas under the curves may be
compared as a measure of overall interrater reliability. This comparison is made
by applying the following formula:
droc = (1- |Area1- Area2|)
By subtracting the difference in areas by
one, droc is on a similar scale as K,
ranging from 0 to 1.
ROC Curves continued
If both raters correctly classify the objects
at the same rate, their sensitivities and
specificities will be equal, resulting in a
droc of 1.
If one rater correctly classifies all the
objects, and the second rater misclassifies
all the objects, droc will equal 0.
Statistics for Figure 1(N=20):
Rater One:
Rater Two:
% Correct = 80 %
% Correct = 55 %
sensitivity = 0.80
sensitivity = 0.60
specificity = 0.80
specificity = 0.533
Area under ROC = 0.80
Area under ROC = 0.567
droc = 0.7667
Monte Carlo Simulation
Several different levels of disease prevalence, sample
size and rater error rates were assessed using Monte
Carlo methods.
Total sample sizes of 20, 50 and 100 were generated
each for disease prevalence of 5, 15, 25, 50, 75, and 90
percent. Two raters were used in this study. Rater One
was fixed at a 5 percent probability of misclassifying the
true state of the disease, while Rater Two was allowed
varying levels of percent misclassification.
For each condition of disease prevalence, rater error,
and sample size, 1000 valid samples were generated
and analyzed using SAS proc IML.
Figure Two: Actual Percent Agreement (N=50)
1 - ROC Curve Difference
ROC Curves continued
1
0.95
0.9
0.85 Another limitation is that K provides only
0.8
measure
of
agreement,
regardless
of
0.75
0.7 whether the raters correctly classify the
0.65 state of the items. K can be high,
a
0.6 indicating excellent reliability, even though
0.55
both raters incorrectly assess the items.
0.5
0.05
0.15
0.25
0.5
0.75
Rater Two Error Probability
Prevalence:
0.05
0.15
0.25
0.5
0.75
0.9
Figure Four: Cohen's Kappa Coefficient (N=50)
ROC Curves continued
1
1 - ROC Curve Difference
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
If both raters correctly classify the objects
at the same rate, their sensitivities and
specificities will be equal, resulting in a
droc of 0.
If one rater correctly classifies all the
objects, and the second rater misclassifies
0.05 objects,
0.15 droc will
0.25 equal 0.5
0.75
all the
1.
Rater Two Error Probability
Prevalence:
0.05
0.15
0.25
0.5
0.75
0.9
Figure Five: PABAK Coefficient (N=50)
ROC Curves continued
1
1 - ROC Curve Difference
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
If both raters correctly classify the objects
at the same rate, their sensitivities and
specificities will be equal, resulting in a
droc of 0.
If one rater correctly classifies all the
objects, and the second rater misclassifies
0.05 objects,
0.15 droc will
0.25 equal 0.5
0.75
all the
1.
Rater Two Error Probability
Prevalence:
0.05
0.15
0.25
0.5
0.75
0.9
Figure Three: Difference in ROC Curves (N=50)
ROC Curves continued
1
1 - ROC Curve Difference
0.95
0.9
0.85 If
both raters correctly classify the objects
0.8
at the same rate, their sensitivities and
0.75
specificities will be equal, resulting in a
0.7
0.65 droc of 0.
0.6
0.55 If
one rater correctly classifies all the
0.5 objects, and the second rater misclassifies
0.05 objects,
0.15 droc will
0.25 equal 0.5
0.75
all the
1.
Rater Two Error Probability
Prevalence:
0.05
0.15
0.25
0.5
0.75
0.9
Based on the above results, it appears that the
difference in two ROC curves may be a more
stable estimate of inter-rater agreement than K.
Based on the metric used to assess K, a similar
metric can be formed for the difference in two
ROC curves. We propose the following:
1.0 > droc > 0.95
excellent reliability
0.8 < droc < 0.95
good reliability
0 < droc < 0.8
marginal reliability
ROC Curves continued
From the example data provided with Figure 1, it
can be seen that droc behaves similarly to K.
The droc from these data is 0.7667, while K is
0.30. Both result in a decision of marginal interrater reliability.
However, from the ROC plot and the percent
correct for each rater, it is seen that Rater One
is much more correct in his observations than
Rater Two, with percent agreements of 80 %
and 55 %, respectively.
ROC Curves continued
Without the individual calculation of the
sensitivities and specificities, information about
the correctness of the raters would have
remained obscure. Additionally, with the large
differential rater error, K may have been
underestimated.
The difference in ROC curves allows many
advantages over K, but only when the true state
of the objects being rated is known. Finally, with
very little adaptation, these methods may be
extended to more than two raters and to
continuous outcome data.
So, we now know how to assess whether a test
is a good predictor of disease, how to compare
two or more tests, and how to determine
cutpoints. But,
What if there is no established “gold-standard?”
Discrepant Analysis
Discrepant Analysis (DA) is a commonly used
(and commonly misused) technique of
estimating the sensitivity and specificity of
diagnostic tests that are imperfect “goldstandards.” This technique often results in
“upwardly biased” estimates of the diagnostic
statistics.
Discrepant Analysis continued
Example:
Chlamydia trachomatis is a common STI that
has been diagnosed using cervical swab
culture for years. Often, though, patients only
present for screening when they are
symptomatic. Symptomatic screening may be
closely associated with organism load.
Therefore, culture diagnosis may miss carriers
and patients with low organism loads.
Discrepant Analysis continued
Example continued:
GenProbe testing has also been used to
capture some cases that are not captured by
culture. New polymerase chain reaction
(PCR) and ligase chain reaction (LCR) DNA
assays may be better diagnostic tests. But,
there is obviously no good “gold-standard.”
Discrepant Analysis continued
Example continued:
1. Culture vs. PCR
2. Culture + GenProbe vs. PCR
3. Culture vs. LCR
4. Culture + GenProbe vs. LCR
…and many other combinations.
Discrepant Analysis continued
Example continued:
Goal is to maximize the sensitivity and
specificity of the new tests, since we think that
the new tests are probably more accurate.
Major limitation is that this is often seen as a
“fishing expedition” with the great possibility of
Type I error, and inflation of diagnostic
statistics.
Polychotomous Logistic Regression
Simple logistic regression is useful when the
outcome of interest is binomial (ie: yes/no,
male/female, etc.)
Linear regression is useful when the outcome
of interest is continuous (ie: age, blood
pressure, etc.)
But what if the outcome is categorical with
more than one level?
Generalized Estimating Equations
GEE is used when there are repeated
measures on continuous, ordinal, or
categorical outcomes, and there are different
numbers of measurements on each subject.
It is useful in that it accounts for missing data
at different time points.
The interpretation of the GEE model is the
same as all other regressions.