Ewout Steyerberg ppt, part II - for Clinical Prediction Models
Download
Report
Transcript Ewout Steyerberg ppt, part II - for Clinical Prediction Models
Relationship between performance measures:
From statistical evaluations to decision-analysis
Ewout Steyerberg
Dept of Public Health, Erasmus MC,
Rotterdam, the Netherlands
[email protected]
Chicago, October 23, 2011
General issues
Usefulness / Clinical utility: what do we mean exactly?
Evaluation of predictions
Evaluation of decisions
Adding a marker to a model
Statistical significance?
Testing β enough (no need to test increase in R2, AUC, IDI, …)
Clinical relevance: measurement worth the costs?
(patient and physician burden, financial costs)
Overview
Case study: residual masses in testicular cancer
Model development
Evaluation approach
Performance evaluation
Statistical
Overall
Calibration and discrimination
Decision-analytic
Utility-weighted measures
www.clinicalpredictionmodels.org
Prediction approach
Outcome: malignant or benign tissue
Predictors:
primary histology
3 tumor markers
tumor size (postchemotherapy, and reduction)
Model:
logistic regression
544 patients, 299 malignant tissue
Internal validation by bootstrapping
External validation in 273 patients, 197 malignant tissue
Logistic regression results
Characteristic
Primary tumor teratoma-positive?
Prechemotherapy AFP elevated?
Prechemotherapy HCG elevated?
Square root of postchemotherapy mass size (mm)
Reduction in mass size per 10%
Ln of standardised prechemotherapy LDH
(LDH/upper limit of local normal value)
Without LDH
2.7 [1.8 – 4.0]
2.4 [1.5 – 3.7]
1.7 [1.1 – 2.7]
1.08 [0.95 – 1.23]
0.77 [0.70 – 0.85]
-
With LDH
2.5 [1.6 – 3.8]
2.5 [1.6 – 3.9]
2.2 [1.4 – 3.4]
1.34 [1.14 – 1.57]
0.85 [0.77 – 0.95]
0.37 [0.25 – 0.56]
Evaluation approach: graphical assessment
0.8
0.6
Necrosis
Tumor
0.0
0.2
0.4
0.6
0.8
Predicted probability
1.0
0.0
0.2
0.4
0.6
0.4
0.2
0.0
Observed frequency
0.8
1.0
Validation, n=273
1.0
Development, n=544
Necrosis
Tumor
0.0
0.2
0.4
0.6
0.8
Predicted probability
1.0
Lessons
1. Plot observed versus expected outcome with distribution of predictions
by outcome (‘Validation graph’)
2. Performance should be assessed in validation sets, since apparent
performance is optimistic (model developed in the same data set as
used for evaluation)
Preferably external validation
At least internal validation, e.g. by bootstrap cross-validation
Performance evaluation
Statistical criteria: predictions close to observed outcomes?
Overall; consider residuals y – ŷ, or y – p
Discrimination: separate low risk from high risk
Calibration: e.g. 70% predicted = 70% observed
Clinical usefulness: better decision-making?
One cut-off, defined by expected utility / relative weight of errors
Consecutive cut-offs: decision curve analysis
Predictions close to observed outcomes? Penalty functions
Logarithmic score: (1 – Y)*(log(1 – p)) + Y*log(p)
Quadratic score: Y*(1 – p)^2 + (1 – Y)*p^2
4
0.25 0.5
0
1
0
20
40
60
Predicted probability (%)
80
100
Quadratic score
1
y=0
2
3
y=1
0
Logarithmic score
5
6
Behavior of logarithmic and quadratic error scores
Overall performance measures
R2: explained variation
Logistic / Cox model: Nagelkerke’s R2
Brier score: Y*(1 – p)^2 + (1 – Y)*p^2
Brierscaled = 1 – Brier / Briermax
Briermax = mean(p) x (1 – mean(p))^2 + (1 – mean(p)) x mean(p)^2
Brierscaled very similar to Pearson R2 for binary outcomes
Overall performance in case study
R2
Brier
Briermax
Brierscaled
Development
38.9%
0.174
0.248
29.8%
Internal validation
37.6%
0.178
0.248
28.2%
External validation
26.7%
0.161
0.201
20.0%
Measures for discrimination
Concordance statistic, or area under the ROC curve
Discrimination slope
Lorenz curve
ROC curves for case study
Development, n=544
1.0
0%
30%
1.0
30%
40%
20%
40%
0.6
0.4
0.6
0.4
0.2
0.2
0.0
0.0
0.0
0.2
0.4
0%
20%
0.8
True positive rate
0.8
True positive rate
Validation, n=273
0.6
False positive rate
0.8
1.0
0.0
0.2
0.4
0.6
False positive rate
0.8
1.0
0
1
Tumor
0.0
0.0
0.0
0.4
0.6
0.8
0.4
0.6
0.8
0
1
Tumor
0.4
0.6
0.8
1.0
1.0
1.0
Slope=0.3
0.2
Predicted risk without LDH
0.2
Predicted risk with LDH
0.2
Predicted risk without LDH
Box plots with discrimination slope for case study
Slope=0.34
Slope=0.24
0
1
Tumor at validation
0.8
0.6
0.4
0.5
0.2
0.6
0.64
0.7
0.83
0.98
0.0
Proportion with the outcome
1.0
Lorenz concentration curves: general pattern
0.0
0.2
0.4
0.6
Cumulative proportion
0.8
1.0
Lorenz concentration curves: case study
1.0
Validation, n=273
0.0
0.2
0.4
0.6
0.8
1.0
0.8
0.6
0.4
0.2
0.0
Fraction with unresected tumor
Development, n=544
0.0
0.2
0.4
0.6
0.8
1.0
Fraction NOT undergoing resection
0.0
0.2
0.4
0.6
0.8
1.0
Fraction NOT undergoing resection
Discriminative ability of testicular cancer model
Development
n=544, 245 necrosis
C statistic
0.818
[95% CI]
[0.783 – 0.852]
Discrimination slope
0.301
[95% CI]
[0.235 – 0.367]#
Lorenz curve p25, tumors missed 9%
p75, tumors missed 58%
Internal validation External validation
n=273, 76 necrosis
0.812
0.785
[0.777 – 0.847]** [0.726 – 0.844]
0.294
0.237
[0.228 – 0.360]** [0.178 – 0.296]#
13%
65%
Characteristics of measures for discrimination
Measure
Concordance statistic
Calculation
Rank order statistic
Visualization
ROC curve
Discrimination slope
Difference in mean
of predictions
between outcomes
Shows concentration
of outcomes missed
by cumulative
proportion of
negative
classifications
Box plot
Lorenz curve
Concentration
curve
Pros
Insensitive to outcome
incidence; interpretable
for pairs of patients with
and without the outcome
Easy interpretation, nice
visualization
Shows balance between
finding true positive
subjects versus total
classified as positive
Cons
Interpretation
artificial
Depends on the
incidence of the
outcome
Depends on the
incidence of the
outcome
Measures for calibration
Graphical assessments
Cox recalibration framework (1958)
Tests for miscalibration
Cox; Hosmer-Lemeshow; Goeman - LeCessie
0.8
0.6
0.4
0.2
Ideal
Nonparametric
Grouped observations
0.0
Fraction with actual outcome
1.0
Calibration: general principle
0.0
0.2
0.4
0.6
Predicted probability
0.8
1.0
Calibration: case study
0.8
0.6
Necrosis
Tumor
0.0
0.2
0.4
0.6
0.8
Predicted probability
1.0
0.0
0.2
0.4
0.6
0.4
0.2
0.0
Observed frequency
0.8
1.0
Validation, n=273
1.0
Development, n=544
Necrosis
Tumor
0.0
0.2
0.4
0.6
0.8
Predicted probability
1.0
Calibration tests
Calibration-in-the-large
Calibration slope
Recalibration
Calibration-in-the-large
Calibration slope
Calibration tests
Overall miscalibration
Hosmer-Lemeshow
Goeman – Le Cessie#
H0
a=0 | boverall =1
boverall =1
a=0 and b overall =1
H1
a<>0 | boverall =1
boverall <> 1
a<>0 or boverall <> 1
df
1
1
2
Development
0
1
Internal validation
0
0.97 **
External validation
–0.03
0.74
p=1
p=0.66
p=0.63
-
p=0.13
p=0.42
p=0.94
Hosmer-Lemeshow test for testicular cancer model
Decile
1
2
3
4
5
6
7
8
9
10
P
<7.3%
7.3-16.5%
16.6-26.5%
26.6-34.7%
34.8-43.6%
43.7-54.0%
54.1-63.5%
63.6-73.8%
73.9-85.0%
>85.0%
Development
N
Predicted Observed
56
2.4
1
53
6.3
4
55
11.6
13
54
16.4
15
54
21.0
25
58
28.5
33
52
31.0
31
54
36.9
36
54
42.8
40
54
48.0
47
544
245
245
Chi-square=5.9, df=8, p=0.66
P
<1.8%
1.8-7.3%
7.4-11.1%
11.2-17.5%
17.6-24.3%
24.4-31.0%
31.1-37.2%
37.3-54.6%
54.7-64.7%
>64.7%
Validation
N Predicted Observed
31
0.2
1
25
1.1
1
31
2.6
4
30
4.4
5
27
5.6
7
30
8.1
6
20
6.7
9
38
17.2
18
15
8.8
8
26
20.3
17
273
74.9
76
Chi-square=9.2, df=9, p=0.42
Some calibration and goodness-of-fit tests
Performance
aspect
Calibration-inthe-large
Compare mean(y)
versus mean(ŷ)
Calibration
graph
Calibration
slope
Regression slope of
linear predictor
Calibration
graph
Calibration test
Joint test of calibrationin-the-large and
calibration slope
Absolute difference
between smoothed y
versus ŷ
Calibration
graph
Compare observed
versus predicted in
grouped patients
Consider correlation
between residuals
Calibration
graph or
table
-
Compare observed
versus predicted in
subgroups
Table
Harrell’s E
statistic
HosmerLemeshow test
Goeman – Le
Cessie test
Subgroup
calibration
Calculation
Visualization
Calibration
graph
Pros
Key issue in
validation; statistical
testing possible
Key issue in
validation; statistical
testing possible
Efficient test of 2
key issues in
calibration
Conceptually easy,
summarizes
miscalibration over
whole curve
Conceptually easy
Overall statistical
test; supplementary
to calibration graph
Conceptually easy
Cons
By definition OK in
model development
setting
By definition OK in
model development
setting
Insensitive to more
subtle miscalibration
Depends on smoothing
algorithm
Interpretation difficult;
low power in small
samples
Very general
Not sensitive to various
miscalibration patterns
Lessons
1. Visual inspection of calibration important at external validation,
combined with test for calibration-in-the-large and calibration slope
Clinical usefulness: making decisions
Diagnostic work-up
Test ordering
Starting treatment
Therapeutic decision-making
Surgery
Intensity of treatment
Decision curve analysis
Andrew Vickers
Departments of Epidemiology and
Biostatistics
Memorial Sloan-Kettering Cancer Center
How to evaluate predictions?
Prediction models are
wonderful!
How to evaluate predictions?
Prediction models are
wonderful!
How do you know that
they do more good than
harm?
Overview of talk
• Traditional statistical and decision analytic
methods for evaluating predictions
• Theory of decision curve analysis
Illustrative example
• Men with raised PSA are referred for
prostate biopsy
• In the USA, ~25% of men with raised PSA
have positive biopsy
• ~750,000 unnecessary biopsies / year in US
• Could a new molecular marker help predict
prostate cancer?
Molecular markers for prostate
cancer detection
• Assess a marker in men undergoing
prostate biopsy for elevated PSA
• Create “base” model:
– Logistic regression: biopsy result as
dependent variable; PSA, free PSA, age as
predictors
• Create “marker” model
– Add marker(s) as predictor to the base model
• Compare “base” and “marker” model
How to evaluate models?
• Biostatistical approach (ROC’ers)
– P values
– Accuracy (area-under-the-curve: AUC)
• Decision analytic approach (VOI’ers)
– Decision tree
– Preferences / outcomes
PSA velocity
P value for PSAv
in multivariable
model <0.001
PSAv an
“independent”
predictor
AUC:
Base model = 0.609
Marker model =0 .626
AUCs and p values
• I have no idea whether to use the
model or not
– Is an AUC of 0.626 high enough?
– Is an increase in AUC of 0.017 enough
to make measuring velocity worth it?
Decision analysis
• Identify every possible decision
• Identify every possible consequence
– Identify probability of each
– Identify value of each
p1
Cancer
Cancer
Biopsy
p2
No cancer
No
Cancer
Apply
model
p3
Cancer
No biopsy
No cancer
No
Cancer
Cancer
Cancer
Biopsy
No Cancer
No cancer
Cancer
No biopsy
No cancer
No
Cancer
1- (p1 + p2 + p3)
(p1 + p3)
a
1 - (p1 + p3)
b
(p1 + p3)
1 - (p1 + p3)
c
d
a
b
c
d
Optimal decision
• Use model
– p1 a + p2 b + p3 c + (1 - p1 - p2 - p3 )d
• Treat all
– (p1 + p3 )a + (1- (p1 + p3 ))b
• Treat none
– (p1 + p3 )c + (1- (p1 + p3 ))d
• Which gives highest value?
Drawbacks of traditional
decision analysis
• p’s require a cut-point to be chosen
p1
Cancer
Cancer
Biopsy
p2
No cancer
No
Cancer
Apply
model
p3
Cancer
No biopsy
No cancer
No
Cancer
Cancer
Cancer
Biopsy
No Cancer
No cancer
Cancer
No biopsy
No cancer
No
Cancer
1- (p1 + p2 + p3)
(p1 + p3)
a
1 - (p1 + p3)
b
(p1 + p3)
1 - (p1 + p3)
c
d
a
b
c
d
Problems with traditional
decision analysis
• p’s require a cut-point to be chosen
• Extra data needed on health values
outcomes (a – d)
– Harms of biopsy
– Harms of delayed diagnosis
– Harms may vary between patients
p1
Cancer
Cancer
Biopsy
p2
No cancer
No
Cancer
Apply
model
p3
Cancer
No biopsy
No cancer
No
Cancer
Cancer
Cancer
Biopsy
No Cancer
No cancer
Cancer
No biopsy
No cancer
No
Cancer
1- (p1 + p2 + p3)
(p1 + p3)
a
1 - (p1 + p3)
b
(p1 + p3)
1 - (p1 + p3)
c
d
a
b
c
d
Evaluating values of health
outcomes
1. Obtain data from the literature on:
•
Benefit of detecting cancer
(cp to missed / delayed cancer)
•
Harms of unnecessary prostate biopsy
(cp to no biopsy)
• Burden: pain and inconvenience
• Cost of biopsy
Evaluating values of health
outcomes
2. Obtain data from the individual
patient:
•
What are your views on having a
biopsy?
•
How important is it for you to find a
cancer?
Either way
• Investigator: “here is a data set, is
my model or marker of value?”
• Analyst: “I can’t tell you, you have to
go away and do a literature search
first. Also, you have to ask each and
every patient.”
ROCkers and VOIers
• ROCkers’ methods are simple and
elegant but useless
• VOIers’ methods are useful, but
complex and difficult to apply
Solving the decision tree
Treatment
No treatment
Disease
p
a
No disease
1-p
b
Disease
p
c
No disease
1-p
d
Threshold probability
Probability of disease is p̂
Define a threshold probability of disease as pptt
Patient accepts treatment if pˆ pt
Solve the decision tree
• pt, cut-point for choosing whether to
treat or not
• Harm:Benefit ratio defines p
– Harm: d – b (FP)
– Benefit: a – c (TP)
• pt / (1-pt) = H:B
If P(D=1) = Pt
pt
d b
ac
1 pt
Treatment
No treatment
Disease
pt
a
No disease
1-pt
b
Disease
pt
c
No disease
1-p
t
d
Intuitively
• The threshold probability at which a
patient will opt for treatment is
informative of how a patient weighs the
relative harms of false-positive and
false-negative results.
Nothing new so far
• Equation has been used to set
threshold for positive diagnostic test
• Work out true harms and benefits of
treatment and disease
– E.g. if disease is 4 times worse than
treatment, treat all patients with
probability of disease >20%.
A simple decision analysis
1. Select a pt
A simple decision analysis
1. Select a pt
2. Positive test defined as pˆ pt
A simple decision analysis
1. Select a pt
2. Positive test defined as pˆ pt
3. Count true positives (benefit), false positives (harm)
A simple decision analysis
1. Select a pt
2. Positive test defined as pˆ pt
3. Count true positives (benefit), false positives (harm)
4. Calculate “Clinical Net Benefit” as:
TruePositiveCount FalsePosit iveCount pt
n
n
1 pt
Long history: Peirce 1884
Peirce 1884
Worked example at pt = 20%
N=2742
Biopsy if
risk ≥
20%
Biopsy
all men
Negative
346
0
True
positive
653
710
False
positive
Net benefit calculation
Net
benefit
1743
653 – 1743 × (0.2 ÷ 0.8)
2742
0.079
2032
710- 2032× (0.2 ÷ 0.8)
2742
0.074
Net benefit has simple
clinical interpretation
• Net benefit of 0.079 at pt of 20%
• Using the model is the equivalent of
a strategy that identified the
equivalent of 7.9 cancers per 100
patients with no unnecessary
biopsies
Net benefit has simple
clinical interpretation
• Difference between model and treat
all at pt of 20%.
– 5/1000 more TPs for equal number of
FPs
• Divide by weighting 0.005/ 0.25 = 0.02
– 20/1000 less FPs for equal number of
TPs (=20/1000 fewer unnecessary
biopsies with no missed cancers)
Decision curve analysis
1. Select a pt
2. Positive test defined as pˆ pt
3. Calculate “Clinical Net Benefit” as:
TruePositiveCount FalsePosit iveCount pt
n
n
1 pt
4. Vary pt over an appropriate range
Vickers & Elkin Med Decis Making 2006;26:565–574
Decision curve: theory
0.3
0.2
0.1
Treat none
0.0
Net benefit
0.4
0.5
Treat none
0
20
40
60
Threshold probability (%)
80
100
0.5
Treat all
[p(outcome)=50%]
0.3
0.2
0.1
Treat none
0.0
Net benefit
0.4
Treat
all
0
20
40
60
Threshold probability (%)
80
100
0.5
Decisions with model
0.3
0.1
0.2
Decisions based on model
Treat none
0.0
Net benefit
0.4
Treat
all
0
20
40
60
Threshold probability (%)
80
100
Points in Decision Curves
• If treat none, NB = ..
• If treat all, and threshold = 0%,
NB = …
• If cut-off is incidence of end point
– NBtreat none = NBtreat all = …
Decision curve analysis
• Decision curve analysis tells us about the
clinical value of a model where accuracy
metrics do not
• Decision curve analysis does not require
either:
– Additional data
– Individualized assessment
• Simple to use software is available to
implement decision curve analysis
www.decisioncurveanalysis.org
Decision analysis in the
medical research literature
• Only a moderate number of papers
devoted to decision analysis
• Many thousands of papers analyzed
without reference to decision making
(ROC curves, p values)
Decision Curve Analysis
• With thanks to….
– Elena Elkin
– Mike Kattan
– Daniel Sargent
– Stuart Baker
– Barry Kramer
– Ewout Steyerberg
Illustrations
Clinical usefulness of testicular cancer model
Cutoff 70% necrosis / 30% malignant, motivated by
Decision analysis
Current practice: ≈ 65%
Net benefit calculations
Resect all: NB=(299–3/7∙245)/544=
0.357
0.602
Resect none: NB = (0 – 0) / 544 =
0
0
Model: NB =(275–3/7∙143)/544=
0.393
0.602
Difference model – resect all:
0.036
0
more resections of tumor
3.6/100
0
at the same number of unnecessary resections of necrosis
Decision curves for testicular cancer model
Comparison of performance measures
Lessons
1. Clinical usefulness may be limited despite reasonable discrimination
and calibration
Which performance measure when?
It depends …
Evaluation of usefulness requires weighting and consideration of
outcome incidence
Hilden J. Prevalence-free utility-respecting summary indices of diagnostic power do not
exist. Stat Med. 2000;19(4):431-40.
Summary indices vs graphs
(e.g. area vs ROC curve, validation graphs, decision curves,
reclassification table vs predictiveness curve)
Which performance measure when?
1. Discrimination: if poor, usefulness unlikely, but NB >= 0
2. Calibration: if poor in new setting, risk of NB<0
Conclusions
Statistical evaluations important, but may be at odds
with evaluation of clinical usefulness; ROC 0.8 good?
0.6 always poor? NO!
Decision-analytic based performance measures,
such as decision curves, are important to consider in
the evaluation of the potential of a prediction model
to support individualized decision making
References
Steyerberg, EW. Clinical prediction models: a practical approach to development, validation, and updating.
New York, Springer, 2009.
Vickers AJ, Elkin EB: Decision curve analysis: a novel method for evaluating prediction models. Med Decis
Making 26:565-74, 2006
Steyerberg EW, Vickers AJ: Decision Curve Analysis: A Discussion. Med Decis Making 28; 146, 2008
Pencina MJ, D'Agostino RB Sr, Steyerberg EW. Extensions of net reclassification improvement calculations
to measure usefulness of new biomarkers. Stat Med 30:11-21, 2011
Steyerberg EW, Vickers AJ, Cook NR, Gerds T, Gonen M, buchowski N, Pencina MJ, Kattan MW.
Assessing the performance of prediction models: a framework for some traditional and novel measures.
Epidemiology, 21:128-38, 2010
Steyerberg EW, Pencina MJ, Lingsma HF, Kattan MW, Vickers AJ, Van Calster B. Assessing the
incremental value of diagnostic and prognostic markers: a review and illustration. Eur J Clin Invest. 2011.
Steyerberg EW, Van Calster B, Pencina MJ. Performance measures for prediction models and markers:
evaluation of predictions and classifications Rev Esp Cardiol 64:788-794, 2011
Evaluation of incremental value of markers
Case study: CVD prediction
Cohort: 3264 participants in Framingham Heart Study
Age 30 to 74 years
183 developed CHD (10 year risk: 5.6%)
Data as used in
Pencina MJ, D'Agostino RB Sr, D'Agostino RB Jr, Vasan RS. Evaluating the added predictive ability of
a new marker: from area under the ROC curve to reclassification and beyond.
Stat Med 27:157-172, 2008
Steyerberg EW, Van Calster B, Pencina MJ. Performance measures for prediction models and
markers: evaluation of predictions and classifications Rev Esp Cardiol 64:788-794, 2011
Analysis
Cox proportional hazards models
Time to event data
Reference model:
Dichotomous: Sex, diabetes, smoking
Continuous: age, systolic blood pressure (SBP), total cholesterol
as continuous
All hazard ratios statistically signicant
Add high-density lipoprotein (HDL) cholesterol
continuous predictor highly signicant
(hazard ratio = 0.65, P-value < .001)
How good are these models?
Performance of reference model
Incremental value of HDL
Performance criteria
Steyerberg EW, Van Calster B, Pencina, M.
Medidas del rendimiento de modelos de prediccio ´n y marcadores pronosticos: evaluacion de las predicciones y clasificaciones. Rev
Esp Cardiol. 2011. doi:10.1016/j.recesp.2011.04.017
Case study: quality of predictions
Discrimination
Area: 0.762 without HDL vs 0.774 with HDL
Calibration
Internal: quite good
External: more relevant
Performance
Full range of predictions
ROC
R2
..
Classifications / decisions
Cut-off to define low vs high risk
Determine a cut-off for classification
Data-driven cut-off
Youden’s index: sensitivity + specificity – 1
E.g. sens 80%, spec 80% Youden = …
E.g. sens 90%, spec 80% Youden = …
E.g. sens 80%, spec 90% Youden = …
E.g. sens 40%, spec 60% Youden = …
E.g. sens 100%, spec 100% Youden = …
Youden’s index maximized: upper left corner ROC curve
If predictions perfectly calibrated
Upper left corner: cut-off = incidence of the outcome
Incidence = 183/3264 = 5.6%
Determine a cut-off for classification
Data-driven cut-off
Youden’s index: sensitivity + specificity – 1
Decision-analytic
Cut-off determined by clinical context
Relative importance (‘utility’) of the consequence of a true or false
classification
True-positive classification: correct treatment
False-positive classification: overtreatment
True-negative classification: no treatment
False-negative classification: undertreatment
Harm: net overtreatment (FP-TN)
Benefit: net correct treatment (TP-FN)
Odds of the cut-off = H:B ratio
Evaluation of performance
Youden index: “science of the method”
Net Benefit: “utility of the method”
References:
Peirce, Science 1884
Vergouwe, Semin Urol Oncol 2002
Vickers, MDM 2006
Net Benefit
Net Benefit = (TP – w FP) / N
w = cut-off/ (1 – cut-off)
e.g.: cut-off 50%: w = .5/.5=1;
cut-off 20%: w=.2/.8=1/4
w = H : B ratio
“Number of true-positive classifications,
penalized for false-positive classifications”
Increase in AUC
5.6%: AUC 0.696 0.719
20% : AUC 0.550 0.579
Continuous variant
Area: 0.762 0.774
Addition of a marker to a model
Typically small improvement in discriminative ability according to AUC
(or c statistic)
c stat blamed for being insensitive
Study ‘Reclassification’
Net Reclassification Index:
improvement in sensitivity + improvement in specificity
= (move up | event – move down | event) +
(move down | non-event – move up | non-event )
22/183=12%
-1/3081=.03%
29
7
173
174
NRI for 5.6% cut-off?
NRI for CHD: 7/183 = 3.8%
NRI for No CHD: 24/3081 = 0.8%
NRI = 4.6%
NRI and sens/spec
NRI = delta sens + delta spec
Sens w/out
= 135/183 = 73.8%
Sens with HDL = 142/183 = 77.6%
NRI better than delta AUC?
NRI = delta(sens) + delta(spec)
AUC for binary classification = (sens + spec) / 2
NRI and delta AUC
NRI = delta(sens) + delta(spec)
AUC for binary classification = (sens + spec) / 2
Delta AUC = (delta(sens) + delta(spec)) / 2
NRI = 2 x delta(AUC)
Delta(Youden) = delta(sens) + delta(spec)
NRI = delta(Youden)
NRI has ‘absurd’ weighting?
Decision-analytic performance: NB
Net Benefit = (TP – w FP) / N
No HDL model:
TP = 3+132 = 135
FP = 166 + 901= 1067
w = 0.056/0.944 = 0.059
N = 3264
NB = (135 – 0.059 x 1067) / 3264 = 2.21%
With HDL model:
NB = (142 – 0.059 x 1043) / 3264 = 2.47%
Delta(NB)
Increase in TP: 10 – 3 = 7
Decrease in FP: 166 – 142 = 24
Increase in NB: (7 + 0.059 x 24) / 3264 = 0.26%
Interpretation:
“2.6 more true CHD events identified per 1000 subjects, at the same number of FP
classifications.”
“ HDL has to be measured in 1/0.26% = 385 subjects to identify one more TP”
Application to FHS
Continuous NRI: no categories
All cut-offs; information similar to AUC and Decision Curve