Ewout Steyerberg ppt, part II - for Clinical Prediction Models

Download Report

Transcript Ewout Steyerberg ppt, part II - for Clinical Prediction Models

Relationship between performance measures:
From statistical evaluations to decision-analysis
Ewout Steyerberg
Dept of Public Health, Erasmus MC,
Rotterdam, the Netherlands
[email protected]
Chicago, October 23, 2011
General issues
 Usefulness / Clinical utility: what do we mean exactly?
 Evaluation of predictions
 Evaluation of decisions
 Adding a marker to a model
 Statistical significance?
Testing β enough (no need to test increase in R2, AUC, IDI, …)
 Clinical relevance: measurement worth the costs?
(patient and physician burden, financial costs)
Overview
 Case study: residual masses in testicular cancer
 Model development
 Evaluation approach
 Performance evaluation
 Statistical
 Overall
 Calibration and discrimination
 Decision-analytic
 Utility-weighted measures
 www.clinicalpredictionmodels.org
Prediction approach
 Outcome: malignant or benign tissue
 Predictors:
 primary histology
 3 tumor markers
 tumor size (postchemotherapy, and reduction)
 Model:
 logistic regression
 544 patients, 299 malignant tissue
 Internal validation by bootstrapping
 External validation in 273 patients, 197 malignant tissue
Logistic regression results
Characteristic
Primary tumor teratoma-positive?
Prechemotherapy AFP elevated?
Prechemotherapy HCG elevated?
Square root of postchemotherapy mass size (mm)
Reduction in mass size per 10%
Ln of standardised prechemotherapy LDH
(LDH/upper limit of local normal value)
Without LDH
2.7 [1.8 – 4.0]
2.4 [1.5 – 3.7]
1.7 [1.1 – 2.7]
1.08 [0.95 – 1.23]
0.77 [0.70 – 0.85]
-
With LDH
2.5 [1.6 – 3.8]
2.5 [1.6 – 3.9]
2.2 [1.4 – 3.4]
1.34 [1.14 – 1.57]
0.85 [0.77 – 0.95]
0.37 [0.25 – 0.56]
Evaluation approach: graphical assessment
0.8
0.6
Necrosis
Tumor
0.0
0.2
0.4
0.6
0.8
Predicted probability
1.0
0.0
0.2
0.4
0.6
0.4
0.2
0.0
Observed frequency
0.8
1.0
Validation, n=273
1.0
Development, n=544
Necrosis
Tumor
0.0
0.2
0.4
0.6
0.8
Predicted probability
1.0
Lessons
1. Plot observed versus expected outcome with distribution of predictions
by outcome (‘Validation graph’)
2. Performance should be assessed in validation sets, since apparent
performance is optimistic (model developed in the same data set as
used for evaluation)
 Preferably external validation
 At least internal validation, e.g. by bootstrap cross-validation
Performance evaluation
 Statistical criteria: predictions close to observed outcomes?
 Overall; consider residuals y – ŷ, or y – p
 Discrimination: separate low risk from high risk
 Calibration: e.g. 70% predicted = 70% observed
 Clinical usefulness: better decision-making?
 One cut-off, defined by expected utility / relative weight of errors
 Consecutive cut-offs: decision curve analysis
Predictions close to observed outcomes? Penalty functions
 Logarithmic score: (1 – Y)*(log(1 – p)) + Y*log(p)
 Quadratic score: Y*(1 – p)^2 + (1 – Y)*p^2
4
0.25 0.5
0
1
0
20
40
60
Predicted probability (%)
80
100
Quadratic score
1
y=0
2
3
y=1
0
Logarithmic score
5
6
Behavior of logarithmic and quadratic error scores
Overall performance measures
 R2: explained variation
 Logistic / Cox model: Nagelkerke’s R2
 Brier score: Y*(1 – p)^2 + (1 – Y)*p^2
 Brierscaled = 1 – Brier / Briermax
 Briermax = mean(p) x (1 – mean(p))^2 + (1 – mean(p)) x mean(p)^2
 Brierscaled very similar to Pearson R2 for binary outcomes
Overall performance in case study
R2
Brier
Briermax
Brierscaled
Development
38.9%
0.174
0.248
29.8%
Internal validation
37.6%
0.178
0.248
28.2%
External validation
26.7%
0.161
0.201
20.0%
Measures for discrimination
 Concordance statistic, or area under the ROC curve
 Discrimination slope
 Lorenz curve
ROC curves for case study
Development, n=544
1.0
0%
30%
1.0
30%
40%
20%
40%
0.6
0.4
0.6
0.4
0.2
0.2
0.0
0.0
0.0
0.2
0.4
0%
20%
0.8
True positive rate
0.8
True positive rate
Validation, n=273
0.6
False positive rate
0.8
1.0
0.0
0.2
0.4
0.6
False positive rate
0.8
1.0
0
1
Tumor
0.0
0.0
0.0
0.4
0.6
0.8
0.4
0.6
0.8
0
1
Tumor
0.4
0.6
0.8
1.0
1.0
1.0
Slope=0.3
0.2
Predicted risk without LDH
0.2
Predicted risk with LDH
0.2
Predicted risk without LDH
Box plots with discrimination slope for case study
Slope=0.34
Slope=0.24
0
1
Tumor at validation
0.8
0.6
0.4
0.5
0.2
0.6
0.64
0.7
0.83
0.98
0.0
Proportion with the outcome
1.0
Lorenz concentration curves: general pattern
0.0
0.2
0.4
0.6
Cumulative proportion
0.8
1.0
Lorenz concentration curves: case study
1.0
Validation, n=273
0.0
0.2
0.4
0.6
0.8
1.0
0.8
0.6
0.4
0.2
0.0
Fraction with unresected tumor
Development, n=544
0.0
0.2
0.4
0.6
0.8
1.0
Fraction NOT undergoing resection
0.0
0.2
0.4
0.6
0.8
1.0
Fraction NOT undergoing resection
Discriminative ability of testicular cancer model
Development
n=544, 245 necrosis
C statistic
0.818
[95% CI]
[0.783 – 0.852]
Discrimination slope
0.301
[95% CI]
[0.235 – 0.367]#
Lorenz curve p25, tumors missed 9%
p75, tumors missed 58%
Internal validation External validation
n=273, 76 necrosis
0.812
0.785
[0.777 – 0.847]** [0.726 – 0.844]
0.294
0.237
[0.228 – 0.360]** [0.178 – 0.296]#
13%
65%
Characteristics of measures for discrimination
Measure
Concordance statistic
Calculation
Rank order statistic
Visualization
ROC curve
Discrimination slope
Difference in mean
of predictions
between outcomes
Shows concentration
of outcomes missed
by cumulative
proportion of
negative
classifications
Box plot
Lorenz curve
Concentration
curve
Pros
Insensitive to outcome
incidence; interpretable
for pairs of patients with
and without the outcome
Easy interpretation, nice
visualization
Shows balance between
finding true positive
subjects versus total
classified as positive
Cons
Interpretation
artificial
Depends on the
incidence of the
outcome
Depends on the
incidence of the
outcome
Measures for calibration
 Graphical assessments
 Cox recalibration framework (1958)
 Tests for miscalibration
 Cox; Hosmer-Lemeshow; Goeman - LeCessie
0.8
0.6
0.4
0.2
Ideal
Nonparametric
Grouped observations
0.0
Fraction with actual outcome
1.0
Calibration: general principle
0.0
0.2
0.4
0.6
Predicted probability
0.8
1.0
Calibration: case study
0.8
0.6
Necrosis
Tumor
0.0
0.2
0.4
0.6
0.8
Predicted probability
1.0
0.0
0.2
0.4
0.6
0.4
0.2
0.0
Observed frequency
0.8
1.0
Validation, n=273
1.0
Development, n=544
Necrosis
Tumor
0.0
0.2
0.4
0.6
0.8
Predicted probability
1.0
Calibration tests
Calibration-in-the-large
Calibration slope
Recalibration
Calibration-in-the-large
Calibration slope
Calibration tests
Overall miscalibration
Hosmer-Lemeshow
Goeman – Le Cessie#
H0
a=0 | boverall =1
boverall =1
a=0 and b overall =1
H1
a<>0 | boverall =1
boverall <> 1
a<>0 or boverall <> 1
df
1
1
2
Development
0
1
Internal validation
0
0.97 **
External validation
–0.03
0.74
p=1
p=0.66
p=0.63
-
p=0.13
p=0.42
p=0.94
Hosmer-Lemeshow test for testicular cancer model
Decile
1
2
3
4
5
6
7
8
9
10
P
<7.3%
7.3-16.5%
16.6-26.5%
26.6-34.7%
34.8-43.6%
43.7-54.0%
54.1-63.5%
63.6-73.8%
73.9-85.0%
>85.0%
Development
N
Predicted Observed
56
2.4
1
53
6.3
4
55
11.6
13
54
16.4
15
54
21.0
25
58
28.5
33
52
31.0
31
54
36.9
36
54
42.8
40
54
48.0
47
544
245
245
Chi-square=5.9, df=8, p=0.66
P
<1.8%
1.8-7.3%
7.4-11.1%
11.2-17.5%
17.6-24.3%
24.4-31.0%
31.1-37.2%
37.3-54.6%
54.7-64.7%
>64.7%
Validation
N Predicted Observed
31
0.2
1
25
1.1
1
31
2.6
4
30
4.4
5
27
5.6
7
30
8.1
6
20
6.7
9
38
17.2
18
15
8.8
8
26
20.3
17
273
74.9
76
Chi-square=9.2, df=9, p=0.42
Some calibration and goodness-of-fit tests
Performance
aspect
Calibration-inthe-large
Compare mean(y)
versus mean(ŷ)
Calibration
graph
Calibration
slope
Regression slope of
linear predictor
Calibration
graph
Calibration test
Joint test of calibrationin-the-large and
calibration slope
Absolute difference
between smoothed y
versus ŷ
Calibration
graph
Compare observed
versus predicted in
grouped patients
Consider correlation
between residuals
Calibration
graph or
table
-
Compare observed
versus predicted in
subgroups
Table
Harrell’s E
statistic
HosmerLemeshow test
Goeman – Le
Cessie test
Subgroup
calibration
Calculation
Visualization
Calibration
graph
Pros
Key issue in
validation; statistical
testing possible
Key issue in
validation; statistical
testing possible
Efficient test of 2
key issues in
calibration
Conceptually easy,
summarizes
miscalibration over
whole curve
Conceptually easy
Overall statistical
test; supplementary
to calibration graph
Conceptually easy
Cons
By definition OK in
model development
setting
By definition OK in
model development
setting
Insensitive to more
subtle miscalibration
Depends on smoothing
algorithm
Interpretation difficult;
low power in small
samples
Very general
Not sensitive to various
miscalibration patterns
Lessons
1. Visual inspection of calibration important at external validation,
combined with test for calibration-in-the-large and calibration slope
Clinical usefulness: making decisions
 Diagnostic work-up
 Test ordering
 Starting treatment
 Therapeutic decision-making
 Surgery
 Intensity of treatment
Decision curve analysis
Andrew Vickers
Departments of Epidemiology and
Biostatistics
Memorial Sloan-Kettering Cancer Center
How to evaluate predictions?
Prediction models are
wonderful!
How to evaluate predictions?
Prediction models are
wonderful!
How do you know that
they do more good than
harm?
Overview of talk
• Traditional statistical and decision analytic
methods for evaluating predictions
• Theory of decision curve analysis
Illustrative example
• Men with raised PSA are referred for
prostate biopsy
• In the USA, ~25% of men with raised PSA
have positive biopsy
• ~750,000 unnecessary biopsies / year in US
• Could a new molecular marker help predict
prostate cancer?
Molecular markers for prostate
cancer detection
• Assess a marker in men undergoing
prostate biopsy for elevated PSA
• Create “base” model:
– Logistic regression: biopsy result as
dependent variable; PSA, free PSA, age as
predictors
• Create “marker” model
– Add marker(s) as predictor to the base model
• Compare “base” and “marker” model
How to evaluate models?
• Biostatistical approach (ROC’ers)
– P values
– Accuracy (area-under-the-curve: AUC)
• Decision analytic approach (VOI’ers)
– Decision tree
– Preferences / outcomes
PSA velocity
P value for PSAv
in multivariable
model <0.001
PSAv an
“independent”
predictor
AUC:
Base model = 0.609
Marker model =0 .626
AUCs and p values
• I have no idea whether to use the
model or not
– Is an AUC of 0.626 high enough?
– Is an increase in AUC of 0.017 enough
to make measuring velocity worth it?
Decision analysis
• Identify every possible decision
• Identify every possible consequence
– Identify probability of each
– Identify value of each
p1
Cancer
Cancer
Biopsy
p2
No cancer
No
Cancer
Apply
model
p3
Cancer
No biopsy
No cancer
No
Cancer
Cancer
Cancer
Biopsy
No Cancer
No cancer
Cancer
No biopsy
No cancer
No
Cancer
1- (p1 + p2 + p3)
(p1 + p3)
a
1 - (p1 + p3)
b
(p1 + p3)
1 - (p1 + p3)
c
d
a
b
c
d
Optimal decision
• Use model
– p1 a + p2 b + p3 c + (1 - p1 - p2 - p3 )d
• Treat all
– (p1 + p3 )a + (1- (p1 + p3 ))b
• Treat none
– (p1 + p3 )c + (1- (p1 + p3 ))d
• Which gives highest value?
Drawbacks of traditional
decision analysis
• p’s require a cut-point to be chosen
p1
Cancer
Cancer
Biopsy
p2
No cancer
No
Cancer
Apply
model
p3
Cancer
No biopsy
No cancer
No
Cancer
Cancer
Cancer
Biopsy
No Cancer
No cancer
Cancer
No biopsy
No cancer
No
Cancer
1- (p1 + p2 + p3)
(p1 + p3)
a
1 - (p1 + p3)
b
(p1 + p3)
1 - (p1 + p3)
c
d
a
b
c
d
Problems with traditional
decision analysis
• p’s require a cut-point to be chosen
• Extra data needed on health values
outcomes (a – d)
– Harms of biopsy
– Harms of delayed diagnosis
– Harms may vary between patients
p1
Cancer
Cancer
Biopsy
p2
No cancer
No
Cancer
Apply
model
p3
Cancer
No biopsy
No cancer
No
Cancer
Cancer
Cancer
Biopsy
No Cancer
No cancer
Cancer
No biopsy
No cancer
No
Cancer
1- (p1 + p2 + p3)
(p1 + p3)
a
1 - (p1 + p3)
b
(p1 + p3)
1 - (p1 + p3)
c
d
a
b
c
d
Evaluating values of health
outcomes
1. Obtain data from the literature on:
•
Benefit of detecting cancer
(cp to missed / delayed cancer)
•
Harms of unnecessary prostate biopsy
(cp to no biopsy)
• Burden: pain and inconvenience
• Cost of biopsy
Evaluating values of health
outcomes
2. Obtain data from the individual
patient:
•
What are your views on having a
biopsy?
•
How important is it for you to find a
cancer?
Either way
• Investigator: “here is a data set, is
my model or marker of value?”
• Analyst: “I can’t tell you, you have to
go away and do a literature search
first. Also, you have to ask each and
every patient.”
ROCkers and VOIers
• ROCkers’ methods are simple and
elegant but useless
• VOIers’ methods are useful, but
complex and difficult to apply
Solving the decision tree
Treatment
No treatment
Disease
p
a
No disease
1-p
b
Disease
p
c
No disease
1-p
d
Threshold probability
Probability of disease is p̂
Define a threshold probability of disease as pptt
Patient accepts treatment if pˆ  pt
Solve the decision tree
• pt, cut-point for choosing whether to
treat or not
• Harm:Benefit ratio defines p
– Harm: d – b (FP)
– Benefit: a – c (TP)
• pt / (1-pt) = H:B
If P(D=1) = Pt
pt
d b

ac
1  pt
Treatment
No treatment
Disease
pt
a
No disease
1-pt
b
Disease
pt
c
No disease
1-p
t
d
Intuitively
• The threshold probability at which a
patient will opt for treatment is
informative of how a patient weighs the
relative harms of false-positive and
false-negative results.
Nothing new so far
• Equation has been used to set
threshold for positive diagnostic test
• Work out true harms and benefits of
treatment and disease
– E.g. if disease is 4 times worse than
treatment, treat all patients with
probability of disease >20%.
A simple decision analysis
1. Select a pt
A simple decision analysis
1. Select a pt
2. Positive test defined as pˆ  pt
A simple decision analysis
1. Select a pt
2. Positive test defined as pˆ  pt
3. Count true positives (benefit), false positives (harm)
A simple decision analysis
1. Select a pt
2. Positive test defined as pˆ  pt
3. Count true positives (benefit), false positives (harm)
4. Calculate “Clinical Net Benefit” as:
TruePositiveCount FalsePosit iveCount  pt


n
n
 1  pt



Long history: Peirce 1884
Peirce 1884
Worked example at pt = 20%
N=2742
Biopsy if
risk ≥
20%
Biopsy
all men
Negative
346
0
True
positive
653
710
False
positive
Net benefit calculation
Net
benefit
1743
653 – 1743 × (0.2 ÷ 0.8)
2742
0.079
2032
710- 2032× (0.2 ÷ 0.8)
2742
0.074
Net benefit has simple
clinical interpretation
• Net benefit of 0.079 at pt of 20%
• Using the model is the equivalent of
a strategy that identified the
equivalent of 7.9 cancers per 100
patients with no unnecessary
biopsies
Net benefit has simple
clinical interpretation
• Difference between model and treat
all at pt of 20%.
– 5/1000 more TPs for equal number of
FPs
• Divide by weighting 0.005/ 0.25 = 0.02
– 20/1000 less FPs for equal number of
TPs (=20/1000 fewer unnecessary
biopsies with no missed cancers)
Decision curve analysis
1. Select a pt
2. Positive test defined as pˆ  pt
3. Calculate “Clinical Net Benefit” as:
TruePositiveCount FalsePosit iveCount  pt


n
n
 1  pt
4. Vary pt over an appropriate range
Vickers & Elkin Med Decis Making 2006;26:565–574



Decision curve: theory
0.3
0.2
0.1
Treat none
0.0
Net benefit
0.4
0.5
Treat none
0
20
40
60
Threshold probability (%)
80
100
0.5
Treat all
[p(outcome)=50%]
0.3
0.2
0.1
Treat none
0.0
Net benefit
0.4
Treat
all
0
20
40
60
Threshold probability (%)
80
100
0.5
Decisions with model
0.3
0.1
0.2
Decisions based on model
Treat none
0.0
Net benefit
0.4
Treat
all
0
20
40
60
Threshold probability (%)
80
100
Points in Decision Curves
• If treat none, NB = ..
• If treat all, and threshold = 0%,
NB = …
• If cut-off is incidence of end point
– NBtreat none = NBtreat all = …
Decision curve analysis
• Decision curve analysis tells us about the
clinical value of a model where accuracy
metrics do not
• Decision curve analysis does not require
either:
– Additional data
– Individualized assessment
• Simple to use software is available to
implement decision curve analysis
www.decisioncurveanalysis.org
Decision analysis in the
medical research literature
• Only a moderate number of papers
devoted to decision analysis
• Many thousands of papers analyzed
without reference to decision making
(ROC curves, p values)
Decision Curve Analysis
• With thanks to….
– Elena Elkin
– Mike Kattan
– Daniel Sargent
– Stuart Baker
– Barry Kramer
– Ewout Steyerberg
Illustrations
Clinical usefulness of testicular cancer model
 Cutoff 70% necrosis / 30% malignant, motivated by
 Decision analysis
 Current practice: ≈ 65%
Net benefit calculations
Resect all: NB=(299–3/7∙245)/544=
0.357
0.602
Resect none: NB = (0 – 0) / 544 =
0
0
Model: NB =(275–3/7∙143)/544=
0.393
0.602
Difference model – resect all:
0.036
0
more resections of tumor
3.6/100
0
at the same number of unnecessary resections of necrosis
Decision curves for testicular cancer model
Comparison of performance measures
Lessons
1. Clinical usefulness may be limited despite reasonable discrimination
and calibration
Which performance measure when?

It depends …

Evaluation of usefulness requires weighting and consideration of
outcome incidence
Hilden J. Prevalence-free utility-respecting summary indices of diagnostic power do not
exist. Stat Med. 2000;19(4):431-40.

Summary indices vs graphs
(e.g. area vs ROC curve, validation graphs, decision curves,
reclassification table vs predictiveness curve)
Which performance measure when?
1. Discrimination: if poor, usefulness unlikely, but NB >= 0
2. Calibration: if poor in new setting, risk of NB<0
Conclusions
 Statistical evaluations important, but may be at odds
with evaluation of clinical usefulness; ROC 0.8 good?
0.6 always poor? NO!
 Decision-analytic based performance measures,
such as decision curves, are important to consider in
the evaluation of the potential of a prediction model
to support individualized decision making
References

Steyerberg, EW. Clinical prediction models: a practical approach to development, validation, and updating.
New York, Springer, 2009.

Vickers AJ, Elkin EB: Decision curve analysis: a novel method for evaluating prediction models. Med Decis
Making 26:565-74, 2006
Steyerberg EW, Vickers AJ: Decision Curve Analysis: A Discussion. Med Decis Making 28; 146, 2008


Pencina MJ, D'Agostino RB Sr, Steyerberg EW. Extensions of net reclassification improvement calculations
to measure usefulness of new biomarkers. Stat Med 30:11-21, 2011

Steyerberg EW, Vickers AJ, Cook NR, Gerds T, Gonen M, buchowski N, Pencina MJ, Kattan MW.
Assessing the performance of prediction models: a framework for some traditional and novel measures.
Epidemiology, 21:128-38, 2010

Steyerberg EW, Pencina MJ, Lingsma HF, Kattan MW, Vickers AJ, Van Calster B. Assessing the
incremental value of diagnostic and prognostic markers: a review and illustration. Eur J Clin Invest. 2011.

Steyerberg EW, Van Calster B, Pencina MJ. Performance measures for prediction models and markers:
evaluation of predictions and classifications Rev Esp Cardiol 64:788-794, 2011
Evaluation of incremental value of markers
Case study: CVD prediction
 Cohort: 3264 participants in Framingham Heart Study
 Age 30 to 74 years
 183 developed CHD (10 year risk: 5.6%)
 Data as used in

Pencina MJ, D'Agostino RB Sr, D'Agostino RB Jr, Vasan RS. Evaluating the added predictive ability of
a new marker: from area under the ROC curve to reclassification and beyond.
Stat Med 27:157-172, 2008

Steyerberg EW, Van Calster B, Pencina MJ. Performance measures for prediction models and
markers: evaluation of predictions and classifications Rev Esp Cardiol 64:788-794, 2011
Analysis
 Cox proportional hazards models
 Time to event data
 Reference model:
 Dichotomous: Sex, diabetes, smoking
 Continuous: age, systolic blood pressure (SBP), total cholesterol
as continuous
 All hazard ratios statistically signicant
 Add high-density lipoprotein (HDL) cholesterol
 continuous predictor highly signicant
(hazard ratio = 0.65, P-value < .001)
How good are these models?
 Performance of reference model
 Incremental value of HDL
Performance criteria

Steyerberg EW, Van Calster B, Pencina, M.
Medidas del rendimiento de modelos de prediccio ´n y marcadores pronosticos: evaluacion de las predicciones y clasificaciones. Rev
Esp Cardiol. 2011. doi:10.1016/j.recesp.2011.04.017
Case study: quality of predictions
Discrimination
Area: 0.762 without HDL vs 0.774 with HDL
Calibration
 Internal: quite good
 External: more relevant
Performance
 Full range of predictions
 ROC
 R2
 ..
 Classifications / decisions
 Cut-off to define low vs high risk
Determine a cut-off for classification
 Data-driven cut-off
 Youden’s index: sensitivity + specificity – 1
 E.g. sens 80%, spec 80%  Youden = …
 E.g. sens 90%, spec 80%  Youden = …
 E.g. sens 80%, spec 90%  Youden = …
 E.g. sens 40%, spec 60%  Youden = …
 E.g. sens 100%, spec 100%  Youden = …
 Youden’s index maximized: upper left corner ROC curve
 If predictions perfectly calibrated
 Upper left corner: cut-off = incidence of the outcome
 Incidence = 183/3264 = 5.6%
Determine a cut-off for classification
 Data-driven cut-off
 Youden’s index: sensitivity + specificity – 1
 Decision-analytic
 Cut-off determined by clinical context
 Relative importance (‘utility’) of the consequence of a true or false
classification
 True-positive classification: correct treatment
 False-positive classification: overtreatment
 True-negative classification: no treatment
 False-negative classification: undertreatment
 Harm: net overtreatment (FP-TN)
 Benefit: net correct treatment (TP-FN)
 Odds of the cut-off = H:B ratio
Evaluation of performance
 Youden index: “science of the method”
 Net Benefit: “utility of the method”
 References:
 Peirce, Science 1884
 Vergouwe, Semin Urol Oncol 2002
 Vickers, MDM 2006
Net Benefit
 Net Benefit = (TP – w FP) / N
w = cut-off/ (1 – cut-off)
 e.g.: cut-off 50%: w = .5/.5=1;
cut-off 20%: w=.2/.8=1/4
 w = H : B ratio
 “Number of true-positive classifications,
penalized for false-positive classifications”
Increase in AUC
 5.6%: AUC 0.696  0.719
 20% : AUC 0.550  0.579
Continuous variant
Area: 0.762  0.774
Addition of a marker to a model
 Typically small improvement in discriminative ability according to AUC
(or c statistic)
 c stat blamed for being insensitive
 Study ‘Reclassification’
 Net Reclassification Index:
 improvement in sensitivity + improvement in specificity
= (move up | event – move down | event) +
(move down | non-event – move up | non-event )
22/183=12%
-1/3081=.03%
29
7
173
174
NRI for 5.6% cut-off?
 NRI for CHD: 7/183 = 3.8%
 NRI for No CHD: 24/3081 = 0.8%
 NRI = 4.6%
NRI and sens/spec
 NRI = delta sens + delta spec
 Sens w/out
= 135/183 = 73.8%
 Sens with HDL = 142/183 = 77.6%
NRI better than delta AUC?
 NRI = delta(sens) + delta(spec)
 AUC for binary classification = (sens + spec) / 2
NRI and delta AUC
 NRI = delta(sens) + delta(spec)
 AUC for binary classification = (sens + spec) / 2
 Delta AUC = (delta(sens) + delta(spec)) / 2
 NRI = 2 x delta(AUC)
 Delta(Youden) = delta(sens) + delta(spec)
 NRI = delta(Youden)
NRI has ‘absurd’ weighting?
Decision-analytic performance: NB
 Net Benefit = (TP – w FP) / N
 No HDL model:
 TP = 3+132 = 135
 FP = 166 + 901= 1067
 w = 0.056/0.944 = 0.059
 N = 3264
 NB = (135 – 0.059 x 1067) / 3264 = 2.21%
 With HDL model:
 NB = (142 – 0.059 x 1043) / 3264 = 2.47%
 Delta(NB)
 Increase in TP: 10 – 3 = 7
 Decrease in FP: 166 – 142 = 24
 Increase in NB: (7 + 0.059 x 24) / 3264 = 0.26%
 Interpretation:
 “2.6 more true CHD events identified per 1000 subjects, at the same number of FP
classifications.”
 “ HDL has to be measured in 1/0.26% = 385 subjects to identify one more TP”
Application to FHS
Continuous NRI: no categories
 All cut-offs; information similar to AUC and Decision Curve