Motivation - RePEc: Research Papers in Economics

Download Report

Transcript Motivation - RePEc: Research Papers in Economics

models with continuous
covariates
with a practical emphasis on fractional
polynomials and applications in clinical
epidemiology
Professor Patrick Royston,
MRC Clinical Trials Unit, London.
Berlin, April 2005.
8/4/2005
1
The problem …
“Quantifying epidemiologic risk factors
using non-parametric regression: model
selection remains the greatest challenge”
Rosenberg PS et al, Statistics in Medicine 2003; 22:3369-3381
Trivial nowadays to fit almost any model
To choose a good model is much harder
8/4/2005
2
Overview
• Context and motivation
• Introduction to fractional polynomials for the
univariate smoothing problem
• Extension to multivariable models
• More on spline models
• Stability analysis
• Stata aspects
• Conclusions
8/4/2005
3
Motivation
• Often have continuous risk factors in epidemiology
and clinical studies – how to model them?
• Linear model may describe a dose-response
relationship badly
‘Linear’ = straight line = 0 + 1 X + … throughout talk
• Using cut-points has several problems
• Splines recommended by some – but are not ideal
Lack a well-defined approach to model selection
‘Black box’
Robustness issues
8/4/2005
4
Problems of cut-points
• Step-function is a poor approximation to true
relationship
Almost always fits data less well than a suitable
continuous function
• ‘Optimal’ cut-points have several difficulties
Biased effect estimates
Inflated P-values
Not reproducible in other studies
8/4/2005
5
Example datasets
1. Epidemiology
• Whitehall 1
17,370 male Civil Servants aged 40-64 years
Measurements include: age, cigarette smoking,
BP, cholesterol, height, weight, job grade
Outcomes of interest: coronary heart disease, allcause mortality  logistic regression
Interested in risk as function of covariates
Several continuous covariates
 Some may have no influence in multivariable context
8/4/2005
6
Example datasets
2. Clinical studies
• German breast cancer study group (BMFT-2)
Prognostic factors in primary breast cancer
Age, menopausal status, tumour size, grade, no. of
positive lymph nodes, hormone receptor status
Recurrence-free survival time  Cox regression
686 patients, 299 events
Several continuous covariates
Interested in prognostic model and effect of
individual variables
8/4/2005
7
Example:
Systolic blood pressure vs. age
50
100
150
200
250
300
Whitehall 1: BP vs age
40
45
50
55
60
65
Age, years
8/4/2005
8
(Systolic BP and age – not
linear)
150
Whitehall 1: BP vs age
125
130
135
140
145
95% CI
Linear function
FP1 function
Running line
40
45
50
55
60
65
Age, years
8/4/2005
9
Empirical curve fitting: Aims
• Smoothing
• Visualise relationship of Y with X
• Provide and/or suggest functional form
8/4/2005
10
Some approaches
• ‘Non-parametric’ (local-influence) models
Locally weighted (kernel) fits (e.g. lowess)
Regression splines
Smoothing splines (used in generalized additive models)
• Parametric (non-local influence) models
Polynomials
Non-linear curves
Fractional polynomials
 Intermediate between polynomials and non-linear curves
8/4/2005
11
Local regression models
• Advantages
Flexible – because local!
May reveal ‘true’ curve shape (?)
• Disadvantages
Unstable – because local!
No concise form for models
 Therefore, hard for others to use – publication,compare results with
those from other models
Curves not necessarily smooth
‘Black box’ approach
Many approaches – which one(s) to use?
8/4/2005
12
Polynomial models
• Do not have the disadvantages of local
regression models, but do have others:
• Lack of flexibility (low order)
• Artefacts in fitted curves (high order)
• Cannot have asymptotes
8/4/2005
13
Fractional polynomial models
• Describe for one covariate, X
multiple regression later
• Fractional polynomial of degree m for X with powers
p1, … , pm is given by
FPm(X) = 1 X p + … + m X p
1
m
• Powers p1,…, pm are taken from a special set
{2,  1,  0.5, 0, 0.5, 1, 2, 3}
• Usually m = 1 or m = 2 is sufficient for a good fit
8/4/2005
14
FP1 and FP2 models
• FP1 models are simple power transformations
• 1/X2, 1/X, 1/X, log X, X, X, X2, X3
8 models
• FP2 models are combinations of these
For example 1(1/X) + 2(X2)
28 models
• Note ‘repeated powers’ models
For example 1(1/X) + 2(1/X)log X
8 models
8/4/2005
15
FP1 and FP2 models:
some properties
• Many useful curves
• A variety of features are available:
Monotonic
Can have asymptote
Non-monotonic (single maximum or minimum)
Single turning-point
• Get better fit than with conventional
polynomials, even of higher degree
8/4/2005
16
Examples of FP2 curves
- varying powers
8/4/2005
(-2, 1)
(-2, 2)
(-2, -2)
(-2, -1)
17
- single power, different
coefficients
(-2, 2)
4
Y
2
0
-2
-4
10
8/4/2005
20
30
x
40
50
18
A philosophy of function
selection
• Prefer simple (linear) model
• Use more complex (non-linear) FP1 or FP2
model if indicated by the data
• Contrast to local regression modelling
Already starts with a complex model
8/4/2005
19
Estimation and significance
testing for FP models
• Fit model with each combination of powers
FP1: 8 single powers
FP2: 36 combinations of powers
• Choose model with lowest deviance (MLE)
• Comparing FPm with FP(m  1):
compare deviance difference with 2 on 2 d.f.
one d.f. for power, 1 d.f. for regression coefficient
supported by simulations; slightly conservative
8/4/2005
20
Selection of FP function
•
•
•
•
•
•
•
Has flavour of a closed test procedure
Use 2 approximations to get P-values
Define nominal P-value for all tests (often 5%)
Fit linear and best FP1 and FP2 models
Test FP2 vs. null – test of any effect of X (2 on 4 df)
Test FP2 vs linear – test of non-linearity (2 on 3 df)
Test FP2 vs FP1 – test of more complex function
against simpler one (2 on 2 df)
8/4/2005
21
Example: Systolic BP and age
Model
FP2 v Null
FP2 v Linear
FP2 v FP1
d.f.
4
3
2
Deviance
difference
944.57
29.95
3.29
Pvalue
0.000
0.000
0.2
Reminder:
8/4/2005
FP1 had power 3:
1 X3
FP2 had powers (1,1):
1 X + 2 X log X
22
Aside: FP versus spline
• Why care about FPs when splines are more
flexible?
• More flexible  more unstable
More chance of ‘over-fitting’
• In epidemiology, dose-response relationships
are often simple
• Illustrate by small simulation example
8/4/2005
23
FP versus spline (continued)
•
•
•
•
•
•
•
•
Logarithmic relationships are common in practice
Simulate regression model y = 0 + 1log(X) + error
Error is normally distributed N(0, 2)
Take 0 = 0, 1 = 1; X has lognormal distribution
Vary  = {1, 0.5, 0.25, 0.125}
Fit FP1, FP2 and spline with 2, 4, 6 d.f.
Compute mean square error
Compare with mean square error for true model
8/4/2005
24
FP vs. spline (continued)
2
y
0
-2
-4
-4
-2
y
0
2
4
Sigma = 0.5
4
Sigma = 1
2
4
6
2
4
Sigma = 0.25
Sigma = 0.125
6
2
y
0
-2
-4
-2
y
0
2
4
x
-4
0
2
4
x
8/4/2005
0
x
4
0
6
0
2
4
6
x
25
FP vs. spline (continued)
FP1 and spline with 2 df
2
1
0
-1
-2
-2
-1
0
1
2
Solid: FP1; dashed: spline 2 df
4
6
0
2
4
6
2
4
6
0
2
4
6
1
-2
-1
0
1
0
-1
-2
8/4/2005
0
2
2
2
0
26
FP vs. spline (continued)
2
1
0
-1
-2
-2
-1
0
1
2
FP2 and spline with 4 df
2
3
4
5
0
1
2
3
4
5
1
2
3
4
5
0
1
2
3
4
5
1
-2
-1
0
1
0
-1
-2
8/4/2005
0
2
1
2
0
27
FP vs. spline (continued)
0
.04
.08
.12
FP vs. spline: prediction error
.125
.25
.5
1
sigma
True
Spline 2df
8/4/2005
FP1
Spline 4df
FP2
Spline 6df
28
FP vs. spline (continued)
• In this example, spline usually less accurate
than FP
• FP2 less accurate than FP1 (over-fitting)
• FP1 and FP2 more accurate than splines
• Splines often had non-monotonic fitted curves
Could be medically implausible
• Of course, this is a special example
8/4/2005
29
Multivariable FP (MFP) models
• Assume have k > 1 continuous covariates and
perhaps some categoric or binary covariates
• Allow dropping of non-significant variables
• Wish to find best multivariable FP model for
all X’s
• Impractical to try all combinations of powers
• Require iterative fitting procedure
8/4/2005
30
Fitting multivariable FP models
(MFP algorithm)
• Combine backward elimination of weak
variables with search for best FP functions
• Determine fitting order from linear model
• Apply FP model selection procedure to each X
in turn
fixing functions (but not ’s) for other X’s
• Cycle until FP functions (i.e. powers) and
variables selected do not change
8/4/2005
31
Example: Prognostic factors in
breast cancer
• Aim to develop a prognostic index for risk of
tumour recurrence or death
• Have 7 prognostic factors
4 continuous, 3 categorical
• Select variables and functions using 5%
significance level
8/4/2005
32
Univariate linear analysis
Variable
X1
X2
X3
X4a
X4b
X5
X6
X7
8/4/2005
Name
Age
Menopausal status
Tumour size
Grade 2 or 3
Grade 3
No. of positive lymph nodes
Progesterone receptor status
Oestrogen receptor status
2
0.58
0.28
15.68
19.92
8.19
50.02
34.04
4.70
33
Univariate FP2 analysis
Variable
X1 age
X3 size
X5 nodes
X6 PgR
X7 ER
Powers
(2, 0.5)
(1, 3)
(1, 2)
(0.5, 0)
(2, 1)
2 d.f.
17.61
4
19.81
4
81.36
4
52.73
4
23.07
4
P
0.001
0.001
< 0.001
< 0.001
< 0.001
Gain
17.03
4.13
31.34
18.69
18.37
Gain compares FP2 with linear on 3 d.f.
All factors except for X3 have a non-linear effect
8/4/2005
34
Multivariable FP analysis
Variable
X1 age
X3 size
X5 nodes
X6 PgR
X7 ER
X2 mens.
X4a grad 2/3
X4b grad 3
8/4/2005
FP etc.
(2, 0.5)
Out
(2, 1)
0.5
Out
Out
In
Out
2
19.33
5.31
74.14
32.70
2.15
0.21
4.59
0.15
d.f.
P
4 0.001
4
0.3
4 <0.001
4 <0.001
4
0.7
1
0.6
1
0.03
1
0.7
35
Comments on analysis
• Conventional backwards elimination at 5%
level selects X4a, X5, X6, and X1 is excluded
• FP analysis picks up same variables as
backward elimination, and additionally X1
• Note considerable non-linearity of X1 and X5
• X1 has no linear influence on risk of
recurrence
• FP model detects more structure in the data
than the linear model
8/4/2005
36
Plots of fitted FP functions
Breast cancer: Fitted FP functions
1
Nodes
20
40
-1
-.5
0
.5
Log relative hazard
0
1
2
3
4
Log relative hazard
5
Age
60
80
Age, years
0
10
20
30
40
No. of positive lymph nodes
50
-3
-2
-1
0
Log relative hazard
1
Progesterone receptor
0
8/4/2005
500
1000
1500
2000
Progesterone receptor status
2500
37
Survival by risk groups
0.00
0.25
0.50
0.75
1.00
Prognostic classification scheme
0
2
4
Recurrence-free survival, yr
Group = Low risk
Group = High risk
8/4/2005
6
8
Group = Medium risk
38
Robustness of FP functions
• Breast cancer example showed non-robust
functions for nodes – not medically sensible
• Situation can be improved by performing
covariate transformation before FP analysis
• Can be done systematically (work in progress)
• Sauerbrei & Royston (1999) used negative
exponential transformation of nodes
exp(–0.12 * number of nodes)
8/4/2005
39
0
.5
1
1.5
Making the function for lymph
nodes more robust
-.5
Original
Exponential transformation
0
8/4/2005
10
20
30
No. of positive lymph nodes
40
50
40
2nd example: Whitehall 1
MFP analysis
Covariate
Age
Cigarettes
Systolic BP
Total cholesterol
Height
Weight
Job grade
FP etc.
Linear
0.5
-1, -0.5
Linear
Linear
-2, 3
In
No variables were eliminated by the MFP algorithm
Weight is eliminated by linear backward elimination
8/4/2005
41
Plots of FP functions
Whitehall 1: multivariable FP analysis
Cigarettes
.5
.4
.3
.2
.1
.08
45
50 55 60
Age at entry
65
0
20
40
Cigarettes/day
5
10
Cholesterol/ mmol/l
100 150 200 250 300
Systolic BP
15
40
60
Height
.08 .09
.1
Probability of death
.2
.08
.1
.12 .14 .16 .18
Probability of death
.1
.12
.14
Probability of death
0
50
Weight
.16
Total cholesterol
60
.11 .12 .13
40
8/4/2005
Systolic BP
Probability of death
.1
Probability of death
.05
.1
.15
Probability of death
.2
.12 .14 .16 .18
Age
80 100 120 140
Weight/kgs
140
160
180
Height/cms
200
42
A new multivariable regression
algorithm with spline functions
• Inspired by closed test procedure for selecting an FP
function
• Start with predefined number of knots
Determines maximum complexity of function
• Use predetermined knot positions
E.g. at fixed percentile positions of distn. of x
• Simplest function (default) is linear
• Closed test procedure to reduce the knot set if some
knots are not significant
• Apply backfitting procedure as in mfp
• Implemented in Stata as new command mrsnb
8/4/2005
43
Splines: Breast cancer example
• Selects variables similar to mfp
Grade 2/3 omitted, otherwise selected variables
are identical
• Knots: age(46, 53); transformed nodes(linear);
PgR(7, 132)
• Deviance of selected model almost identical to
mfp model
8/4/2005
44
-.5
20
40
60
80
0
10
20
30
40
No. of positive lymph nodes
50
0
Age, years
-6
-4
-2
Log HR
0
0
1
.5
2
1
3
4
1.5
Plots of fitted FP functions
0
500
1000
1500
2000
Progesterone receptor status
2500
Solid lines, FP; dashed lines, spline
8/4/2005
45
Improving the robustness of
spline models
• Often have covariates with positively skew
distributions – can produce curve artefacts
• Simple approach is to log-transform covariates
with a skew distribution – e.g. 1 > 0.5
• Then fit the spline model
• In the breast cancer example, this approach
gives a more satisfactory log function for PgR
8/4/2005
46
Stability of FP models
• Models (variables, FP functions) selected by
statistical criteria – cut-off on P-value
• Approach has several advantages …
• … and also is known to have problems
Omission bias
Selection bias
Unstable – many models may fit equally well
8/4/2005
47
Stability investigation
• Instability may be studied by bootstrap resampling
(sampling with replacement)
Take bootstrap sample B times
Select model by chosen procedure
Count how many times each variable is selected
Summarise inclusion frequencies & their dependencies
Study fitted functions for each covariate
• May lead to choosing several possible models, or a
model different from the original one
8/4/2005
48
Bootstrap stability analysis of
the breast cancer dataset
• 5000 bootstrap samples taken (!)
• MFP algorithm with Cox model applied to
each sample
• Resulted in 1222 different models (!!)
• Nevertheless, could identify stable subset
consisting of 60% of replications
Judged by similarity of functions selected
8/4/2005
49
Bootstrap stability analysis of
the breast cancer dataset
Variable
Model
selected
Age
FP1
FP2
Menopausal status
—
Tumour size
FP1
FP2
Grade 2/3
—
Grade 3
—
Lymph nodes
FP1
Progesterone receptors
FP1
FP2
Oestrogen receptors
FP1
FP2
8/4/2005
% bootstraps
model selected
16
76
20
34
6
58
9
100
95
4
13
6
50
of fitted curves from stable
subset
Log relative hazard
6
1
0
4
-1
2
-2
-3
0
20
30
40
50
60
Age, years
70
80
Log relative hazard
2
25
50
75
Tumour size, mm
100
0
250
PgR, fmol/L
500
1
1
0
0
-1
-1
0
8/4/2005
0
10
20
30
Number of positive lymph nodes
51
Presentation of models for
continuous covariates
• The function + 95% CI gives the whole story
• Functions for important covariates should
always be plotted
• In epidemiology, sometimes useful to give a
more conventional table of results in
categories
• This can be done from the fitted function
8/4/2005
52
Example: Cigarette smoking and
all-cause mortality (Whitehall 1)
Cigarettes per day
Number
OR (model based)
Range
Ref. At risk Dying Estimate 95% CI
point
0 (referent) 0
10103 690
1.00
-1-10
5
2254 243
1.69
1.59, 1.80
11-20
15
3448 494
2.25
2.04, 2.49
21-30
25
1117 185
2.60
2.31, 2.91
31-40
35
283
48
2.86
2.52, 3.24
41-50
45
43
8
3.07
2.68, 3.52
51-60
55
12
2
3.25
2.82, 3.75
8/4/2005
53
Other issues (1)
• Handling continuous confounders
May use a larger P-value for selection e.g. 0.2
Not so concerned about functional form here
• Binary/continuous covariate interactions
Can be modelled using FPs (Royston & Sauerbrei
2004)
Adjust for other factors using MFP
8/4/2005
54
Other issues (2)
• Time-varying effects in survival analysis
Can be modelled using FP functions of time
(Berger; also Sauerbrei & Royston, in progress)
• Checking adequacy of FP functions
May be done by using splines
Fit FP function and see if spline function adds
anything, adjusting for the fitted FP function
8/4/2005
55
Stata aspects
• Command mfp is part of Stata 8
• Example of use:
 mfp stcox x1 x2 x3 x4a x4b x5 x6 x7
hormon, select(0.05, hormon:1)
• Command mrsnb is available from PR
• Example of use:
 mrsnb stcox x1 x2 x3 x4a x4b x5 x6 x7
hormon, select(0.05, hormon:1)
• Command mfpboot is available from PR
Does bootstrap stability analysis of MFP models
8/4/2005
56
Concluding remarks (1)
• FP method in general
No reason (other than convention) why regression models
should include only positive integer powers of covariates
FP is a simple extension of an existing method
Simple to program and simple to explain
Parametric, so can easily get predicted values
FP usually gives better fit than standard polynomials
Cannot do worse, since standard polynomials are included
8/4/2005
57
Concluding remarks (2)
• Multivariable FP modelling
Many applications in general context of multiple
regression modelling
Well-defined procedure based on standard
principles for selecting variables and functions
Aspects of robustness and stability have been
investigated (and methods are available)
Much experience gained so far suggests that
method is very useful in clinical epidemiology
8/4/2005
58
Some references
•
•
•
•
•
•
•
Royston P, Altman DG (1994) Regression using fractional polynomials of
continuous covariates: parsimonious parametric modelling. Applied Statistics 43:
429-467
Royston P, Altman DG (1997) Approximating statistical functions by using
fractional polynomial regression. The Statistician 46: 1-12
Sauerbrei W, Royston P (1999) Building multivariable prognostic and diagnostic
models: transformation of the predictors by using fractional polynomials. JRSS(A)
162: 71-94. Corrigendum JRSS(A) 165: 399-400, 2002
Royston P, Ambler G, Sauerbrei W. (1999) The use of fractional polynomials to
model continuous risk variables in epidemiology. International Journal of
Epidemiology, 28: 964-974.
Royston P, Sauerbrei W (2004). A new approach to modelling interactions between
treatment and continuous covariates in clinical trials by using fractional
polynomials. Statistics in Medicine 23: 2509-2525.
Royston P, Sauerbrei W (2003) Stability of multivariable fractional polynomial
models with selection of variables and transformations: a bootstrap investigation.
Statistics in Medicine 22: 639-659.
Armitage P, Berry G, Matthews JNS (2002) Statistical Methods in Medical
Research. Oxford, Blackwell.
8/4/2005
59