Regression_checking the model

Download Report

Transcript Regression_checking the model

Statistics for Health Research
Regression:
Checking the Model
Peter T. Donnan
Professor of Epidemiology and Biostatistics
Objectives of session
• Recognise the need to check fit of
the model
• Carry out checks of assumptions in
SPSS for simple linear regression
• Understand predictive model
• Understand residuals
How is the fitted line
obtained?
Use method of least squares (LS)
Seek to minimise squared vertical
differences between each point and
fitted line
Results in parameter estimates or
regression coefficients of slope (b)
and intercept (a) – y=a+bx
Dependent (y)
Consider Fitted line of
y = a +bx
a
Explanatory (x)
Consider the regression of age on
minimum LDL cholesterol achieved
• Select Regression
Linear….
• Dependent (y) – Min LDL achieved
• Independent (x) - Age_Base
Output from SPSS linear
regression
Coefficientsa
Model
1
Unstandardized Coefficients Standardized Coefficients
B
Std. Error Beta
t
(Constant)
2.024
.105
19.340
Age at baseline
-.008
.002
-.121
-4.546
sig
.000
.000
a. Dependent Variable: Min LDL achieved
N.B. -0.008 may look very small but
represents:
The DECREASE in LDL achieved for each
increase in one unit of age i.e. ONE year
Output from SPSS linear
regression
Coefficientsa
Model
1
Unstandardized Coefficients Standardized Coefficients
B
Std. Error Beta
t
(Constant)
2.024
.105
19.340
Age at baseline
-.008
.002
-.121
-4.546
sig
.000
.000
a. Dependent Variable: Min LDL achieved
H0 : slope b = 0
Test t = slope/se = -0.008/0.002 = 4.546 with
p<0.001, so statistically significant
Predicted LDL = 2.024 - 0.008xAge
Prediction Equation from linear
regression
Predicted LDL achieved = 2.024 - 0.008xAge
So for a man aged 65 the predicted LDL
achieved = 2.024 – 0.008x 65 = 1.504
Age
Predicted Min LDL
45
1.664
55
1.584
65
1.504
75
1.424
Assumptions of Regression
1. Relationship is linear
2. Outcome variable and hence
residuals or error terms are approx.
Normally distributed
Use Graphs and Scatterplot
to obtain the Lowess line of
fit
Use Graphs and Scatterplot to
obtain the Lowess line of fit
1. Create Scatterplot and then
double-click to enter chart
editor
2. Chose Icon ‘Add fit line at
total’
3. Then select type of fit such
as Lowess
Linear assumption: Fitted
lowess smoothed line
Lowess smoothed line (red) gives a good eyeball
examination of linear assumption (green)
Definition of a residual
A residual is the difference between
the predicted value (fitted line) and the
actual value or unexplained variation
ri = yi – E ( yi )
Or
ri = yi – ( a + bx )
Residuals
To assess the residuals in SPSS
linear regression, select plots…..
Normalised
or
standardised
predicted
value of LDL
Normalised
residual
Select
histogram of
residuals and
normal
probability plot
In SPSS linear regression, select
Statistics…..
Model fit
Select
confidence
intervals for
regression
coefficients
Select DurbinWatson for
serial correlation
and identification
of outliers
Output:
Scatterplot of residuals vs. predicted
Note
1) Mean of
residuals
= 0
2) Most of
data lie
within +
or -3
SDs of
mean
Assumptions of Regression
1. Relationship is linear
2. Outcome variable and hence
residuals or error terms are approx.
Normally distributed
Output:
Histogram of standardised residuals
Plot of
residuals
with
normal
curve
superimposed
Output:
Cumulative probability plot
Look for
deviation
from
diagonal
line to
indicate
nonnormality
Output:
Description of residuals
Descriptive statistics for residuals
Residuals Statisticsa
Minimum Maxim um
Predicted Value
1.314867 1.843205
Residual
-1.65389 4.0658469
Std. Predicted Value
-2.750
3.264
Std. Residual
-2.302
5.660
Mean Std. Deviation
1.556478
.0878548
.0000000
.7181448
.000
1.000
.000
1.000
a. Dependent Variable: Min LDL achieved
Worth
investigation?
Subjects with standardised
residuals > 3
Casewise Diagnostics(a)
N
1383
1383
1383
1383
Case NumberStd. Residual Min LDL
164
5.660
5.5840
209
4.395
4.5260
250
3.143
3.7875
268
3.064
3.8730
274
3.227
4.0953
362
4.095
4.5350
517
3.636
4.3240
849
3.968
4.3290
1047
4.207
4.4360
1075
3.885
4.4040
1103
3.519
3.9905
1229
3.016
3.7660
1290
3.975
4.2345
Predicted
1.518153
1.368685
1.529325
1.671664
1.777153
1.593460
1.711788
1.478113
1.413686
1.613219
1.462584
1.599254
1.379107
a. Dependent Variable: Min LDL achieved
Residual
4.0658471
3.1573148
2.2581750
2.2013357
2.3180975
2.9415398
2.6122125
2.8508873
3.0223141
2.7907805
2.5279157
2.1667456
2.8553933
Output:
Model fit and serial correlation
Model Summary
Model
1
R
.121a
R Square Adjusted R Square Std. Error of the Estimate Durbin-Watson
.015
.014
.7184048
2.034
a. Predictors: (Constant), Age at baseline
R – correlation between min LDL achieved and Age at
baseline, here 0.121
R2 - % variation explained, here 1.5%, not particularly
high
Durbin-Watson test - serial correlation of residuals
should be approximately 2 if no serial correlation
Summary
After fitting any regression model check
assumptions • Functional form – linearity is default,
often not best fit, consider quadratic…
• Check Residuals for approx. normality
• Check Residuals for outliers (> 3 SDs)
• All accomplished within SPSS
Practical on Model Checking
Read in ‘LDL Data.sav’
1) Fit age squared term in min LDL model and
check fit of model compared to linear fit
(Hint: Use transform/compute to create age
squared term and fit age and age2)
2) Fit separate linear regressions with min
Chol achieved with predictors of 1) baseline
Chol 2) APOE_lin 3) adherence
Check assumptions and interpret results