Name der Abteilung/Unit maximal 3 Zeilen können hier plaziert
Download
Report
Transcript Name der Abteilung/Unit maximal 3 Zeilen können hier plaziert
Department of Epidemiology and Public Health
Unit of Biostatistics and Computational Sciences
Ordinary linear regression
PD Dr. C. Schindler
Swiss Tropical and Public Health Institute
University of Basel
[email protected]
Annual meeting of the Swiss Societies of Clinical Neurophysiology
and of Neurology, Lugano, May 3rd 2012
Example:
Association between
blood volume
and
body weight
in women
Question:
How does the mean of blood volume depend on
body weight in women?
The regression line
y = 893 + 45.7 · x
y = 893 + 45.7 · 70 = 4092
In this example, the regression line describes the
mean of blood volume of women as a function of
weight.
In general:
The regression line describes the mean of the
dependent variable Y* as a function of the
independent variable X**.
* syn. outcome variable
** syn. explanatory or predictor variable
Regression equation and regression parameters
y
y=a+b·x
Dy
b = Dy / Dx
Dx
a
0
0
x
Regression parameters
a = intercept = y-value of the line at x = 0
b = slope of the line = change in y, if x increases by one unit
The values of the parameters must be determined
from empirical data.
They are estimates of the respective true parameter values
at the population level.
Therefore, they are referred to as parameter estimates.
Interpretation of parameter estimates
a^ = estimated intercept = 893 ml:
for a weight of 0 kg, a blood volume of 893 ml would be
predicted.
Of course, this interpretation does not make sense, since
valid predictions can only be made for values of weight
between 50 and 80 kg (range of observed values)
b^= estimated slope = 45.7 ml/kg:
According to this model, the mean of blood volume in women
is supposed to increase by 45.7 ml with each additional kg of
weight.
Note: a and b denote the parameters of the true regression line
at the population level.
Residuals and predicted values
Residual = deviation of the
observed value of the dependent
variable (here: blood volume) from
the value which the model predicts
for the respective value of the
independent variable (here: weight)
(-> predicted value).
Residual plot
Definition and properties of the regression line
1. Among all possible lines, the regression line
stands out as the one with the smallest possible
variance of the residuals.
2. The regression line always runs through the point
(mean of X, mean of Y)
i.e., for the mean of the independent variable,
the regression line always predicts the mean of the
dependent variable.
Regression output of a statistics program (SPSS)
Coefficientsa
Unstandardized Coefficients
B1
Model
1
Standardized
Coefficients
Std. Error
(Constant) 2
893.253
369.827
Gewicht ) 3
45.682
5.846
Beta
t
.777
Sig.
2.415
.020
7.815
.000
a. Dependent Variable: blutvol
The rightmost column (Sig) contains the p-values of the two parameter
estimates. They refer to the deviation of these estimates from 0.
The t-value (4. column) equals the ratio between the parameter estimate
(B) and its standard error (Std. Error). The standardized coefficient
equals Pearson’s correlation coefficient.
1parameter
estimate, 2intercept a, 3slope b
Slope
With a p-value < 0.0001, the deviation of the estimated slope
from 0 is highly significant.
The hypothesis, that the slope of the true regression line be 0 can
therefore be rejected at the usual significance level of 0.05 (in
fact even at a significance level of 0.0001).
*If the true slope is 0 then two situations are possible:
a) the mean of Y does not depend on X at all
or
b) the mean of Y depends on X in a specific non-linear way
(see next slide)
y = 0.1 · x2
y
y=0·x+8
x
Here b = 0. The mirror symmetriy of the curve with respect
to the vertical axis at x = 0 forces the regression line to run
horizontally.
Intercept
With a p-value < 0.05, the deviation of the estimated intercept
from 0 is statistically significant as well at the usual level of 0.05.
Therefore, the hypothesis that the true regression line pass through
the origin of the coordinate system, can also be rejected at the
usual level of 0.05.
Approximate 95%-confidence interval of the slope
(Parameter estimate ± 2 standard error)
45.7 ± 2 · 5.8 = (34.1, 57.3)
We can be 95% confident that the slope of the regression line
at the population level lies between 34.1 and 57.3 ml/kg.
It is thus quite certain that the true regression slope is higher
than 30 and lower than 60 ml/kg.
Other important parameters of a regression model (SPSS)
Proportion of variance of Y, which is explained by the model
Model Summary
Model
1
R
R Square
.777a
Adjusted R
Square
.604
Std. Error of the
Estimate
.594
308.45008
a. Predictors: (Constant), Gewicht
Standard deviation of residuals
Coefficientsa
Standardized
Unstandardized Coefficients Coefficients
Model
1
B
(Constant)
Gewicht
a. Dependent Variable: blutvol
Std. Error
893.253
369.827
45.682
5.846
Beta
t
.777
Sig.
2.415
.020
7.815
.000
Decomposition of total variance
Total variance = variance of predicted values + variance of residuals
explained variance
unexplained variance
Total variance = sum of squared deviations of the individual
values of Y from their mean value.
Variance of residuals = sum of squared residuals
(“residual sum of squares”)
Variance of predicted values = sum of squared deviations of
the predicted values of Y from the
sample mean of Y.
R2-value (or measure of determination) of the model
explained variance* =
total variance*
total variance* - unexplained variance*
total variance*
* of Y
Note:
R2 = 1
The data are completely explained by the model,
i.e., all the points lie on the regression line.
R2 = 0
slope of the regression line = 0.
Regression line with 95%-confidence intervals
Confidence intervals of predicted values become wider with
increasing distance from the center.
Power considerations
SE (b) =
sresiduals
n - 1 sX
(standard error of the slope)
SE(b) is proportional to the standard deviation of residuals
-> the residuals should be as small as possible
SE(b) is inversely proportional to the square root of n-1
-> n should be sufficiently large
SE(b) is inversely proportional to the standard deviation of X
-> the range of X should be as large as possible
Conditions for the validity of a regression model
a)
The residual plot should display a
horizontal point cloud (no banana or
wave shape).
-> validity of parameter estimates,
confidence intervals and p-values)
b)
The (vertical) variability of the residuals
should be more or less constant across the
whole range of the independent variable
(condition of homoscedasticity).
-> validity of confidence intervals and
p-values
Q uantile der Standardnorm alverteilung
c)
3
2
the distribution of residuals should be
approximately normal (visual assessment by normal probability plot).
1
0
-1
-> validity of confidence intervals and
p-values
-2
-3
-1000
-500
0
Residuen (ml)
500
1000
d) 1. Each observational unit should only occupy one row of the data table
2.
(i.e., each subject should contribute one observation to the analysis).
If the individual observational units can be grouped into clusters (families,
hospitals, etc.) then the cluster means of residuals must not vary systematically between the clusters (i.e., cluster means of residuals should differ
from 0 only by chance*).
-> validity of confidence intervals and p-values
*If they don’t, one should introduce the cluster variable as additional fixed or
random factor into the regression model.
Beware: Not all relations can be well described by a regression line. Very often,
relation(s) between dependent and independent variable(s) are non-linear.
Non-linear association
y = -1.6 + 4.26 · x – 0.039 · x2
Linear association
y = -22.6 + 2.3 · x
Multiple regression models
(illustration based on concrete example)
Association
between
systolic blood pressure,
gender, age and overweight
Different purposes of regression models
1. Prediction models
ex. Prediction of blood volume based on weight.
Prediction of clinical outcome after t years.
2. Reference models
ex. Growth curves, reference values for functional parameters
as a function of sex, age, etc.
3. Explanatory models*
describe the parallel influences of different predictor variables on
a given outcome variable.
e.g., Influence of sex, age and obesity on systolic blood pressure.
* also serve to “protect” effect estimates against confounding.
Aim 1: Reference model for adult systolic blood pressure (SBP)
in Lugano as a function of sex and age.
Sample used: SAPALDIA-subjects from Lugano with normal weight
(i.e., BMI < 25 kg/m2)
SAPALDIA study
(Swiss Cohort Study on Air Pollution and Lung and Heart Diseases
in Adults)
8 study areas (Basel, Geneva, Lugano, Aarau, Wald, Payerne,
Davos, Montana)
Study subjects were between 18 and 60 years old in 1991 and
had to be resident in the respective area for at least 3 years.
1st survey (1991): n = 9651 lung health (symptoms/lung function)
+ allergies
2nd survey (2002): n 6500 lung health + allergies +
cardiovascular health
(blood pressure, 24hr – ECG)
SAPALDIA: Study areas
Statistical method
Ordinary linear regression (quantitative outcome)
Simple model:
E[SBP | sex, age] = b0 + b1 · female + b2 · age_50
E[SBP | sex, age] = mean of SBP (as a function of sex and age)
predicted value of Y (
“
)
expected value of Y (
“
)
female = binary variable with 1 in women and 0 in men.
age_50 = age – 50 (age centered at 50 yrs)
Result of regression model (program STATA)
Source |
SS
df
MS
-------------+-----------------------------Model | 37845.7847
2 18922.8923
Residual | 130361.613
477 273.294787
-------------+-----------------------------Total | 168207.398
479
351.16367
Number of obs =
F( 2,
477) =
Prob > F
=
R-squared
=
Adj R-squared =
Root MSE
=
480
69.24
0.0000
0.2250
0.2217
16.532
-----------------------------------------------------------------------------bpsys |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------female | -13.41972
1.593982
-8.42
0.000
-16.55181
-10.28762
age_50 |
.5503499
.0650948
8.45
0.000
.4224418
.678258
_cons |
128.5214
1.296175
99.15
0.000
125.9745
131.0683
------------------------------------------------------------------------------
1. The age-adjusted mean of systolic blood pressure was significantly lower among
women (i.e., by 13.4 mm Hg).
2. The gender-adjusted mean of SBP showed a mean increase of 0.55 mm Hg per year.
3. The value of the intercept parameter, 128.5 mm Hg, is the estimated mean of SBP
in 50 year old men (they have female = 0 and age_50 = 0).
-50
0
Residuals
50
100
Normal probability plot (QQ-plot)
-50
0
Inverse Normal
50
Point line is slightly curved -> distribution of residuals is slightly skewed
50
100
Residual plot
Residuals
x-axis:
predicted values
-50
0
y-axis:
residuals
100
110
120
Fitted values
130
140
(vertical) variability of residuals increases from left to right
If the distribution of residuals is left skewed and their (vertical)
variability gets larger with increasing predicted values,
then a logarithmic transformation of the data often helps.
We will thus consider the new outcome variable
Y = ln(SBP)
Statistical method
Ordinary linear regression (quantitative outcome)
Alternative model:
E[ln(SBP) | sex, age] = b0 + b1 · female + b2 · age_50
E[ln(Y) | sex, age] = mean of ln(Y) as a function of sex and age
exp{E[ln(Y) | sex, age]} = geometric mean of Y as a function of
sex and age.
≈ median of Y as a function of sex and age
E[ln(Y) | sex, age]
(if residuals are symmetrically distributed)
e
.2
.4
Residual plot
Residuals
0
x-axis:
predicted values
-.4
-.2
y-axis:
residuals
-.4
-.2
0
Inverse Normal
.2
.4
Point line is almost linear -> distribution of residuals close to normal
.4
Residual plot
Residuals
0
.2
x-axis:
predicted values
-.4
-.2
y-axis:
residuals
4.6
4.7
4.8
Fitted values
4.9
5
(vertical) variability of residuals increases less strongly from left to right
Source |
SS
df
MS
-------------+-----------------------------Model | 2.49178871
2 1.24589436
Residual | 8.70727941
477 .018254255
-------------+-----------------------------Total | 11.1990681
479
.0233801
Number of obs =
F( 2,
477) =
Prob > F
=
R-squared
=
Adj R-squared =
Root MSE
=
480
68.25
0.0000
0.2225
0.2192
.13511
-----------------------------------------------------------------------------lnbpsys |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------female | -.1109775
.0130272
-8.52
0.000
-.1365753
-.0853798
age_50 |
.0043791
.000532
8.23
0.000
.0033337
.0054244
_cons |
4.846471
.0105933
457.50
0.000
4.825656
4.867287
------------------------------------------------------------------------------
1. The age-adjusted mean of ln(SBP) was lower by 0.11 in women.
The geometric mean ratio of SBP between women and men was exp(-0.11) = 0.90.
The geometric mean of SBP was lower in women by 10%.
2. On average, the geom. mean of SBP increased by a factor of exp(0.0043) = 1.0043,
i.e., by 0.43% per year of age.
3. The estimated geometric mean of SBP in 50 year old men is exp(4.846) = 127.2.
Is the relation between ln(SBP) and age linear?
may be assessed by adding the square of age_50:
Source |
SS
df
MS
-------------+-----------------------------Model |
2.5071076
3 .835702532
Residual | 8.69196052
476 .018260421
-------------+-----------------------------Total | 11.1990681
479
.0233801
age_50squared = age_502
Number of obs =
F( 3,
476)
Prob > F
R-squared
Adj R-squared
Root MSE
480
=
45.77
= 0.0000
= 0.2239
= 0.2190
= .13513
------------------------------------------------------------------------------lnbpsys |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
--------------+---------------------------------------------------------------female | -.1103769
.0130459
-8.46
0.000
-.1360115
-.0847423
age_50 |
.0043318
.0005346
8.10
0.000
.0032814
.0053823
age_50squared |
.0000422
.000046
0.92
0.360
-.0000483
.0001326
_cons |
4.840393
.0125019
387.17
0.000
4.815827
4.864959
-------------------------------------------------------------------------------
The square term is clearly not significant with a p-value of 0.36.
Is the relation between ln(SBP) and age independent of gender?
may be assessed by adding the interaction term: female_age_50 = female*age_50
Source |
SS
df
MS
-------------+-----------------------------Model | 2.59024374
3
.86341458
Residual | 8.60882438
476 .018085766
-------------+-----------------------------Total | 11.1990681
479
.0233801
Number of obs =
F( 3,
476)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
480
47.74
0.0000
0.2313
0.2264
.13448
------------------------------------------------------------------------------lnbpsys |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
--------------+---------------------------------------------------------------female | -.1139236
.0130282
-8.74
0.000
-.1395236
-.0883236
age_50 |
.0027492
.0008766
3.14
0.002
.0010267
.0044716
female_age_50 |
.0025665
.0011
2.33
0.020
.000405
.0047279
_cons |
4.847934
.0105629
458.96
0.000
4.827179
4.86869
-------------------------------------------------------------------------------
The interaction term is statistically significant with a p-value of 0.02.
The slope between ln(SBP) and age is higher in women (i.e., 0.0027+0.0026 = 0.0053)
than in men (i.e., 0.0027).
5
4.6
4.8
men
women
4.4
ln(SBP)
5.2
5.4
Graphical representation of the model on the log-scale
30
40
50
60
age
70
100
150
men
women
50
SBP
200
250
Graphical representation of the model on the original scale:
30
40
50
age
60
70
Variable selection strategies in prediction / reference models
1. Between two models select the one which is more significant.
2. Between two models select the one with the lower AIC-value
(AIC = Akaike information criterion).
3. Between two models select the one with the lower BIC-value
(BIC = Bayesian information criterion).
2) and 3) are better than 1), because they estimate performance of
the model in new data. They are strongly linked to cross-validation.
3) is stricter than 2) and is preferable if parsimony of the model is
an important criterion.
Aim 2: Assessment of the association between adult systolic blood
pressure (SBP) in Lugano and overweight.
We consider variable „overweight“ with values:
0 in persons with BMI 25kg/m2
1 in persons with BMI > 25 kg/m2
Regression model:
ln(SBP) = b0 + b1 · overweight
Source |
SS
df
MS
-------------+-----------------------------Model | 2.14124857
1 2.14124857
Residual | 20.0169737
922 .021710384
-------------+-----------------------------Total | 22.1582222
923 .024006741
Number of obs =
F( 1,
922)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
924
98.63
0.0000
0.0966
0.0957
.14734
-----------------------------------------------------------------------------lnbpsys |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------overweight |
.0963513
.0097019
9.93
0.000
.0773109
.1153917
_cons |
4.779094
.0067253
710.61
0.000
4.765896
4.792293
------------------------------------------------------------------------------
1. The mean of ln(SBP) was higher by 0.096 in overweight persons compared to
persons of normal weight. The geometric mean ratio of SBP between overweight
and normal weight persons was exp(0.096) = 1.10.
The geometric mean of SBP was higher by 10% in overweight persons.
2. The estimated geometric mean of SBP in normal weight persons is exp(4.779) =
119.0.
Adjustment for gender and age:
Source |
SS
df
MS
-------------+-----------------------------Model | 6.89527577
4 1.72381894
Residual | 15.2629465
919 .016608212
-------------+-----------------------------Total | 22.1582222
923 .024006741
Number of obs =
F( 4,
919)
Prob > F
R-squared
Adj R-squared
Root MSE
924
= 103.79
= 0.0000
= 0.3112
= 0.3082
= .12887
------------------------------------------------------------------------------lnbpsys |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
--------------+---------------------------------------------------------------female |
-.103432
.0091201
-11.34
0.000
-.1213307
-.0855334
age_50 |
.0036905
.0005486
6.73
0.000
.0026138
.0047672
female_age_50 |
.0022885
.000742
3.08
0.002
.0008323
.0037447
overweight |
.054128
.0088583
6.11
0.000
.0367431
.071513
_cons |
4.840025
.0083731
578.05
0.000
4.823592
4.856457
-------------------------------------------------------------------------------
The gender and age-adjusted mean of ln(SBP) was higher by 0.054 in overweight
persons compared to persons of normal weight. The adjusted geometric mean ratio
of SBP between overweight and normal weight persons was exp(0.054) = 1.055.
The adjusted geometric mean of SBP was higher by 5.5% in overweight persons.
Arithmetic of confounding
OW
SBP
+
+
age
OW
SBP
-
-
female
association between OW and age = +
association between OW and F = -
association between SBP and age = +
association between SBP and F = -
Confounding of association between
SBP and OW by age = + + = + .
Confounding of association between
SBP and OW by sex = - - = + .
Both, age and sex are positive confounders of the association between SBP and OW.
=> If age and sex are included in the model, the slope between SBP and OW decreases.
Adjustment for clustering of data
Example: multi-center studies
If clustering is ignored, then this may lead to
a) a loss of power (RCT‘s with randomisation stratified by center)
b) confounding (observational studies with different study areas)
Remedy: Introduce study center as a fixed factor into the regression
model or use mixed linear model with random effects for
the different centers.
SAPALDIA-example (fixed area effects)
Source |
SS
df
MS
-------------+-----------------------------Model | 23.5304364
11 2.13913058
Residual | 49.2344884 3231 .015238158
-------------+-----------------------------Total | 72.7649248 3242 .022444456
Number of obs =
F( 11, 3231)
Prob > F
R-squared
Adj R-squared
Root MSE
3243
= 140.38
= 0.0000
= 0.3234
= 0.3211
= .12344
------------------------------------------------------------------------------lnbpsys |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
--------------+---------------------------------------------------------------female | -.0887425
.0045394
-19.55
0.000
-.0976429
-.079842
age_50 |
.0031324
.0002716
11.53
0.000
.0025999
.0036649
female_age_50 |
.0031068
.000379
8.20
0.000
.0023637
.0038499
overweight |
.0600691
.0045947
13.07
0.000
.0510602
.0690779
_Iarea_161 | -.0076996
.0078966
-0.98
0.330
-.0231825
.0077832
_Iarea_162 |
.0201633
.0098355
2.05
0.040
.0008788
.0394477
_Iarea_163 | -.0084698
.0084272
-1.01
0.315
-.0249929
.0080534
_Iarea_164 | -.0076526
.0093321
-0.82
0.412
-.02595
.0106449
_Iarea_165 | -.0411928
.0086567
-4.76
0.000
-.058166
-.0242196
_Iarea_166 | -.0126127
.0083109
-1.52
0.129
-.0289078
.0036825
_Iarea_167 | -.0260132
.0095668
-2.72
0.007
-.0447708
-.0072556
_cons |
4.841831
.0071065
681.32
0.000
4.827897
4.855765
All but one study area gets a parameter estimate, expressing its difference to the
one area which serves as the reference (here: area 160).
SAPALDIA-example (mixed linear model with random area effects)
Random-effects ML regression
Group variable: area
Number of obs
Number of groups
=
=
3243
8
Random effects u_i ~ Gaussian
Obs per group: min =
avg =
max =
259
405.4
624
Log likelihood
=
2176.8602
LR chi2(4)
Prob > chi2
=
=
1227.18
0.0000
------------------------------------------------------------------------------lnbpsys |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
--------------+---------------------------------------------------------------female | -.0889006
.0045361
-19.60
0.000
-.0977913
-.08001
age_50 |
.0031431
.0002713
11.58
0.000
.0026113
.003675
female_age_50 |
.0031
.0003787
8.19
0.000
.0023578
.0038422
overweight |
.0597827
.0045912
13.02
0.000
.0507841
.0687812
_cons |
4.831426
.0068285
707.54
0.000
4.818043
4.84481
--------------+---------------------------------------------------------------/sigma_u |
.0150779
.0045517
.0083441
.0272459
/sigma_e |
.1233706
.0015339
.1204006
.1264139
rho |
.014717
.0087657
.0041602
.0430383
Random area effects u are viewed as independent outcomes of a normal distribution
with mu = 0 and su = 0.015 (residual standard deviation within areas = 0.123).
Thank you for your attention!