Lecture 19 Slides (Apr 4)
Download
Report
Transcript Lecture 19 Slides (Apr 4)
April 4
• Logistic Regression
– Lee Chapter 9
– Cody and Smith 9:F
HRT Use and Polyps
Case (Polyps)
HRT
Use
No
HRT
Use
Control (No Polyps)
72
175
247
102
114
216
289
463
174
RO HRT Use (Case v Control)
RO = 72/102
175/114
= 0.46
c2 =
( 463 ) (RO)2
( 174) (289) (247) (216)
=16.04
Inference for binary data
• Relative risk, odds ratios, 2x2 tables are
limited
– Can’t adjust for many confounders
– Limited to categorical predictors
– Can’t look at multiple variables simultaneously
• Logistic regression
– Adjust for many confounders
– Study continuous predictors
– Model interactions
Linear regression model
Y = bo + b1X1 + b2X2 + ... + bpXp
Y = dependent variable
Xi = independent variables
Y is continuous, normally distributed
Model the mean response (Y) based on the predictors
b0 is mean of Y when all Xs are 0
b1 is increase in mean of Y for increase in 1 unit of X
New regression model?
Y?= bo + b1X1 + b2X2 + ... + bpXp
Y = binary outcome (0 or 1)
Xi = independent variables
Would like to use this type of model for a binary
outcome variable
Draw a line ?
What if you had multiple observations
at each Score (or you grouped scores)
Score
Proportion Dying
< 10
11-20
21-30
31-40
1/10 = 0.10
4/15 = 0.27
5/15 = 0.33
8/16 = 0.50
*
*
*
*
Possibilities for Y
Y?= bo + b1X1 + b2X2 + ... + bpXp
Y = probability of Y = 1 (Problem: Y bound by 0 -1)
Y = odds of Y = 1
Y = log (odds of Y = 1) – Has good properties
Probability, Odds, Log Odds
p
0.01
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
0.99
Bound by 0 -1
Odds (p/(1-p))
0.01
0.11
0.25
0.43
0.63
1.00
1.50
2.33
4.00
9.00
99.00
Extreme Values
Log (Odds)
-4.60
-2.20
-1.38
-0.85
-0.41
Less extreme
0.00
values and
symmetric about
0.41
p=0.5
0.85
1.38
2.20
4.60
Nearly a straight line for
middle values of P
Logistic regression equation
The model is:
log(
p
1- p
) = b 0 b1 x1 b 2 x 2 ...
Model log odds of outcome as a linear function of one
or more variables
Xi = predictors, independent variables
A Little Math
• The natural LOG and exponential (EXP) functions
are inverse functions of each other
–
–
–
–
LOG (a) = b
LOG (1) = 0
LOG (.5) = -0.693
LOG (1.5) = .405
These will be logistic regression betas
EXP (b) = a
EXP(0) = 1
EXP(-.693) = .5
EXP(.405) = 1.5
These will be the odds ratios
Note: Calculators and Excel use LN for natural logarithm
A Little Math
• LOG function
– Takes values [ 0 to +infinity]
[-infinity to +infinity]
• EXP function
– Takes values [ -infinity to infinity]
[0 to +infinity]
A Little Math
• Properties of LOG function
– log (a*b) = log (a) + log (b)
– log (a/b) = log (a) – log (b)
• Properties of EXP function
– exp (a+b) = exp(a) * exp(b)
– exp (a-b) = exp(a)/exp(b)
Odds Ratios
Differences in log odds
(ODDS)
These
will be
the odds
ratios
These will be typical betas from the logistic regression model
Logistic regression – single binary covariate
The model is:
log(
p
1- p
) = b 0 b1 x
We need to use a dummy variable to code for men and
women
x = 1 for women, 0 for men
What do the betas mean? What is odds ratio, women
versus men?
Odds for Men and Women
For men;
log(
For women;
log(
p
1- p
p
1- p
) = b 0 b1 x = b 0 b1 (0) = b 0
) = b 0 b1 x = b 0 b1 (1) = b 0 b1
b1 is difference in log odds between men and women
After some algebra, the odds ratio is equal to;
odds for women
= exp( B1 )
odds for men
Example - risk of CVD for men vs. women
log(odds) = b0 + b1x
= -2.5504 - 1.0527*x
For females; log(odds) = -2.5504 - 1.0527(1) = -3.6031
For males; log(odds) = -2.5504 - 1.0527(0) = -2.5504
Dif = -1.0527
exp(b1) = odds ratio for women vs. men
Here, exp(b1) = exp(-1.0527) = 0.35
Women are at a 65% lower risk of the outcome than men (OR<1)
Note
• Odds ratio from 2 x 2 table
• EXP (b) from logistic regression for binary risk factor
• These will be equal
Multiple logistic regression model
log(odds) = bo + b1X1 + b2X2 + ... + bpXp
log(odds) = logarithm of the odds for the outcome,
dependent variable
Xi = predictors, independent variables
bi - log(OR) associated with either
•
•
exposure (for categorical predictors)
a 1 unit increase in predictor (for continuous)
OR adjusted for other variables in model
Interpretation of coefficients - continuous predictors
Example - effect of age on risk of death in 10 years
log(odds) = -8.2784+ 0.1026*age
b0 = -8.2784, b1 = 0.1026
exp(b1) = exp(0.1026) = 1.108
A one year increase in age is associated with an odds ratio of death
of 1.108 (assumption that this is true for any 2 consecutive ages)
This is an increase of approximately 11% (= 1.108 - 1)
Interpretation of coefficients - continuous predictors
What about a 5 year increase in age?
Multiply coefficient by the change you want to look at;
exp(5*b1) = exp(5*0.1026) = 1.67
A five year increase in age is associated with an odds ratio of death
of 1.67
This is an increase of 67%
Note: exp(5*b1) does not equal 5*exp(b1)
Parameter Estimation
• How do we come up with estimates for bi?
• Can’t use least squares since outcome is not
continuous
• Use Maximum Likelihood Estimation (MLE)
Maximum Likelihood Estimation
• Choose parameter estimates that maximize the
probability of observing the data you observed.
• Example for estimation a proportion p
–
–
–
–
Observe 7/10 have characteristic
P = 0.70 is estimate p
P = 0.70 is MLE of p (Why?)
Which value of p maximizes the probability of getting 7 of
10?
– Answer: 0.70
MLE Simple Example
• Wish to estimate a proportion p
• Sample n = 2
–
–
–
–
Observe 1 of 2 have characteristic
L = p (1-p)
What value of p maximizes L?
Answer: p = 0.5 which is p=1/2
Fitted regression line
Curve based on:
p
log(
) = b o b1 x
1- p
bo effects location
b1 effects curvature
Inference for multiple logistic regression
• Collect data, choose model, estimate bo and bis
• Describe odds ratios, exp(bi), in statistical terms.
– How confident are we of our estimate?
– Is the odds ratio is different from one due to chance?
Not interested in inference for bo (related to overall probability
of outcome)
Confidence Intervals for logistic regression
coefficients
• General form of 95% CI: Estimate ± 1.96*SE
– Bi estimate, provided by SAS
– SE is complicated, provided by SAS
• Related to variability of our data and sample
size
95% Confidence Intervals for the odds ratio
•
Based on transforming the 95% confidence interval for the parameter
estimates
(e
bi -1.96SE
,e
•
Supplied automatically by SAS
•
Look to see if interval contains 1
bi 1.96SE
)
“We have a statistically significant association between the predictor and
the outcome controlling for all other covariates”
•
Equivalent to a hypothesis test; reject Ho: OR = 1 at alpha = 0.05.
Based on whether or not 1 is in the interval
Hypothesis test for individual logistic
regression coefficient
• Null and alternative hypotheses
– Ho : bi = 0, Ha: bi 0
• Test statistic: c2 = (bi/ SE)2, supplied by SAS
• p-values are supplied by SAS
• If p<0.05, “there is a statistically significant association
between the predictor and outcome variable controlling for all
other covariates” at alpha = 0.05
PROC LOGISTIC
PROC LOGISTIC DATA = dataset ;
MODEL outcome = list of x variables;
RUN;
• CLASS statement allows for categorical variables with many
groups (>2)
DATA temp;
INPUT apache death @@ ;
xdeath = 2;
if death = 1 then xdeath = 1;
DATALINES;
0 0 2 0 3 0 4 0 5 0
6 0 7 0 8 0 9 0 10 0
11 0 12 0 13 0 14 0 15 0
16 0 17 1 18 1 19 0 20 0
21 1 22 1 23 0 24 1 25 1
26 1 27 0 28 1 29 1 30 1
31 1 32 1 33 1 34 1 35 1
36 1 37 1 38 1 41 0
;
PROC LOGIST DATA=temp;
MODEL xdeath = apache;
RUN;
The LOGISTIC Procedure
Model Information
Data Set
Response Variable
Number of Response Levels
Number of Observations
Model
Optimization Technique
WORK.TEMP
xdeath
2
39
binary logit
Fisher's scoring
Response Profile
Ordered
Value
xdeath
Total
Frequency
1
2
1
2
18
21
Probability modeled is xdeath=1.
The LOGISTIC Procedure
Analysis of Maximum Likelihood Estimates
Parameter
DF
Estimate
Standard
Error
Intercept
apache
1
1
-4.3861
0.2034
1.3687
0.0605
Wald
Chi-Square
Pr > ChiSq
10.2686
11.3093
0.0014
0.0008
Odds Ratio Estimates
Effect
Point
Estimate
apache
1.226
95% Wald
Confidence Limits
1.089
1.380
EXP(0.2034 – 1.96*.0605)
EXP(0.2034)
EXP(0.2034 +1.96*.0605)
TOMHS – bpstudy sas dataset
• Variable CLINICAL (1=yes, 0 =no) indicates whether patient had a CVD
event
• Run logistic regression separately for age and gender to determine if:
– Age is related to CVD
• What is the odds associated with a 1 year increase in age
• What is the odds associated with a 5 year increase in age
– Gender is related to CVD
• What is the odds of CVD (women versus men)
• Run logistic regression for age and gender together
• Note: Download dataset from web-page or use dataset on SATURN