Logistic Regression
Download
Report
Transcript Logistic Regression
Logistic Regression
Logistic Regression
When ?
Just like multiple regression, but when the dependent variable is
dichotomous.
E.g. improved or not improved; successful or not successful.
Why ?
Logistic regression can be used for classification purpose (it includes c2).
Why not performed a discriminant analysis ?
Give probability of an effect (outcome) and evaluate the risk (odds).
Discriminant analysis can produced a probability of success that lie outside [0,1]
Discriminant analysis require normality
Why not performed a multiple regression ?
Multiple regression can produced a probability of success that lie outside [0,1]
Multiple regression require homoscedasticity
Multiple regression require normality
Logistic Regression
Exemple:
Suppose we want to predict whether someone as a coronary
disease (DV) using age in years (IV).
It is customary to code a binary DV either 0 or 1.
Logistic Regression
The logistic curve
u b0 b1 x1
yˆ
1
1 eu
bp x p
Linear part
Nonlinear part
Logistic Regression
The logistic curve
u b0 b1 x1
yˆ
1
1 eu
bp x p
Logistic Regression
Exemple:
Suppose we want to predict whether someone as a coronary
disease (DV) using age in years (IV).
It is customary to code a binary DV either 0 or 1.
Logistic Regression
The logistic curve
u b0 b1 x1
bp x p
1
eu
yˆ
1 e u 1 eu
where ŷ is the probability of a 1, e is the base of the natural logarithm (about
2.718) and b are the parameters of the model. The value of a yields ŷ when X is
zero, and b adjusts how quickly the probability changes with changing X a
single unit (we can have standardized and unstandardized b in logistic
regression, just as in ordinary linear regression). Because the relation between X
and ŷ is nonlinear, b does not have a straightforward interpretation in this model
as it does in ordinary linear regression.
Logistic Regression
(Where did it came from)
Suppose we only know a person's age and we want to predict
whether that person has a coronary disease or not. We can talk
about the probability of having the disease, or we can talk about
the odds of having the disease. Let's say that the probability of
not having the disease for a given age is .95. Then the odds of
not having the disease is
yˆ
0.95
odds
19
1 yˆ 1 0.95
Now the odds of having the disease would be .05/.95 or 1/19 or
0.0526. This asymmetry is unappealing, because the odds of
having the disease should be the opposite of the odds of not
having the desease.
Logistic Regression
(Where did it came from)
We can take care of this asymmetry though the natural logarithm, ln. The
natural log of 19 is 2.9444 (ln(0.95/0.05)=2.9444). The natural log of 1/19 is
- 2.9444 (ln(0.05/0.95)=-2.9444), so the log odds of having a coronary
disease is exactly opposite to the log odds of not having a disease.
Log (odds ) Log (
Solving for
yˆ
) b0 b1 x1
ˆ
1 y
bp x p u
yˆ
)u
ˆ
1 y
yˆ
eu
1 yˆ
ŷ Log (
eu
1
yˆ
1 eu 1 e u
In term of probability
In term of odds
Logistic Regression
Finding the regression weights.
In multiple regression, we wanted to minimized the residual sum of square.
This yield to the formula
b ( X T X) 1 X T y
With the logistic curve, there is no mathematical solution that will produce
least squares estimates of the parameters. We will used instead the maximum
(log) likelihood.
A likelihood is a conditional probability: P( ŷ |X), the probability of ŷ given
X). The idea is to choose the regression weights that will give the maximum
(log) likelihood between the data and the logistic curve.
n
L(b) yi yˆì (1 yi )(1 yˆì ) , b (b0 , b1 ,
i 1
, bp )
Maximum likelihood
n
LL(b) yi ln( yˆì ) (1 yi ) ln(1 yˆì )
i 1
Maximum log likelihood
Logistic Regression
Finding the regression weights.
The maximum of this expression can then be found numerically using an
optimization algorithm
Logistic Regression
Finding the regression weights.
The maximum of this expression can then be found numerically using an
optimization algorithm
Logistic Regression
Finding the regression weights.
The maximum of this expression can then be found numerically using an
optimization algorithm
Logistic Regression
Hypothesis testing
The idea is to compare the full model with only the constant using chisquare.
c 2 2ln( LL(b) LL(0)) -53.6765-(-68.3315) = 29.3099
c 2 (1) 29.3099, p <0.01
There is only 1
predictor
This indicates that age can reliably distinguished between people having a
coronary disease from those who do not.
Logistic Regression
Hypothesis testing
We can use the same idea to build a regression model.
c 2 2 ln( LL(bigger model) LL(smaller model))
Also, the Wald statistic can be used (Z test).
Sometimes the values are squared then a chi-square is used
zi
bi
SEii
SE
Fisher information
matrix
I (b) 1
I (b) X T WX, where, W diag yˆi 1 yˆ i
Logistic Regression
Hypothesis testing
Also, the Wald statistic can be used
Constant
SE
I (b) 1
I (b) XT WX
z const
bconst
-5.30945
-4.68348
SEconst 1.13365
z disease
b disease
0.110921
4.61022
SEdisease 0.0240598
IV (coronary disease)
Logistic Regression
Explained variability
There are three popular measures that approximate the variance
interpretation found in linear regression (R2).
1- McFadden's p 2 1
2
CS
2- R
1 e
LL(b)
53.6765
1
0.214468
LL(0)
68.3315
2
LL ( b ) LL ( 0 )
n
1 e
2
53.6765 ( 68.3315)
n
0.254052
2
RCS
0.254052
3 R 2 =
=0.340993,
RMax 0.745035
2
N
2
Max
where, R
1 e
2
LL ( 0 )
n
1 e
2
( 68.3315)
n
0.745035
Logistic Regression
Odds Ratio (OR)
The odds ratio is the increase (or decrease) in odds of being in one outcome
category when the value of the predictor increases by on unit.
If the odds are the same across groups, then OR=1.
If the odds are greater than 1, then there is an increase probability of being
classify into the category.
If the odds are smaller than1, then there is a decrease probability of being
classify into the given category.
ORi ebi
ORDisease ebdisease e0.110921 1.11731
Thus, at each of my birthdays I increase my odds of having a coronary
disease by 1.12. In other words, each year I increase the risk of developing a
coronary disease by 12 percents.
Logistic Regression
Odds Ratio (OR)
For a 5 year age difference, say, the increase is exp(b)5 [= 1.117315] = 1.74,
or a 74% increase.
Classification table
Cut off = 0.5
Constant only
Total correct percentage = 57
All predictors
Total correct percentage = 74
Logistic Regression
Prediction
If I have (x’=)50 years old, what is my probability of having a coronary
disease ?
yˆ
yˆ
1
1 e (-5.30945 0.110921x1 )
1
1 e
(-5.30945 0.110921*50)
1
1 e
0.236604
0.558877
Logistic Regression
Confidence intervals
CI=0.95
SE (u (x)) x T I (b) 1 x x TVar (b)x
1
SE (u )
50
0.026677 1
1.28517
1
50
0.026677 0.000578876 50
1
SE (u ) 0.0646601 0.254284
50
u ( x) Z / 2 SE ( yˆ ( x))
0.236604 1.96*0.254284
-0.261783, 0.73499
1
1 eu ( x )
1
1 e
0.434925, 0.675899
-0.261783, 0.73499
Logistic Regression
Confidence bands
CI=0.95
SE (u (x)) xT I(b) 1 x xTVar (b)x
u (x) Z / 2 SE ( yˆ (x))
1
1 eu ( x )
Logistic Regression
Recoding a continuous variable into a dichotomous
variable
Cutoff at 55
Contingency table
Logistic Regression
Recoding a continuous variable into a dichotomous
variable
Cutoff at 55
Regression weights
yˆ
Wald test
1
1 e (-0.840783 2.09355x1 )
Logistic Regression
Recoding a continuous variable into a dichotomous
variable
Cutoff at 55
Explained variability
1- McFadden's p 2 0.136861
2
2- RCS
0.170588
3 RN2 = 0.228967
Logistic Regression
Recoding a continuous variable into a dichotomous
variable
Cutoff at 55
Classification table
Total correct percentage = 57
Total correct percentage = 72
Logistic Regression
Recoding a continuous variable into a dichotomous
variable
Cutoff at 55
Odds ratio
ORDisease ebdisease e2.09355 8.11364
If I am 55 years old and up, I have 8 times more chances to have a coronary
disease.
ORDisease
21
22 8.11364
6
51
Logistic Regression
Recoding a continuous variable into a dichotomous
variable
Cutoff at 55
Confidence intervals
ebi Z a / 2 SEbi
2.87956, 22.8615
The CI (0.95) is asymmetric. It suggests that coronary disease is 2.9 to 22.9
more likely to occur if I am 55 yrs and up.