Logistic Regression

Download Report

Transcript Logistic Regression

Notes on Logistic Regression
STAT 4330/8330
1
Introduction
Previously, you learned about odds
ratios (OR’s).
We now transition and begin discussion
of binary logistic regression. We will see
that OR’s play an important role in the
results of binary logistic models.
2
Binary Logistic Regression
Binary Logistic Regression is an
appropriate when:
1.
2.
3.
The response variable is categorical w/ 2 categories (binary,
dichotomous, etc.). The response categories are often
generically labeled “success” or “failure”.
One or more explanatory variables are involved. These can
be either quantitative or categorical or a mixture of both.
One is interested in assessing the relationship between the
binary response and the explanatory variables and/or
predicting the response category based on the value(s) of the
explanatory variable(s).
3
The Model Equation
4
The Model Equation
A few points:
1. E(y) can never fall below 0 or above 1
(Remember: it is a probability!).
2. The model is not a linear function of the β
parameters. This is a type of nonlinear
regression model.
5
The Model Function
6
The Model Equation
Alternatively, the equation can be
transformed to show that it models the
natural logarithm of the odds of y = 1.
7
The Model Equation
The left side is called the “logit”
8
The Model Equation
In general, the bi estimates the change in the logodds when xi is increased by 1 unit, holding all
other x’s in the model fixed.
Therefore, exp(bi) estimates the OR of a success for
each additional 1-unit increase in xi.
Furthermore, (exp(bi)-1)*100 gives the percent
increase in the odds of a success for each 1-unit
increase in xi.
9
Example: The Outbreak Data
The Outbreak data contain a sample of N =
196 persons in 2 neighborhoods (sectors) of a
large city during a disease outbreak.
Can we predict whether or not a person
contracts the disease?
We will begin with a simple binary logit
model (with 1 predictor = age).
10
Example: The Outbreak Data.
Through SAS PROC LOGISTIC, we
find that b1 = .0285.
Therefore, OR = exp(.0285) = 1.029,
indicating that a person’s odds of
contracting the disease increase 1.029
times for every year they age.
11
Example: The Outbreak Data.
Furthermore, we can state that the
odds of contracting the disease
increase by 2.89% with each additional
year in age.
(exp(.0285)-1)*100% = 2.89%.
12
Example: The Outbreak Data.
We can transform these results to
discuss the increase in odds in 5 & 10
year increments by the following:
exp(cbi) = the OR when there is a
difference of c units.
13
Example: The Outbreak Data.
Therefore:
(exp(5*.0285)-1)*100% = 15.32%
(exp(10*.0285)-1)*100% = 32.98%
As a result then, a person’s odds of
getting the disease increase by 15.32%
for every additional 5 years in age.
14
Model Fit
We ended last session fitting a simple (1predictor) binary logit model to the
Outbreak data using SAS.
We will now continue covering the SAS
PROC LOGISTIC output.
15
Model Fit Statistics
All of these statistics assess the model fit
through the quality of the explanatory
capacity of the model.
16
Model Fit Statistics
-2 Log L The -2 Log-Likelihood is a
transformation of the
Likelihood function (L). L is a
quantification of how well the
model fits the sample data.
17
Model Fit Statistics
Both AIC & SC are deviants of the -2 Log
L that penalize for model complexity
(the number of predictor variables).
18
Model Fit Statistics
AIC Akaike Information Criterion. Used
to compare non-nested models.
Smaller is better. AIC is only
meaningful in relation to another
model’s AIC value.
19
Model Fit Statistics
SC
Schwarz Criterion. Very much like
AIC, however the penalization is
different. SC tends to favor simpler
models than AIC.
20
Model Fit Statistics
Choose either AIC or SC (not both) and
use the values under the heading
‘Intercept and Covariates’ to compare to
competing models.
21
The model equation.
22
Inference: The Coefficients.
Instead of a t-test for the significance of a
coefficient (like in linear regression), we
have a Wald Chi-Squared test.
23
Inference: The Coefficients.
Remember, typically we do not evaluate
the intercept, but rather focus on the test
for each predictor.
24
Inference: The Coefficients.
In this case, age is a statistically
significant predictor of disease status at
the α = .05 level, X2(1) = 11.53,
p = .0007.
25
Inference: The Coefficients.
One can also obtain CI’s for the
parameter estimates using CL option in
the MODEL statement of PROC
LOGISTIC.
26
Inference: The Coefficients.
As we found in linear regression, we can
conclude that a given predictor is
statistically significant at the α = .05 if the
95% CI does not include the null value of
0.
27
Inference: The Coefficients.
Therefore, our best estimate of the
change in the log-odds for age is 0.0285,
however, we are 95% confident that that
change lies between 0.0120 and 0.0449
for the population.
28
Inference: The Coefficients.
Furthermore: exp(.0285) = 1.029
exp(.0120) = 1.012
exp(.0449) = 1.046
Therefore, we estimate a person’s odds
of contracting the disease increase
1.029 times for every year they age and
we are 95% confident that this increase
ranges between (1.012,1.046) for the
29
pop.
Inference: The Coefficients.
Of course, we no longer have to compute
these odds ratio estimates by hand,
because SAS provides them for us.
30
Inference: The Coefficients.
Furthermore: (exp(.0285)-1)*100% = 2.89%.
(exp(.0120)-1)*100% = 1.21%
(exp(.0449)-1)*100% = 4.59%
We can state that the odds of contracting
the disease increase by 2.89% with each
additional year in age and we are 95%
confident that this increase ranges
between (1.21%,4.59%) for the pop.
31
Final Note: Model Fitting
Realize that in order to estimate the
model parameters, the data must consist
of a substantial number of each response
category. For example, one will not be
able to estimate the risk of contracting a
disease if the data set does not contain
any individuals who have been
diagnosed with the disease.
32
Final Note: Model Fitting
Essentially, then, in order to estimate the
probability of either a success or failure,
the data set must contain a substantial
number (> 30 is best) of observations that
experienced a success and a substantial
number that experienced a failure.
33
More about output.
PROC LOGISTIC provides more
information concerning how the model
fits the sample data.
34
More about Model Fit
Percent Concordant
A pair of observations with
different observed responses is
considered concordant if the
observation with the lower ordered
response value has a lower
predicted value than the
observation with a higher ordered
response value.
35
More about Model Fit
Percent Discordant
A pair is considered discordant if an
observation with a lower ordered
response value has a higher
predicted value than an observation
with a higher order response.
36
More about Model Fit
Percent Tied
A pair with different responses is
considered tied if it is neither
concordant nor discordant.
37
More about Model Fit
Somer’s D, Gamma, & Tau-a
These are statistics that measure the
strength and direction of the
relationship between pairs.
38
More about Model Fit
Somer’s D & Tau-a
Like r, these vary between -1.0 (all
pairs discordant) & +1.0 (all pairs
are concordant).
Somer’s D = the difference between
the % concordant and the %
discordant * 100.
39
More about Model Fit
Gamma
Gamma is a similar statistic: it’s
values also range between -1.0 &
+1.0, however the interpretation of
these values is different: -1.0 = no
association & + 1.0 = perfect
association.
40
Predicted Values
The output of a logit model is the
predicted probability of a success for
each observation.
41
Predicted Values
These are obtained and stored in a
separate SAS data set using the OUTPUT
statement (see the following code).
42
Predicted Values
PROC LOGISTIC outputs the predicted
values and 95% CI limits to an output
data set that also contains the original
raw data.
43
Predicted Values
Use the PREDPROBS = I option in order
to obtain the predicted category (which
is saved in the _INTO_ variable).
44
Predicted Values
_FROM_ = The observed response
category = The same value as the
response variable.
45
Predicted Values
_INTO_ = The predicted response
category.
46
Predicted Values
IP_1 = The Individual Probability of a
response of 1.
47
Scoring Observations in SAS
Obtaining predicted probabilities and/or
predicted outcomes (categories) for new
observations (i.e., scoring new
observations) is done in logit modeling
using the same procedure we used in
scoring new observations in linear
regression.
48
Scoring Observations in SAS
1. Create a new data set with the desired
values of the x variables and the y
variable set to missing.
2. Merge the new data set with the
original data set.
3. Refit the final model using PROC
LOGISTIC using the OUTPUT
statement.
49
Classification Table & Rates
A Classification Table is used to
summarize the results of the predictions
and to ultimately evaluate the fitness of
the model.
Obtain a classification table using PROC
FREQ.
50
Classification Table & Rates
The observed (or actual) response is in
rows and the predicted response is in
columns.
51
Classification Table & Rates
Correct classifications are summarized
on the main diagonal.
52
Classification Table & Rates
The total number of correct
classifications (i.e., ‘hits’) is the sum of
the main diagonal frequencies.
O = 130+9 = 139
53
Classification Table & Rates
The total-group hit rate is the ratio of O
and N.
HR = 139/196 = .698
54
Classification Table & Rates
Individual group hit rates can also be
calculated. These are essentially the row
percents on the main diagonal.
55