PROC FREQ and PROC LOGISTIC

Download Report

Transcript PROC FREQ and PROC LOGISTIC

Simple Logistic Regression
An introduction to
PROC FREQ and
PROC LOGISTIC
Introduction to Logistic Regression
Logistic Regression is used when the outcome
variable of interest is categorical, rather than
continuous. Examples include: death vs. no
death, recovery vs. no recovery, obese vs. not
obese, etc. All of the examples you will see in
this class have binary outcomes, meaning there
are only two possible outcomes.
Simple Logistic Regression has only one predictor
variable. You may already be familiar with this
type of regression under a different name: odds
ratio.
Simple Logistic Regression: An example
Imagine you are interested in investigating
whether there is a relationship between
race and party identification. Race (Black
or White) is the independent variable, and
Party Identification (Democrat or
Republican) is the dependent variable.
Consider the following table:
Example from Agresti, A. Categorical Data Analysis, 2nd ed. 2002.
Race x Party Identification
Democrat
Republican
Black
103
11
White
341
405
The odds of being a Democrat for Black vs.
White is:
• OR(odds ratio) = (103/11)/(341/405) =
(103x405)/(341x11) = 11.12
• Blacks have a 11.12 times greater odds of
being a Democrat than Whites.
The odds of being a Republican for Black
vs. White is:
• (11/103)/(405/341) = (11x341)/(405x103) =
0.09
• Blacks have a 91% (1-0.09) lower odds of
being a Republican than Whites.
Odds Ratios in SAS
Copy the following code into SAS:
DATA partyid;
INPUT race $ party $ count;
DATALINES;
B D 103
B R 11
W D 341
W R 405
;
RUN;
PROC PRINT DATA = partyid;
RUN;
Odds Ratios with PROC FREQ
There are two ways to get Odds Ratios in SAS
when there is one predictor and one outcome
variable. The first is with PROC FREQ. Type
the following code into SAS:
PROC FREQ DATA = partyid; weight count;
TABLES race * party / chisq relrisk;
RUN;
Notes about the SAS code:
• weight is a term in SAS which weighs
whatever variable you specify. When you
have a table you want to enter into SAS, it
is often easier to use a “count” variable
rather than list each subject individually.
Because the data set has 860
observations, we would have to type out
860 separate datalines if we did not use
the “count” variable and “weight count”
option.
• TABLES tells SAS to construct a table with
the two specified variables (in this case,
race and party).
• The chisq option requests all Chi-Square
statistics.
• The relrisk option gives you estimates of
the odds ratio and relative risks for the two
columns.
Output from PROC FREQ
Reading the Table
• Each cell has four numbers: count,
percent, row %, and column %
• There are 103 Black Democrats, which is
11.98% of the total sample.
• 90.35% of Blacks are Democrats.
• 20.32% of Democrats are Black.
Compare this to 2.64% of Republicans
who are Black.
Interpreting Chi-Square Statistic
The Chi-Square (Χ2) test statistic tests the
null hypothesis that two variables are
independent versus the alternative, that
they are not independent (that is, related).
Ho: race and party identification are
independent
Ha: race and party identification are
associated
Χ2 = 78.9082, pvalue < 0.0001.
Reject Ho. Conclude that race and party
identification are associated.
Output of Odds Ratio
Interpreting the Odds Ratio
You can find the OR in the SAS output under
“Case-Control (Odds Ratio).”
The odds ratio is 11.12 with a 95%
Confidence Interval of [5.87, 21.05].
Because this C.I. does not contain 0, we
know that the OR is statistically significant.
Blacks have a 11.12 times greater odds of
being Democratic than Whites.
A note about the PROC FREQ table:
Dem
Notice the way the table is
Black 103
set up in SAS:
White 341
Rep
11
405
When calculating the OR in PROC FREQ, SAS
will alphabetize the table, and this affects the
OR it will calculate. SAS is calculating the odds
of being a Democrat for Blacks versus Whites
(or the odds of being Black for Democrats
versus Republicans). If you wanted the odds of
being Democratic for Whites versus Blacks, you
would have to either calculate this by hand or
use PROC LOGISTIC.
Odds Ratio with PROC LOGISTIC
To simplify our data set, we will change our
variables to have values of 1 and 0, rather than
B/W and D/R. If someone is Black, s/he will have
a value of “1” for the variable “race2.” Whites will
have a value of “0.” If someone is a Democrat,
s/he will have a value of “1” for “party2.”
Republicans will have a value of “0.” Type the
following code into SAS, which creates a new
data set called “partyid2”:
DATA partyid2;
SET partyid;
if race = "B" then race2 = 1; else race2 = 0;
if party = "D" then party2 = 1; else party2 = 0;
RUN;
PROC LOGISTIC
Once you have created the new data set, do regression
analysis on the data, using PROC LOGISTIC (notice the
format is similar to that of linear regression, with the
model statement y = x):
PROC LOGISTIC descending data = partyid2; weight count;
MODEL party2 = race2;
RUN;
• “Descending” tells SAS to model the probability that
“party2” = 1 (Democratic). If you did not include the
descending statement, SAS would model the probability
that “party2” = 0 (Republican). All subsequent
interpretations will be in terms of the odds of being
Democratic, not Republican.
PROC LOGISTIC Output
Interpreting the Output
From PROC LOGISITC, we now have an
equation for our log(odds):
Log(odds) = β0 + β1x
Log(odds) = -0.1720 + 2.4088x
where x = 1 if the person is Black and x = 0
if the person is White.
Calculating the Odds Ratio
Suppose we wanted to know the odds of being a Democrat
for Blacks vs. Whites.
• The log(odds) of being Democratic for Blacks is:
β0 + β1(1) = β0 + β1
• The log(odds) of being Democratic for Whites is:
β0 + β1(0) = β0.
• To calculate the OR, take the log(odds) for Blacks minus
the log(odds) for Whites:
β0 + β1 – (β0) = β1
• Then exponentiate this value:
exp(β1) = exp(2.4088) = 11.12
This is the same OR calculated earlier using PROC FREQ.
In addition, it is given to you in the PROC LOGISTIC
output under “Odds Ratio Estimates” with the 95% C.I.
Calculating the OR, cont.
Suppose we wanted to know the odds of
being a Democrat for Whites vs. Blacks.
• To calculate the OR, take the log(odds) for
Whites minus the log(odds) for Blacks:
β0 – (β0 + β1) = -β1
• Then exponentiate this value:
exp(-β1) = exp(-2.4088) = 0.0899
Whites have a 91% (1-.0899) decreased
odds of being Democratic than Blacks.
Significance Testing
Testing the significance of a parameter
estimate can be done by constructing a
confidence interval around that parameter
estimate.
If the C.I. for an estimate (or log(OR))
contains 0, the variable is not significantly
associated with the outcome.
If the C.I. for an OR contains 1, the variable
is not significantly associated with the
outcome.
The Wald Chi-Square statistic tests whether
the parameter estimate equals zero, that is
Ho: β1 = 0 vs. Ha: β1 ≠ 0.
From the output, we see that the pvalue of
this test < 0.0001, so we reject Ho and
conclude that race is significantly related
to party identification.
Confidence Interval Construction
Confidence interval construction is similar to what
you have seen for linear regression, except that
it is now on the natural log scale:
95% C.I. for β1 = β1 +/- 1.96*se(β1)
= 2.4088 +/- 1.96*(0.3256)
= [1.77,3.05]. This C.I. does not contain 0.
exp [1.77,3.05] = [5.875, 21.052] This C.I. does
not contain 1.
Notice that [5.875, 21.052] is also the 95% C.I. for
the OR given in the SAS output.
Calculating the Probability
If you were asked to calculate the probability
that someone is a Democrat, given that he
is Black, you would use the following
formula:
Π(probability) = exp(log(odds))/[1+
exp(log(odds))]
Π = exp(-0.1720+2.4088)/[1+ exp(0.1720+2.4088)] = 0.9035
A Black person has a 90.35% chance of
being a Democrat.
Summary
This has been an introduction to calculating
odds ratios in PROC FREQ and PROC
LOGISTIC. The next section will introduce
you to multiple predictors in logistic
regression, including interactions.