17 An Introduction to Logistic Regression

Download Report

Transcript 17 An Introduction to Logistic Regression

An Introduction to Logistic
Regression
GV917 For categorical Dependent Variables
What do we do when the dependent variable
in a regression is a dummy variable?





Suppose we have the dummy variable turnout:
1 – if a survey respondent turns out to vote
0 – if they don’t vote
One thing we could do is simply run an ordinary least
squares regression and interpret the above variable as a
probability – where 1 means that something is certain to
happen and 0 means that it is certain not to happen
The problem with doing this is that we only have
observable measures scoring 1 or 0 of an underlying
unmeasured scale where a particular individual might
have a 0.55 probability of voting
Turnout
1.00
.00
1.00
1.00
1.00
.00
1.00
1.00
1.00
.00
.00
1.00
1.00
.00
1.00
1.00
.00
1.00
.00
1.00
.00
1.00
1.00
1.00
1.00
.00
1.00
1.00
1.00
1.00
Interest
4.00
1.00
4.00
3.00
3.00
1.00
3.00
2.00
3.00
1.00
2.00
2.00
2.00
1.00
2.00
4.00
2.00
2.00
1.00
1.00
2.00
4.00
1.00
2.00
1.00
2.00
3.00
3.00
4.00
3.00
(N=30)








Turnout
1 yes
0 No
Interest in the Election
1 not at all interested
2 not very interested
3 fairly interested
4 very interested
OLS Regression of Interest on Turnout – The
Linear Probability Model
Model Summ ary
Model
1
R
.540a
Adjust ed
R Square
.266
R Square
.291
St d. E rror of
the Es timate
.39930
a. Predic tors: (Constant), interest
ANOVA b
Model
1
Regres sion
Residual
Total
Sum of
Squares
1.836
4.464
6.300
df
1
28
29
Mean Square
1.836
.159
F
11.513
Sig.
.002a
a. Predictors: (Constant), interes t
b. Dependent Variable: turnout
Coefficientsa
Model
1
(Constant)
interes t
Unstandardized
Coeffic ients
B
Std. Error
.152
.177
.238
.070
a. Dependent Variable: turnout
Standardiz ed
Coeffic ients
Beta
.540
t
.856
3.393
Sig.
.399
.002
The Residuals of the OLS Turnout Regression
Ca sew ise Dia gnosticsa
Case Num ber
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
St d. Residual
-.264
-.977
-.264
.333
.333
-.977
.333
.930
.333
-.977
-1. 574
.930
.930
-.977
.930
-.264
-1. 574
.930
-.977
1.527
-1. 574
-.264
1.527
.930
1.527
-1. 574
.333
.333
-.264
.333
a. Dependent Variable: turnout
turnout
1.00
.00
1.00
1.00
1.00
.00
1.00
1.00
1.00
.00
.00
1.00
1.00
.00
1.00
1.00
.00
1.00
.00
1.00
.00
1.00
1.00
1.00
1.00
.00
1.00
1.00
1.00
1.00
Predic ted
Value
1.1053
.3901
1.1053
.8669
.8669
.3901
.8669
.6285
.8669
.3901
.6285
.6285
.6285
.3901
.6285
1.1053
.6285
.6285
.3901
.3901
.6285
1.1053
.3901
.6285
.3901
.6285
.8669
.8669
1.1053
.8669
Residual
-.10526
-.39009
-.10526
.13313
.13313
-.39009
.13313
.37152
.13313
-.39009
-.62848
.37152
.37152
-.39009
.37152
-.10526
-.62848
.37152
-.39009
.60991
-.62848
-.10526
.60991
.37152
.60991
-.62848
.13313
.13313
-.10526
.13313
What’s Wrong?

Predicted probabilities which exceed 1.0, which makes
no sense.

The test statistics t and F are not valid because the
sampling distribution of residuals does not meet the
required assumptions (heteroscedasticity)

We can correct for the heteroscedasticity, but a better
option is to use a logistic regression model
Some Preliminaries needed for Logistic Regression

Odds Ratios

These are defined as the probability of an event occurring divided by the
probability of it not occurring. Thus if p is the probability of an event:
p
Odds = ------1- p









For example:
In the 2005 British Election Study Face-to-Face survey 48.2 per cent of the
sample were men, and 51.8 percent women, thus the odds of being a man
were:
0.482
0.518
-------- = 0.93 and the odds of being a women were -------- = 1.07
0.518
0.482
Note that if the odds ratio was 1.00 it would mean that women were equally
likely to appear in the survey as men.
Log Odds in the Logistic Regression Model

The natural logarithm of a number is the power we must raise e (2.718) to
give the number in question.


So the natural logarithm of 100 is 4.605 because 100 = e 4.605
This can be written 100 = exp(4.605)

Similarly the anti-log of 4.605 is 100 because e 4.605 = 100

In the 2005 BES study 70.5 per cent of men and 72.9 per cent of women
voted.

The odds of men voting were 0.705/0.295 = 2.39, and the log odds were
ln(2.39) = 0.8712

The odds of women voting were 0.729/0.271 = 2.69, and the log odds were
ln(2.69) = 0.9896

Note that ln(1.0) = 0, so that when the odds ratio is 1.0 the log odds ratio is
zero
Why Use Logarithms?

They have 3 advantages:
Odds vary from 0 to ∞, whereas log odds vary from -∞ to + ∞ (minus
infinity to plus infinity) and are centered on 0. Odds less than 1 have
negative values in log odds, and odds greater than one have
positive values in log odds. This accords better with the natural
number system which runs from -∞ to + ∞.
If we take any two numbers and multiply them together that is the
equivalent of adding their logs. Thus logs make it possible to
convert multiplicative models to additive models, a useful property in
the case of logistic regression which is a non-linear multiplicative
model when not expressed in logs
A useful statistic for evaluating the fit of models is -2*loglikelihood (also
known as the deviance). The model has to be expressed in
logarithms for this to work
Logistic Regression





^
p(y)
ln -------^
1 - p(y)
= a + bXi


^
Where p(y) is the predicted probability of being a voter
^
1 – p(y) is the predicted probability of not being a voter

If we express this in terms of anti-logs or odds ratios, then








^
p(y)
--------^
1 - p(y)
= exp( a + bXi)

and



^
exp(a + bX )
p(y) = ----------------- i
1 + exp(a + bXi)
The Logistic Function


The logistic function can never be greater than
one, so there are no impossible probabilities
It corrects the problems with the test statistics
Estimating a Logistic Regression

In OLS regression the least squares solution can be defined
analytically – there are equations called the Normal Equations which
we use to find the values of a and b. In logistic regression there are
no such equations. The solutions are derived iteratively – by a
process of trial and error.

Doing this involves identifying a likelihood function. A likelihood is a
measure of how typical a sample is of a given population. For
example we can calculate how typical the ages of the students in
this class are in comparison with students in the university as a
whole. Applied to our regression problem we are working out how
likely individuals are to be voters given their level of interest in the
election and given values for the a and b coefficients.


We ‘try out’ different values of a and b and the maximum likelihood
estimation identifies the values which are most likely to reproduce
the distribution of voters and non-voters we see in the sample, given
their levels of interest in the election.
A Note on Maximum Likelihood Estimation








Define the probability of getting a head in tossing a fair coin as p(H)
= 0.5, so that p(1-H) = 0.5 (getting a tail). So the probability of two
heads followed by a tail is:
P[(H)(H)(1-H)] = (0.5)(0.5)(0.5) = 0.125
We can get this sequence in 3 different ways (the tail can be first,
second or third), so that the probability of getting 2 heads and a tail
without worrying about the sequence is 0.125(3) = 0.375
But suppose we did not know the value of p(H). We could ‘try out’
different values and see how well they fitted an experiment
consisting of repeated tosses of a coin three times. For example if
we thought p(H) = 0.4, then two heads and a tail would give
(0.4)(0.4)(0.6)(3)= 0.288.
If we thought it was 0.3 we would get:
(0.3)(0.3)(0.7)(3) = 0.189
Clearly in a repeat experiment with a fair coin we would get 2 heads
and a tail 0.375 of the time, so our assumption that this was 0.288 or
0.189 would be false – maximum likelihood estimation means
finding the best fit using different values of p(H)
Maximum Likelihood in General

More generally we can write a likelihood
function for this exercise:

LF = π [pi2 * (1- pi)] where pi is the
probability of getting a head and π is the
number of ways this sequence can occur.

The maximum value of this function occurs
when pi=0.5, making this the maximum
likelihood estimate of the sequence two
heads and a tail.
Explaining Variance



In OLS regression we defined the following
expression:
_
_
Σ(Yi – Y)2 = Σ(Ŷ – Y)2 + Σ(Yi - Ŷ)2

Or
Total Variation = Explained Variation + Residual Variation

In logistic regression measures of the
Deviance replace the sum of squares as the
building blocks of measures of fit and
statistical tests.
Deviance

Deviance measures are built from maximum likelihoods calculated
using different models. For example, suppose we fit a model with
no slope coefficient (b), but an intercept coefficient (a). We can call
this model zero because it has no predictors. We then fit a second
model, called model one, which has both a slope and an intercept.
we can form the ratio of the maximum likelihoods of these models:


maximum likelihood of model zero
Likelihood ratio = --------------------------------------------maximum likelihood of model one

Expressed in logs this becomes:


Log Likelihood ratio = ln(maximum likelihood of model zero –
maximum likelihood of model one)

Note that the (Likelihood ratio)2 is the same as 2(log likelihood ratio)

The Deviance is defined as -2(log likelihood ratio)
What does this mean?
The maximum likelihood of model zero is
analogous to the total variation in OLS and the
maximum likelihood of model one is analogous
to the explained variation. If the maximum
likelihoods of models zero and one were the
same, then the likelihood ratio would be 1 and
the log likelihood ratio 0.
This would mean that model one was no better
than model zero in accounting for turnout, so the
deviance captures how much we improve things
by taking into account interest in the election.
The bigger the deviance the more the
improvement
SPSS Output from the Logistic
Regression of Turnout
Om nib us Tests of Mo del Coe fficients
St ep 1
Chi-square
10.757
10.757
10.757
St ep
Bl ock
Model
df
Si g.
.001
.001
.001
1
1
1
Mo de l Sum ma ry
St ep
1
-2 Log
lik eliho od
25 .894 a
Co x & Sne ll
R Squa re
.30 1
Na gelk erke
R Squa re
.42 7
a. Es tima tion term inat ed a t iteration num ber 6 b ecau se
pa rame ter e stim ate s ch ange d by les s tha n .0 01.
Classification Table
a
Predicted
turnout
Step 1
Observed
turnout
.00
.00
1.00
1.00
Percentage
Correct
55.6
85.7
76.7
5
3
4
18
df
Sig.
.012
.047
Overall Percentage
a. The cut value is .500
Variables in the Equation
Step
a
1
interest
Constant
B
1.742
-2.582
S.E.
.697
1.302
a. Variable(s) entered on step 1: interest.
Wald
6.251
3.934
1
1
Exp(B)
5.708
.076
The Meaning of the Omnibus Test
Omnibus Tests of Model Coe fficients
St ep 1
St ep
Block
Model
Chi-square
10.757
10.757
10.757
df
1
1
1
Sig.
.001
.001
.001

SPSS starts by fitting what it calls Block 0, which is the model containing the constant
term and no predictor variables. It then proceeds to Block 1 which fits the model with
the predictor variable and gives us another estimate of the likelihood function. These
two can then be compared and the table shows a chi-square statistical test of the
improvement in the model achieved by adding interest in the election to the equation.
This chi-square statistic is significant at the 0.001 level. In a multiple logistic
regression this table tells us how much all of the predictor variables improve things
compared with model zero.

We have significantly improved on the baseline model by adding the variable
interest to the equation
The Model Summary Table
Model Sum ma ry
St ep
1
-2 Log
Cox & Snell
lik elihood
R Square
25.894 a
.301
Nagelk erke
R Square
.427
a. Es timation terminat ed at iteration number 6 because
parameter estimates changed by les s than .001.

The -2 loglikelihood statistic for our two variable model appears in
the table, but it is only really meaningful for comparing different
models. The Cox and Snell and the Nagelkerke R Squares are
different ways of approximating the percentage of variance
explained (R square) in multiple regression. The Cox and Snell
statistic is problematic because it has a maximum value of 0.75.
The Nagelkerke R square corrects this and has a maximum value of
1.0, so it is often the preferred measure.
The Classification Table
Classification Table
a
Predicted
turnout
Step 1
Observed
turnout
Overall Percentage
.00
.00
1.00
1.00
5
3
4
18
Percentage
Correct
55.6
85.7
76.7
a. The cut value is .500

The classification table tells us the extent to which the
model correctly predicts the actual turnout, so it is
another goodness of fit measure. The main diagonal
from top left to bottom right contains the cases predicted
correctly (23), whereas the off-diagonal from bottom right
to top left are the cases predicted incorrectly (7). So
overall 76.7 per cent of the cases are predicted correctly.
Interpreting the Coefficients
Variables in the Equation
Step
a
1
interest
Constant
B
1.742
-2.582
S.E.
.697
1.302
Wald
6.251
3.934
df
1
1
Sig.
.012
.047
Exp(B)
5.708
.076
a. Variable(s) entered on step 1: interest.

The column on the left gives the coefficients in the logistic regression
model. It means that a unit change in the level of interest in the election
increases the log odds of voting by 1.742. The standard error appears in
the next column (0.697) and the Wald statistic in the third column. The
latter is the t statistic squared (6.251) and as we can see it is significant at
the 0.012 level. Finally, Exp (B) is the anti-log of the (B) column so that
e1.742 = 5.708 and is the odds ratio of the effects. This is the effect on the
odds of voting of an increase in the level of interest in the election by one
unit. Since odds ratios are a bit more easy to understand than log odds
ratios the effects are often reported using these coefficients. The coefficient
is saying that a unit increase in interest in the election raises the odds of
voting compared with non-voting by nearly 6.0 – a big effect
Making Sense of the Coefficients





^
p(y)
ln -------^
1 - p(y)

So that




^
p(y)
= -2.582 + 1.742Xi
= exp(-2.582 + 1.742Xi)
------------------------------1 + exp(-2.582 + 1.742Xi)
Translating into Probabilities

Suppose a person scores 4 on the interest in the election variable (they are
very interested). Then according to the model the probability that they will
vote is:

^

P(y)
= exp(-2.582 + 1.742(4))
------------------------------1 + exp(-2.582 + 1.742(4)Xi)



^

If they are not at all interested and score (1) then:

^

Consequently a change from being not at all interested to being very
interested increases the probability of voting by 0.99-0.30= 0.69


P(y)
P(y)
= exp(4.386)/(1 + exp(4.386)) = 0.99
=
exp(-0.84)/(1 + exp(-0.84)) = 0.30
Probabilities of voting at different levels of interest
in the election






Level of Interest
Probability of Voting
1
0.30
2
0.71
3
0.93
4
0.99
Obviously being interested has a big effect on
whether an individual votes.
Conclusions



Logistic regression allows us to model
relationships when the dependent variable is
a dummy variable.
It can be extended to multinomial logistic
regression in which there are several
categories – and this produces several sets
of coefficients
The results are more reliable than if we had
just used ordinary least squares regression