Topic 1: Binary Logit Models

Download Report

Transcript Topic 1: Binary Logit Models

Topic 1
Binary Logit Models
1

Often variables in social sciences are
dichotomous:




Employed vs. unemployed;
Married vs. unmarried;
Guilty vs. innocent;
Voted vs. didn’t vote
2


Social scientists frequently wish to estimate
regression models with a dichotomous
dependent variable
Most researchers are aware that
There is something wrong with using OLS for
a dichotomous dependent variable;
But they do not know what makes
dichotomous dependent variable problematic
in standard linear regression; and
What other methods are superior
3



Focus of this topic is on logit analysis (or
logistic regression analysis) for dichotomous
dependent variable
Logit models have many similarities to OLS
regression models
Examine why OLS regression run into
problems when the dependent variable is
dichotomous
4
Example




Dataset: penalty.txt
Comprises 147 penalty cases in the state of
New Jersey
In all cases the defendant was convicted of firstdegree murder with a recommendation by the
prosecutor that a death sentence be imposed
Penalty trial is conducted to determine if the
defendant should receive a death penalty or life
imprisonment
5

The dataset comprises the following variables:
DEATH
1
for a death sentence
0
for a life sentence
BLACKD
1
if the defendant was black
0
otherwise
WHITVIC
1
if the victim was white
0
otherwise
SERIOUS – an average rating of seriousness of the
crime evaluated by a panel of judges, ranging from 1
(least serious) to 15 (most serious)
6
DATA PENALTY;
INFILE ‘D:\TEACHING\MS4225\PENALTY.TXT;
INPUT DEATH BLACKD WHITVIC SERIOUS CULP SERIOUS2;
PROC REG;
MODEL DEATH=BLACKD WHITVIC SERIOUS;
RUN;
7
8

Remarks on OLS regression output:



The coefficient for SERIOUS is positive and very
significant
Neither of the two racial variables are significantly
different from zero
R2 is very low


F-test indicates overall significance of the model
Should we trust these results?
9

Assumptions of the linear regression model:
1. Yi  a  bxi   i
2. E ( i )  0
i
3. var( )   2
(homoscedasticity)
i
i
4. cov( ,  )  0 
(absence of autocorrelation)
i
j
i j
5. x i ’s are treated as fixed
6.


i
~ Normal
If assumptions 1-5 are satisfied, then OLS
estimators of a and b are B.L.U.
If all assumptions are satisfied, then OLS
estimators of a and b are M.V.U.
10



Now, what if y is a dichotomy with possible
values of 1 or 0?
It is still possible to claim that assumptions 1,
2, 4 and 5 are true
But if 1 and 2 are true then 3 and 6 are
necessarily false!!
11

Consider assumption 6
 Note that  i  yi  a  bxi
 i  1  a  bxi
yi  0,  i  a  bxi
It is obvious that  i cannot be normally distributed.
In fact, it follows a Binomial distribution
So aˆ and bˆ are also not normally distributed.
Standard inference procedures are no longer valid
as a consequence
But in large samples, Binomial distribution tends
towards the Normal distribution
If yi  1,



12


Consider assumption 3:
Note that
E( yi )  1 Pr(yi  1)  0  Pr(yi  0)
 1 pi  0  (1  pi )

 pi
But from Assumptions 1 and 2,
E( yi )  E(a  bxi   i )
 a  bxi  E ( i )

 a  bxi
Therefore, pi  a  bxi
 Linear probability model (LPM)
13

Accordingly, from our previous output, a 1point increase in the SERIOUS scale is
associated with a 0.038 increase in the
probability of a death sentence; the
probability of a death sentence for black
defendants is 0.12 higher than for non-black
defendants, ceteris paribus
14

var( i )  E[( i  E( i ))2 ]
 E ( i2 )
 (a  bxi ) 2 (1  pi )  (1  a  bxi ) 2 pi
 (a  bxi ) 2 (1  a  bxi )  (1  a  bxi ) 2 (a  bxi )
 (a  bxi )(1  a  bxi )
 pi (1  pi )

So,  i must be heteroscedastic. The disturbance
variance is at a maximum when pi  0.5
15
So, what are the consequences?
 Violation of assumptions 3 and 6 does not lead to
biased estimation by OLS (only assumptions 1 and
2 are required for OLS to yield unbiased estimators)
 If the sample size is large enough, the estimators
will be approximately normal even when  i ' s are not
normally distributed.
 Voilation of the homoscedasticity assumption makes
the OLS estimators no longer efficient. In addition,
the estimated standard errors are biased.
16
Also,
the model
pi  a  bxi
is implausible, because pi is a linear function of
xi and therefore has no upper or lower bound.
But it is impossible for the true values (which
are probabilities) to be greater than 1 or less
than 0!
17

Odds of an event: the ratio of the expected
number of times that an event will occur to
the expected number of times it will not
occur, (e.g. an odds of 4 means we expect 4
times as many occurrences as nonoccurrences; an odds of 5/2 (or 5 to 2) means
we expect 4 occurrences to 2 nonoccurrences.
18

Let p be the probability of an event and 0 the
odds of the event, then
p
o
1 p
or
o
p
1 o
19

Relationship between Odds and Probability
Probability
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Odds
0.11
0.25
0.43
0.67
1.00
1.50
2.33
4.00
9.00
•o<1 => p<0.5
•o>1 => p>0.5
•0 < o < ∞
20
Death Sentence by Race of Defendant for
147 Penalty Trials.
Death
Life
Total
Blacks
28
45
73
50
 0.52
97
28
OD| B 
 0.62
45
22
OD| NB 
 0.42
52
non-blacks
22
52
74
50
97
147
OD 
0.62
 1.476
0.42
∴ Ratio of black-odds to non-black odds are:
=> The odds of death sentence for blacks are 47.6% higher than
for non-blacks, or the odds of death sentence for non-blacks 21
are 0.63 times the corresponding odds for non-blacks.

Logit model:
pi 
1
1  e  (a  bxi ) which is the cumulative logistic
distribution function.
Let Z  a  bx , then
i
pi 
i
1
 F (a  bxi )
Zi
1 e
Notes:
 As Zi ranges from -∞ to +∞, Pi ranges between 0
and 1;
 Pi is non-linearly related to Zi
22

Also,
Let
pi
 e Zi
1  pi
(the odds of the event)
pi
Li  ln(
)
1  pi
 Zi
 a  bxi
Although Li is linear in Xi, the probabilities
themselves are not. This is in contrast to LPM.
23
Graph of logit model for a
single explanatory variable
Pi
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
-4
-3
-2
-1
0
1
2
3
4
(produce a graph using a = 0 and b = 1)
24

Now p  F (a  bx )
i
i
pi F (a  bxi )

X i
xi
 F ' (a  bxi ) b
 f (a  bxi ) b

As f(a+bXi) is always positive, the sign of b
indicates the direction of relationship between
pi and Xi
25

For the LOGIT model,
Z
e i
f (a  bxi ) 
(1  eZi ) 2
 F (a  bxi )[1  F (a  bxi )]
 pi (1  pi )
∴ pi  b  p (1  p )
i
i
xi
In other words, a 1-unit change in Xi does not
produce a constant effect on pi
26

Note that yi only takes on values of 0 and 1,
so Li is not defined. Therefore, OLS is not an
appropriate estimation technique. Maximum
Likelihood (ML) estimation is usually
undertaken.
ML basic principle: to choose as estimates
those parameter values which would
maximize the probability of what we have in
fact observed.
27
Steps:
 Write down an expression for the probability
of the data as a function of the unknown
parameters [construction of likelihood
function]
 Find the values of the unknown parameters
that make the value of this expression as
large as possible.
28
 pi 
  log(1  pi )
log L   yi log
i
 1  pi  i
 (a  bxi )


e
a  bxi

  yi log(e
)   log
 (a  bxi ) 
i
i
1 e

 a  yi  b  xi yi   log(1  ea  bxi )
i
Taking the derivatives of log L and setting them to zero give:

 log L
(aˆ  bˆxi )
  yi   1  e
aˆ
i

1
  yi   yˆ i  0
i
29

 log L
(aˆ  bˆxi )
  xi yi   xi 1  e
bˆ
i

1
  xi yi   xi yˆ i  0
i
The first order conditions are non-linear in aˆ
and bˆ , so solutions are typically obtained by
iterative methods.
30
Newton-Raphson algorithm
Let U(a,b) be the vector of first derivatives of
log L with respect to a and b and let I(a,b) be
the corresponding matrix of the second
derivatives.
  yi   yˆ i 
i.e.

i
U (a , b )   i
 xi yi   xi yˆ i 
 i

i
gradient or
score vector
31
  2 logL

 a a
H a , b   
  2 logL

 a b
 2 logL 

a b 

 2 logL 

b b 
   yˆ i (1  yˆ i )  x yˆ (1  yˆ ) 
i i i

i




2
  xi yˆ i (1  yˆ i )   xi yˆ i (1  yˆ i )
i


Hessian
matrix
32

The Newton-Raphson algorithm derives new
estimates based on
aˆ j 1  aˆ j 
1
bˆ   bˆ   H (a , b )U (a , b )
 j 1   j 
1
H
(a , b ) is the inverse of H (a , b )
where

In practice, we need a set of starting values.
[PROC LOGISTIC starts with all coefficients
equal to zero]
33

The process is repeated until the maximum
change in each parameter estimate from one
step to the next is less than some criterion.
i.e. aˆ j 1  aˆ j  u
and
bˆ j 1  bˆ j  u
34

Note that
1
ˆ
ˆ
cov(a , b )  H (aˆ , bˆ )

This variance-covariance matrix can be
obtained using the COVB option in the
MODEL statement in SAS
35
SAS Program
DATA PENALTY;
INFILE ‘D:\TEACHING\MS4225\PENALTY.TXT’;
INPUT DEATH BLACKD WHITVIC SERIOUS CULP SERIOUS2;
PROC LOGISTIC DATA=PENALTY DESCENDING;
MODEL DEATH=BLACKD WHITVIC SERIOUS;
RUN;
36
37
38
Interpretation of results

Rather than a t-statistic SAS reports a Wald
Chi-square value, which is the square of the
usual t-statistic.
Reason: the t-statistic is only an asymptotic
one and has an “asymptotic” N(0,1)
distribution under null. The square of a N(0,1)
is a chi-square random variable with one df.
39

Test of overall significance
H 0 : b1  b 2  ...  b k  0
H1 : otherwise
1.
Likelihood-Ratio test:
LR  2{ln L(bˆR )  ln L(bˆUR )} ~ k2
2.
Score (Lagrange-Multipler) test
LM  [U (bˆR )]'[H 1 (bˆR )][U (bˆR )] ~ k2
3.
Wald test:
'
2
ˆ
ˆ
ˆ
W  bUR [H (bUR )]bUR ~ k
40
Model Selection Criteria
1.
Akaike’s Information Criterion
AIC = -2 ln L + 2 *(k+1)
Schwartz Criterion
SC = -2 ln L + (k+1)*ln(n)
3. Generalized R2= 1  exp  LR 
 n 
analogous to conventional R2 used in linear
regression
2.
41
Optimization Technique

Fishers’ scoring (Iteratively reweighted least
squares) – equivalent to Newton-Raphson
algorithm.
42



Odds ratio = eb
The (predicted) odds ratio of 1.813 indicates
that the odds of a death sentence for black
defendants are 81% higher than the odds for
other defendants
The (predicted) odds of death are about 29%
higher when the victim is white. (But note that
the coefficient is insignificant)
43

a 1-unit increase in the SERIOUS scale is
associated with a 21% increase in the
predicted odds of a death sentence.
44
Association of predicted probabilities
and observed responses
Example: For the 147 observations in the
sample, there are 147C2= 10731 ways to pair
them up (without pairing an observation with
itself). Of these, 5881 pairs have either both 1’s
on the dependent variable or both 0’s. These
we ignore, leaving 4850 pairs in which one
case has a 1 and the other case has a zero.
For each pair, we ask the question “Does the
case with a 1 have a higher predicted value
(based on the model) than the case with a 0?
45
If yes, we call that pair concordant; if no, we
call that pair discordant; if the two cases have
the same predicted value, we call it a tie.
Let C = number of concordant pairs;
D = number of discordant pairs;
T = number of ties
N = total number of pairs (before
eliminating any)
46
CD
T au- a 
N
CD
Gamma 
CD
CD
Somer's D 
C  D T
C  0.5  (1  Somer's D)
All 4 measures vary between 0 and 1 with large
values corresponding to stronger associations
between the predicted and observed values
47
An Illustrative example of
LOGIT model
Table 12.4 of Ramanathan (1995) presents information
on the acceptance or rejection to medical school for a
sample of 60 applicants, along with a number of their
characteristics. The variables are as follows:
Accept =1 if granted an acceptance, 0 otherwise;
GPA = cumulative undergraduate grade point average
BI0 = score in the biology portion of the Medical College
Admission Test (MCAT);
CHEM = score in the chemistry portion of the MACT;
PHY = score in the physics portion of the MCAT;
RED = score in the reading portion of the MCAT;
48
PRB = score in the problem portion of the MCAT;
QNT = score in quantitative portion of the MCAT;
AGE = age of applicant;
GENDER = 1 if male, 0 if female;
1.
Estimate a LOGIT model for the probability of
acceptance into medical school
2.
Predict the probability of success of an individual
with the following characteristics
GPA = 2.96
BIO = 7
CHEM = 7
49
PHY = 8
RED = 5
PRB = 7
QNT = 5
AGE = 25
GENDER = 0
3.
4.
Calculate Cragg and Uhler’s pseudo R2 for the
above model. How well does the model appear to
fit the data?
AGE and GENDER represent personal
characteristics. Test the hypothesis that AGE and
GENDER jointly have no impact on the probability
of success.
50
DATA UNI;
INFILE 'D:\TEACHING\MS4225\MEDICAL.TXT';
INPUT ACCEPT GPA BIO CHEM PHY RED PRB QNT
AGE GENDER;
PROC LOGISTIC DATA=UNI DESCENDING;
MODEL ACCEPT=GPA BIO CHEM PHY RED PRB QNT
AGE GENDER;
RUN;
PROC LOGISTIC DATA=UNI DESCENDING;
MODEL ACCEPT=GPA BIO CHEM PHY RED PRB QNT;
RUN;
51
52
53
54
55
LOGIT alternative estimation
technique
DATA PENALTY;
INFILE 'D:\TEACHING\MS4225\PENALTY.TXT';
INPUT DEATH BLACKD WHITVIC SERIOUS CULP SERIOUS2;
PROC GENMOD DATA=PENALTY DESCENDING;
MODEL DEATH=BLACKD WHITVIC SERIOUS/D=B;
RUN;
56
57
Advantages of PROC GENMOD

Class variable
DATA PENALTY;
INFILE 'D:\TEACHING\MS4225\PENALTY.TXT';
INPUT DEATH BLACKD WHITVIC SERIOUS CULP SERIOUS2;
PROC GENMOD DATA=PENALTY DESCENDING;
CLASS CULP;
MODEL DEATH=BLACKD WHITVIC CULP/D=B;
RUN;
58


Variable CULP takes the integer values 1 to 5
(5 notes high culpability and 1 denotes low
culpability)
The CLASS option treats this variable as a
set of categories by creating 4 dummy
variables, one for each of the values 1
through 4 (the default in GENMOD is to take
the highest value as the omitted category)
59
60
61


Multiplicative terms in the MODEL statement
(to capture interaction effects)
For example, some people may argue that
black defendants who kill white victims may
be especially likely to receive a death
sentence. We can test this hypothesis by
including the variable BLACK*WHITVIC in
the model statement
62
DATA PENALTY;
INFILE 'D:\TEACHING\MS4225\PENALTY.TXT';
INPUT DEATH BLACKD WHITVIC SERIOUS CULP SERIOUS2;
PROC GENMOD DATA=PENALTY DESCENDING;
MODEL DEATH=BLACKD WHITVIC CULP BLACKD*WHITVIC/D=B;
RUN;
63
64
Other features of PROC GENMOD

Deviance = -2 ln L


(for individual data)
Involves a comparison between the model of
interest and the maximal (or saturated) model
which always fit the data better. The question is
whether the difference in fit is statistically
significant.
With individual data, the saturated model has one
parameter for every predicted probability and
therefore gives a perfect fit and a likelihood value
of 1.
65



Unfortunately, with individual level data the deviance
does not have a chi-square distribution because the
number of parameters increases with sample size,
thereby violating a condition of asymptotic theory.
SCALE variable (can be ignored for binary
regression models unless one is working with
grouped data and want to allow for over dispersion)
Pearson Chi-square test (to be considered at a later
stage)
66
Disadvantages of PROC
GENMOD


Does not provide the odds ratio estimates
Does not report a global test for the overall
significance of model.
67
Hosmer-Lemeshow (HL)
statistic
DATA PENALTY;
INFILE 'D:\TEACHING\MS4225\PENALTY.TXT';
INPUT DEATH BLACKD WHITVIC SERIOUS CULP
SERIOUS2;
PROC LOGISTIC DATA=PENALTY DESCENDING;
MODEL DEATH=BLACKD WHITVIC CULP/LACKFIT;
RUN;
68
69
70

The HL statistic is calculated in the following
way:
Based on the estimated model, predicted probabilities
are generated for all observations. These are sorted
by size, then grouped into approximately 10 intervals.
Within each interval, the expected frequency is
obtained by adding up the predicted probabilities.
Expected frequencies are compared with the
observed frequency by the conventional Pearson chisquare statistic. The d.o.f. is the number of intervals
minus 2.
71