Binary Dependent Variables

Download Report

Transcript Binary Dependent Variables

9. Binary Dependent Variables
• 9.1 Homogeneous models
– Logit, probit models
– Inference
– Tax preparers
• 9.2 Random effects models
• 9.3 Fixed effects models
• 9.4 Marginal models and GEE
• Appendix 9A - Likelihood calculations
9.1 Homogeneous models
• The response of interest, yit, now may be only a 0 or a 1, a
binary dependent variable.
– Typically indicates whether the ith subject possesses an
attribute at time t.
• Suppose that the probability that the response equals 1 is
denoted by Prob(yit = 1) = pit.
– Then, we may interpret the mean response to be the
probability that the response equals 1 , that is,
E yit = 0 Prob(yit = 0) + 1 Prob(yit = 1) = pit .
– Further, straightforward calculations show that the
variance is related to the mean through the expression
Var yit = pit (1 - pit ) .
Inadequacy of linear models
• Homogeneous means that we will not incorporate subjectspecific terms that account for heterogeneity.
• Linear models of the form yit = xit  + it are inadequate
because:
– The expected response is a probability and thus must vary
between 0 and 1 although the linear combination, xit  ,
may vary between negative and positive infinity.
– Linear models assume homoscedasticity (constant
variance) yet the variance of the response depends on the
mean which varies over observations.
– The response must be either a 0 or 1 although the
distribution of the error term is typically regarded as
continuous.
Using nonlinear functions of
explanatory variables
• In lieu of linear, or additive, functions, we express the
probability of the response being 1 as a nonlinear function
of explanatory variables
pit =  (xit  ).
• Two special cases are:
1
ez
π( z ) 
 z
z
1 e
e 1
– the logit case
–  (z ) as a cumulative standard normal distribution
function, the probit case.
• These two functions are similar. I focus on the logit case
because it permits closed-form expressions unlike the
cumulative normal distribution function.
Threshold interpretation
• Suppose that there exists an underlying linear model,
yit* = xit  + it*.
– The response is interpreted to be the “propensity” to
possess a characteristic.
– We do not observe the propensity but we do observe
when the propensity crosses a threshold, say 0.
0 yit*  0
– We observe
yit  
*
1
yit  0
• Using the logit distribution function,
Prob (it*  a) = 1/ (1 + exp(-a) )
• Note that Prob(-it*  xit  ) = Prob(it*  xit  ). Thus,
1
*
*
Prob( yit  1)  Prob( yit  0)  Prob( it  xitβ) 
  (xitβ)
1  exp(xitβ)
Random utility interpretation
• In economics applications, we think of an individual choosing
among c categories.
– Preferences among categories are indexed by an
unobserved utility function.
– We model utility as a function of an underlying value plus
random noise, that is, Uitj = uit(Vitj + itj), j = 0,1.
– If Uit1 > Uit0 , then denote this choice as yit = 1.
– Assuming that uit is a strictly increasing function, we have
Prob( yit  1)  Prob(U it 0  U it1 )
 Probu it (Vit 0   it 0 )  u it (Vit1   it1 )  Prob it 0   it1  Vit1  Vit 0 
• Parameterize the problem by taking Vit0 = 0 and Vit1 = xit β.
• We may take the difference in the errors, it0 - it1 , to be
normal or logistic, corresponding to the probit and logit cases.
Logistic regression
• This is another phrase used to describe the logit case.
• Using p = (z), the inverse of  can be calculated as
z = -1(p) = ln ( p/(1-p) ) .
– Define logit (p) = ln ( p/(1-p) ) to be the logit function.
– Here, p/(1-p) is known as the odds ratio. It has a
convenient economic interpretation in terms of fair
games.
• That is, suppose that p = 0.25. Then, the odds ratio is 0.333.
• The odds against winning are 0.333 to 1, or 1 to 3. If we bet $1,
then in a fair game we should win $3.
• The logistic regression models the linear combination of
explanatory variables as the logarithm of the odds ratio,
xit  = ln ( pit/(1-pit) ) .
Parameter interpretation
• To interpret  =( 1, 2, …, K), we begin by assuming that
jth explanatory variable, xitj, is either 0 or 1.
• Then, with the notation, we may interpret
 j  xit1  1  xitK  β  xit1  0  xitK  β
 Prob( yit  1 | xitj  1) 
 Prob( yit  1 | xitj  0) 



 ln
 ln
 1  Prob( yit  1 | xitj  1) 
 1  Prob( yit  1 | xitj  0) 




• Thus,
e
j


 0) / 1  Prob( y

 0) 
Prob( yit  1 | xitj  1) / 1  Prob( yit  1 | xitj  1)
Prob( yit  1 | xitj
it
 1 | xitj
• To illustrate, if j = 0.693, then exp(j) = 2.
– The odds (for y = 1) are twice as great for xj = 1 as for
xj = 0.
More parameter interpretation
• Similarly, assuming that jth explanatory variable is
continuous, we have
 Prob( yit  1 | xitj ) 
d
d

j 
xit β 
ln
dxitj
dxitj  1  Prob( yit  1 | xitj ) 



d
Prob( yit  1 | xitj ) / 1  Prob( yit  1 | xitj )
dxitj

Prob( yit  1 | xitj ) / 1  Prob( yit  1 | xitj )


• Thus, we may interpret j as the proportional change in the
odds ratio, known as an elasticity in economics.
Parameter estimation
•
•
•
The customary estimation method is maximum likelihood.
The log likelihood of a single observation is
ln(1  π(xit β))
yit ln π(xit β)  (1  yit ) ln(1  π(xit β))  
ln π(xit β)
The log likelihood of the data set is
y
it
•
if yit  0
if yit  1
ln π(xit β)  (1  yit ) ln(1  π(xit β))
it
Taking partial derivatives with respect to b yields the score equations

it
x it  yit  π(xit β) 
π(xit β)
0
π(xit β)1  π(xit β) 
– The solution of these equations, say bMLE, yields the maximum
likelihood estimate.
• The score equations can also be expressed as a generalized estimating
equation:

1





y

E
y
E
y
Var
y
0
it
it
it
it
it
β
• where

E yit  xit π(xitβ)

E yit  π(xit β)
Var yit  π(xit β)1  π(xit β) 
β
For the logit function
• The normal equations are:
 x y
it
it
  (xitβ)   0
it
– The solution depends on the responses yit only through the vector of
statistics it xit yit .
• The solution of these equations, say bMLE, yields the
maximum likelihood estimate bMLE .
• This method can be extended to provide standard errors for the
estimates.
9.2 Random effects models
• We accommodate heterogeneity by incorporating subjectspecific variables of the form:
pit =  (i + xit  ).
– We assume that the intercepts are realizations of random
variables from a common distribution.
• We estimate the parameters of the {i} distribution and the
K slope parameters .
• By using the random effects specification, we dramatically
reduced the number of parameters to be estimated
compared to the Section 9.3 fixed effects set-up.
– This is similar to the linear model case.
• This model is computationally difficult to evaluate.
Commonly used distributions
• We assume that subject-specific effects are independent and
come from a common distribution.
– It is customary to assume that the subject-specific effects are
normally distributed.
• We assume, conditional on subject-specific effects, that the
responses are independent. Thus, there is no serial correlation.
• There are two commonly used specifications of the conditional
distributions in the random effects panel data model.
– 1. A logistic model for the conditional distribution of a response.
1
That is, Prob( y  1 |  )  π(  x β) 
it
i
i
it
1  exp ( i  xit β) 
– 2. A normal model for the conditional distribution of a
response. That is,
Prob( yit  1 |  i )  ( i  xit β)
– where  is the standard normal distribution function.
Likelihood
• Let Prob(yit = 1| i) =(i + xit ) denote the conditional
probability for both the logistic and normal models.
• Conditional on i, the likelihood for the it th observation is:
1  π( i  xitβ) if yit  0
y
(1 y )
π( i  xitβ) (1  π( i  xitβ))

if yit  1
π( i  xitβ)
it
it
• Conditional on i, the likelihood for the ith subject is:
Ti

y
1 y
π i  xit β  it 1  π i  xit β  it
t 1
• Thus, the (unconditional) likelihood for the ith subject is:
li 
Ti

πa  xit β 
yit
1  πa  xitβ1 y
it
φ(a)da
t 1
– Here,  is the standard normal density function.
• Hence, the total log-likelihood is i ln li .
– Note: lots of evaluations of a numerical integral….
Comparing logit to probit specification
• There are no important advantages or disadvantages when
choosing the conditional probability  to be:
– logit function (logit model)
– standard normal (probit model)
• The likelihood involves roughly the same amount of work
to evaluate and maximize, although the logit function is
slightly easier to evaluate than the standard normal
distribution function.
• The probit model is slightly easier to interpret because
unconditional probabilities can be expressed in terms of the
standard normal distribution function.
• That is,


xit β 


Prob( yit  1)  E Φ( i  xit β)  Φ

2 
1


 

9.3 Fixed effects models
• As with homogeneous models, we express the probability
of the response being 1 as a nonlinear function of linear
combinations of explanatory variables.
• To accommodate heterogeneity, we incorporate subjectspecific variables of the form:
pit =  (i + xit ).
– Here, the subject-specific effects account only for the
intercepts and do not include other variables.
– We assume that {i} are fixed effects in this section.
• In this chapter, we assume that responses are serially
uncorrelated.
• Important point: Panel data with dummy variables provide
inconsistent parameter estimates….
Maximum likelihood estimation
• Unlike random effect models, maximum likelihood estimators
are inconsistent in fixed effects models.
– The log likelihood of the data set is
y
it
ln ( i  xit β)  (1  yit ) ln(1   ( i  xit β))
it
– This log likelihood can still be maximized to yield maximum
likelihood estimators.
– However, as the subject size n tends to infinity, the number of
parameters also tends to infinity.
• Intuitively, our ability to estimate  is corrupted by our
inability to estimate consistently the subject-specific effects
{ i } .
– In the linear case, we had that the maximum likelihood estimates are
equivalent to the least squares estimates.
• The least squares estimates of  were consistent.
• The least squares procedure “swept out” intercept estimators
when producing estimates of  .
Maximum likelihood estimation is
inconsistent
• Example 9.2 (Chamberlain, 1978, Hsiao 1986).
– Let Ti = 2, K=1 and xi1 = 0 and xi2=1.
– Take derivatives of the likelihood function to get the
score functions – these are in display (9.8).
– From (9.8), the score functions are
– and
L
ei
ei  
 yi1  yi 2 

0
i
i  
 i
1 e
1 e

L
e i   
   yi 2 
0
i   
β
1 e
i 

– Appendix 9A.1
• Maximize this to get bmle
• Show that the probability limit of bmle is 2 , and hence is an
inconsistent estimator of .
Conditional maximum likelihood
estimation
• This estimation technique provides consistent estimates of
the beta coefficients.
– It is due to Chamberlain (1980) in the context of fixed
effects panel data models.
• Let’s consider the logit specification of , so that
pit  π( i  xit β) 
1
1  exp ( i  xit β) 
• Big idea: With this specification, it turns out that t yit is a
sufficient statistic for i.
– Thus, if we condition on t yit , then the distribution of
the responses will not depend on i.
Example of the sufficiency
• To illustrate how to separate the intercept from the slope
effects, consider the case Ti = 2.
– Suppose that the sum, t yit = yi1+yi2, equals either 0 or 2.
•
•
•
•
If sum equals 0, then Prob (yi1 = 0, yi2 = 0 |yi1 + yi2 = sum) = 1.
If sum equals 2, then Prob (yi1 = 1, yi2 = 1 |yi1 + yi2 = sum) = 1.
Both conditional probabilities do not depend on i .
Both conditional events are certain and will contribute nothing
to a conditional likelihood.
– If sum equals 1,
Prob yi1  yi 2  1  Prob yi1  0Prob yi 2  1  Prob yi1  1Prob yi 2  0
exp i  xi1β   exp i  xi 2β 

1  exp i  xi1β 1  exp i  xi 2β 
Example of the sufficiency
• Thus,
Prob yi1  0Prob yi 2  1
Prob yi1  0, yi 2  1 | yi1  yi 2  1 
Prob yi1  yi 2  1
exp i  xi 2β 

exp i  xi1β   exp i  xi 2β 
exp xi 2 β 

exp xi1β   exp xi 2 β 
• This does not depend on i .
– Note that if an explanatory variable xij is time-constant
(xij2 xij1 ), then the corresponding parameter j
disappears from the conditional likelihood.
Conditional likelihood estimation
• Let Si be the random variable representing t yit and let sumi be
the realization of t yit .
• The conditional likelihood of the data set is
n

i 1
 piy1i1 piy2i 2  piTyiT 


 Prob(Si  sumi ) 
– Note that the ratio equals one when sumi equal 0 or Ti.
– The distribution of Si is messy and is difficult to compute
for moderate size data sets with T more than 10.
• This provides a fix for the problem of “infinitely many
nuisance parameters.”
– Computationally difficult, hard to extend to more complex
models, hard to explain to consumers
9.4 Marginal models and GEE
• Marginal models, also know as “population-averaged” models,
only require specification of the first two moments
– Means, variances and covariances
– Not a true probability model
– Ideal for moment estimation (GEE, GMM)
• Begin in the context of the random effects binary dependent
variable model
– The mean is E yit = m it  m it (β, )   πa  x it β  d F (a)
– The variance is Var yit = mit (1- mit ).
– The covariance is Cov (yir, yis)
 πa  x ir β  πa  x is β  d F (a)  m ir m is

GEE – generalized estimating equations
• This is a method of moments procedure
– Essentially the same as generalized method of moments
– One matches theoretical moments to sample moments, with
appropriate weighting.
• Idea – find the values of the parameters that satisfy
n
0 K   G m (b EE , EE )Vi (b EE , EE )  (y i  μ i (b EE , EE ))
1
i 1
– We have already specified the variance matrix.
– We also use a K x Ti matrix of derivatives

μiT 
 μ i (β, )   μi1

  
G m (β, )  

i

β

 β
– For binary variables, we have

mit  xit  πa  xitβ d F (a)
β
β 
Marginal Model
• Choose the mean function to be
– Motivated by probit specification
m it  Φxit β 
 x β 

it
Prob( yit  1)  E Φ( i  xit β)  Φ

2 
1


 

• For the variance function, consider Var yit =  mit (1- mit).
• Let Corr(yir, yis) denote the correlation between yir and yis.
– This is known as a working correlation.
• Use the exchangeable correlation structure specified as
1 for r  s
Corr ( y ir , y is )  
  for r  s

• Here, the motivation is that the latent variable i is common to
all observations within a subject, thus inducing a common
correlation.
• The parameters τ = (, ) constitute the variance components.
Robust Standard Errors
• Model-based standard errors are taken from the square root of
the diagonal elements of
 n

1
 G m (b EE , EE )Vi (b EE , EE )  G m (b EE , EE ) 


 i 1


1
• As an alternative, robust or empirical standards errors are
from
 n

 G m Vi1G m 


 i 1


1
 n
 n


1
1
 G m Vi y i  μ i y i  μ i  Vi G m  G m Vi1G m 



 i 1
 i 1



1
• These are robust to misspecified heterscedasticity as well as
time series correlation.