Binary Dependent Variables
Download
Report
Transcript Binary Dependent Variables
9. Binary Dependent Variables
• 9.1 Homogeneous models
– Logit, probit models
– Inference
– Tax preparers
• 9.2 Random effects models
• 9.3 Fixed effects models
• 9.4 Marginal models and GEE
• Appendix 9A - Likelihood calculations
9.1 Homogeneous models
• The response of interest, yit, now may be only a 0 or a 1, a
binary dependent variable.
– Typically indicates whether the ith subject possesses an
attribute at time t.
• Suppose that the probability that the response equals 1 is
denoted by Prob(yit = 1) = pit.
– Then, we may interpret the mean response to be the
probability that the response equals 1 , that is,
E yit = 0 Prob(yit = 0) + 1 Prob(yit = 1) = pit .
– Further, straightforward calculations show that the
variance is related to the mean through the expression
Var yit = pit (1 - pit ) .
Inadequacy of linear models
• Homogeneous means that we will not incorporate subjectspecific terms that account for heterogeneity.
• Linear models of the form yit = xit + it are inadequate
because:
– The expected response is a probability and thus must vary
between 0 and 1 although the linear combination, xit ,
may vary between negative and positive infinity.
– Linear models assume homoscedasticity (constant
variance) yet the variance of the response depends on the
mean which varies over observations.
– The response must be either a 0 or 1 although the
distribution of the error term is typically regarded as
continuous.
Using nonlinear functions of
explanatory variables
• In lieu of linear, or additive, functions, we express the
probability of the response being 1 as a nonlinear function
of explanatory variables
pit = (xit ).
• Two special cases are:
1
ez
π( z )
z
z
1 e
e 1
– the logit case
– (z ) as a cumulative standard normal distribution
function, the probit case.
• These two functions are similar. I focus on the logit case
because it permits closed-form expressions unlike the
cumulative normal distribution function.
Threshold interpretation
• Suppose that there exists an underlying linear model,
yit* = xit + it*.
– The response is interpreted to be the “propensity” to
possess a characteristic.
– We do not observe the propensity but we do observe
when the propensity crosses a threshold, say 0.
0 yit* 0
– We observe
yit
*
1
yit 0
• Using the logit distribution function,
Prob (it* a) = 1/ (1 + exp(-a) )
• Note that Prob(-it* xit ) = Prob(it* xit ). Thus,
1
*
*
Prob( yit 1) Prob( yit 0) Prob( it xit β)
(xit β)
1 exp( xit β)
Random utility interpretation
• In economics applications, we think of an individual choosing
among c categories.
– Preferences among categories are indexed by an
unobserved utility function.
– We model utility as a function of an underlying value plus
random noise, that is, Uitj = uit(Vitj + itj), j = 0,1.
– If Uit1 > Uit0 , then denote this choice as yit = 1.
– Assuming that uit is a strictly increasing function, we have
Prob( y it 1) Prob(U it 0 U it1 )
Prob u it (Vit 0 it 0 ) u it (Vit1 it1 ) Prob it 0 it1 Vit1 Vit 0
• Parameterize the problem by taking Vit0 = 0 and Vit1 = xit β.
• We may take the difference in the errors, it0 - it1 , to be
normal or logistic, corresponding to the probit and logit cases.
Logistic regression
• This is another phrase used to describe the logit case.
• Using p = (z), the inverse of can be calculated as
z = -1(p) = ln ( p/(1-p) ) .
– Define logit (p) = ln ( p/(1-p) ) to be the logit function.
– Here, p/(1-p) is known as the odds ratio. It has a
convenient economic interpretation in terms of fair
games.
• That is, suppose that p = 0.25. Then, the odds ratio is 0.333.
• The odds against winning are 0.333 to 1, or 1 to 3. If we bet $1,
then in a fair game we should win $3.
• The logistic regression models the linear combination of
explanatory variables as the logarithm of the odds ratio,
xit = ln ( pit/(1-pit) ) .
Parameter interpretation
• To interpret =( 1, 2, …, K), we begin by assuming that
jth explanatory variable, xitj, is either 0 or 1.
• Then, with the notation, we may interpret
j xit1 1 xitK β xit1 0 xitK β
Prob( yit 1 | xitj 1)
Prob( yit 1 | xitj 0)
ln
ln
1 Prob( yit 1 | xitj 1)
1 Prob( yit 1 | xitj 0)
• Thus,
e
j
0) / 1 Prob( y
0)
Prob( yit 1 | xitj 1) / 1 Prob( yit 1 | xitj 1)
Prob( yit 1 | xitj
it
1 | xitj
• To illustrate, if j = 0.693, then exp(j) = 2.
– The odds (for y = 1) are twice as great for xj = 1 as for
xj = 0.
More parameter interpretation
• Similarly, assuming that jth explanatory variable is
continuous, we have
Prob( yit 1 | xitj )
d
d
j
xit β
ln
dxitj
dxitj 1 Prob( yit 1 | xitj )
d
Prob( yit 1 | xitj ) / 1 Prob( yit 1 | xitj )
dxitj
Prob( yit 1 | xitj ) / 1 Prob( yit 1 | xitj )
• Thus, we may interpret j as the proportional change in the
odds ratio, known as an elasticity in economics.
Parameter estimation
•
•
•
The customary estimation method is maximum likelihood.
The log likelihood of a single observation is
ln( 1 π(xit β))
yit ln π(xit β) (1 yit ) ln( 1 π(xit β))
ln π(xit β)
The log likelihood of the data set is
y
it
•
if yit 0
if yit 1
ln π(xit β) (1 yit ) ln(1 π(xit β))
it
Taking partial derivatives with respect to b yields the score equations
it
π(xit β)
xit yit π(xit β)
0
π(xit β)1 π(xit β)
– The solution of these equations, say bMLE, yields the maximum
likelihood estimate.
• The score equations can also be expressed as a generalized estimating
equation:
1
y
E
y
E
y
Var
y
0
it
it
it
it
it
β
• where
E yit xit π(xit β)
E yit π( x it β)
Var yit π( xit β)1 π( xit β)
β
For the logit function
• The normal equations are:
x y
it
it
(xit β) 0
it
– The solution depends on the responses yit only through the vector of
statistics it xit yit .
• The solution of these equations, say bMLE, yields the
maximum likelihood estimate bMLE .
• This method can be extended to provide standard errors for the
estimates.
9.2 Random effects models
• We accommodate heterogeneity by incorporating subjectspecific variables of the form:
pit = (i + xit ).
– We assume that the intercepts are realizations of random
variables from a common distribution.
• We estimate the parameters of the {i} distribution and the
K slope parameters .
• By using the random effects specification, we dramatically
reduced the number of parameters to be estimated
compared to the Section 9.3 fixed effects set-up.
– This is similar to the linear model case.
• This model is computationally difficult to evaluate.
Commonly used distributions
• We assume that subject-specific effects are independent and
come from a common distribution.
– It is customary to assume that the subject-specific effects are
normally distributed.
• We assume, conditional on subject-specific effects, that the
responses are independent. Thus, there is no serial correlation.
• There are two commonly used specifications of the conditional
distributions in the random effects panel data model.
– 1. A logistic model for the conditional distribution of a response.
1
That is, Prob( y 1 | ) π( x β)
it
i
i
it
1 exp ( i xit β)
– 2. A normal model for the conditional distribution of a
response. That is,
Prob( yit 1 | i ) ( i xit β)
– where is the standard normal distribution function.
Likelihood
• Let Prob(yit = 1| i) =(i + xit ) denote the conditional
probability for both the logistic and normal models.
• Conditional on i, the likelihood for the it th observation is:
1 π( i xitβ) if yit 0
y
(1 y )
π( i xitβ) (1 π( i xitβ))
if yit 1
π( i xitβ)
it
it
• Conditional on i, the likelihood for the ith subject is:
Ti
y
1 y
π i xit β it 1 π i xit β it
t 1
• Thus, the (unconditional) likelihood for the ith subject is:
li
Ti
πa xit β
yit
1 πa xitβ1 y
it
φ(a)da
t 1
– Here, is the standard normal density function.
• Hence, the total log-likelihood is i ln li .
– Note: lots of evaluations of a numerical integral….
Comparing logit to probit specification
• There are no important advantages or disadvantages when
choosing the conditional probability to be:
– logit function (logit model)
– standard normal (probit model)
• The likelihood involves roughly the same amount of work
to evaluate and maximize, although the logit function is
slightly easier to evaluate than the standard normal
distribution function.
• The probit model is slightly easier to interpret because
unconditional probabilities can be expressed in terms of the
standard normal distribution function.
• That is,
xit β
Prob( yit 1) E Φ( i x it β) Φ
2
1
9.3 Fixed effects models
• As with homogeneous models, we express the probability
of the response being 1 as a nonlinear function of linear
combinations of explanatory variables.
• To accommodate heterogeneity, we incorporate subjectspecific variables of the form:
pit = (i + xit ).
– Here, the subject-specific effects account only for the
intercepts and do not include other variables.
– We assume that {i} are fixed effects in this section.
• In this chapter, we assume that responses are serially
uncorrelated.
• Important point: Panel data with dummy variables provide
inconsistent parameter estimates….
Maximum likelihood estimation
• Unlike random effect models, maximum likelihood estimators
are inconsistent in fixed effects models.
– The log likelihood of the data set is
y
it
ln ( i xit β) (1 yit ) ln(1 ( i xit β))
it
– This log likelihood can still be maximized to yield maximum
likelihood estimators.
– However, as the subject size n tends to infinity, the number of
parameters also tends to infinity.
• Intuitively, our ability to estimate is corrupted by our
inability to estimate consistently the subject-specific effects
{ i } .
– In the linear case, we had that the maximum likelihood estimates are
equivalent to the least squares estimates.
• The least squares estimates of were consistent.
• The least squares procedure “swept out” intercept estimators
when producing estimates of .
Maximum likelihood estimation is
inconsistent
• Example 9.2 (Chamberlain, 1978, Hsiao 1986).
– Let Ti = 2, K=1 and xi1 = 0 and xi2=1.
– Take derivatives of the likelihood function to get the
score functions – these are in display (9.8).
– From (9.8), the score functions are
– and
L
e i
e i
yi1 yi 2
0
i
i
i
1 e
1 e
L
e i
yi 2
0
i
β
1 e
i
– Appendix 9A.1
• Maximize this to get bmle
• Show that the probability limit of bmle is 2 , and hence is an
inconsistent estimator of .
Conditional maximum likelihood
estimation
• This estimation technique provides consistent estimates of
the beta coefficients.
– It is due to Chamberlain (1980) in the context of fixed
effects panel data models.
• Let’s consider the logit specification of , so that
pit π( i xit β)
1
1 exp ( i xit β)
• Big idea: With this specification, it turns out that t yit is a
sufficient statistic for i.
– Thus, if we condition on t yit , then the distribution of
the responses will not depend on i.
Example of the sufficiency
• To illustrate how to separate the intercept from the slope
effects, consider the case Ti = 2.
– Suppose that the sum, t yit = yi1+yi2, equals either 0 or 2.
•
•
•
•
If sum equals 0, then Prob (yi1 = 0, yi2 = 0 |yi1 + yi2 = sum) = 1.
If sum equals 2, then Prob (yi1 = 1, yi2 = 1 |yi1 + yi2 = sum) = 1.
Both conditional probabilities do not depend on i .
Both conditional events are certain and will contribute nothing
to a conditional likelihood.
– If sum equals 1,
Prob yi1 yi 2 1 Prob yi1 0Prob yi 2 1 Prob yi1 1Prob yi 2 0
exp i xi1β exp i xi 2β
1 exp i xi1β1 exp i xi 2β
Example of the sufficiency
• Thus,
Prob yi1 0, yi 2 1 | yi1 yi 2 1
exp i xi 2β
exp i xi1β exp i xi 2β
Prob yi1 0Prob yi 2 1
Prob yi1 yi 2 1
expxi 2β
expxi1β expxi 2β
• This does not depend on i .
– Note that if an explanatory variable xij is time-constant
(xij2 xij1 ), then the corresponding parameter j
disappears from the conditional likelihood.
Conditional likelihood estimation
• Let Si be the random variable representing t yit and let sumi be
the realization of t yit .
• The conditional likelihood of the data set is
n
i 1
piy1i1 piy2i 2 piTyiT
Prob( S i sumi )
– Note that the ratio equals one when sumi equal 0 or Ti.
– The distribution of Si is messy and is difficult to compute
for moderate size data sets with T more than 10.
• This provides a fix for the problem of “infinitely many
nuisance parameters.”
– Computationally difficult, hard to extend to more complex
models, hard to explain to consumers
9.4 Marginal models and GEE
• Marginal models, also know as “population-averaged” models,
only require specification of the first two moments
– Means, variances and covariances
– Not a true probability model
– Ideal for moment estimation (GEE, GMM)
• Begin in the context of the random effects binary dependent
variable model
– The mean is E yit = m it m it (β, ) πa xit β d F (a)
– The variance is Var yit = mit (1- mit ).
– The covariance is Cov (yir, yis)
πa xir β πa xis β d F (a) m ir m is
GEE – generalized estimating equations
• This is a method of moments procedure
– Essentially the same as generalized method of moments
– One matches theoretical moments to sample moments, with
appropriate weighting.
• Idea – find the values of the parameters that satisfy
n
0K
1
G m (b EE , EE ) Vi (b EE , EE ) (y i μ i (b EE , EE ))
i 1
– We have already specified the variance matrix.
– We also use a K x Ti matrix of derivatives
μiT
μ i (β, ) μi1
G m (β, )
i
β
β
– For binary variables, we have
mit xit πa xitβ d F (a)
β
β
Marginal Model
• Choose the mean function to be
– Motivated by probit specification
m it Φx it β
x β
it
Prob( yit 1) E Φ( i xit β) Φ
2
1
• For the variance function, consider Var yit = mit (1- mit).
• Let Corr(yir, yis) denote the correlation between yir and yis.
– This is known as a working correlation.
• Use the exchangeable correlation structure specified as
1 for r s
Corr ( y ir , y is )
for r s
• Here, the motivation is that the latent variable i is common to
all observations within a subject, thus inducing a common
correlation.
• The parameters τ = (, ) constitute the variance components.
Robust Standard Errors
• Model-based standard errors are taken from the square root of
the diagonal elements of
n
1
G m (b EE , EE )Vi (b EE , EE ) G m (b EE , EE )
i 1
1
• As an alternative, robust or empirical standards errors are
from
G m Vi1G m
i 1
n
1
G m Vi1 y i μ i y i μ i Vi1G m G m Vi1G m
i 1
i 1
n
n
1
• These are robust to misspecified heterscedasticity as well as
time series correlation.