Differences-in-Differences and A (Very) Brief Introduction

Download Report

Transcript Differences-in-Differences and A (Very) Brief Introduction

Maximum Likelihood
Estimation
Methods of Economic
Investigation
Lecture 17
Last Time

IV estimation Issues

Heterogeneous Treatment Effects



The assumptions
LATE interpretation
Weak Instruments


Bias in Finite Samples
F-statistics test
Today’s Class

Maximum Likelihood Estimators




You’ve seen this in the context of OLS
Can make other assumptions on the form of
likelihood function
This is how we estimate discrete choice models
like probit and logit
This is a very useful form of estimation


Has nice properties
Can be very robust to mis-specification
Our Standard OLS

Standard OLS
Yi = Xi’β + εi

Focus on minimizing mean squared error
with an assumption that εi|Xi ~ N(0, σ2)
Another way to motivate linear models

“Extremum Estimators”: maximize/minimize
some function



OLS Minimize Mean-Squared Error
Could also imagine minimizing some other types
of functions
We often use a “likelihood function”


This approach is more general, allowing us to
deal with more complex nonlinear models
Useful properties in terms of consistency and
asymptotic convergence
What is a likelihood function

Suppose we have independent and
identically distributed random variables
{Zi, . . . ,ZN} drawn from a density
function f(z; θ). Then the likelihood
function given a sample
N
L( )   f ( z i , )
i 1

Because it is sometimes convenient, we
often use this in logarithmic form
N
log(L( ))   log( f ( z i , ))
i 1
Consistency - 1

Consider the population likelihood function
with the “true” parameter θ0
L0 ( )  E0 [log( f ( z; )]   [log f ( z; )] f ( z; 0 )dz
z

Think of L0 as the population average and
log L as the sample estimate, so that in
the usual way
1

log L( ; z N )) N
 L0 ( )
N
Consistency - 2

The population likelihood function is
maximized L0(θ) at the true value, θ0 .
Why?



think of the sample likelihood function as telling
us how likely it is one would observe the sample
if the parameter value θ is really the true
parameter value.
Similarly, the population likelihood function L0(θ)
will be the largest at the value of θ that makes
it most likely to “observe the population”
That value is true parameter value. ie θ0 =
argmaxL0(θ).
Consistency - 3

We now know that the population likelihood
L0(θ) is maximized at θ0


Can use Jensen’s inequality to apply this to the
log function
the sample likelihood function log L(θ; z)
gets closer to L0(θ) as N increases


i.e. log(L) will start having the same shape as L0
For large N, the sample likelihood will be
maximized at θ0

(ˆ MLE   0 ) N
0
Information Matrix Equality

An additional useful property from the MLE
comes from:
  2 ln f ( z; ) 
  ln f ( z; ) 
E
 E

2









2

Define the score function as the vector of
derivatives of the log likelihood function
 ln f
S ( z; ) 


Define the Hessian as the matrix of second
derivatives of the log likelihood function
 2 ln f ( z; )
H ( z, ) 
 2
Asymptotic Distribution

Define the following:
  2 ln f ( z; )


I ( 0 )   E[ H ( z, 0 )]   E 
2


  0 


Then the MLE estimate will converge in
distribution to:
d
N (ˆ MLE   0 ) 

N (0, I ( 0 ) 1 )

Where the information matrix I(θ) has the
property that Var(ˆ)  I (0 ) 1 i.e. there does
not exist a consistent estimate of θ with a
smaller variance
Computation

Can be quite complex because need to
numerically maximize

General procedure





Re-scale variables so they have roughly similar
variances
Choose some starting value and estimated
maximum in that areas
do this over and over across different grids
Get an approximation of the underlying objective
function
If this converges to a single maximum—you’re done
Test Statistics

Define our likelihood function L(z;θ0,θ1)

Suppose we want to test H0: θ0 = 0
against the alternative HA: θ0 ≠ 0

We could estimate a restricted and an
unrestricted likelihood function


MLE
r
MLE
u
(0,
 (
MLE
1r
MLE
0u
)  arg maxlog(L( z,0,1 ))
,
MLE
1u
)  arg maxlog(L( z; 0 ,1 )
Test Statistics - 1

We can test how “close” our restricted and
unrestricted models might be
LR  2  ( L(
MLE
u

)  L(
MLE
r
)) ~ X (dim(0 ))
2
We could test if the restricted log likelihood
function is maximized at θ0 = 0, the
derivative of the log likelihood function with
respect to 0 at that point should be close to
zero.
N
1 N
MLE
1
LM   S ( z; r )' I  S ( z; rMLE )
N
i 1
i 1
Test Statistics - 2
The restricted and unrestricted estimates
of θ should be close together if the null
hypothesis is correct
 Partition the information matrix as follows

I 1

 I 00
  10
I
I 01 
11 
I 
Define the Wald Test as:
W  N  (rMLE  uMLE )'( I 00 )1 (rMLE  uMLE )
Comparing test statistics

In large samples, these test statistics should
converge in probability



In finite samples, the three will tend to generate somewhat
different test statistics,
Will generally come to the same conclusion
The difference between the tests is how they go
about answering that question.



The LR test requires estimates of both of the models
The W and LM tests approximate the LR test but require
that only one model be estimated.
When model is linear the three test statistics have the
following relationship W ≥ LR ≥ LM
OLS in the MLE context

Linear Model log Likelihood Function

Choose parameter values which maximize
this:
Example 1: Discrete choice

Latent Variable Model:




True variable of interest is: Y*= X’β + ε
We don’t observe Y* but we can observe
Y = 1[Y*>0]
Pr[Y=1] = Pr[Y*>0] = Pr[ε<X’β]
What to assume about ε?



Linear Probability Model: Pr[Y=1] = X’β
Probit Model: Pr[Y=1] = Ф(X’β)
Logit Model: Pr[Y=1] = exp(X’β)/ [1 + exp(X’β)]
Likelihood Functions

Probit

Logit
Marginal Effects
In the linear function we can interpret our
coefficients as the change in the likelihood
function with respect to the relevant
variable, i.e. L

X
 In non-linear functions, things are a bit
trickier. We get




We get the parameter estimate of β
But we want: L   ( X '  )  
X
These are the “marginal effects” and are
typically evaluated at the mean values of X
Next Time

Time Series Processes




AR
MA
ARMA
Model Selection


Return to MLE
Various Criterion for Model Choice