Transcript Lecture 7

Computational statistics 2009
The basic idea
Assume a particular model with unknown
parameters.
Determine how the likelihood of a given event varies
with model model parameters
Choose the parameter values that maximize the
likelihood of the observed event
Computational statistics 2009
A general mathematical formulation
Consider a sample (X1, ..., Xn) which is drawn from a probability
distribution P(X|) where  are parameters.
If the Xs are independent with probability density function P(Xi|)
then the joint probability of the whole set is
n
P( X 1 ,.., X n |  ) =  P( X i |  )
i=1
Find the parameters that maximize this function
Computational statistics 2009
The likelihood function for the general
non-linear model
Assume that
Y = f(X,  ) + e
e ~ N(0;  )
Then the likelihood function is
L(  , ) =
1
2
exp [-0.5(Y - f(X, ) -1 (Y - f(X, )]
Note that the ML-estimator of  is identical to the mean square estimator
if  = 2I, where I is the identity matrix.
Computational statistics 2009
Large sample properties of ML
estimators
Consistency: As the sample size increases, the ML estimator converges
to the true parameter value
Invaríance: If f() is a function of the unknown parameters of the
distribution, then the ML estimator of f() is f(̂)
Asymptotic normality: As the sampe size increases, the sampling
distribution of an ML estimator converges to a normal distribution
Variance: For large sample sizes, the variance of an ML estimator
(assuming a single unknown parameter) is approximately the negative of
the reciprocal of the second derivative of the log-likelihood function
evaluated at the ML estimate.
1
2
  L( | x)


Var (ˆ)  
|
2
 ˆ 
 

Computational statistics 2009
The information matrix (Hessian)
The matrix
   2 log (L( ˆ )) 
 = I(  )
E - 


   
is a measure of how `pointy' the likelihood function is.
The variance of the ML estimator is given by the inverse Hessian
Var( ̂ ML ) = [I(  ) ] -1
Computational statistics 2009
The Cramer-Rao lower bound
The Cramer-Rao lower bound is the smallest theoretical variance
which can be achieved.
ML gives this, so any other estimation technique can at best only
equal it.
If  * is another estimator of 
Var(  * )  I(  )-1
Do we need estimators other than ML estimators?
Computational statistics 2009
ML estimators for dynamic models
A general decomposition technique for the log likelihood function
allows us to extend standard ML procedures to dynamic models
(time series models).
From the basic definition of conditional probability
Pr( ,  ) = Pr( |  )Pr(  )
This may be applied directly to the likelihood function
Computational statistics 2009
Prediction error decomposition
Consider the decomposition
log(L( Y 1 ,Y 2 ,...Y T -1 ,Y T ))
= log(L( Y T | Y 1 ,Y 2 ,...,Y T -1 )) + log(L( Y 1 ,Y 2 ,...,Y T -1 ))
The first term is the conditional probability of Y given all past values.
We can then condition the second term and so on to give
T -2
=  log(L( Y T -i | Y 1 ,...,Y T -i-1 )) + log(L( Y 1 ))
i=0
that is, a series of one step ahead prediction errors conditional on
actual lagged Y.
Computational statistics 2009
Numerical optimization
In simple cases (e.g. OLS) we can calculate the maximum likelihood
estimates analytically.
But in many cases we cannot, then we resort to numerical
optimisation of the likelihood function.
This amounts to hill climbing in parameter space.
1.
2.
3.
4.
set an arbitrary initial set of parameters.
determine a direction of movement
determine a step length to move
examine some termination criteria and either stop or go back to 2
Computational statistics 2009
L
Lu

1

2

*
Computational statistics 2009
Gradient methods for determining the
maximum of a function
These methods base the direction of movement on the first
derivatives of the likelihood function with respect to the parameters.
Often the step length is also determined by (an approximation to) the
second derivatives. So
-1
  L   L 
  
 i+1 =  i + 
2 
      
2
The class of gradient methods include: Newton, Quasi Newton,
Steepest descent etc.
Computational statistics 2009
Qualitative response models
Assume that we have a quantitative model
Y t =  X t + ut
but we only observe certain limited information, e.g.
z = 1 if Y > 0
z = 0 if Y < 0
Then we can group the data into two groups and form a likelihood
function with the following form
L =  F(- X t )  F(1 -  X t )
z=0
z=1
where F is the cumulative distribution function of the error terms ut
Computational statistics 2009