Transcript Slide 1

Estimating parameters from data
Gil McVean, Department of Statistics
Thursday 13th February 2009
1
Questions to ask…
•
How can I estimate model parameters from data?
•
What should I worry about when choosing between estimators?
•
Is there some optimal way of estimating parameters from data?
•
How can I compare different parameter values?
•
How should I make statements about certainty regarding estimates and
hypotheses?
2
Motivating example I
•
I conduct an experiment where I measure the weight of 100 mice that were
exposed to a normal diet and 50 mice exposed to a high-energy diet
•
I want to estimate the expected gain in weight due to the change in diet
Normal
High-calorie
3
Motivating example II
•
I observe the co-segregation of two traits (e.g. a visible trait and a genetic marker)
in a cross
•
I want to estimate the recombination rate between the two markers
Bateson and Punnett experiment
Phenotype and
genotype
Observed
Expected from
9:3:3:1 ratio
Purple, long (P_L_)
284
216
Purple, round (P_ll)
21
72
Red, long (ppL_)
21
72
Red, round (ppll)
55
24
4
Parameter estimation
•
We can formulate most questions in statistics in terms of making statements
about underlying parameters
•
We want to devise a framework for estimating those parameters and making
statements about our certainty
•
In this lecture we will look at several different approaches to making such
statements
– Moment estimators
– Likelihood
– Bayesian estimation
5
Moment estimation
•
You have already come across one way of estimating parameter values – moment
methods
•
In such techniques parameter values are found that match sample moments
(mean, variance, etc.) to those expected
•
E.g. for random variables X1, X2, etc. sampled from a N(m,s2) distribution
X
s 
2
n
1
n
X
1
n
 X
i 1
i
n
i 1
X
2
i
mˆ  X
sˆ 2 
n
n 1
s2
E( X )  m
E (s 2 ) 
n 1
n
s2
6
Example: fitting a gamma distribution
•
The gamma distribution is parameterised by a shape parameter, a, and a scale
parameter, b
b a a 1 bx
f ( x;a , b ) 
x e
(a )
•
The mean of the distribution is a/b and the variance is a/b2
•
We can fit a gamma distribution by looking at the first two sample moments
X
bˆ  1
2
(
X

X
)

i
n 1
i
aˆ  Xbˆ
Alkaline phosphatase
measurements in 2019 mice
a  4.03
b  0.14
7
Bias
•
Although the moment method looks sensible, it can lead to biased estimators
•
In the previous example, estimates of both parameters are upwardly biased
•
Bias is measured by the difference between the expected estimate and the truth
Bias(ˆ)  E(ˆ)  
•
However, bias is not the only thing to worry about
– For example, the value of the first observation is an unbiased estimator of the mean for
a Normal distribution. However it is a rubbish estimator
•
We also need to worry about the variance of an estimator
8
Example: estimating the population mutation rate
•
In population genetics, a parameter of interest is the population-scaled mutation
rate
•
There are two common estimators for this parameter
– The average number of differences between two sequences
– The total number of polymorphic sites in the sample divided by a constant that is
approximately the log of the sample size
•
Which is better?
•
The first estimator has larger variance than the second – suggesting that it is an
inferior estimator
•
It is actually worse than this – it is not even guaranteed to converge on the truth
as the sample size gets infinitely large
– A property called consistency
9
The bias-variance trade off
•
Some estimators may be biased
•
Some estimators may have large variance
•
Which is better?
•
A simple way of combining both metrics is to consider the mean-squared error of
an estimator ˆ


2
ˆ
ˆ
ˆ
MSE ( )  Var ( )  E (   )
10
Example
•
Consider two ways of estimating the variance of a Normal distribution from the
sample variance
s 
2
 X
n
1
n
i 1
i  X
2
sˆ A2  s 2
sˆ B2 
n 2
s
n 1
•
The second estimator is unbiased, but the first estimator has lower MSE
•
Actually, there is a third estimator, which is even more biased than the first, but
which has even lower MSE
sˆ C2 
n 2
s
n 1
11
Least squares estimation
•
A commonly-used approach to fitting models to data is called least squares
estimation
•
This attempts to minimise the sum of the squares of residuals
– A residual is the difference between an observed and a fitted value
•
An important point to remember is that minimising LS is not the only thing to
worry about when fitting model
– Over-fitting
12
Problems with moment estimation
•
It is not always possible to exactly match sample moments with their expectation
•
It is not clear when using moment methods how much of the information in the
data about the parameters is being used
– Often not much..
•
Why should MSE be the best way of measuring the value of an estimator?
13
Is there an optimal way to estimate parameters?
•
For any model the maximum information about model parameters is obtained by
considering the likelihood function
•
The likelihood function is proportional to the probability of observing the data
given a specified parameter value
•
One natural choice for point estimation of parameters is the maximum likelihood
estimate, the parameter values that maximise the probability of observing the
data
•
The maximum likelihood estimate (mle) has some useful properties (though is not
always optimal in every sense )
14
An intuitive view on likelihood
m  2, s 2  1
m  0, s 2  1
m  0, s 2  4
15
An example
•
Suppose we have data generated from a Poisson distribution. We want to
estimate the parameter of the distribution
•
The probability of observing a particular random variable is
em m X
P( X ; m ) 
X!
•
If we have observed a series of iid Poisson RVs we obtain the joint likelihood by
multiplying the individual probabilities together
e  m m X1 e  m m X 2
em m X n
P( X 1 , X 2 ,, X n ; m ) 

 
X 1!
X 2!
X n!
L ( m ; X)   e  m m X i
i
L( m ; X)  e  nm m nX
16
Comments
•
Note in the likelihood function the factorials have disappeared. This is because
they provide a constant that does not influence the relative likelihood of different
values of the parameter
•
It is usual to work with the log likelihood rather than the likelihood. Note that
maximising the log likelihood is equivalent to maximising the likelihood
•
We can find the mle of the parameter analytically
L( m ; X)  e  nm m nX
( m ; X)  nm  nX log m
d
nX
 n 
dm
m
mˆ  X
Take the natural log of the
likelihood function
Find where the derivative of
the log likelihood is zero
Note that here the mle is the
same as the moment
estimator
17
Sufficient statistics
•
In this example we could write the likelihood as a function of a simple summary of
the data – the mean
•
This is an example of a sufficient statistic. These are statistics that contain all
information about the parameter(s) under the specified model
•
For example, support we have a series of iid normal RVs
P( X 1 , X 2 ,, X n ; m , s )  
2
 ( X i  m ) 2 / 2s 2
1
2 s
e


i
L ( m , s ; X)  s e
2
n
1
2s 2
( X i  m ) 2
i

( m , s 2 ; X)   n logs  2sn 2 X 2  2 mX  m 2
Mean square
Mean

18
Properties of the maximum likelihood estimate
•
The maximum likelihood estimate can be found either analytically or by numerical
maximisation
•
The mle is consistent in that it converges to the truth as the sample size gets
infinitely large
•
The mle is asymptotically efficient in that it achieves the minimum possible
variance (the Cramér-Rao Lower Bound) as n→∞
•
However, the mle is often biased for finite sample sizes
– For example, the mle for the variance parameter in a normal distribution is the sample
variance
19
Comparing parameter estimates
•
Obtaining a point estimate of a parameter is just one problem in statistical
inference
•
We might also like to ask how good different parameter values are
•
One way of comparing parameters is through relative likelihood
•
For example, suppose we observe counts of 12, 22, 14 and 8 from a Poisson
process
•
The maximum likelihood estimate is 14. The relative likelihood is given by
m
L ( m ; X)
 e  n ( m  mˆ )  
L( mˆ ; X)
 mˆ 
nX
20
Using relative likelihood
•
The relative likelihood and log likelihood surfaces are
shown below
21
Interval estimation
•
In most cases the chance that the point estimate you obtain for a parameter is
actually the correct one is zero
•
We can generalise the idea of point estimation to interval estimation
•
Here, rather than estimating a single value of a parameter we estimate a region of
parameter space
– We make the inference that the parameter of interest lies within the defined region
•
The coverage of an interval estimator is the fraction of times the parameter
actually lies within the interval
•
The idea of interval estimation is intimately linked to the notion of confidence
intervals
22
Example
•
Suppose I’m interested in estimating the mean of a normal distribution with
known variance of 1 from a sample of 10 observations
•
I construct an interval estimator
•
The chart below shows how the coverage properties of this estimator vary with a
ˆ  X  a, X  a
If I choose a to be 0.62 I
would have coverage of 95%
23
Confidence intervals
•
It is a short step from here to the notion of confidence intervals
•
We find an interval estimator of the parameter that, for any value of the
parameter that might be possible, has the desired coverage properties
•
We then apply this interval estimator to our observed data to get a confidence
interval
•
We can guarantee that among repeat performances of the same experiment the
true value of the parameter would be in this interval 95% of the time
•
We cannot say ”There is a 95% chance of the true parameter being in this interval”
24
Example – confidence intervals for normal distribution
•
Creating confidence intervals for the mean of normal distributions is relatively
easy because the coverage properties of interval estimators do not depend on the
mean (for a fixed variance)
•
For example, the interval estimator below has 95% coverage properties for any
mean
mˆ  X  a, X  a 
s
a  1.96
•
n
As you’ll see later, there is an intimate link between confidence intervals and
hypothesis testing
25
Example: confidence intervals for exponential distribution
•
For most distributions, the coverage properties of an estimator will depend on the
true underlying parameter
•
However, we can make use of the CLT to make confidence intervals for means
•
For example, for the exponential distribution with different means, the graph
shows the coverage properties for the interval estimator (n=100)
mˆ  X  a, X  a 
 n11  ( X i  X ) 2 


i
a  1.96

n




1/ 2
26
Confidence intervals and likelihood
•
Thanks to the CLT there is another useful result that allows us to define
confidence intervals from the log-likelihood surface
•
Specifically, the set of parameter values for which the log-likelihood is not more
than 1.92 less than the maximum likelihood will define a 95% confidence interval
– In the limit of large sample size the LRT is approximately chi-squared distributed under
the null
•
This is a very useful result, but shouldn’t be assumed to hold
– i.e. Check with simulation
27
Bayesian estimators
•
As you may notice, the notion of a confidence interval is very hard to grasp and
has remarkably little connection to the data that you have collected
•
It seems much more natural to attempt to make statements about which
parameter values are likely given the data you have collected
•
To put this on a rigorous probabilistic footing we want to make statements about
the probability (density) of any particular parameter value given our data
•
We use Bayes theorem
Prior
Likelihood
Posterior
P( ) P( D |  )
P( | D) 
P( D)
Normalising constant
28
Bayes estimators
•
The single most important conceptual difference between Bayesian statistics and
frequentist statistics is the notion that the parameters you are interested in are
themselves random variables
•
This notion is encapsulated in the use of a subjective prior for your parameters
•
Remember that to construct a confidence interval we have to define the set of
possible parameter values
•
A prior does the same thing, but also gives a weight to different values
29
Example: coin tossing
•
I toss a coin twice and observe two heads
•
I want to perform inference about the probability of obtaining a head on a single
throw for the coin in question
•
The point estimate/MLE for the probability is 1.0 – yet I have a very strong prior
belief that the answer is 0.5
•
Bayesian statistics forces the researcher to be explicit about prior beliefs but, in
return, can be very specific about what information has been gained by
performing the experiment
30
The posterior
•
Bayesian inference about parameters is contained in the posterior distribution
•
The posterior can be summarised in various ways
Posterior mean
Posterior
Prior
Credible Interval
31
Bayesian inference and the notion of shrinkage
•
The notion of shrinkage is that you can obtained better estimates by assuming a
certain degree of similarity among the things you want to estimate
•
Practically, this means two things
– Borrowing information across observations
– Penalising inferences that are very different from anything else
•
The notion of shrinkage is implicit in the use of priors in Bayesian statistics
•
There are also forms of frequentist inference where shrinkage is used
– But NOT MLE
32