Transcript Chap 8

Chap 8: Estimation of parameters
& Fitting of Probability
Distributions
Section 6.1: INTRODUCTION
Unknown parameter(s) values must be
estimated before fitting probability laws to data.
Section 8.2: Fitting the Poisson
Distribution to Emissions of Alpha
Particles (classical example)
Recall: The Probability Mass Function of a Poisson random
variable X is given by:
 k  P X  k  
e
k

k!
for k  0,1,2,...
From the observed data, we must estimate a value ̂ for
the parameter   E ( X )  V ( X )
What if
the experiment is repeated?
The estimate of  will be viewed as a random
variable ̂ which has a probability dist’n
referred to as its sampling distribution.
The spread of the sampling distribution reflects
the variability of the estimate.
Chap 8 is about fitting the model to data.
Chap 9 will be dealing with testing such a fit.
Assessing Goodness of Fit (GOF):
Example: Fit a Poisson dist’n to counts-p240
Informally, GOF is assessed by comparing the
Observed (O) and the Expected (E) counts that
are grouped (at least 5 each) into the 16 cells.
Formally, use a measure of discrepancy such as
the Pearson’s chi-square statistic
(Oi  Ei )
X  
Ei
all cells
2
2
to quantify the comparison of the O and E counts.
In this example,  2  8.99
Null dist’n:
X 2 is a random variable (as a function of random
counts) whose probability dist’n is called its null
distribution. It can be shown that the null
dist’n of X 2is approximately the chi-square
dist’n with degrees of freedom df = no. of cells
— no. of independent parameters fitted — 1.
Notation: df = 16 (cells) –1(parameter ) –1 = 14
X ~
2
2
df
where df # cells # indep. param. fit 1
The larger the value of X 2, the worse the fit.
p-value:
Figure 8.1 on page 242 gives a nice feeling of what
a p-value might be. The p-value measures the
degree of evidence against the statement “model
fits data well == Poisson is the true model.”
The smaller the p-value, the worse the fit or there
is more evidence against the model.
Small p-value means then rejecting the null or saying
that “the model does NOT fit the data well.”
How small is small ?
when P-value < = ALPHA,
where ALPHA is the level of confidence.
8.3: Parameter Estimation:
MOM & MLE
Let the observed data be a random sample i.e. a
sequence of I.I.D. random variables X 1 , X 2 ,..., X n
whose joint distribution depends on an unknown
parameter  (scalar or vector).
An estimate ˆ of  will be a random variable
function of the X 1 , X 2 ,..., X n whose dist’n is
known as its sampling dist’n.
The standard deviation of the sampling dist’n will
be termed as its standard error.
8.4: The Method of Moments
Definition: the k (pop’n) moment of a
k
random variable X is denoted by  k  En X 
1
th
k
̂

X
k
and
its
(sample)
moment
by

k
i
n
n
1
k
i 1
k
̂ k   X i is viewed as an estimate of  k  E X
n i 1
Algorithm: MOM estimates parameter(s) by
finding expressions for them in terms of
the lowest possible (pop’n) moments and
then substituting (sample) moments into
the expressions.
th
 
8.5: The Method
of Maximum Likelihood
Algorithm: Let X 1 , X 2 ,..., X n be a sequence of I.I.D.
random variables.
• The likelihood function is
n
lik    f x1 , x2 ,..., xn |      f  xi |  
 if IID
i 1
• The MLE ˆ of  is that value of  that maximizes
mle
the likelihood function or maximizes the natural
logarithm (since the logarithm is monotonic function)
n
• The log-likelihood function l    log lik     log  f  X i |  
i 1
is then to be maximized to get the MLE.
8.5.1: MLEs
of Multinomial Cell Probabilities
Suppose that X 1 , X 2 ,..., X m, the counts in cells 1,2,..., m ,
follows a multinomial distribution with total
count n
n
and cell probabilities p1 , p2 ,..., pm with  pi  1
i 1
Caution: the marginal dist’n of each X i is binomial (n, pi )
X i are not INDEPENDENT i.e. their joint
BUT the …
PMF is not the product of the marginal PFMs. The
good news is that the MLE still applies.
Problem: Estimate the p’s from the x’s.
8.5.1a: MLEs of Multinomial Cell
Probabilities (cont’d)
To answer the question, we assume n is given
and
n
we wish to estimate p1 , p2 ,..., pm with  pi  1
m
i 1n!
x


f
x
,...,
x
|
p
,...,
p

p
From the joint PMF

1
m
1
m
i
m
i 1
x
!
, the log-likelihood becomes:
 i
i
m
m
i 1
i 1
i 1
l  p1 ,..., pm   log n! log xi !  xi log pi
To maximize such a log-likelihood subject to the
constraint p1  ...  pm x1 , we use a Lagrange
j
ˆ
p

multiplier to get j
after maximizing
n
 n

l  p1 ,..., pm ,    log n! log xi !  xi log pi    pi  1
i 1
i 1
 i 1

m
m
8.5.1b: MLEs of Multinomial Cell
Probabilities (cont’d)
xj
ˆj 
Deja vu: note that the sampling dist’n of the p
is determined by the binomial dist’ns of the x j n
Hardy-Weinberg Equilibrium: GENETICS
Here the multinomial cell probabilities are
functions of other unknown parameters ; that is
m
m
i 1
i 1
pi  pi    l    log n! log xi !  xi log pi  
Read example A on page 260-261.
8.5.2: Large Sample Theory
for MLEs
Let ˆn be an estimate of a parameter  based on X 1 ,...., X n
The variance of the sampling dist’n of many estimators
decreases as the sample size n increases.
An estimate is said to be a consistent estimate of a parameter
if ˆ approaches  as the sample size n approaches infinity.
n
Consistency is a limiting property that does not require any
behavior of the estimator for a finite sample size.
ˆn is consistent in probabilit y if :
P
ˆn 


(read ˆn converges in probabilit y to  )


That is , for any   0, lim n P ˆn      0
8.5.2: Large Sample Theory
for MLEs (cont’d)
Theorem: Under appropriate smoothness conditions
on f , the MLE from an I.I.D sample is consistent
and the probability dist’n of nI  0  ˆmle   0 tends to
N(0,1). In other words, the large sample
distribution of the MLE is approximately normal with
mean  0 (say, the MLE is asymptotically unbiased )
1
and its asymptotic variance is nI ( 0 )
where the information about the parameter is:




2
2
 




 
I    E  log f  X |      E  2 log f ( X |  )
 
 
 

8.5.3: Confidence Intervals
for MLEs
Recall that a confidence interval (as seen in
Chap.7) is a random interval containing
the parameter of interest with some
specific probability.
Three (3) methods to get CI for MLEs are:
• Exact CIs
• Approximated CIs using Section 8.5.2
• Bootstrap CIs
8.6: Efficiency &
Cramer-Rao Lower Bound
Problem: Given a variety of possible estimates,
the best one to choose should have its
sampling distribution highly concentrated
about the true parameter.
Because of its analytic simplicity, the mean
square error, MSE, will be used as a measure
of such a concentration.





ˆ
ˆ
MSE ( )  E    0  Var (ˆ)  Bias (ˆ)


where Bias  E (ˆ)  
2
0

2
8.6: Efficiency &
Cramer-Rao Lower Bound (cont’d)
Unbiasedness means MSE (ˆ)  Var (ˆ)
~
ˆ
Definition: Given two estimates,  and  , of~a
parameter  , the efficiency of ˆ relative to is
~
defined to be:
~ Var ( )
eff (ˆ, ) 
Var (ˆ)
Theorem: (Cramer-Rao Inequality)
Under smooth assumptions on the density f ( x |  )
of the IID sequence X 1 ,..., X n when T  t X 1 ,..., X n
is an unbiased estimate of  , we get the lower bound:

Var (T )  nI ( )
1

8.7: Sufficiency
Is there a function T  X 1 ,..., X n  containing all the
information in the sample about the parameter  ?
If so, without loss of information the original data
may be reduced to this statistic T X 1 ,..., X n .
Definition: a statistic T X 1 ,..., X n is said to be
sufficient for if the conditional dist’n of X 1 ,..., X n
, given T = t, does not depend on  for any value t
In other words, given the value of T, which is called
a sufficient statistic, one can gain no more
knowledge about the parameter  from further
investigation with respect to the sample dist’n.




8.7.1: a Factorization Theorem
How to get a sufficient statistic?
Theorem A: a necessary and sufficient condition
for T X 1 ,..., X n to be sufficient for a
parameter  is that the joint PDF or PMF
factors in the form:


f ( x1 ,..., xn |  )  g T x1 ,..., xn , h( x1 ,..., xn )
Corollary A: if T is sufficient for  , then the MLE is
a function of T.
8.7.2: The Rao-Blackwell thm
The following theorem gives a quantitative rationale
for basing an estimator of a parameter  on an
existing sufficient statistic.
Theorem: Rao-Blackwell Theorem
Let ˆ be an estimator of with E (ˆ 2 )   for all 
Suppose that T is sufficient for  ,
~
and let   E (ˆ | T ) .
2
2
~
Then, for all  , E   
 E  ˆ   


~
The inequality is strict unless ˆ  

 
8.8: Conclusion
Some key ideas in Chap.7 such as sampling
distributions, Confidence Intervals were revisited
MOM and MLE were applied to some distributional
theory approximations.
Theoretical concepts of efficiency, Cramer-Rao
lower bound, and efficiency were discussed.
Finally, some light was shed in Parametric
Bootstrapping.