Bayesian estimates

Download Report

Transcript Bayesian estimates

CHAPTER 8
More About Estimation
8.1 Bayesian Estimation
In this chapter we introduce the concepts related to
estimation and begin this by considering Bayesian estimates,
which are also based upon sufficient statistics if the latter exist.
We shall now describe the Bayesian approach to the
problem of estimation. This approach takes into account any
prior knowledge of the experiment that the statistician has and
it is one application of a principle of statistical inference that
may be called Bayesian statistics.
Consider a random variable X that has a distribution of
probability that depends upon the symbol  , where  is an
element of a well-defined set  .
random variable : has a distribution of probability over the
set  .
x: a possible value of the random variable X.
 : a possible value of the random variable  .
The distribution of X depends upon  , an
experimental value of the random variable  . We
shall denote the p.d.f. of  by h( ) and we take
h( )  0 when  is not an element of  . Moreover,
we now denote the p.d.f. of X by f x   since we
think of it as a conditional p.d.f. of X, given    .
Say X 1 , X 2 ,..., X n is a random sample from his
conditional distribution of X . Thus we can write the
joint conditional p.d.f. of X 1 , X 2 ,..., X n , given   ,
as f ( x1  ) f ( x2  )    f ( xn  ).
Thus the joint p.d.f. of X 1 , X 2 ,..., X n and  is
g ( x1 , x2 ,..., xn ,  )  f ( x1  ) f ( x2  )    f ( xn  )h( ).
If  is a random variable of the continuous type,
the joint marginal p.d.f. of X 1 , X 2 ,..., X n is given by

g1 ( x1 , x2 ,..., xn )   g ( x1 , x2 ,..., xn ,  )d .
-
If  is a random variable of the discrete type,
integration would be replaced by summation. In
either case the conditional p.d.f. of ,given
X 1  x1 ,..., X n  xn , is
g ( x1 , x2 ,..., xn ,  )
k ( x1 , x2 ,..., xn ) 
g1 ( x1 , x2 ,..., xn )

f ( x1  ) f ( x2  )    f ( xn  )h( )
g1 ( x1 , x2 ,..., xn )
.
This ralationship is another form of Bayes' formula.
Example . Let X 1 , X 2 ,..., X n be a random sample
from a Poission distribution with mean  , where  is
the observed value of a random variable  having a
gamma distribution with known parameters  and  .
Thus
 x1 e   xn e     1e   
g ( x1 ,..., xn ,  )  

,

 
xn !   ( )  
 x1!
provided that xi  0,1,2,3,..., i  1,2,..., n and 0     ,
and is equal to zero elsewhere. Then
g1 ( x1 ,..., xn )  

0
 x  1e  ( n 1  )
d

x1!   xn !( ) 
i
n

( xi   )
1
xi 

.
x1!   xn !( )  (n  1  )
Finally, the conditional p.d.f. of , given X 1  x1 ,...,
X n  xn , is
g ( x1 ,..., xn ,  )
k ( x1 ,..., xn ) 
g1 ( x1 ,..., xn )
 x  1e   ( n 1) 

,
x 
xi    (n  1)
provided that 0     , and is equal to zero elsewhere.
This conditional p.d.f. is one of the gamma type with
parameters    xi   and     (n  1).
i
i
Bayesian statisticians frequently write that
k ( x1 ,..., xn ) is proportional to
g ( x1 , x2 ,..., xn ,  );
that is,
k ( x1 ,..., xn )  f ( x1  )    f ( xn  )h( ).
In example 1, the Bayesian statistician would
simply write
k ( x1 ,..., xn )   xi e  n   1e  
or, equivalently,
xi  1   ( n 1) 
k ( x1 ,..., xn )  
e
,
0     and is equal to zero elsewhere.
In Bayesian statistics, the p.d.f. h( ) is called the
prior p.d.f. of  , and the conditional p.d.f. k ( y ) is
called the posterior p.d.f. of  .
Suppose that we want a point estimate of  , this
really amounts to selecting a decision function  , so
that  ( y ) is a predicted value of  when the computed
value y and k ( y ) are known.
W : an experimental value of any random variable;
E (W ) : the mean, of the distribution of W;
  ,  ( y ) : the loss function.
A Bayes' solution is a decision function  that
minimizes

E ,  ( y )Y  y     ,  ( y )k ( y )d

    ,  ( y)k ( y)d g ( y)dy
     ,  ( y )g ( y  )dyh( ) d




1




If an interval estimate of  is desired, we can find
two functions u ( y ) and v( y ) so that the conditional
probability
v( y)
Pru ( y )    v( y ) Y  y    k ( y )d ,
u( y)
is large, say 0.95.
8.2 Fisher Information and the Rao-Cramer
Inequality
Let X be a random variable with p.d.f. f ( x; ),  ,
where the parameter space  is an interval. We
consider only special cases, sometimes called regular
cases, of probability density functions as we wish to
differentiate under an integral sign.
We have that



f ( x; )dx  1
and, by taking the derivative with respect to  ,
f ( x; )
  dx  0.

(1)
The latter expression can be rewritten as
f ( x; )


 f ( x; ) f ( x; )dx  0
or, equivalently,
  ln f ( x;  )
  f ( x; )dx  0.
If we differentiate again, if follows that
  2 ln f ( x; )
 ln f ( x; ) f ( x; ) 
dx  0. (2)
   2 f ( x; )  
 
We rewrite the second term of the left-hand member of
this equation as

f ( x; )
2
  ln f ( x; )
   ln f ( x; ) 
 f ( x; )dx 
f ( x; )dx.
 




f ( x; )



This is called Fisher information and is denoted by I ( ).
2
That is,
   ln f ( x;  ) 
I ( )   
f ( x; ) dx;





but, from Equation (2), we see that I ( ) can be
computed from
2
  ln f ( x;  )
I ( )   
f ( x; ) dx.
2


Sometimes, one expression is easier to compute than
the other, but often we prefer the second expression.
Example 1. Let X be binomial b(1,  ). Thus
ln f ( x; )  x ln   (1  x) ln(1   ),
 ln f ( x; ) x 1  x
 
,

 1
and
 2 ln f ( x; )
x
1 x
 
.
2
2

 (1   )
Clearly,
 X
1 X 
I ( )   E  2 
2

(
1


)



1
1
1
1
 2
 

,
2

(1   )
 1    (1   )
which is larger for  values close to zero or 1.
X 1 , X 2 ,..., X n
The likelihood function is
L( )  f ( x1 ; ) f ( x2 ; )    f ( xn ; ).
ln L( )  ln f ( x1 ; )  ln f ( x2 ; )      ln f ( xn ; )
 ln f ( xn ; )
 ln L( )  ln f ( x1 ; )  ln f ( x2 ; )


  




n
  ln f ( X i ; )  2 
I n ( )   E 




 
i 1
I n ( )  nI ( )
1
2
The Rao-Cramér inequality:  Y 
( k ( )   )
nI ( )
Definition 1. Let Y be an unbiased estimator of a
parameter  in such a case of point estimation. The
statistic Y is called an efficient estimator of  if and
only if the variance of Y attains the Rao-Cramér
lower bound.
Definition 2. In cases in which we can
differentiate with respect to a parameter under an
integral or summation symbol, the ratio of the RaoCramér lower bound to the actual variance of any
unbiased estimation of a parameter is called the
efficiency of that statistic.
Example 2. Let X 1 , X 2 ,..., X n denote a random
sample from a Poisson distribution that has the mean
  0. It is known that X is an m.l.e. of ;we shall
show that it is also an efficient estimator of  . We
have
 ln f ( x; ) 

( x ln     ln x!)


x
x 
 1 
.


Accordingly,
  ln f ( X ; )  2  E ( X   ) 2  2 
1
E 
 2  2  .
 
2





 

The Rao-Cramér lower bound in this case is
1 n(1  )   n . But  n is the variance of X . HenceX
is an efficient estimator of  .
8.3 Limiting Distributions of Maximum Likelihood
Estimators
We can differentiate under the integral sign, so that
 ln L( ) n  ln f ( X n ; )
Z



i 1
has mean zero and variance nI ( ) . In addition, we
want to be able to find the maximum likelihood
estimator ˆ by solving
ln L( )
 0.

L(ˆ)  f ( X 1 ;ˆ)    f ( X n ;ˆ)
ln L( )
Z


ˆ
   2

.
2
 ln L( )
 ln L( )

2
2


This equation can be rewrited as
Z nI ( )
ˆ  

.
(1)
2
1  ln L( )
1

I ( )
2
n

nI ( )
Since Z is the sum of the i.i.d. random variables
 ln f ( X i ; )
,
i  1,2,..., n,

each with mean zero and variance I ( ),the
numerator of the right-hand member of Equation (1)
is limiting N(0,1) by the central limit theorem.
Example. Suppose that the random sample arises
from a distribution with p.d.f.
f ( x; )  x 1 , 0  x  1,     { : 0    },
zero elsewhere. We have
ln f ( x; )  ln   (  1) ln x,
 ln f ( x; ) 1
  ln x,


and
2
 ln f ( x; )
1
 2
2


Since E ( 1  2 )   1  2 , the lower bound of the
variance of every unbiased estimator of  is  2 n .
Moreover, the maximum likelihood estimator
n
ˆ
   n ln i 1 X i has an approximate normal
distribution with mean  and variance  2 n .Thus, in
a limiting sense, ˆ is the unbiased minimum variance
estimator of  ; that is, ˆ is asymptotically efficient.
8.4 Robust M-Estimation
We have found the m.l.e. of the center  of the
Cauchy distribution with p.d.f.
1
f ( x; ) 
,    x  ,
2
 1 (x  )
where       .The logarithm of the
likelihood function of a random sample X 1 , X 2 ,..., X n
from this distribution is
n
ln L( )  n ln    ln 1  ( xi   ) 2 .



i 1
d ln L( ) n 2( xi   )

 0.
2
d
i 1 1  ( xi   )

The equation can be solved by some iterative process.
We use the weight function
2
ˆ
w( x   0 ) 
,
2
1  ( x  ˆ0 )
n
n
i 1
i 1
ln L( )   ln f ( xi   )    ( xi   ),
where  ( x)   ln f ( x), and
n
f ( xi   ) n
d ln L( )
 
   ( xi   ),
d
i 1 f ( xi   )
i 1
where  ( x)   ( x).
 ( x)  ln   ln(1  x 2 ),
We have
and
2x
 ( x) 
.
2
1 x
In addition, we define a weight function as
 ( x)
w( x) 
,
x
2
which equals 2 (1  x ) in the Cauchy case.
Definition 1. An estimator that is fairly good (small
variance, say) for a wide variety of distributions (not
necessarily the best for any one of them) is called a
robust estimator.
Definition 2. Estimators associated with the solution of
the equation n
  ( xi   )  0
i 1
are frequently called robust M-estimators (denoted by ˆ )
because they can be thought of as maximum likelihood
estimators.
Huber's  function:
 ( x)  k ,
x  k ,
 x,
 k  x  k,
 k,
k  x,
with weight w( x)  1, x  k , and k x , provided that
k  x.
With Huber's function, another problem arises:
If we double each X 1 , X 2 ,..., X n , estimators such as X
and median( X i ) also double.
This is not at all true with the solution of the equation
n
  ( x   )  0,
i 1
i
where the  function is that of Huber.
n
 xi   

0

 d 
i 1
(1)
A popular d to use is
median xi  median( xi )
d
.
0.6745
The scheme of selecting d also provides us with a clue
for selecting k.
To satisfy the inequality
xi  
k
d
Because then
 xi    xi  

.

d
 d 
If all the values satisfy this inequality, then Equation (1)
becomes
 xi    n xi  

 0.


 d  i 1 d
i 1
This has the solution x , which of course is most
desirable with normal distributions.
n
To solve Equation (1). Newton's method.
Let ˆ0 be a first estimate of  , such as ˆ0  median( xi ).
n
n
ˆ  1 
 xi  ˆ0 

x


i
0 
ˆ )  




(



   0

0 




i 1
i 1
 d 
 d  d 
n
 xi  ˆo 

d   

d
i 1


ˆ
ˆ
1   0 
,(the one-step M-estimate of  )
n
 xi  ˆ0 

 


d
i 1


If we use ˆ1 in place of ˆ0 , we obtain ˆ2 ,the two-step
M-estimate of  .
The scale parameter  is known. Two terms of Taylor's
expansion of
n
 X i  ˆ 
0
 



i 1


about  provides the approximation
n
n
 X i   ˆ
 X i    1 

  (   )  
    0.

  
    
i 1
i 1
This can be rewritten
 X i  
  

 

(2)
ˆ
  
X i  

   
We have considered
  X   
E  
  0,
   
Clearly,
  X   
 2  X   
var  
  E  
.
   
   
Thus Equation (2) can be rewritten as
n (ˆ   )
  X   
 E  2 



 
   X    
 
 E  

 
  
2
 2  X   
 X i  

nE


   
   

 

.
  X   
 X  
  i  nE  i 
(3)