Chap10: SUMMARIZING DATA

Download Report

Transcript Chap10: SUMMARIZING DATA

Ch15: Decision Theory &
Bayesian Inference
15.1: INTRO:
We are back to some theoretical statistics:
1. Decision Theory
– Make decisions in the presence of uncertainty
2. Bayesian Inference
– Alternative to traditional (“frequentist”) method
15.2: Decision Theory
New Terminology:
(true) state of nature = parameter   
action a  A :choice based on the observation of data,
or a random variable, X, whose CDF depends on 
(statistical) decision function: d : 
A
  
data space
action space
loss function: l :   A  R
risk function = expected loss R( , d )  El ( , d ( X ))
Example of a quadratic loss function :
l ( , d ( X ))   ( )  d ( X ) ; d ( X ) estimates  ( )
2

 R( , d )  E  ( )  d ( X )
2
  Mean Square Error
Example: Game Theory
A = Manager of oil Company vs B = Opponent (Nature)
Situation: Is there any oil at a given location?
Each of the players A and B has the choice of 2 moves:
A has the choice between actions a1 , a2  A
to continue or to stop drilling
B controls the choice between parameters 1 ,  2  
whether there is oil or not.
l (ai ,  j )  loss function
If A chooses action ai
and B chooses parameter  j
Then A pays an amount to B
l a1 ,1 
l a1 , 2 
l a2 ,1 
l a2 , 2 
15.2.1: Bayes & Minimax Rules
“good decision” with smaller risk
What If R(1 , d1 )  R(1 , d 2 ) & R( 2 , d1 )  R( 2 , d 2 )  TROUBLE ?
To go around, use either a Minimax or a Bayes Rule:
• Minimax Rule: (minimize the maximum risk)
d  arg min d  max   R( , d ) 

• Bayes Rule: (minimize the Bayes risk)
d



 arg min d   E R (, d )


Bayes Risk






Classical Stat. vs Bayesian Stat.
Classical (or Frequentist): Unknown but fixed
parameters to be estimated from the data.
Inference
Data  Model  and/or
Prediction
Bayesian: Parameters are random variables.
Data and prior are combined to estimate posterior.
The same picture as above with some
Prior Information
15.2.2: Posterior Analysis
Bayesians look at the parameter  as a random variable
with prior dist’n g ( ) and a posterior distribution
h( | X  x)
f ( X  x |  )  g ( )


proportional
Theorem A : If d 0 ( x)  arg min d E l (, d ( x)) 
where E l (, d ( x)) 


or E l (, d ( x)) 
l ( , d ( x))h( | x)


continuouscase


discretecase
 l ( , d ( x))h( | x)d
is the posterior risk of an action a  d ( x) as the exp ected loss
Then d 0 ( x) is the Bayes rule.
15.2.3: Classification &
Hypothesis Testing
Wish: classify an element as belonging to one of the
classes partitioning a population of interest.
e.g. an utterance will be classified by a computer as one
of the words in its dictionary via sound measurements.
Hypothesis testing can be seen as a classification matter
with a constraint on the probability of misclassification
(the probability of type I error).
Neyman  Pearson Lemma : Let d  be a test for H 0 : X ~ f ( x | 1 ) vs H 1 : X ~ f ( x |  2 )
with accep tan ce region
f (x | 2 )
 c and significan ce level  
f ( x | 1 )
Let d be another test at a significan ce level    
Then the power of d is less than or equal to the power of d 
15.2.4: Estimation
Theorem A : The Bayes (rule ) estimate is
the mean of the posterior distributi on
Pr oof : Re call , l ( , d )    ˆ 2 is the square error loss ;



where ˆ estimates   Posterior risk is : E   ˆ
squared bias







 Var ( | X  x)  E ( | X  x)  ˆ 2

 

posteriorvar iance
independent on ˆ


min imized by ˆ  E ( | X  x )
Thus, the Bayes rule ˆ  E ( | X  x)
is the mean of the posterior distributi on
Continuous case :ˆ  h( | x)d

Discrete case :ˆ 
h( | x)




2

X x 
15.2.4: Estimation (example)
Caution : The classical MLEs are 0 and 1and are different from Bayes '
Example : A biased coin is thrown once. Let  be the probabilit y of heads.
What ˆ  ? To reflect the fact that we have no idea about how biased
is the coin , we' ll put a uniform prior distributi on on  : g ( )  1, 0    1.
Let X  1 (if a Head appears ) and X  0 (if a Tail appears )
 , if x  1
 the distributi on of X given is f ( x |  )  
1   , if x  0
and the posterior distribut i on is : h( | x) 
 h( | X  1) 

1
 d
0
f ( x |  ) 1
 f ( x |  )d
 2 and h( | X  0) 
1
1
 (1   )d
 2(1   )
0
2
, if x  1

3
Finally , the Bayes estimate of  is ˆ   h( | X  x)d  
 1 , if x  0

3
15.3.1: Bayesian Inference for the
Normal Distribution
Theorem A : Assume  ~ N (  0 ,  02 ) and X |  ~ N (  ,  2 )
 the posterior distributi on of  is :
1
 1





x
 2

0
2
0

1
1 

|X x~N
, 2  2


1
1


0
 2


2
0 


Note : Posterior Mean is a weighted average of Pr ior Mean & the Data
If the Experiment (Observation of X ) is much more inf ormative
 1 
than the Pr ior Distributi on , that is     | X  x ~ N  x, 2 
  
Pr oof : read page 589 textbook
2
2
0
How is the prior distribution altered
by a random sample ?
Theorem A (extended to a random sample) :
Assume  ~ N (  0 ,  ) and X 1 ,..., X n |   N (  ,  )
2
0
iid
2
 the posterior distributi on of  is :
n
 1

 2  0  2  x

0

1
n 

 | X 1  x1 ,..., X n  x n ~ N
, 2 2


1
n


0



2
2
0 


15.3.2: The Beta Dist’n is a
conjugate prior to the Binomial
Definition : The Pr obability Density Function of the Beta Distributi on
(a  b) a 1
x (1  x) b 1 , 0  x  1
(a)(b)
a
ab
Theorem : Y ~ Beta (a, b)  E (Y ) 
and Var (Y ) 
ab
(a  b) 2 (a  b  1)
Applicatio n : Assume p ~ Beta (a, b) and X | p ~ Bin (n, p)
with parameters a  0 and b  0 is : f ( x) 
 the posterior distributi on of p is : p | X  x ~ Beta (a  x, b  n  x)
a
ax
Since  prior  E ( p) 
and  posterior  E  p | X  x  
ab
a  b 1
ab
1
  posterior 
  prior 
x
a  b 1
a  b 1
(the posterior mean is a weighted average of the prior mean and the data)