Chap10: SUMMARIZING DATA
Download
Report
Transcript Chap10: SUMMARIZING DATA
Ch15: Decision Theory &
Bayesian Inference
15.1: INTRO:
We are back to some theoretical statistics:
1. Decision Theory
– Make decisions in the presence of uncertainty
2. Bayesian Inference
– Alternative to traditional (“frequentist”) method
15.2: Decision Theory
New Terminology:
(true) state of nature = parameter
action a A :choice based on the observation of data,
or a random variable, X, whose CDF depends on
(statistical) decision function: d :
A
data space
action space
loss function: l : A R
risk function = expected loss R( , d ) El ( , d ( X ))
Example of a quadratic loss function :
l ( , d ( X )) ( ) d ( X ) ; d ( X ) estimates ( )
2
R( , d ) E ( ) d ( X )
2
Mean Square Error
Example: Game Theory
A = Manager of oil Company vs B = Opponent (Nature)
Situation: Is there any oil at a given location?
Each of the players A and B has the choice of 2 moves:
A has the choice between actions a1 , a2 A
to continue or to stop drilling
B controls the choice between parameters 1 , 2
whether there is oil or not.
l (ai , j ) loss function
If A chooses action ai
and B chooses parameter j
Then A pays an amount to B
l a1 ,1
l a1 , 2
l a2 ,1
l a2 , 2
15.2.1: Bayes & Minimax Rules
“good decision” with smaller risk
What If R(1 , d1 ) R(1 , d 2 ) & R( 2 , d1 ) R( 2 , d 2 ) TROUBLE ?
To go around, use either a Minimax or a Bayes Rule:
• Minimax Rule: (minimize the maximum risk)
d arg min d max R( , d )
• Bayes Rule: (minimize the Bayes risk)
d
arg min d E R (, d )
Bayes Risk
Classical Stat. vs Bayesian Stat.
Classical (or Frequentist): Unknown but fixed
parameters to be estimated from the data.
Inference
Data Model and/or
Prediction
Bayesian: Parameters are random variables.
Data and prior are combined to estimate posterior.
The same picture as above with some
Prior Information
15.2.2: Posterior Analysis
Bayesians look at the parameter as a random variable
with prior dist’n g ( ) and a posterior distribution
h( | X x)
f ( X x | ) g ( )
proportional
Theorem A : If d 0 ( x) arg min d E l (, d ( x))
where E l (, d ( x))
or E l (, d ( x))
l ( , d ( x))h( | x)
continuouscase
discretecase
l ( , d ( x))h( | x)d
is the posterior risk of an action a d ( x) as the exp ected loss
Then d 0 ( x) is the Bayes rule.
15.2.3: Classification &
Hypothesis Testing
Wish: classify an element as belonging to one of the
classes partitioning a population of interest.
e.g. an utterance will be classified by a computer as one
of the words in its dictionary via sound measurements.
Hypothesis testing can be seen as a classification matter
with a constraint on the probability of misclassification
(the probability of type I error).
Neyman Pearson Lemma : Let d be a test for H 0 : X ~ f ( x | 1 ) vs H 1 : X ~ f ( x | 2 )
with accep tan ce region
f (x | 2 )
c and significan ce level
f ( x | 1 )
Let d be another test at a significan ce level
Then the power of d is less than or equal to the power of d
15.2.4: Estimation
Theorem A : The Bayes (rule ) estimate is
the mean of the posterior distributi on
Pr oof : Re call , l ( , d ) ˆ 2 is the square error loss ;
where ˆ estimates Posterior risk is : E ˆ
squared bias
Var ( | X x) E ( | X x) ˆ 2
posteriorvar iance
independent on ˆ
min imized by ˆ E ( | X x )
Thus, the Bayes rule ˆ E ( | X x)
is the mean of the posterior distributi on
Continuous case :ˆ h( | x)d
Discrete case :ˆ
h( | x)
2
X x
15.2.4: Estimation (example)
Caution : The classical MLEs are 0 and 1and are different from Bayes '
Example : A biased coin is thrown once. Let be the probabilit y of heads.
What ˆ ? To reflect the fact that we have no idea about how biased
is the coin , we' ll put a uniform prior distributi on on : g ( ) 1, 0 1.
Let X 1 (if a Head appears ) and X 0 (if a Tail appears )
, if x 1
the distributi on of X given is f ( x | )
1 , if x 0
and the posterior distribut i on is : h( | x)
h( | X 1)
1
d
0
f ( x | ) 1
f ( x | )d
2 and h( | X 0)
1
1
(1 )d
2(1 )
0
2
, if x 1
3
Finally , the Bayes estimate of is ˆ h( | X x)d
1 , if x 0
3
15.3.1: Bayesian Inference for the
Normal Distribution
Theorem A : Assume ~ N ( 0 , 02 ) and X | ~ N ( , 2 )
the posterior distributi on of is :
1
1
x
2
0
2
0
1
1
|X x~N
, 2 2
1
1
0
2
2
0
Note : Posterior Mean is a weighted average of Pr ior Mean & the Data
If the Experiment (Observation of X ) is much more inf ormative
1
than the Pr ior Distributi on , that is | X x ~ N x, 2
Pr oof : read page 589 textbook
2
2
0
How is the prior distribution altered
by a random sample ?
Theorem A (extended to a random sample) :
Assume ~ N ( 0 , ) and X 1 ,..., X n | N ( , )
2
0
iid
2
the posterior distributi on of is :
n
1
2 0 2 x
0
1
n
| X 1 x1 ,..., X n x n ~ N
, 2 2
1
n
0
2
2
0
15.3.2: The Beta Dist’n is a
conjugate prior to the Binomial
Definition : The Pr obability Density Function of the Beta Distributi on
(a b) a 1
x (1 x) b 1 , 0 x 1
(a)(b)
a
ab
Theorem : Y ~ Beta (a, b) E (Y )
and Var (Y )
ab
(a b) 2 (a b 1)
Applicatio n : Assume p ~ Beta (a, b) and X | p ~ Bin (n, p)
with parameters a 0 and b 0 is : f ( x)
the posterior distributi on of p is : p | X x ~ Beta (a x, b n x)
a
ax
Since prior E ( p)
and posterior E p | X x
ab
a b 1
ab
1
posterior
prior
x
a b 1
a b 1
(the posterior mean is a weighted average of the prior mean and the data)