Bayesian Model Comparison and Occam`s Razor

Download Report

Transcript Bayesian Model Comparison and Occam`s Razor

Bayesian Model Comparison
and Occam’s Razor
Lecture 2
A Picture of Occam’s Razor
Occam’s razor
• "All things being equal, the simplest solution
tends to be the best one," or alternately, "the
simplest explanation tends to be the right one."
In other words, when multiple competing
theories are equal in other respects, the
principle recommends selecting the theory that
introduces the fewest assumptions and
postulates the fewest hypothetical entities. It is in
this sense that Occam's razor is usually
understood.
•
Wikipedia
Copernican versus Ptolemaic View
of the Universe
• Copernicus proposed a model of the solar
system in which the earth revolved around the
sun. Ptolemy (around 1000 years earlier) had
proposed a theory of the universe in which
planetary bodies revolved around the earth – he
used ‘epicycles’ to explain his theory.
• Copernicus’s theory ‘won’ because it was a
simpler framework from which to explain
astronomical motion. Epicycles also ‘explained’
astronomical motion but employed an
unnecessarily complex framework which could
not properly predict such things.
Occam’s Razor: Choose Simple
Models when possible
Boxes behind a tree
• In the figure, is there 1 or 2 boxes behind
the tree? A one box theory: does not
assume many complicated factors such as
that two boxes happen to be of identical
height, but does explain the data as we
see it. A two-box theory: does assume an
unlikely factor, but also explains the data
as we see it.
Statistical Models
• Statistical models are designed to describe data
by postulating that data
X=(x1,...,xn) follow a density f(X|Θ) in a class
described by those in a class
{f(X|Θ): Θ} (possibly nonparametric). For a given
parameter Θ0, We can compare the likelihood of
data values X01 vs X02 via
f(X01|Θ0)/f(X02|Θ0). If this is >1, then the first
datum is more likely; if <1, then the second is
more likely.
Bayesian Model Comparison
• We evaluate statistical models via:
P( X | M ) P( M )
P( M | X ) 
P( X )
• The term ‘P(X|M)’ is the likelihood; the term P(M)
is the prior; the term ‘P(X)’ is the marginal
density of the data.
• When comparing two models M1 and M2, we
need only look at the ratio,
BF (1, 2) 
P ( X | M1 ) P ( M1 )
P( X | M 2 ) P( M 2 )
(*)
Bayesian Model Comparison
(continued)
• In comparing the two models, the term
• ‘P(X|M)’ explains how well model M explains the
data. We see in our tree example that both the
one box and two box theories explain the data
well. So they don’t help us decide between the
one and two box models. But, the probability,
‘P(M1)’ for the one-box theory is much larger
than the probability ‘P(M2)’ . So we prefer the
one-box to the two box theory. Note that things
like the MLE have no preference regarding the
one versus two-box theory.
Model Comparison when
parameters are present
• If parameters Θ are present, we want to use:
P( X | M )   P( X | M , ) P( | M )d
• This is the average score of the data.
• Calculus shows (see the appendix) that,
•
 (1/ 2) 
ˆ
ˆ

(
X
|
M
,

)(

|
M
)
d


(
X
|
M
,

)(

|
M
)
I
;

 ˆ

ˆ )(
ˆ | M )
 2 log ( X | M , 
 ;   I 1
I 
ˆ


2

The Occam factor
• Now, if we had two models M1,M2 which
explained the data equally well, but the
first provided more certain (posterior)
information than the second, we prefer the
first model to the second. The ‘likelihood’
scores are similar for the two models; the
‘Occam factor’ ‘(Θ|M)Σ(1/2)’ or posterior
uncertainty for the first model is smaller
than that for the second.
Example of Model Comparison
when Parameters are present
• Say we want to choose between two regression
models for a set of bivariate data. The first is a
linear model and the second is a polynomial
model involving terms up to the fourth power.
The second always does a better job of fitting
the data than the first. But the posterior
uncertainty of the second tends to be smaller
than that of the second because the presence of
additional parameters adds posterior uncertainty.
Note that classical statistics always views the
second as better than the first.
An example
•
•
•
•
The data: (-8,8),(-2,10),(6,11) (see next)
The model under :
H0: y=β0+ε;
H1: y= β0+β1x+ε;
Parameters have simple gaussian priors and
σe=1.
• Score[0]=φ{√3 σY} φ{Y } (1/√3)=1.5*10-23;
• Score[1]=φ{√3 σY √(1-ρ2)} φ{b0} φ{b1}(1/[3σX])
=.71*10-24
•
Score(1)/Score(0)=
.71/15 = .05
Example Explained H0
• Score[0]=φ{√3 σY} φ{Y } (1/√3) =1.5*10-23;
• Y is the average of the Y’s. φ is the
gaussian density.
• φ{√3 σY} is the likelihood under the null
model (with MLE assignment)
• φ{Y } is the prior under the null model
(with MLE assignment)
• (1/√3) is the inverse of the square root of
the information.
Example explained H1
• Score[1]=φ{√3 σY √(1-ρ2)} φ{b0} φ{b1}
(1/σX)
• φ{√3 σY √(1-ρ2)} is the likelihood under the
alternative model (under MLE assignment)
• φ{b0} φ{b1} is the prior under the
alternative model (under MLE assignment)
• b0, b1 are the usual beta estimates.
• (1/3σX) is the inverse of the square root of
the information.
Regression Example
Classical Statistics falls short
• Comparing the likelihoods (under MLE’s) without
regards to the Occam factor gives:
• Classical Null Score= φ{√3 σY} =.012;
• Classical Alt Score= φ{√3 σY √(1-ρ2)} =.3146
• In this case, the alternate model is to be
preferred. But, we can see from the picture it
isn’t too good, and adds more complexity which
doesn’t serve a good purpose.
Stats for the linear model
•
σx= 7.02;
σy= 1.53;
•
•
b0= 9.9459
b1= 0.2095
•
BINT = b conf
•
•
•
•
•
•
•
mean(y)=9.66;
mean(x)=-1.33
5.5683 14.3236
-0.5340 0.9530
R = residuals
-0.2703
0.4730
-0.2027
RINT =residual conf
•
•
•
-3.7044
-5.5367
-2.7783
•
STATS =
•
R2=
3.1638
6.4827
2.3729
0.9276 F= 12.8133 p0= 0.1734
p1=0.3378
Dice Example
• We roll a die 30 times getting [4,4,3,3,7,9]. Is it a
fair die? Would you be willing to gamble using it?
H0: p1=…=p6=(1/6) ; H1: p’s ≈Dir(1,…,1)
What does chi-squared goodness of fit say? Chisquare p-value is 31% -- we would never reject
the null in this case.
What does Bayes theory say:
score under H0 is
30 6
 30   1   1 
7
5!

7.3*10
 4 4 3 3 7 9 6   6 

   
Dice Example (continued)
• Under the alternative:
 30
 4 
 4 4 3 3 7 9   30 

 
 4*105
4
4
 
 30 
4
3
 
 30 
3
3
 
 30 
3
7
 
 30 
7
9
9
-5
(7*10
*5!)
 
 30 
• In this case, the laplace approximation is slightly off. The
real answer is 3*10-6
• So, roughly, the alternative is about 10 times as likely as
the null. This is in accord with our intuition.
Possible Project
• Possible Project: Construct or otherwise
get bivariate data which are essentially
linearly related with noise. Assume linear
and higher power models have equal prior
probability. Calculate the average score
for linear and higher order models. Show
the average score for the linear model is
best.
Another Possible Project
Generate multinomial data from a
distribution with equal p’s. For the
generated data determine the chi-squared
p-value and compare it to the Bayes factor
favoring the null (true) hypothesis –
determine how the chi-squared values
differ from the Bayes factor counterparts
over many simulations.
Appendix: Laplace approximation
• In the usual setting,
m( X )   ( X | )    d    exp log  X |      d 
 





 

 


1
ˆ
ˆ
ˆ 'I   
ˆ d
   exp log X |  

2
 
1
ˆ
ˆ
ˆ 'I   
ˆ
  ( X | )  exp    
2



d
1
1/ 2
ˆ (X | 
ˆ) 1 
ˆ 'I   
ˆ
 
I
exp



1/ 2 
2

I
 
 
 

1
ˆ
ˆ
ˆ (X | 
ˆ )  1/ 2
  ( X | ) 1/ 2  
I
 



d

Possible Project
• Fill in the mathematical steps involving the
calculation of the marginal distribution of
the data and compare it to the laplace
approximation.