Bayesin Statistics, Gibb Sampling and the EM Algorithm

Download Report

Transcript Bayesin Statistics, Gibb Sampling and the EM Algorithm

Bayesian Statistics, MCMC, and
the Expectation Maximization
Algorithm
The Burglar Alarm Problem

A burglar alarm is sensitive to both burglaries
and earthquakes. In california earthquakes
happen fairly frequently.

You are at a conference far from california but
in phone contact with the alarm: You observe
the alarm ring.


A = Alarm rings
A C = Alarm does not ring
Alarm problem (continued)

The alarm could be due to a burglary or an
earthquake: (these are assumed not to be
observed)
1 & if a burglary takes place
b=
0 & otherwise
1 & if there is an earthquake
e=
0 & otherwise
Likelihood


The likelihood is concerned with what is observed
– we observe whether the alarm goes off or not:
P(A|b=1,e=1)= .607 (the chance that the alarm goes
off given a burglary and an earthquake)
 P(A|b=0,e=1)=.135 (the chance that the alarm goes
off given no burglary but an earthquake)
 P(A|b=1,e=0)=.368 (the chance that the alarm goes
off given a burglary but no earthquake)
 P(A|b=0,e=0)= .001 (the chance that the alarm goes
off given no burglary and no earthquake.
PRIOR
The prior governs probability
distributions over the
presence/absence of burglaries,
earthquakes:
 P(b=1)=.1 P(e=1)=.1;
 b, e are mutually independent
 Priors characterize available
information about
burglary,earthquakes

Bayes Theorem

Bayes Rule and related results:
P(C and D)
; (Conditional Probability Rule)
P(D)
P(C and D) = P(D)P(C | D) (Multiplication Rule)
P(C | D) =

Bayes Theorem (a consequence of above)
serves to combine prior expertise with
likelihood. Suppose ‘D’ stands for data and Θ
for parameters. Then:
P(D | )P()
P( | D) =
P(D | )P() + P(D | )P()
Bayes Theorem
We use Bayes’ theorem to find the probability that there was a
burglary given that the alarm went off and the probability
that there was an earthquake given that the alarm went off.
To do this, we need to make use of two quantities:

a) the likelihood: the probability that the alarm went
off given that the burglary did/didn’t take place and/or the
earthquake did or did not take place.

b) the prior: the probability that the burglary
did/didn’t take place and/or the earthquake did or didn’t
take place.
Bayes Theorem for burglaries

We first update the likelihood relative to
earthquakes and then use Bayes rule to
calculate the probability of interest:
P(A | b = 1) = P(A | b = 1,e = 0)P(e = 0) +
P(A | b = 1, e = 1)P(e = 1) = .9 * .368 + .1* .607 = .3919
P(A | b = 0) = P(A | b = 0, e = 0)P(e = 0) +
P(A | b = 0, e = 1)P(e = 1) = .9 * .001 + .1* .135 = .0144
P(A | b = 1)P(b = 1)
P(b = 1 | A) =
=
P(A | b = 1)P(b = 1) + P(A | b = 0)P(b = 0)
.3919 * .1
= .751
.3919 * .1 + .0144 * .9

So, about 75% of the time when the alarm
goes off, there is a burglary.
Updating likelihood relative to
earthquakes
Prior
probabilities
Likelihood
probabilities
New Likelihood
probabilities
P( A | b  1) 
.1*.607  .9*.368  .3919
Bayes’ Law
P(b=1 & A)
=.1 * .3919
+ P(b  1| A)
P(b=0 & A)
=.9 * .0144
P(A) =.05215

P (b  1 and A)
P ( A)
.03919

 .751
.05215
Bayes Theorem for earthquakes

We first update the likelihood relative to
burglaries and then calculate the probability
of interest:
P(A | e = 1) = P(A | b = 1, e = 1)P(b = 1) +
P(A | b = 0, e = 1)P(b = 0) = .1* .607 + .9 * .135 = .1822
P(A | e = 0) = P(A | b = 0, e = 0)P(b = 0) +
P(A | b = 1, e = 0)P(e = 1) = .9 * .001 + .1 * .368 = .0377
P(A | e = 1)P(e = 1)
P(e = 1 | A) =
=
P(A | e = 1)P(e = 1) + P(A | e = 0)P(e = 0)
.1822 * .1
= .349
.1822 * .1 + .0377 * .9

So, about 35% of the time when the alarm
goes off, there is an earthquake.
Expectation Maximization Algorithm

The EM algorithm concerns how to make
inference about parameters in the
presence of latent variables. Such
variables are used to indicate the state of
a process to which an observation
belongs. Inference is based on
estimating these parameters.
EM algorithm for the alarm problem

Let b,e be latent variables. We now represent
the prior on these latent variables by,
(b, e)  1b (1  1 )1b 2e (1  2 )1e

The EM algorithm yields estimates of the
parameters by maximizing:
Q =  logξ1  P(b = 1 | A) + log(1- ξ1 )P(b = 0 | A)
+ (log ξ 2 )P(e = 1 | A) + log(1- ξ 2 )P(e = 0 | A)
EM estimate

The EM estimates
are,
ξ1 = P(b = 1 | A) = .751; ξ 2 = P(e = 1 | A) = .349;
MCMC

Gibbs sampling is one example of Markov
Chain Monte Carlo. The idea behind MCMC is
to simulate a posterior distribution by visiting
the values of parameters in proportion to their
posterior probability. In Gibb’s sampling visits
depend entirely on conditional posterior
probabilities. Another form of MCMC is
Metropolis Hastings (MH). In MH visiting
depends on Markov Kernel.
Gibbs Sampling (MCMC)

Gibbs sampling is an iterative algorithm which
successively visits the parameter values of b
and e in proportion to their posterior
probabilities.
GIBBS SAMPLER
 1. Simulate b,e according to their priors.


2. For given b simulate e as a bernoulli variable
with probability P(e=1|b,A) with,
P(e = 1 | b, A) =
P(A | b,e = 1)P(e = 1)
P(A | b,e = 1)P(e = 1) + P(A | b,e = 0)P(e = 0)
.155 if b = 1
=
.938 if b = 0
Gibb’s sampling (continued)

3. For the e obtained from step 2,
simulate b from P(b=1|e,A) with,
P(b = 1 | e, A) =
.333
=
.976

P(A | b = 1, e)P(b = 1)
P(A | b = 1, e)P(b = 1) + P(A | b = 0, e)P(b = 0)
& if e = 1
& if e = 0
4. Iterate steps 2 and 3. The proportion of
times b=1 in this chain is P(b=1|A); the
proportion of times e=1 in this chain is
P(e=1|A).
Gibb’s sampler convergence: Cumulative Proportion of
times that b=1 (burglary) in the sampler
Gibb’s sampler convergence: Cumulative Proportion of
times that e=1 (earthquake) in the sampler
Appendix: Derivation of EM

For data D, latent variables Z, and parameters
Θ, Bayes theorem shows that,
 Θ | Z, D  (Z | D) =  Z | Θ, D  Θ | D 
Solving for (Θ|D) and taking logs we get:
log(Θ|D)=log(Θ|D,Z)+log(Z|D)-log(Z| Θ,D)
Integrating both sides over (Z|D, Θ#), we
get,
 Θ | D  =  log  Θ | D,Z   Z | Θ#  dZ +  log  Z | D,Θ   Z | Θ #  dZ


-  log  Z | D  Z | Θ# dZ
EM (continued)
It follows that the term log(Θ|D) is improved in
Θ iff the first term on the right is improved.
 The resulting quantity to be maximized in Θ is:

Q(Θ,Θ ) =   Θ | D, Z  Z | D,Θ
#
#
 dZ