Inference IV: Approximate Inference
Download
Report
Transcript Inference IV: Approximate Inference
Approximate Inference
Slides by Nir Friedman
.
When can we hope to approximate?
Two situations:
Highly stochastic distributions
“Far” evidence is discarded
“Peaked”
distributions
improbable values are ignored
Stochasticity & Approximations
Consider
a chain:
X1
X2
Xn+1
X3
P(Xi+1
= t | Xi = t) = 1-
P(Xi+1 = f | Xi = f) = 1-
Computing the probability of Xn+1 given X1 , we get
Even # of flips: P (X
n 1 t | X1 t )
Odd # of flips: P (Xn 1 f | X1 t )
n 2k
(1 )n 2k
2k
n / 2
k 0
(n 1) / 2
k 0
n
2k 1
(1 )n 2k 1
2k 1
Plot of P(Xn = t | X1 = t)
1
n=5
n = 10
n = 20
0.9
0.8
0.7
0.6
0.5
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Stochastic Processes
This
behavior of a chain (a Markov Process) is
called Mixing.
In general Bayes nets there is a similar behavior.
If probabilities are far from 0 & 1, then effect of “far”
evidence vanishes (and so can be discarded in
approximations).
“Peaked” distributions
If
the distribution is “peaked”, then most of the
mass is on few instances
If we can focus on these instances, we can
ignore the rest
Instances
Global conditioning
A
a
B
C
D
E
I
J
K
L
M
Fixing value of A & B
b
a
b
C
D
E
I
J
K
L
M
P(m) P(a, b, c, d , e,..., l )
a
b c
d e... l
Fixing values in the beginning of the summation can decrease tables formed by
variable elimination. This way space is traded with time. Special case: choose to fix
a set of nodes that “break all loops”. This method is called cutset-conditioning.
Bounded conditioning
A
B
Fixing value of A & B
By examining only the probable assignment of A & B,
we perform several simple computations instead of a
complex one.
Bounded conditioning
P(Y y, e)
P(Y y, e |a, b) P(a, b)
a ,bPr obasble
Choose
A and B so that P(Y,e |a,b) can be
computed easily. E.g., a cycle cutset.
Search for highly probable assignments to A,B.
Option 1--- select a,b with high P(a,b).
Option 2--- select a,b with high P(a,b | e).
We need to search for such high mass values and
that can be hard.
Bounded Conditioning
Advantages:
Combines exact inference within approximation
Continuous: more time can be used to examine more cases
Bounds: unexamined mass 1
P(a, b)
a ,bPr obable
used to compute error-bars
Possible problems:
P(a,b) is prior mass not the posterior.
If posterior is significantly different P(a,b| e), Computation
can be wasted on irrelevant assignments
Network Simplifications
In
these approaches, we try to replace the original
network with a simpler one
the resulting network allows fast exact methods
Network Simplifications
Typical simplifications:
Remove parts of the network
Remove edges
Reduce the number of values (value abstraction)
Replace a sub-network with a simpler one
(model abstraction)
These simplifications are often w.r.t. to the
particular evidence and query
Stochastic Simulation
Suppose our goal is the compute the likelihood of evidence
P(e) where e is an assignment to some variables in
{X1,…,Xn}.
Assume that we can sample instances <x1,…,xn> according
to the distribution P(x1,…,xn).
What is then the probability that a random sample <x1,…,xn>
satisfies e?
Answer: simply P(e) which is what we wish to compute.
Each sample simulates the tossing of a biased coin with
probability P(e) of “Heads”.
Stochastic Sampling
Intuition:
given a sufficient number of samples
x[1],…,x[N], we can estimate
# Heads
P (e)
N
P (e|x[i])
i
Zeros or ones
N
of large number implies that as N grows, our
estimate will converge to p with high probability
Law
How
many samples do we need to get a reliable
estimation?
We will not discuss this issue here.
Sampling a Bayesian Network
P(X1,…,Xn) is represented by a Bayesian network,
can we efficiently sample from it?
If
Idea:
sample according to structure of the network
Write distribution using the chain rule, and then
sample each variable given its parents
Logic sampling
P(e) 0.001
Earthquake
Radio
Burglary
Alarm
0.03
P(b) 0.03
be be be be
P(a) 0.98 0.7
e
e
P(r) 0.3 0.001
Call
a
B E A C R
Samples:
b
a
P(c) 0.8 0.05
0.4
0.01
Logic sampling
0.001
P(e) 0.001
Earthquake
Radio
Burglary
Alarm
P(b) 0.03
be be be be
P(a) 0.98 0.7
e
e
P(r) 0.3 0.001
Call
a
B E A C R
Samples:
b e
a
P(c) 0.8 0.05
0.4
0.01
Logic sampling
P(e) 0.001
Earthquake
Radio
Burglary
Alarm
P(b) 0.03
be be be be
P(a) 0.98 0.7
e
e
P(r) 0.3 0.001
Call
a
B E A C R
Samples:
b e a
a
P(c) 0.8 0.05
0.4
0.01
Logic sampling
P(e) 0.001
Earthquake
Radio
Burglary
Alarm
P(b) 0.03
be be be be
P(a) 0.98 0.7
e
e
P(r) 0.3 0.001
Call
a
B E A C R
Samples:
b e a c
a
P(c) 0.8 0.05
0.4
0.01
Logic sampling
P(e) 0.001
Earthquake
Radio
Burglary
Alarm
P(b) 0.03
be be be be
P(a) 0.98 0.7
e
e
P(r) 0.3 0.001
Call
a
B E A C R
Samples:
b e a c
a
P(c) 0.8 0.05
0.4
0.01
Logic sampling
P(e) 0.001
Earthquake
Radio
Burglary
Alarm
P(b) 0.03
be be be be
P(a) 0.98 0.7
e
e
P(r) 0.3 0.001
Call
a
B E A C R
Samples:
b e a c
r
a
P(c) 0.8 0.05
0.4
0.01
Logic Sampling
X1, …, Xn be order of variables consistent with
arc direction
for i = 1, …, n do
sample xi from P(Xi | pai )
(Note: since Pai {X1,…,Xi-1}, we already
assigned values to them)
return x1, …,xn
Let
Logic Sampling
Sampling
a complete instance is linear in number of
variables
Regardless of structure of the network
if P(e) is small, we need many samples to
get a decent estimate
However,
Can we sample from P(Xi|e) ?
evidence e is in roots of the Bayes network, easily
If evidence is in leaves of the network, we have a
problem:
Our sampling method proceeds according to the
order of nodes in the network.
If
B
Z
R
A=a
X
Likelihood Weighting
we ensure that all of our sample satisfy e?
One simple (but wrong) solution:
When we need to sample a variable Y that is
assigned value by e, use its specified value.
Can
X
example: we know Y = 1
Sample X from P(X)
Then take Y = 1
Is this a sample from P(X,Y |Y = 1) ?
NO.
For
Y
Likelihood Weighting
Problem:
these samples of X are from P(X)
Solution:
Penalize samples in which P(Y=1|X) is small
X
We
now sample as follows:
Let xi be a sample from P(x)
Let wi= P(Y = 1|X = xi )
w P ( X x| x )
P( X x | Y 1 )
w
i
i
i
i
i
Y
Likelihood Weighting
X1, …, Xn be order of variables consistent with
arc direction
w = 1
for i = 1, …, n do
if Xi = xi has been observed
Let
w
w P(Xi = xi | pai )
else
sample
return
xi from P(Xi | pai )
x1, …,xn, and w
Likelihood Weighting
P(e) 0.001
Earthquake
Radio
r
r
P(r) 0.3 0.001
Burglary
Alarm
=a
=r
0.03
P(b) 0.03
be be be be
P(a) 0.98 0.7
Call
a
B E A C R Weight
Samples:
b
a
P(c) 0.8 0.05
0.4
0.01
Likelihood Weighting
0.001
P(e) 0.001
Earthquake
Radio
r
r
P(r) 0.3 0.001
Burglary
Alarm
=a
=r
P(b) 0.03
be be be be
P(a) 0.98 0.7
Call
a
B E A C R Weight
Samples:
b e
a
P(c) 0.8 0.05
0.4
0.01
Likelihood Weighting
P(e) 0.001
Burglary
Earthquake
Radio
r
Alarm
=a
=r
r
P(r) 0.3 0.001
P(b) 0.03
be be be be
P(a) 0.98 0.7
Call
a
B E A C R Weight
Samples:
b e
a
0.6
a
P(c) 0.8 0.05
0.4
0.01
Likelihood Weighting
P(e) 0.001
Burglary
Earthquake
Radio
r
Alarm
=a
=r
r
P(r) 0.3 0.001
P(b) 0.03
be be be be
P(a) 0.98 0.7
Call
a
B E A C R Weight
Samples:
b e
a c
0.6
a
P(c) 0.8 0.05
0.05
0.4
0.01
Likelihood Weighting
P(e) 0.001
Earthquake
Radio
r
r
Burglary
Alarm
=a
=r
P(r) 0.3 0.001
P(b) 0.03
be be be be
P(a) 0.98 0.7
Call
a
B E A C R Weight
Samples:
b e
a c r
0.6 *0.3
a
P(c) 0.8 0.05
0.4
0.01
Likelihood Weighting
Why
does this make sense?
When N is large, we expect to sample NP(X = x)
samples with x[i] = x
Thus,
P(Y 1 | X x [i]) P( X x|x[i] ) N P(X x , Y 1)
P ( X x | Y 1 )
N P (Y 1)
P(Y 1 | X x [i])
i
i
Summary
Approximate inference is needed for large pedigrees.
We have seen a few methods today. Some could fit
genetic linkage analysis and some do not. There are
many other approximation algorithms: Variational
methods, MCMC, and others.
In next semester’s project of Bioinformatics (236524), we
will offer projects that seek to implement some
approximation methods and embed them in the superlink
software.