Transcript P(X i

Inference III:
Approximate Inference
Slides by Nir Friedman
.
Global conditioning
A
a
B
C
D
E
I
J
K
L
M
Fixing value of A & B
b
a
b
C
D
E
I
J
K
L
M


P(m)    P(a, b, c, d , e,..., l )
a
b  c
d e... l

Fixing values in the beginning of the summation can decrease tables formed by
variable elimination. This way space is traded with time. Special case: choose to fix
a set of nodes that “break all loops”. This method is called cutset-conditioning.
Alternatively, choose to fix some variables from the largest cliques in a clique tree.
Approximation
 Until
now, we examined exact computation
 In many applications, approximation are sufficient
 Example: P(X = x|e) = 0.3183098861838
 Maybe P(X = x|e)  0.3 is a good enough
approximation
 e.g., we take action only if P(X = x|e) > 0.5
 Can
we find good approximation algorithms?
Types of Approximations
Absolute error
 An estimate q of P(X = x | e) has absolute
error , if
P(X = x|e) -   q  P(X = x|e) + 
equivalently
q -   P(X = x|e)  q + 
 Absolute


1
q
2
error is not always what we want:
If P(X = x | e) = 0.0001, then an absolute error
of 0.001 is unacceptable
If P(X = x | e) = 0.3, then an absolute error of
0.001 is overly precise
0
Types of Approximations
Relative error
 An estimate q of P(X = x | e) has relative
error , if
P(X = x|e)(1 - )  q  P(X = x|e)(1 + )
equivalently
q/(1-)
q/(1 + )  P(X = x|e)  q/(1 - )
1
q
q/(1+)
 Sensitivity
of approximation depends on
actual value of desired result
0
Complexity
 Recall,
exact inference is NP-hard
 Is approximate inference any easier?
 Construction


for exact inference:
Input: a 3-SAT problem 
Output: a BN such that P(X=t) > 0 iff  is
satisfiable
Complexity: Relative Error
 Suppose
that q is a relative error estimate of
P(X = t),
 If  is not satisfiable, then P(X = t)=0 . Hence,
0 = P(X = t)(1 - )  q  P(X = t)(1 + ) = 0
namely, q=0. Thus, if q > 0, then  is satisfiable
An immediate consequence:
Thm:
Given , finding an -relative error approximation is NPhard
Complexity: Absolute error
 We
can find absolute error approximations to
P(X = x) with high probability (via sampling).
 We will see such algorithms shortly
 However, once we have evidence, the problem is
harder
Thm
 If  < 0.5, then finding an estimate of P(X=x|e) with
 absulote error approximation is NP-Hard
Proof
 Recall
our construction
Q1
Q2
1
Q3
2
A1
3
Q4
... 
... ...
A2
...
Qn
k
k-1
X
Proof (cont.)
we can estimate with  absolute error
 Let p1  P(Q1 = t | X = t)
 Assign q1 = t if p1 > 0.5, else q1 = f
 Suppose
Let p2  P(Q2 = t | X = t, Q1 = q1 )
Assign q2 = t if p2 > 0.5, else q2 = f
…
Let pn  P(Qn = t | X = t, Q1 = q1, …, Qn-1 = qn-1 )
Assign qn = t if pn > 0.5, else qn = f
Proof (cont.)
Claim: if  is satisfiable, then q1 ,…, qn is a satisfying
assignment
 Suppose  is satisfiable
 By induction on i there is a satisfying assignment with Q1
= q1, …, Qi = qi
Base case:
If Q1 = t in all satisfying assignments,
 P(Q1 = t | X = t) = 1
 p1  1 -  > 0.5
 q1 = t
If Q1 = f, in all satisfying assignments, then q1 = f
Otherwise, statement holds for any choice of q1
Proof (cont.)
Claim: if  is satisfiable, then q1 ,…, qn is a
satisfying assignment
 Suppose  is satisfiable
 By induction on i there is a satisfying
assignment with Q1 = q1, …, Qi = qi
Induction argument:
If Qi+1 = t in all satisfying assignments s.t.Q1 = q1, …, Qi = qi
 P(Qi+1 = t | X = t, Q1 = q1, …, Qi = qi ) = 1
 pi+1  1 -  > 0.5
 qi+1 = t
If Qi+1 = f in all satisfying assignments s.t.Q1 = q1, …, Qi = qi
then qi+1 = f
Proof (cont.)
can efficiently check whether q1 ,…, qn is a
satisfying assignment (linear time)
 If it is, then  is satisfiable
 If it is not, then  is not satisfiable
 We
 Suppose
we have an approximation procedure with
 relative error
  we can determine 3-SAT with n procedure calls
  approximation is NP-hard
When can we hope to approximate?
Two situations:
 “Peaked”
distributions
improbable values are ignored
 Highly
stochastic distributions
“Far” evidence is discarded
“Peaked” distributions
 If
the distribution is “peaked”, then most of the
mass is on few instances
 If we can focus on these instances, we can
ignore the rest
Instances
Stochasticity & Approximations
 Consider
a chain:
X1
X2
X3
Xn
 P(Xi+1
= t | Xi = t) = 1- 
P(Xi+1 = f | Xi = f) = 1- 
 Computing the probability of Xn given X1 , we get
P (Xn 1  t | X1  t ) 
P (Xn 1  f | X1  t ) 
n  2k
  (1   )n 2k
 2k 
n / 2  

k 0
(n 1) / 2  

k 0
n
 2k 1


(1   )n 2k 1
 2k  1 
Plot of P(Xn = t | X1 = t)
1
n=5
n = 10
n = 20
0.9
0.8
0.7
0.6
0.5
0
0.05
0.1
0.15
0.2
0.25

0.3
0.35
0.4
0.45
0.5
Stochastic Processes
 This
behavior of a chain (a Markov Process) is
called Mixing. We return to this as a tool in
approximation
 In general networks there is a similar behavior
 If probabilities are far from 0 & 1, then effect of “far”
evidence vanishes (and so can be discarded in
approximations).
Bounded conditioning
A
B
Fixing value of A & B
By examining only the probable assignment of A & B,
we perform several simple computations instead of a
complex one
Bounded conditioning
P(Y  y, e) 
 P(Y  y, e |a, b)  P(a, b)
a ,bPr obasble
 Choose
A and B so that P(Y,e |a,b) can be
computed easily. E.g., a cycle cutset.
 Search for highly probable assignments to A,B.
 Option 1--- select a,b with high P(a,b).
 Option 2--- select a,b with high P(a,b | e).
 We need to search for such high mass values and
that can be hard.
Bounded Conditioning
Advantages:
 Combines exact inference within approximation
 Continuous: more time can be used to examine more cases
 Bounds: unexamined mass 1 
P(a, b)

a ,bPr obable
used to compute error-bars
Possible problems:
 P(a,b) is prior mass not the posterior.
 If posterior is significantly different P(a,b| e), Computation
can be wasted on irrelevant assignments
Network Simplifications
 In
these approaches, we try to replace the original
network with a simpler one
 the resulting network allows fast exact methods
Network Simplifications
Typical simplifications:
 Remove parts of the network
 Remove edges
 Reduce the number of values (value abstraction)
 Replace a sub-network with a simpler one
(model abstraction)
 These simplifications are often w.r.t. to the
particular evidence and query
Stochastic Simulation
we can sample instances <x1,…,xn>
according to P(X1,…,Xn)
 What is the probability that a random sample
<x1,…,xn> satisfies e?
 This is exactly P(e)
 Suppose
 We
can view each sample as tossing a biased coin
with probability P(e) of “Heads”
Stochastic Sampling
 Intuition:
given a sufficient number of samples
x[1],…,x[N], we can estimate
# Heads
P (e) 

N
 P (e|x[i])
i
N
of large number implies that as N grows, our
estimate will converge to p with high probability
 Law
Sampling a Bayesian Network
P(X1,…,Xn) is represented by a Bayesian network,
can we efficiently sample from it?
 If
 Idea:

sample according to structure of the network
Write distribution using the chain rule, and then
sample each variable given its parents
Logic sampling
P(e) 0.001
Earthquake
Radio
Burglary
Alarm
0.03
P(b) 0.03
be be be be
P(a) 0.98 0.7
e
e
P(r) 0.3 0.001
Call
a
B E A C R
Samples:
b
a
P(c) 0.8 0.05
0.4
0.01
Logic sampling
0.001
P(e) 0.001
Earthquake
Radio
Burglary
Alarm
P(b) 0.03
be be be be
P(a) 0.98 0.7
e
e
P(r) 0.3 0.001
Call
a
B E A C R
Samples:
b e
a
P(c) 0.8 0.05
0.4
0.01
Logic sampling
P(e) 0.001
Earthquake
Radio
Burglary
Alarm
P(b) 0.03
be be be be
P(a) 0.98 0.7
e
e
P(r) 0.3 0.001
Call
a
B E A C R
Samples:
b e a
a
P(c) 0.8 0.05
0.4
0.01
Logic sampling
P(e) 0.001
Earthquake
Radio
Burglary
Alarm
P(b) 0.03
be be be be
P(a) 0.98 0.7
e
e
P(r) 0.3 0.001
Call
a
B E A C R
Samples:
b e a c
a
P(c) 0.8 0.05
0.4
0.01
Logic sampling
P(e) 0.001
Earthquake
Radio
Burglary
Alarm
P(b) 0.03
be be be be
P(a) 0.98 0.7
e
e
P(r) 0.3 0.001
Call
a
B E A C R
Samples:
b e a c
r
a
P(c) 0.8 0.05
0.4
0.01
Logic Sampling
X1, …, Xn be order of variables consistent with
arc direction
 for i = 1, …, n do
 sample xi from P(Xi | pai )
 (Note: since Pai  {X1,…,Xi-1}, we already
assigned values to them)
 return x1, …,xn
 Let
Logic Sampling
 Sampling
a complete instance is linear in number of
variables
 Regardless of structure of the network
if P(e) is small, we need many samples to
get a decent estimate
 However,
Can we sample from P(X1,…,Xn |e)?
 If
evidence is in roots of network, easily
 If evidence is in leaves of network, we have a
problem
 Our sampling method proceeds according to
order of nodes in graph
 Note,
we can use arc-reversal to make evidence
nodes root.
 In some networks, however, this will create
exponentially large tables...
Likelihood Weighting
we ensure that all of our sample satisfy e?
 One simple solution:
 When we need to sample a variable that is
assigned value by e, use the specified value
 Can
X
example: we know Y = 1
 Sample X from P(X)
 Then take Y = 1
 Is this a sample from P(X,Y |Y = 1) ?
 For
Y
Likelihood Weighting
 Problem:
these samples of X are from P(X)
 Solution:

Penalize samples in which P(Y=1|X) is small
X
 We


now sample as follows:
Let x[i] be a sample from P(X)
Let w[i] be P(Y = 1|X = x [i])
P (X
w [i ]P (X  x|x[i])
 x |Y  1 ) 
w [i ]
i
i
Y
Likelihood Weighting
 Why
does this make sense?
 When N is large, we expect to sample NP(X = x)
samples with x[i] = x
 Thus,
 P(Y  1 | X  x [i]) P( X  x|x[i]) N  P(X  x , Y  1)

 P ( X  x |Y  1 )
N  P(Y  1)
 P(Y  1 | X  x [i])
i
i
Using different notation to express the same idea.
Define pj=P(Y=1|X=j). Let Nj be number of samples where X=j.
P( X  j | Y  1) 
pj  N j

j
pj  N j
Likelihood Weighting
P(e) 0.001
Earthquake
Radio
r
r
P(r) 0.3 0.001
Burglary
Alarm
=a
=r
0.03
P(b) 0.03
be be be be
P(a) 0.98 0.7
Call
a
B E A C R Weight
Samples:
b
a
P(c) 0.8 0.05
0.4
0.01
Likelihood Weighting
0.001
P(e) 0.001
Earthquake
Radio
r
r
P(r) 0.3 0.001
Burglary
Alarm
=a
=r
P(b) 0.03
be be be be
P(a) 0.98 0.7
Call
a
B E A C R Weight
Samples:
b e
a
P(c) 0.8 0.05
0.4
0.01
Likelihood Weighting
P(e) 0.001
Burglary
Earthquake
Radio
r
Alarm
=a
=r
r
P(r) 0.3 0.001
P(b) 0.03
be be be be
P(a) 0.98 0.7
Call
a
B E A C R Weight
Samples:
b e
a
0.6
a
P(c) 0.8 0.05
0.4
0.01
Likelihood Weighting
P(e) 0.001
Burglary
Earthquake
Radio
r
Alarm
=a
=r
r
P(r) 0.3 0.001
P(b) 0.03
be be be be
P(a) 0.98 0.7
Call
a
B E A C R Weight
Samples:
b e
a c
0.6
a
P(c) 0.8 0.05
0.05
0.4
0.01
Likelihood Weighting
P(e) 0.001
Earthquake
Radio
r
r
Burglary
Alarm
=a
=r
P(r) 0.3 0.001
P(b) 0.03
be be be be
P(a) 0.98 0.7
Call
a
B E A C R Weight
Samples:
b e
a c r
0.6 *0.3
a
P(c) 0.8 0.05
0.4
0.01
Likelihood Weighting
X1, …, Xn be order of variables consistent with
arc direction
w = 1
 for i = 1, …, n do
 if Xi = xi has been observed
 Let
w

w  P(Xi = xi | pai )
else
 sample
 return
xi from P(Xi | pai )
x1, …,xn, and w
Likelihood Weighting
What can we say about the quality of answer?
 Intuitively, the weights of sample reflects their
probability given the evidence. We need collect a
certain mass.
 Another factor is the “extremeness” of CPDs.
Thm:
 If P(Xi | Pai) [l,u] for all CPDs, and
w [i ] 
i
4(1   )
2
2
ln u n

with probability 1-, the estimate is  relative
error approximation
 then
END