Transcript P(X i

Approximate Inference
Edited from Slides by Nir Friedman
.
Complexity of Inference
Thm:
Computing P(X = x) in a Bayesian network is NPhard
Not surprising, since we can simulate Boolean gates.
Proof
We reduce 3-SAT to Bayesian network computation
Assume we are given a 3-SAT problem:
 q1,…,qn be propositions,
 1 ,... ,k be clauses, such that i = li1 li2  li3
where each lij is a literal over q1,…,qn
  = 1... k
We will construct a Bayesian network s.t.
P(X=t) > 0 iff  is satisfiable
Q1
Q2
1
 P(Qi
Q3
2
3
A1
A2
...
Q4
... 
...
k-1
Ak/2-1
Qn
k
X
= true) = 0.5,
 P(I = true | Qi , Qj , Ql ) = 1 iff Qi , Qj , Ql
satisfy the clause I
 A1, A2, …, are simple binary AND gates
 It
is easy to check



Polynomial number of variables
Each local probability table can be described by a small
table (8 parameters at most)
P(X = true) > 0 if and only if there exists a satisfying
assignment to Q1,…,Qn
 Conclusion:
polynomial reduction of 3-SAT
Note: this construction also shows that computing
P(X = t) is harder than NP
 2nP(X
= t) is the number of satisfying assignments
to 
 Thus, it is #P-hard.
Hardness - Notes
 We
used deterministic relations in our construction
 The same construction works if we use (1-, )
instead of (1,0) in each gate for any  < 0.5
 Hardness does not mean we cannot solve
inference
 It implies that we cannot find a general
procedure that works efficiently for all networks
 For particular families of networks, we can have
provably efficient procedures
 We have seen such families in the course:
HMMs, Evolutionary trees.
Approximation
 Until
now, we examined exact computation
 In many applications, approximation are sufficient
 Example: P(X = x|e) = 0.3183098861838
 Maybe P(X = x|e)  0.3 is a good enough
approximation
 e.g., we take action only if P(X = x|e) > 0.5
 Can
we find good approximation algorithms?
Types of Approximations
Absolute error
 An estimate q of P(X = x | e) has absolute
error , if
P(X = x|e) -   q  P(X = x|e) + 
equivalently
q -   P(X = x|e)  q + 
 Absolute


1
q
2
error is not always what we want:
If P(X = x | e) = 0.0001, then an absolute error
of 0.001 is unacceptable
If P(X = x | e) = 0.3, then an absolute error of
0.001 is overly precise
0
Types of Approximations
Relative error
 An estimate q of P(X = x | e) has relative
error , if
P(X = x|e)(1 - )  q  P(X = x|e)(1 + )
equivalently
q/(1-)
q/(1 + )  P(X = x|e)  q/(1 - )
1
q
q/(1+)
 Sensitivity
of approximation depends on
actual value of desired result
0
Complexity
 Exact
inference is NP-hard
 Is approximate inference any easier?
 Construction


for exact inference:
Input: a 3-SAT problem 
Output: a BN such that P(X=t) > 0 iff  is
satisfiable
Complexity: Relative Error
Theorem:
Given , finding an -relative error approximation is NPhard.
 Suppose
that q is a relative error estimate of
P(X = t)=0.
 Then,
0 = P(X = t)(1 - )  q  P(X = t)(1 + ) = 0
namely, q=0. Thus, -relative error and exact computation
coincide for the value 0.
Complexity: Absolute error
Theorem
 If  < 0.5, then finding an estimate of P(X=x|e) with
 absolute error approximation is NP-Hard
Proof
 Recall
our construction
Q1
Q2
1
Q3
2
A1
3
Q4
... 
... ...
A2
...
Qn
k
k-1
X
Proof (cont.)
we can estimate with  absolute error
 Let p1  P(Q1 = t | X = t)
 Assign q1 = t if p1 > 0.5, else q1 = f
 Suppose
Let p2  P(Q2 = t | X = t, Q1 = q1 )
Assign q2 = t if p2 > 0.5, else q2 = f
…
Let pn  P(Qn = t | X = t, Q1 = q1, …, Qn-1 = qn-1 )
Assign qn = t if pn > 0.5, else qn = f
Proof (cont.)
Claim: if  is satisfiable, then q1 ,…, qn is a satisfying
assignment
 Suppose  is satisfiable
 By induction on i there is a satisfying assignment with
Q1 = q1, …, Qi = qi
Base case:
If Q1 = t in all satisfying assignments,
 P(Q1 = t | X = t) = 1
 p1  1 -  > 0.5
 q1 = t
If Q1 = f, in all satisfying assignments, then q1 = f
Otherwise, the statement holds for any choice of q1
Proof (cont.)
Induction argument:
If Qi+1 = t in all satisfying assignments s.t.Q1 = q1, …, Qi = qi
 P(Qi+1 = t | X = t, Q1 = q1, …, Qi = qi ) = 1
 pi+1  1 -  > 0.5
 qi+1 = t
If Qi+1 = f in all satisfying assignments s.t.Q1 = q1, …, Qi = qi
then qi+1 = f. Otherwise, the statement holds for any choice
of qi .
Proof (cont.)

We can efficiently check whether q1 ,…, qn is a satisfying
assignment (linear time)
 If it is, then  is satisfiable
 If it is not, then  is not satisfiable
Suppose we have an approximation procedure with 
relative error.
 We can determine 3-SAT with n procedure calls. We
generate an assignment as in the proof, and check
satisfyability of the resulting assignment in linear time. If
there were a satisfiable solution, we showed one would find
it, and if no such assignment exists, one won’t find it.
 Thus, approximation is NP-hard.

When can we hope to approximate?
Two situations:
0.16
0.14
0.12
 “Peaked”
distributions
improbable values are ignored
0.1
0.08
0.06
0.04
0.02
 Highly
stochastic distributions
“Far” evidence is discarded.
(E.g., far markers in genetic linkage
analysis)
0
Probability
Stochastic Simulation
we can sample instances <x1,…,xn>
according to P(X1,…,Xn)
 What is the probability that a random sample
<x1,…,xn> satisfies e?
 This is exactly P(e)
 Suppose
 We
can view each sample as tossing a biased coin
with probability P(e) of “Heads”
Stochastic Sampling
 Intuition:
given a sufficient number of samples
x[1],…,x[N], we can estimate
# Heads
P (e) 

N
 P (e|x[i])
i
1 or 0
N
of large number implies that as N grows, our
estimate will converge to p with high probability.
 Law
Sampling a Bayesian Network
P(X1,…,Xn) is represented by a Bayesian network,
can we efficiently sample from it?
 If
 YES:
sample according to structure of the network:
sample each variable given its sampled parents
Logic sampling
P(e) 0.001
Earthquake
Radio
Burglary
Alarm
0.03
P(b) 0.03
be be be be
P(a) 0.98 0.7
e
e
P(r) 0.3 0.001
Call
a
B E A C R
Samples:
b
a
P(c) 0.8 0.05
0.4
0.01
Logic sampling
0.001
P(e) 0.001
Earthquake
Radio
Burglary
Alarm
P(b) 0.03
be be be be
P(a) 0.98 0.7
e
e
P(r) 0.3 0.001
Call
a
B E A C R
Samples:
b e
a
P(c) 0.8 0.05
0.4
0.01
Logic sampling
P(e) 0.001
Earthquake
Radio
Burglary
Alarm
P(b) 0.03
be be be be
P(a) 0.98 0.7
e
e
P(r) 0.3 0.001
Call
a
B E A C R
Samples:
b e a
a
P(c) 0.8 0.05
0.4
0.01
Logic sampling
P(e) 0.001
Earthquake
Radio
Burglary
Alarm
P(b) 0.03
be be be be
P(a) 0.98 0.7
e
e
P(r) 0.3 0.001
Call
a
B E A C R
Samples:
b e a c
a
P(c) 0.8 0.05
0.4
0.01
Logic sampling
P(e) 0.001
Earthquake
Radio
Burglary
Alarm
P(b) 0.03
be be be be
P(a) 0.98 0.7
e
e
P(r) 0.3 0.001
Call
a
B E A C R
Samples:
b e a c
r
a
P(c) 0.8 0.05
0.4
0.01
Logic Sampling
X1, …, Xn be order of variables consistent with
arc direction
 for i = 1, …, n do
 sample xi from P(Xi | pai )
 (Note: since Pai  {X1,…,Xi-1}, we already
assigned values to them)
 return x1, …,xn
 Let
Logic Sampling
 Sampling
a complete instance is linear in number of
variables
 Regardless of structure of the network
if P(e) is small, we need many samples to
get a decent estimate
 However,
Can we sample from P(X1,…,Xn |e)?
 If
 If
evidence is in roots of the network, as before.
evidence is in leaves of the network, we have
a problem: Our sampling method proceeds
according to order of nodes in graph. We need
to retain only those samples that match e. This
might be a rare event.
Likelihood Weighting
 Can
we ensure that all of our sample is used?
 One wrong (but fixable) approach:
 When we need to sample a variable that is
assigned a value by e, use that specified value.
X
example: we know Y = 1
 Sample X from P(X)
 Then take Y = 1
 This is NOT a sample from P(X,Y |Y = 1) !
 For
Y
Likelihood Weighting
 Problem:
these samples of X are from P(X)
 Solution:

Penalize samples in which P(Y=1|X) is small
X
Y
 We


now sample as follows:
Let x[i] be a sample from P(X)
Let w[i] be P(Y = 1|X = x [i])
 w[i]P( X  x|x[i])
P ( X  x |Y  1 ) 
 w[i]
i
i
1 or 0
Likelihood Weighting
P(e) 0.001
Earthquake
Radio
r
r
P(r) 0.3 0.001
Burglary
Alarm
=a
=r
0.03
P(b) 0.03
be be be be
P(a) 0.98 0.7
Call
a
B E A C R Weight
Samples:
b
a
P(c) 0.8 0.05
0.4
0.01
Likelihood Weighting
0.001
P(e) 0.001
Earthquake
Radio
r
r
P(r) 0.3 0.001
Burglary
Alarm
=a
=r
P(b) 0.03
be be be be
P(a) 0.98 0.7
Call
a
B E A C R Weight
Samples:
b e
a
P(c) 0.8 0.05
0.4
0.01
Likelihood Weighting
P(e) 0.001
Burglary
Earthquake
Radio
r
Alarm
=a
=r
r
P(r) 0.3 0.001
P(b) 0.03
be be be be
P(a) 0.98 0.7
Call
a
B E A C R Weight
Samples:
b e
a
0.6
a
P(c) 0.8 0.05
0.4
0.01
Likelihood Weighting
P(e) 0.001
Burglary
Earthquake
Radio
r
Alarm
=a
=r
r
P(r) 0.3 0.001
P(b) 0.03
be be be be
P(a) 0.98 0.7
Call
a
B E A C R Weight
Samples:
b e
a c
0.6
a
P(c) 0.8 0.05
0.05
0.4
0.01
Likelihood Weighting
P(e) 0.001
Earthquake
Radio
r
r
Burglary
Alarm
=a
=r
P(r) 0.3 0.001
P(b) 0.03
be be be be
P(a) 0.98 0.7
Call
a
B E A C R Weight
Samples:
b e
a c r
0.6 *0.3
a
P(c) 0.8 0.05
0.4
0.01
Likelihood Weighting
X1, …, Xn be order of variables consistent with
arc direction
w = 1
 for i = 1, …, n do
 if Xi = xi has been observed
 Let
w

w  P(Xi = xi | pai )
else
 sample
 return
xi from P(Xi | pai )
x1, …,xn, and w
Likelihood Weighting
 Why
does this make sense?
 When N is large, we expect to sample NP(X = x)
samples with x[i] = x
 Thus,
 P(Y  1 | X  x [i]) P( X  x|x [i] ) N  P(X  x , Y  1)

 P( X  x | Y  1 )
N  P(Y  1)
 P(Y  1 | X  x [i])
i
i
Likelihood Weighting
What can we say about the quality of answer?
 Intuitively, the weights of a sample reflects their probability
given the evidence. We need collect a enough mass for the
sample to provide accurate answer.
 Another factor is the “extremeness” of CPDs.
Theorem (Dagum & Luby AIJ93):
If
P(Xi | Pai) [l,u] for all local probability tables, and
w [i ] 
i
4(1   )
2
2
ln u n

with probability 1-, the estimate is  relative error
approximation
then
END