Class Slides 4

Download Report

Transcript Class Slides 4

Sampling Bayesian Networks
ICS 276
2007
1
Answering BN Queries





Probability of Evidence P(e) ? NP-hard
Conditional Prob. P(xi|e) ? NP-hard
MPE x = arg max P(x|e) ? NP-hard
MAP y = arg max P(y|e), y  x ?
NPPP-hard
Approximating P(e) or P(xi|e) within :
NP-hard
2
Approximation Algorithms
Structural Approximations
 Eliminate some dependencies


Remove edges
Mini-Bucket Approach
Search
Approach for optimization tasks: MPE, MAP
Sampling
Generate random samples and compute
values of interest from samples,
not original network
3
Algorithm Tree
4
Sampling





Input: Bayesian network with set of nodes X
Sample = a tuple with assigned values
s=(X1=x1,X2=x2,… ,Xk=xk)
Tuple may include all variables (except evidence) or
a subset
Sampling schemas dictate how to generate
samples (tuples)
Ideally, samples are distributed according to P(X|E)
5
Sampling Fundamentals
Given a set of variables X = {X1, X2, … Xn} that
represent joint probability distribution (X) and some
function g(X), we can compute expected value of g(X) :
E g   g ( x) ( X )dx
6
Sampling From (X)
A sample St is an instantiation:
S  {x , x ,..., x }
t
t
1
t
2
t
n
Given independent, identically distributed
samples (iid) S1, S2, …ST from (X), it follows
from Strong Law of Large Numbers:
1 T
t
g  t 1 g ( S )
T
7
Sampling Basics




Given random variable X, D(X)={0, 1}
Given P(X) = {0.3, 0.7}
Generate k samples: 0,1,1,1,0,1,1,0,1
Approximate P’(X):
# samples ( X  0) 4
P' ( X  0) 

 0.4
# samples
10
# samples ( X  1) 6
P' ( X  1) 

 0.6
# samples
10
P' ( X )  {0.4,0.6}
8
How to draw a sample ?



Given random variable X, D(X)={0, 1}
Given P(X) = {0.3, 0.7}
Sample X  P (X)




draw random number r  [0, 1]
If (r < 0.3) then set X=0
Else set X=1
Can generalize for any domain size
9
Sampling in BN




Same Idea: generate a set of samples T
Estimate P(Xi|E) from samples
Challenge: X is a vector and P(X) is a
huge distribution represented by BN
Need to know:



How to generate a new sample ?
How many samples T do we need ?
How to estimate P(E=e) and P(Xi|e) ?
10
Sampling Algorithms


Forward Sampling
Gibbs Sampling (MCMC)





Blocking
Rao-Blackwellised
Likelihood Weighting
Importance Sampling
Sequential Monte-Carlo (Particle
Filtering) in Dynamic Bayesian Networks
11
Forward Sampling

Forward Sampling



Case with No evidence E={}
Case with Evidence E=e
# samples N and Error Bounds
12
Forward Sampling No Evidence
(Henrion 1988)
Input: Bayesian network
X= {X1,…,XN}, N- #nodes, T - # samples
Output: T samples
Process nodes in topological order – first process the
ancestors of a node, then the node itself:
1.
For t = 0 to T
2.
For i = 0 to N
3.
Xi  sample xit from P(xi | pai)
13
Sampling A Value
What does it mean to sample xit from P(Xi | pai) ?

Assume D(Xi)={0,1}

Assume P(Xi | pai) = (0.3, 0.7)
0

0.3
r
1
Draw a random number r from [0,1]
If r falls in [0,0.3], set Xi = 0
If r falls in [0.3,1], set Xi=1
14
Forward sampling (example)
X1
X2
P( x2 | x1 )
X4
P( x1 )
No Evidence
X3
// generate sample k
1. Sample x1 from P ( x1 )
P( x3 | x1 )
2. Sample x2 from P ( x2 | x1 )
P ( x 4 | x 2 , x3 )
3. Sample x3 from P( x3 | x1 )
4. Sample x4 from P ( x4 | x2, x3 )
15
Forward Sampling-Answering Queries
Task: given T samples {S1,S2,…,Sn}
estimate P(Xi = xi) :
# samples ( X i  xi )
P ( X i  xi ) 
T
Basically, count the proportion of samples where
Xi = xi
16
Forward Sampling w/ Evidence
Input: Bayesian network
X= {X1,…,XN}, N- #nodes
E – evidence, T - # samples
Output: T samples consistent with E
1.
For t=1 to T
2.
For i=1 to N
3.
Xi  sample xit from P(xi | pai)
4.
If Xi in E and Xi  xi, reject sample:
5.
i = 1 and go to step 2
17
Forward sampling (example)
Evidence : X 3  0
X1
X3
X2
P( x2 | x1 )
P( x1 )
X4
P( x3 | x1 )
P ( x 4 | x 2 , x3 )
// generate sample k
1. Sample x1 from P ( x1 )
2. Sample x2 from P ( x2 | x1 )
3. Sample x3 from P ( x3 | x1 )
4. If x3  0, reject sample
and start from 1, otherwise
5. Sample x4 from P ( x4 | x2, x3 )
18
Forward Sampling: Illustration
Let Y be a subset of evidence nodes s.t. Y=u
19
Forward Sampling –How many samples?
Theorem: Let s(y) be the estimate of P(y)
resulting from a randomly chosen sample set S
with T samples. Then, to guarantee relative error
at most  with probability at least 1- it is enough
to have:
c
1
T
P( y )  
2


Derived from Chebychev’s Bound.
PP ( y) [ P( y)   , P( y)   ]  2e
2 N 2
20
Forward Sampling - How many samples?
Theorem: Let s(y) be the estimate of P(y)
resulting from a randomly chosen sample set S
with T samples. Then, to guarantee relative error
at most  with probability at least 1- it is enough
to have:
4
2
T
ln
2
P( y )  

Derived from Hoeffding’s Bound (full proof is given in Koller).
PP ( y) [ P( y)   , P( y)   ]  2e
2 N 2
21
Forward Sampling:Performance
Advantages:
 P(xi | pa(xi)) is readily available
 Samples are independent !
Drawbacks:
 If evidence E is rare (P(e) is low), then we
will reject most of the samples!
 Since P(y) in estimate of T is unknown, must
estimate P(y) from samples themselves!
 If P(e) is small, T will become very big!
22
Problem: Evidence

Forward Sampling


High Rejection Rate
Fix evidence values



Gibbs sampling (MCMC)
Likelihood Weighting
Importance Sampling
23
Forward Sampling Bibliography

{henrion88} M. Henrion, "Propagating
uncertainty in Bayesian networks by
probabilistic logic sampling”, Uncertainty
in AI, pp. = 149-163,1988
24
Gibbs Sampling

Markov Chain Monte Carlo method
(Gelfand and Smith, 1990, Smith and Roberts, 1993, Tierney, 1994)




Samples are dependent, form Markov Chain
Sample from P’(X|e) which converges to P(X|e)
Guaranteed to converge when all P > 0
Methods to improve convergence:



Blocking
Rao-Blackwellised
Error Bounds


Lag-t autocovariance
Multiple Chains, Chebyshev’s Inequality
25
Gibbs Sampling (Pearl, 1988)

A sample t[1,2,…],is an instantiation of all
variables in the network:
x  { X 1  x , X 2  x ,..., X N  x }
t

t
1
Sampling process




t
2
t
N
Fix values of observed variables e
Instantiate node values in sample x0 at random
Generate samples x1,x2,…xT from P(x|e)
Compute posteriors from samples
26
Ordered Gibbs Sampler
Generate sample xt+1 from xt :
Process
All
Variables
In Some
Order
t 1
1
X1  x
 P( x1 | x , x ,..., x , e)
t
2
t
3
t
N
X 2  x2t 1  P( x2 | x1t 1 , x3t ,..., x Nt , e)
...
X N  xNt 1  P( x N | x1t 1 , x2t 1 ,..., x Nt 11 , e)
In short, for i=1 to N:
X i  xit 1  sampled from P( xi | x t \ xi , e)
27
Gibbs Sampling (cont’d)
(Pearl, 1988)
: ) ix \ tvo kram | ix (P  ) ix \ t x | ix (P : tnatropmI
) j ap | j x (P  ) iap | ix (P  ) ix \ t x | ix (P
ihc j
X
Markov blanket:
Xi
) j ap  (  ihc  iap  ) i X ( M
j hc j
X
te knalb vo kraM neviG
,)stnerap rieht dna ,nerdlihc ,stnerap(
sedon rehto lla fo t nednepedni si i X
28
Ordered Gibbs Sampling
Algorithm
Input: X, E
Output: T samples {xt }

Fix evidence E

Generate samples from P(X | E)
1.
2.
3.
For t = 1 to T (compute samples)
For i = 1 to N (loop through variables)
Xi  sample xit from P(Xi | markovt \ Xi)
29
Answering Queries


Query: P(xi |e) = ?
Method 1: count #of samples where Xi=xi:
# samples ( X i  xi )
P ( X i  xi ) 
T
Method 2: average probability (mixture estimator):
1 n
t
P( X i  xi )  t 1 P( X i  xi |markov \ X i )
T
Gibbs Sampling Example - BN
X1
X3
X6
X2
X5
X8
X4
X7
X9
X = {X1,X2,…,X9}
E = {X9}
31
Gibbs Sampling Example - BN
X1
X3
X6
X2
X5
X8
X4
X7
X9
X1 = x 10
X2 = x 20
X3 = x 30
X4 = x 40
X5 = x 50
X 6 = x6 0
X 7 = x7 0
X 8 = x8 0
32
Gibbs Sampling Example - BN
X1
X3
X6
X2
X5
X8
X4
X7
X9
X1  P (X1 |X02,…,X08 ,X9}
E = {X9}
33
Gibbs Sampling Example - BN
X1
X3
X6
X2
X5
X8
X4
X7
X9
X2  P(X2 |X11,…,X08 ,X9}
E = {X9}
34
Gibbs Sampling: Illustration
35
Gibbs Sampling: Burn-In






We want to sample from P(X | E)
But…starting point is random
Solution: throw away first K samples
Known As “Burn-In”
What is K ? Hard to tell. Use intuition.
Alternatives: sample first sample valkues from
approximate P(x|e) (for example, run IBP first)
40
Gibbs Sampling: Convergence
Converge to stationary distribution * :
* = * P
where P is a transition kernel
pij = P(Xi  Xj)
 Guaranteed to converge iff chain is :




irreducible
aperiodic
ergodic ( i,j pij > 0)
41
Irreducible


A Markov chain (or its probability transition
matrix) is said to be irreducible if it is
possible to reach every state from every
other state (not necessarily in one step).
In other words, i,j k : P(k)ij > 0 where k is
the number of steps taken to get to state j
from state i.
42
Aperiodic

Define d(i) = g.c.d.{n > 0 | it is
possible to go from i to i in n steps}.
Here, g.c.d. means the greatest
common divisor of the integers in the
set. If d(i)=1 for i, then chain is
aperiodic.
43
Ergodicity

A recurrent state is a state to which the
chain returns with probability 1:
n P(n)ij = 
Recurrent, aperiodic states are ergodic.
Note: an extra condition for ergodicity is that

expected recurrence time is finite. This holds
for recurrent states in a finite state chain.
44
Gibbs Convergence


Gibbs convergence is generally guaranteed as long
as all probabilities are positive!
Intuition for ergodicity requirement: if nodes X
and Y are correlated s.t. X=0 Y=0, then:


once we sample and assign X=0, then we are forced to
assign Y=0;
once we sample and assign Y=0, then we are forced to
assign X=0;
 we will never be able to change their values again!
 Another problem: it can take a very long time to
converge!
46
Gibbs Sampling: Performance
+Advantage: guaranteed to converge to P(X|E)
-Disadvantage: convergence may be slow
Problems:


Samples are dependent !
Statistical variance is too big in highdimensional problems
47
Gibbs: Speeding Convergence
Objectives:
1. Reduce dependence between samples
(autocorrelation)


2.
Skip samples
Randomize Variable Sampling Order
Reduce variance


Blocking Gibbs Sampling
Rao-Blackwellisation
48
Skipping Samples
Pick only every k-th sample (Gayer, 1992)
Can reduce dependence between samples !
Increases variance ! Waists samples !

49
Randomized Variable Order
Random Scan Gibbs Sampler
Pick each next variable Xi for update at random
with probability pi , i pi = 1.
(In the simplest case, pi are distributed uniformly.)
In some instances, reduces variance
(MacEachern, Peruggia, 1999
“Subsampling the Gibbs Sampler: Variance Reduction”)
50
Blocking
Sample several variables together, as a block
 Example: Given three variables X,Y,Z, with
domains of size 2, group Y and Z together to form
a variable W={Y,Z} with domain size 4. Then, given
sample (xt,yt,zt), compute next sample:
Xt+1  P(yt,zt)=P(wt)
(yt+1,zt+1)=Wt+1  P(xt+1)
+ Can improve convergence greatly when two
variables are strongly correlated!
- Domain of the block variable grows exponentially
with the #variables in a block!

51
Blocking Gibbs Sampling
Jensen, Kong, Kjaerulff, 1993
“Blocking Gibbs Sampling Very Large
Probabilistic Expert Systems”
 Select a set of subsets:
E1, E2, E3, …, Ek s.t. Ei  X
Ui Ei = X
Ai = X \ Ei
 Sample P(Ei | Ai)
52
Rao-Blackwellisation



Do not sample all variables!
Sample a subset!
Example: Given three variables X,Y,Z,
sample only X and Y, sum out Z. Given
sample (xt,yt), compute next sample:
Xt+1  P(yt)
yt+1  P(xt+1)
53
Rao-Blackwell Theorem
Bottom line: reducing number of variables in a sample reduce variance!
54
Blocking vs. Rao-Blackwellisation

X
Y


Z
Standard Gibbs:
P(x|y,z),P(y|x,z),P(z|x,y)
Blocking:
P(x|y,z), P(y,z|x)
Rao-Blackwellised:
P(x|y), P(y|x)
(1)
(2)
(3)
Var3 < Var2 < Var1
[Liu, Wong, Kong, 1994
Covariance structure of the Gibbs sampler…]
55
Rao-Blackwellised Gibbs: Cutset Sampling




Select C  X (possibly cycle-cutset), |C| = m
Fix evidence E
Initialize nodes with random values:
0
For i=1 to m: ci to Ci = c i
For t=1 to n , generate samples:
For i=1 to m:
Ci=cit+1  P(ci|c1 t+1,…,ci-1 t+1,ci+1t,…,cmt ,e)
56
Cutset Sampling


Select a subset C={C1,…,CK}  X
A sample t[1,2,…],is an instantiation of C:
c  {C1  c , C2  c ,..., CK  c }
t

t
1
Sampling process




t
2
t
K
Fix values of observed variables e
Generate sample c0 at random
Generate samples c1,c2,…cT from P(c|e)
Compute posteriors from samples
57
Cutset Sampling
Generating Samples
Generate sample ct+1 from ct :
C1  c1t 1  P(c1 | c2t , c3t ,..., cKt , e)
C2  c2t 1  P(c2 | c1t 1 , c3t ,..., cKt , e)
...
CK  c
t 1
K
t 1
1
t 1
2
 P(cK | c , c ,..., c
t 1
K 1
, e)
In short, for i=1 to K:
t 1
i
Ci  c
 sampled from P(ci | c \ ci , e)
t
58
Rao-Blackwellised Gibbs: Cutset Sampling
How to compute P(ci|c t\ci, e) ?
t
 Compute joint P(ci, c \ci, e) for each ci  D(Ci)
 Then normalize:
P(ci| c t\ci , e) =  P(ci, c t\ci , e)
 Computation efficiency depends
on choice of C
59
Rao-Blackwellised Gibbs: Cutset Sampling
How to choose C ?
 Special case: C is cycle-cutset, O(N)
 General case: apply Bucket Tree Elimination
(BTE), O(exp(w)) where w is the induced
width of the network when nodes in C are
observed.
 Pick C wisely so as to minimize w  notion of
w-cutset
60
w-cutset Sampling



C=w-cutset of the network, a set of
nodes such that when C and E are
instantiated, the adjusted induced width
of the network is w
Complexity of exact inference:
bounded by w !
cycle-cutset is a special case
61
Cutset Sampling-Answering Queries


Query: ci C, P(ci |e)=? same as Gibbs:
Special case of w-cutset
1 T
t
P(ci|e)  t 1 P (ci | c \ ci , e)
T
computed while generating sample t

Query: P(xi |e) = ?
1 T
P(xi|e)  t 1 P ( xi | c t ,e)
T
compute after generating sample t
62
Cutset Sampling Example
c 0  {x20 ,x50 }
X1
X2
X3
X4
X5
X6
X7
X8
X9
E=x9
63
Cutset Sampling Example
Sample a new value for X2 :
c 0  {x20 ,x50 }
X1
X2
X3
BTE ( x2' , x50 ,x9 )
BTE ( x2'' , x50 ,x9 )
X4
X5
X6
X7
X8
X9
'
0


BTE
(
x
,
x
1
2
5 ,x9 )
1
0
x2  P ( x2| x5 ,x9 )  

''
0
  BTE ( x2 , x5 ,x9 )
64
Cutset Sampling Example
Sample a new value for X5 :
c 0  {x20 , x50 }
X1
X2
X3
x12  P ( x2| x50 ,x9 )
BTE ( x12 , x5' ,x9 )
X4
X5
X6
X7
X8
X9
BTE ( x12 , x5'' ,x9 )
1
'


BTE
(
x
,
x
1
2
5 ,x9 )
1
1
x5  P ( x5| x 2 ,x9 )  

1
''
  BTE ( x2 , x5 ,x9 )
c1  {x12 , x51 }
65
Cutset Sampling Example
Query P(x2|e) for sampling node X2 :
x12  P ( x2| x50 ,x9 ) Sample 1
X1
X2
X3

x22  P ( x2| x51 ,x9 ) Sample 2

X4
X7
X5
X8
X6
X9
x23  P ( x2| x52 ,x9 ) Sample 3
 P ( x2| x50 ,x9 ) 

1
1
P ( x2 | x9 )   P ( x2| x5 ,x9 ) 
3

2

P
(
x
|
x
,x
)
2
5
9 

66
Cutset Sampling Example
Query P(x3 |e) for non-sampled node X3 :
c1  {x12 , x51 }  P( x3 | x12 , x51 , x9 )
X1
X2
X3
c 2  {x22 , x52 }  P( x3 | x22 , x52 , x9 )
c 3  {x23 , x53 }  P( x3 | x23 , x53 , x9 )
X4
X5
X6
X7
X8
X9
 P( x3 | x12 , x51 , x9 ) 

1
2
2
P( x3 | x9 )   P( x3 | x2 , x5 , x9 )
3

3
3

P
(
x
|
x
,
x
,
x
)
3
2
5
9 

67
Gibbs: Error Bounds
Objectives:
 Estimate needed number of samples T
 Estimate error
Methodology:
 1 chain  use lag-k autocovariance


Estimate T
M chains  standard sampling variance

Estimate Error
68
Gibbs: lag-k autocovariance
Pi  P ( xi | x \ xi )
t
1
N
t
P  P ( xi | e)  t 1 P ( xi | x \ xi )
T
1
N k
 (k )  t 1 ( Pi  P )( Pi  k  P )
T
Lag-k autocovariance
2 1
1

Var ( P )    (0)  2   (i ) 
T
i 1

69
Gibbs: lag-k autocovariance
Estimate Monte Carlo variance:
1
Var ( P ) 
T
2 1


  (0)  2   (i ) 
i 1


Here,  is the smallest positive integer satisfying:
 ( 2 )   ( 2  1) 1
Effective chain size:
 (0)
ˆ
T 
Var ( P )
ˆ T
In absense of autocovariance: T
70
Gibbs: Multiple Chains


Generate M chains of size K
Each chain produces independent estimate Pm:
1
Pm  P ( xi | e) 
K

K
t
P
(
x
|
x
\
x
)
i
i
t 1
Estimate P(xi|e) as average of Pm (xi|e) :
1
P 
M

M
i 1
Pm
Treat Pm as independent random variables.
71
Gibbs: Multiple Chains
{ Pm } are independent random variables
Therefore:
2
1
M


Var ( P )  S 
P

P

m
m 1
M 1
1 M
2
2

Pm  MP 


M  1  m 1

2
  t / 2, M 1
S
M
72
Geman&Geman1984

Geman, S. & Geman D., 1984. Stocahstic
relaxation, Gibbs distributions, and the Bayesian
restoration of images. IEEE
Trans.Pat.Anal.Mach.Intel. 6, 721-41.


Introduce Gibbs sampling;
Place the idea of Gibbs sampling in a general setting
in which the collection of variables is structured in a
graphical model and each variable has a
neighborhood corresponding to a local region of the
graphical structure. Geman and Geman use the Gibbs
distribution to define the joint distribution on this
structured set of variables.
73
Tanner&Wong 1987

Tanner and Wong (1987)


Data-augmentation
Convergence Results
74
Pearl1988

Pearl,1988. Probabilistic Reasoning in
Intelligent Systems, Morgan-Kaufmann.

In the case of Bayesian networks, the
neighborhoods correspond to the Markov
blanket of a variable and the joint
distribution is defined by the factorization
of the network.
75
Gelfand&Smith,1990

Gelfand, A.E. and Smith, A.F.M., 1990.
Sampling-based approaches to
calculating marginal densities. J.
Am.Statist. Assoc. 85, 398-409.

Show variance reduction in using mixture
estimator for posterior marginals.
76
Neal, 1992

R. M. Neal, 1992. Connectionist
learning of belief networks, Artifical
Intelligence, v. 56, pp. 71-118.

Stochastic simulation in Noisy-Or
networks.
77
CPCS54 Test Results
CPCS54, n=54, |C|=15, |E|=3
CPCS54, n=54, |C|=15, |E|=3
Cutset
Cutset
Gibbs
Gibbs
0.004
0.0008
0.003
0.0006
0.002
0.0004
0.001
0.0002
0
0
0
1000
2000
3000
4000
5000
0
5
# samples
10
15
20
25
Tim e(sec)
MSE vs. #samples (left) and time (right)
Ergodic, |X| = 54, D(Xi) = 2, |C| = 15, |E| = 4
Exact Time = 30 sec using Cutset Conditioning
78
CPCS179 Test Results
CPCS179, n=179, |C|=8, |E|=35
Cutset
CPCS179, n=179, |C|=8, |E|=35
Gibbs
Cutset
Gibbs
0.012
0.012
0.01
0.01
0.008
0.008
0.006
0.006
0.004
0.004
0.002
0.002
0
0
100
500
1000
2000
# sam ples
3000
4000
0
20
40
60
80
Tim e(sec)
MSE vs. #samples (left) and time (right)
Non-Ergodic (1 deterministic CPT entry)
|X| = 179, |C| = 8, 2<= D(Xi)<=4, |E| = 35
Exact Time = 122 sec using Loop-Cutset Conditioning
79
CPCS360b Test Results
CPCS360b, n=360, |C|=21, |E|=36
CPCS360b, n=360, |C|=21, |E|=36
Cutset
Cutset
Gibbs
0.00016
0.00016
0.00012
0.00012
0.00008
0.00008
0.00004
0.00004
0
Gibbs
0
0
200
400
600
800
1000
1
2
3
5
# samples
10
20
30
40
50
60
Tim e(sec)
MSE vs. #samples (left) and time (right)
Ergodic, |X| = 360, D(Xi)=2, |C| = 21, |E| = 36
Exact Time > 60 min using Cutset Conditioning
Exact Values obtained via Bucket Elimination
80
Random Networks
RANDOM, n=100, |C|=13, |E|=15-20
RANDOM, n=100, |C|=13, |E|=15-20
Cutset
Gibbs
Cutset
Gibbs
0.001
0.0035
0.0008
0.003
0.0025
0.0006
0.002
0.0004
0.0015
0.001
0.0002
0.0005
0
0
0
200
400
600
800
1000
1200
0
1
2
3
# samples
4
5
6
7
8
9
10
11
Time(sec)
MSE vs. #samples (left) and time (right)
|X| = 100, D(Xi) =2,|C| = 13, |E| = 15-20
Exact Time = 30 sec using Cutset Conditioning
81
Coding Networks
x1
x2
x3
x4
Coding Networks, n=100, |C|=12-14
u1
u2
u3
u4
p1
p2
p3
p4
y1
y2
y3
y4
IBP
Gibbs
Cutset
0.1
0.01
0.001
0
10
20
30
40
50
60
Tim e(sec)
MSE vs. time (right)
Non-Ergodic, |X| = 100, D(Xi)=2, |C| = 13-16, |E| = 50
Sample Ergodic Subspace U={U1, U2,…Uk}
Exact Time = 50 sec using Cutset Conditioning
82
Non-Ergodic Hailfinder
HailFinder, n=56, |C|=5, |E|=1
HailFinder, n=56, |C|=5, |E|=1
Cutset
Cutset
Gibbs
Gibbs
1
0.1
0.1
0.01
0.01
0.001
0.001
0.0001
0.0001
0
500
1000
1500
1
2
# samples
3
4
5
6
7
8
9
10
Tim e(sec)
MSE vs. #samples (left) and time (right)
Non-Ergodic, |X| = 56, |C| = 5, 2 <=D(Xi) <=11, |E| = 0
Exact Time = 2 sec using Loop-Cutset Conditioning
83
Non-Ergodic CPCS360b - MSE
cpcs360b, N=360, |E|=[20-34], w*=20, MSE
Gibbs
0.000025
IBP
|C|=26,fw=3
0.00002
|C|=48,fw=2
0.000015
0.00001
0.000005
0
0
200
400
600
800
1000
1200
1400
1600
Time (sec)
MSE vs. Time
Non-Ergodic, |X| = 360, |C| = 26, D(Xi)=2
Exact Time = 50 min using BTE
84
Non-Ergodic CPCS360b - MaxErr
cpcs360b, N=360, |E|=[20-34], MaxErr
Gibbs
0.007
IBP
0.006
|C|=26,fw=3
0.005
|C|=48,fw=2
0.004
0.003
0.002
0.001
0
0
200
400
600
800
1000
1200
1400
1600
Time (sec)
85
Likelihood Weighting
(Fung and Chang, 1990; Shachter and Peot, 1990)
“Clamping” evidence+
forward sampling+
weighing samples by evidence likelihood
Works well for likely evidence!
86
Likelihood Weighting
Sample in topological order over X !
e
e
e
e
e
e
e
xi P(Xi|pai)
P(Xi|pai) is a look-up in CPT!
e
e
87
Likelihood Weighting Outline
w( t )  1
ForEach X i  X Do
If ( X i  E )
X i  ei
w(t )  w(t )  P(ei | pai )
Else
X i  xi(t )  P( X i | pai )
EndFor
88
Likelihood Weighting
Estimate Posterior Marginals:
T
ˆ ( x , e)
P
i
Pˆ ( xi | e) 

Pˆ (e)
(t )
(t )
w

(
x
,
x
)

i
t 1
T
(t )
w

t 1
w( t )
P( x (t ) )
(t )


P
(
e
|
pa

j
j ) since Q (e j | pa j )  1
(t )
Q( x )
j
89
Likelihood Weighting
Converges to exact posterior marginals
 Generates Samples Fast
 Sampling distribution is close to prior
(especially if E  Leaf Nodes)
 Increasing sampling variance
Convergence may be slow
Many samples with P(x(t))=0 rejected

90
Likelihood Convergence
(Chebychev’s Inequality)


Assume P(X=x|e) has mean  and variance 2
Chebychev:
2
 ˆ  

c
P
  

2
2
N 
 

c 2
1
N  2 2 
 

=P(x|e) is unknown => obtain it from samples!
93
Error Bound Derivation
1
Chebychev' s : P X    k   2 , k  1
k
2
Corollary : If   k , then P X       2

K
pq
(k # samples with X  x' ), Var P  
, q  1 p
T
T
pq / T
P P P  
2
P  P ( x ' | e) 




P P P  

P(1  P)
1

T 2
4T 2
From the Law of Large Numbers : Var ( X ) 
 2 /T  2
P X      
 2
2

T
2
K is a Bernoulli
random
variable
T
94
Likelihood Convergence 2


Assume P(X=x|e) has mean  and variance 2
Zero-One Estimation Theory (Karp et al.,1989):
4
2
T
ln
2
 

=P(x|e) is unknown => obtain it from samples!
95
Local Variance Bound (LVB)
(Dagum&Luby, 1994)

Let  be LVB of a binary valued network:
 l 1 u
u  l ,]1,0[  u , l ,
,  xam  
 u 1 l 
] u , l[  ))x ( ap | x (P ,))x ( ap | x (P
RO
] l  1, u  1[  ))x ( ap | x (P
96
LVB Estimate
(Pradhan,Dagum,1996)

Using the LVB, the Zero-One Estimator
can be re-written:
T
4

k
2
ln
2

97
Importance Sampling Idea



In general, it is hard to sample from
target distribution P(X|E)
Generate samples from sampling
(proposal) distribution Q(X)
Weigh each sample against P(X|E)
I ( f t )   f ( x)dx  
P( x)
f ( x)dx
Q( x)
98
Importance Sampling Variants
Importance sampling: forward, non-adaptive
 Nodes sampled in topological order
 Sampling distribution (for non-instantiated nodes)
equal to the prior conditionals
Importance sampling: forward, adaptive
 Nodes sampled in topological order
 Sampling distribution adapted according to
average importance weights obtained in previous
samples [Cheng,Druzdzel2000]
99
AIS-BN


The most efficient variant of importance
sampling to-date is AIS-BN – Adaptive
Importance Sampling for Bayesian networks.
Jian Cheng and Marek J. Druzdzel. AIS-BN:
An adaptive importance sampling algorithm
for evidential reasoning in large Bayesian
networks. Journal of Artificial Intelligence
Research (JAIR), 13:155-188, 2000.
100
Importance vs. Gibbs
~
Gibbs : x  p ( x | e)
T 
~
p ( x | e) 
 p ( x | e)
t
T
1
t
f ( x)   f ( x )
T t 1
Importance : x  q ( x | e)
t
T
t
t
1
f ( x ) p( x )
f  
t
T t 1
q( x )
wt
101