Transcript Slides
QUIZ!!
T/F: You can always (theoretically) do BNs inference by enumeration. TRUE
T/F: In VE, always first marginalize, then join. FALSE
T/F: VE is a lot faster than enumeration, but not always exact. FALSE
T/F: The running time of VE is independent of the order you pick the r.v.s. FALSE
T/F: The more evidence, the faster VE runs. TRUE
P(X|Y) sums to ... |Y|
P(x|Y) sums to ... ??
P(X|y) sums to ... 1
P(x|y) sums to ... P(x|y)
P(x,Y) sums to .. P(x)
P(X,Y) sums to ... 1
1
CSE 511a: Artificial Intelligence
Spring 2013
Lecture 17: Bayes’ Nets IV– Approximate
Inference (Sampling)
04/01/2013
Robert Pless
Via Kilian Weinberger via Dan Klein – UC Berkeley
Announcements
Project 4 out soon.
Project 3 due at midnight.
3
Exact Inference
Variable Elimination
4
General Variable Elimination
Query:
Start with initial factors:
Local CPTs (but instantiated by evidence)
While there are still hidden variables (not Q or evidence):
Pick a hidden variable H
Join all factors mentioning H
Eliminate (sum out) H
Join all remaining factors and normalize
5
6
Example: Variable elimination
P(at)=.8
P(st)=.6
attend
study
P(fa)=.9
fair
prepared
P(pr|at,st)
0.9
0.5
0.7
0.1
pass
P(pa|at,pre,fa)
0.9
0.1
0.7
0.1
0.7
0.1
0.2
0.1
pr
T
T
T
T
F
F
F
F
at
T
T
F
F
T
T
F
F
fa
T
F
T
F
T
F
T
F
at
T
T
F
F
st
T
F
T
F
Query: What is the probability that
a student attends class, given that
they pass the exam?
[based on slides taken from UMBC CMSC 671,
2005]
7
Join study factors
P(at)=.8
P(st)=.6
attend
study
P(fa)=.9
fair
prep
T
T
T
T
F
F
F
F
prepared
pass
P(pa|at,pre,fa)
0.9
0.1
0.7
0.1
0.7
0.1
0.2
0.1
pr
T
T
T
T
F
F
F
F
at
T
T
F
F
T
T
F
F
fa
T
F
T
F
T
F
T
F
study
T
F
T
F
T
F
T
F
attend
T
T
F
F
T
T
F
F
Original
P(pr|at,st)
0.9
0.5
0.7
0.1
0.1
0.5
0.3
0.9
P(st)
0.6
0.4
0.6
0.4
0.6
0.4
0.6
0.4
Joint
P(pr,st|sm)
0.54
0.2
0.42
0.04
0.06
0.2
0.18
0.36
Marginal
P(pr|sm)
0.74
0.46
0.26
0.54
8
Marginalize out study
P(at)=.8
attend
P(fa)=.9
fair
prepared,
study
prep
T
T
T
T
F
F
F
F
pass
P(pa|at,pre,fa)
0.9
0.1
0.7
0.1
0.7
0.1
0.2
0.1
pr
T
T
T
T
F
F
F
F
at
T
T
F
F
T
T
F
F
fa
T
F
T
F
T
F
T
F
study
T
F
T
F
T
F
T
F
attend
T
T
F
F
T
T
F
F
Original
P(pr|at,st)
0.9
0.5
0.7
0.1
0.1
0.5
0.3
0.9
P(st)
0.6
0.4
0.6
0.4
0.6
0.4
0.6
0.4
Joint
P(pr,st|ac)
0.54
0.2
0.42
0.04
0.06
0.2
0.18
0.36
Marginal
P(pr|at)
0.74
0.46
0.26
0.54
9
Remove “study”
P(at)=.8
attend
P(fa)=.9
fair
prepared
P(pr|at)
0.74
0.46
0.26
0.54
pr
T
T
F
F
at
T
F
T
F
pass
P(pa|at,pre,fa)
0.9
0.1
0.7
0.1
0.7
0.1
0.2
0.1
pr
T
T
T
T
F
F
F
F
at
T
T
F
F
T
T
F
F
fa
T
F
T
F
T
F
T
F
10
Join factors “fair”
P(at)=.8
attend
P(fa)=.9
fair
prepared
P(pr|at)
0.74
0.46
0.26
0.54
prep
T
T
F
F
attend
T
F
T
F
pass
pa
t
t
t
t
t
t
t
t
pre
T
T
T
T
F
F
F
F
attend
T
T
F
F
T
T
F
F
fair
T
F
T
F
T
F
T
F
Original
P(pa|at,pre,
fa)
0.9
0.1
0.7
0.1
0.7
0.1
0.2
0.1
P(fair)
0.9
0.1
0.9
0.1
0.9
0.1
0.9
0.1
Joint
Marginal
P(pa,fa|sm, P(pa|sm,pre
pre)
)
0.81
0.82
0.01
0.63
0.64
0.01
0.63
0.64
0.01
0.18
0.19
0.01
11
Marginalize out “fair”
P(at)=.8
attend
prepared
P(pr|at)
0.74
0.46
0.26
0.54
pass,
fair
pa
T
T
T
T
T
T
T
T
prep
T
T
F
F
Original
pre
T
T
T
T
F
F
F
F
attend
T
T
F
F
T
T
F
F
fair
T
F
T
F
T
F
T
F
P(pa|at,pre,fa)
0.9
0.1
0.7
0.1
0.7
0.1
0.2
0.1
attend
T
F
T
F
Joint
P(fair)
0.9
0.1
0.9
0.1
0.9
0.1
0.9
0.1
Marginal
P(pa,fa|at,pre) P(pa|at,pre)
0.81
0.82
0.01
0.63
0.64
0.01
0.63
0.64
0.01
0.18
0.19
0.01
12
Marginalize out “fair”
P(at)=.8
attend
prepared
P(pr|at)
0.74
0.46
0.26
0.54
prep
T
T
F
F
attend
T
F
T
F
pass
P(pa|at,pre)
0.82
0.64
0.64
0.19
pa
t
t
t
t
pre
T
T
F
F
attend
T
F
T
F
13
Join factors “prepared”
P(at)=.8
attend
prepared
pa
t
t
t
t
pre
T
T
F
F
attend
T
F
T
F
Original
P(pa|at,pr)
0.82
0.64
0.64
0.19
P(pr|at)
0.74
0.46
0.26
0.54
Joint
Marginal
P(pa,pr|sm) P(pa|sm)
0.6068
0.7732
0.2944
0.397
0.1664
0.1026
pass
14
Join factors “prepared”
P(at)=.8
attend
pa
t
t
t
t
pre
T
T
F
F
attend
T
F
T
F
Original
P(pa|at,pr)
0.82
0.64
0.64
0.19
P(pr|at)
0.74
0.46
0.26
0.54
Joint
P(pa,pr|at)
0.6068
0.2944
0.1664
0.1026
Marginal
P(pa|at)
0.7732
0.397
pass,
prepared
15
Join factors “prepared”
P(at)=.8
attend
pass
P(pa|at)
0.7732
0.397
pa
t
t
attend
T
F
16
Join factors
P(at)=.8
attend
pass
pa
T
T
attend
T
F
Original
P(pa|at)
0.7732
0.397
P(at)
0.8
0.2
Joint
P(pa,sm)
0.61856
0.0794
Normalized:
P(at|pa)
0.89
0.11
17
Join factors
attend,
pass
pa
T
T
attend
T
F
Original
P(pa|at)
0.7732
0.397
P(at)
0.8
0.2
Joint
P(pa,at)
0.61856
0.0794
Normalized:
P(at|pa)
0.89
0.11
18
Approximate Inference
Sampling (particle based method)
19
Approximate Inference
20
Sampling – the basics ...
Scrooge McDuck gives
you an ancient coin.
He wants to know what
is P(H)
You have no homework,
and nothing good is on
television – so you toss it
1 Million times.
You obtain 700000x
Heads, and 300000x
Tails.
What is P(H)?
21
Sampling – the basics ...
Exactly, P(H)=0.7
Why?
# Heads 700000
P(H ) =
=
#Tosses 1000000
22
Monte Carlo Method
Who is more likely to win? Green or
Purple?
What is the probability that green
wins, P(G)?
Two ways to solve this:
1. Compute the exact probability.
2. Play 100,000 games and see
how many times green wins.
23
Approximate Inference
Simulation has a name: sampling
F
Sampling is a hot topic in machine learning,
and it’s really simple
S
Basic idea:
Draw N samples from a sampling distribution S
Compute an approximate posterior probability
Show this converges to the true probability P
A
Why sample?
Learning: get samples from a distribution you don’t know
Inference: getting a sample is faster than computing the right
answer (e.g. with variable elimination)
24
Forward Sampling
+c
-c
[Excel Demo]
0.5
0.5
Cloudy
+c
-c
+s
+s
-s
+s
-s
0.1
0.9
0.5
0.5
+r
-r
-s
+r
-r
+c
Sprinkler
+w
-w
+w
-w
+w
-w
+w
-w
0.99
0.01
0.90
0.10
0.90
0.10
0.01
0.99
Rain
WetGrass
-c
+r
-r
+r
-r
0.8
0.2
0.2
0.8
Samples:
+c, -s, +r, +w
-c, +s, -r, +w
…
25
Forward Sampling
This process generates samples with probability:
…i.e. the BN’s joint probability
Let the number of samples of an event be
Then
I.e., the sampling procedure is consistent
26
Example
We’ll get a bunch of samples from the BN:
+c, -s, +r, +w
+c, +s, +r, +w
-c, +s, +r, -w
+c, -s, +r, +w
-c, -s, -r, +w
Cloudy
C
Sprinkler
S
Rain
R
WetGrass
W
If we want to know P(W)
We have counts <+w:4, -w:1>
Normalize to get P(W) = <+w:0.8, -w:0.2>
This will get closer to the true distribution with more samples
Can estimate anything else, too
What about P(C| +w)? P(C| +r, +w)? P(C| -r, -w)?
Fast: can use fewer samples if less time (what’s the drawback?)
27
Rejection Sampling
Let’s say we want P(C)
No point keeping all samples around
Just tally counts of C as we go
Cloudy
C
Sprinkler
S
Rain
R
WetGrass
W
Let’s say we want P(C| +s)
Same thing: tally C outcomes, but
ignore (reject) samples which don’t
have S=+s
This is called rejection sampling
It is also consistent for conditional
probabilities (i.e., correct in the limit)
+c, -s, +r, +w
+c, +s, +r, +w
-c, +s, +r, -w
+c, -s, +r, +w
-c, -s, -r, +w
28
Sampling Example
There are 2 cups.
The first contains 1 penny and 1 quarter
The second contains 2 quarters
Say I pick a cup uniformly at random, then pick a
coin randomly from that cup. It's a quarter (yes!).
What is the probability that the other coin in that
cup is also a quarter?
Likelihood Weighting
Problem with rejection sampling:
If evidence is unlikely, you reject a lot of samples
You don’t exploit your evidence as you sample
Consider P(B|+a)
Burglary
Alarm
-b, -a
-b, -a
-b, -a
-b, -a
+b, +a
Idea: fix evidence variables and sample the rest
Burglary
Alarm
-b +a
-b, +a
-b, +a
-b, +a
+b, +a
Problem: sample distribution not consistent!
Solution: weight by probability of evidence given parents
30
Likelihood Weighting
Sampling distribution if z sampled and e fixed evidence
Cloudy
C
Now, samples have weights
S
R
W
Together, weighted sampling distribution is consistent
31
Likelihood Weighting
+c
-c
0.5
0.5
Cloudy
+c
-c
+s
+s
-s
+s
-s
0.1
0.9
0.5
0.5
+r
-r
-s
+r
-r
+c
Sprinkler
+w
-w
+w
-w
+w
-w
+w
-w
0.99
0.01
0.90
0.10
0.90
0.10
0.01
0.99
Rain
WetGrass
-c
+r
-r
+r
-r
0.8
0.2
0.2
0.8
Samples:
+c, +s, +r, +w
…
32
Likelihood Weighting Example
Cloudy
0
0
0
0
1
Rainy
1
0
0
0
0
Sprinkler
1
1
1
1
1
Wet Grass
1
1
1
1
1
Weight
0.495
0.45
0.45
0.45
0.09
Inference:
Sum over weights that match query value
Divide by total sample weight
What is P(C|+w,+r)?
å sample _ weight
P(C | +w,+r) =
å sample _ weight
C=1
33
Likelihood Weighting
Likelihood weighting is good
We have taken evidence into account as
we generate the sample
E.g. here, W’s value will get picked
based on the evidence values of S, R
More of our samples will reflect the state
of the world suggested by the evidence
Likelihood weighting doesn’t solve
all our problems
Cloudy
C
S
Rain
R
W
Evidence influences the choice of
downstream variables, but not upstream
ones (C isn’t more likely to get a value
matching the evidence)
We would like to consider evidence
when we sample every variable
34
Markov Chain Monte Carlo*
Idea: instead of sampling from scratch, create samples
that are each like the last one.
Procedure: resample one variable at a time, conditioned
on all the rest, but keep evidence fixed. E.g., for P(b|c):
+b
+a
+c
-b
+a
+c
-b
-a
+c
Properties: Now samples are not independent (in fact
they’re nearly identical), but sample averages are still
consistent estimators!
What’s the point: both upstream and downstream
variables condition on evidence.
35
Random Walks
[Explain on Blackboard]
36
Gibbs Sampling
1. Set all evidence E to e
2. Do forward sampling t obtain x1,...,xn
3. Repeat:
1.
2.
3.
4.
Pick any variable Xi uniformly at random.
Resample xi’ from p(Xi | x1,..., xi-1, xi+1,..., xn)
Set all other xj’=xj
The new sample is x1’,..., xn’
37
Markov Blanket
Markov blanket of X:
1. All parents of X
2. All children of X
3. All parents of children of X
(except X itself)
X
X is conditionally independent from all other
variables in the BN, given all variables in the
markov blanket (besides X).
38
Gibbs Sampling
1. Set all evidence E to e
2. Do forward sampling t obtain x1,...,xn
3. Repeat:
1.
2.
3.
4.
Pick any variable Xi uniformly at random.
Resample xi’ from p(Xi | markovblanket(Xi))
Set all other xj’=xj
The new sample is x1’,..., xn’
39
Summary
Sampling can be your salvation
The dominating approach to inference in
BNs
Approaches:
Forward (/Prior) Sampling
Rejection Sampling
Likelihood Weighted Sampling
Gibbs Sampling
40
Learning in Bayes Nets
Task 1: Given the network structure and
given data, where a data point is an
observed setting for the variables, learn
the CPTs for the Bayes Net. Might also
start with priors for CPT probabilities.
Task 2: Given only the data (and possibly
a prior over Bayes Nets), learn the entire
Bayes Net (both Net structure and CPTs).
Turing Award for Bayes Nets
42