Transcript Slides

QUIZ!!





T/F: You can always (theoretically) do BNs inference by enumeration. TRUE
T/F: In VE, always first marginalize, then join. FALSE
T/F: VE is a lot faster than enumeration, but not always exact. FALSE
T/F: The running time of VE is independent of the order you pick the r.v.s. FALSE
T/F: The more evidence, the faster VE runs. TRUE






P(X|Y) sums to ... |Y|
P(x|Y) sums to ... ??
P(X|y) sums to ... 1
P(x|y) sums to ... P(x|y)
P(x,Y) sums to .. P(x)
P(X,Y) sums to ... 1
1
CSE 511a: Artificial Intelligence
Spring 2013
Lecture 17: Bayes’ Nets IV– Approximate
Inference (Sampling)
04/01/2013
Robert Pless
Via Kilian Weinberger via Dan Klein – UC Berkeley
Announcements
 Project 4 out soon.
 Project 3 due at midnight.
3
Exact Inference
Variable Elimination
4
General Variable Elimination
 Query:
 Start with initial factors:
 Local CPTs (but instantiated by evidence)
 While there are still hidden variables (not Q or evidence):
 Pick a hidden variable H
 Join all factors mentioning H
 Eliminate (sum out) H
 Join all remaining factors and normalize
5
6
Example: Variable elimination
P(at)=.8
P(st)=.6
attend
study
P(fa)=.9
fair
prepared
P(pr|at,st)
0.9
0.5
0.7
0.1
pass
P(pa|at,pre,fa)
0.9
0.1
0.7
0.1
0.7
0.1
0.2
0.1
pr
T
T
T
T
F
F
F
F
at
T
T
F
F
T
T
F
F
fa
T
F
T
F
T
F
T
F
at
T
T
F
F
st
T
F
T
F
Query: What is the probability that
a student attends class, given that
they pass the exam?
[based on slides taken from UMBC CMSC 671,
2005]
7
Join study factors
P(at)=.8
P(st)=.6
attend
study
P(fa)=.9
fair
prep
T
T
T
T
F
F
F
F
prepared
pass
P(pa|at,pre,fa)
0.9
0.1
0.7
0.1
0.7
0.1
0.2
0.1
pr
T
T
T
T
F
F
F
F
at
T
T
F
F
T
T
F
F
fa
T
F
T
F
T
F
T
F
study
T
F
T
F
T
F
T
F
attend
T
T
F
F
T
T
F
F
Original
P(pr|at,st)
0.9
0.5
0.7
0.1
0.1
0.5
0.3
0.9
P(st)
0.6
0.4
0.6
0.4
0.6
0.4
0.6
0.4
Joint
P(pr,st|sm)
0.54
0.2
0.42
0.04
0.06
0.2
0.18
0.36
Marginal
P(pr|sm)
0.74
0.46
0.26
0.54
8
Marginalize out study
P(at)=.8
attend
P(fa)=.9
fair
prepared,
study
prep
T
T
T
T
F
F
F
F
pass
P(pa|at,pre,fa)
0.9
0.1
0.7
0.1
0.7
0.1
0.2
0.1
pr
T
T
T
T
F
F
F
F
at
T
T
F
F
T
T
F
F
fa
T
F
T
F
T
F
T
F
study
T
F
T
F
T
F
T
F
attend
T
T
F
F
T
T
F
F
Original
P(pr|at,st)
0.9
0.5
0.7
0.1
0.1
0.5
0.3
0.9
P(st)
0.6
0.4
0.6
0.4
0.6
0.4
0.6
0.4
Joint
P(pr,st|ac)
0.54
0.2
0.42
0.04
0.06
0.2
0.18
0.36
Marginal
P(pr|at)
0.74
0.46
0.26
0.54
9
Remove “study”
P(at)=.8
attend
P(fa)=.9
fair
prepared
P(pr|at)
0.74
0.46
0.26
0.54
pr
T
T
F
F
at
T
F
T
F
pass
P(pa|at,pre,fa)
0.9
0.1
0.7
0.1
0.7
0.1
0.2
0.1
pr
T
T
T
T
F
F
F
F
at
T
T
F
F
T
T
F
F
fa
T
F
T
F
T
F
T
F
10
Join factors “fair”
P(at)=.8
attend
P(fa)=.9
fair
prepared
P(pr|at)
0.74
0.46
0.26
0.54
prep
T
T
F
F
attend
T
F
T
F
pass
pa
t
t
t
t
t
t
t
t
pre
T
T
T
T
F
F
F
F
attend
T
T
F
F
T
T
F
F
fair
T
F
T
F
T
F
T
F
Original
P(pa|at,pre,
fa)
0.9
0.1
0.7
0.1
0.7
0.1
0.2
0.1
P(fair)
0.9
0.1
0.9
0.1
0.9
0.1
0.9
0.1
Joint
Marginal
P(pa,fa|sm, P(pa|sm,pre
pre)
)
0.81
0.82
0.01
0.63
0.64
0.01
0.63
0.64
0.01
0.18
0.19
0.01
11
Marginalize out “fair”
P(at)=.8
attend
prepared
P(pr|at)
0.74
0.46
0.26
0.54
pass,
fair
pa
T
T
T
T
T
T
T
T
prep
T
T
F
F
Original
pre
T
T
T
T
F
F
F
F
attend
T
T
F
F
T
T
F
F
fair
T
F
T
F
T
F
T
F
P(pa|at,pre,fa)
0.9
0.1
0.7
0.1
0.7
0.1
0.2
0.1
attend
T
F
T
F
Joint
P(fair)
0.9
0.1
0.9
0.1
0.9
0.1
0.9
0.1
Marginal
P(pa,fa|at,pre) P(pa|at,pre)
0.81
0.82
0.01
0.63
0.64
0.01
0.63
0.64
0.01
0.18
0.19
0.01
12
Marginalize out “fair”
P(at)=.8
attend
prepared
P(pr|at)
0.74
0.46
0.26
0.54
prep
T
T
F
F
attend
T
F
T
F
pass
P(pa|at,pre)
0.82
0.64
0.64
0.19
pa
t
t
t
t
pre
T
T
F
F
attend
T
F
T
F
13
Join factors “prepared”
P(at)=.8
attend
prepared
pa
t
t
t
t
pre
T
T
F
F
attend
T
F
T
F
Original
P(pa|at,pr)
0.82
0.64
0.64
0.19
P(pr|at)
0.74
0.46
0.26
0.54
Joint
Marginal
P(pa,pr|sm) P(pa|sm)
0.6068
0.7732
0.2944
0.397
0.1664
0.1026
pass
14
Join factors “prepared”
P(at)=.8
attend
pa
t
t
t
t
pre
T
T
F
F
attend
T
F
T
F
Original
P(pa|at,pr)
0.82
0.64
0.64
0.19
P(pr|at)
0.74
0.46
0.26
0.54
Joint
P(pa,pr|at)
0.6068
0.2944
0.1664
0.1026
Marginal
P(pa|at)
0.7732
0.397
pass,
prepared
15
Join factors “prepared”
P(at)=.8
attend
pass
P(pa|at)
0.7732
0.397
pa
t
t
attend
T
F
16
Join factors
P(at)=.8
attend
pass
pa
T
T
attend
T
F
Original
P(pa|at)
0.7732
0.397
P(at)
0.8
0.2
Joint
P(pa,sm)
0.61856
0.0794
Normalized:
P(at|pa)
0.89
0.11
17
Join factors
attend,
pass
pa
T
T
attend
T
F
Original
P(pa|at)
0.7732
0.397
P(at)
0.8
0.2
Joint
P(pa,at)
0.61856
0.0794
Normalized:
P(at|pa)
0.89
0.11
18
Approximate Inference
Sampling (particle based method)
19
Approximate Inference
20
Sampling – the basics ...
 Scrooge McDuck gives
you an ancient coin.
 He wants to know what
is P(H)
 You have no homework,
and nothing good is on
television – so you toss it
1 Million times.
 You obtain 700000x
Heads, and 300000x
Tails.
 What is P(H)?
21
Sampling – the basics ...
 Exactly, P(H)=0.7
 Why?
# Heads 700000
P(H ) =
=
#Tosses 1000000
22
Monte Carlo Method
Who is more likely to win? Green or
Purple?
What is the probability that green
wins, P(G)?
Two ways to solve this:
1. Compute the exact probability.
2. Play 100,000 games and see
how many times green wins.
23
Approximate Inference
 Simulation has a name: sampling
F
 Sampling is a hot topic in machine learning,
and it’s really simple
S
 Basic idea:
 Draw N samples from a sampling distribution S
 Compute an approximate posterior probability
 Show this converges to the true probability P
A
 Why sample?
 Learning: get samples from a distribution you don’t know
 Inference: getting a sample is faster than computing the right
answer (e.g. with variable elimination)
24
Forward Sampling
+c
-c
[Excel Demo]
0.5
0.5
Cloudy
+c
-c
+s
+s
-s
+s
-s
0.1
0.9
0.5
0.5
+r
-r
-s
+r
-r
+c
Sprinkler
+w
-w
+w
-w
+w
-w
+w
-w
0.99
0.01
0.90
0.10
0.90
0.10
0.01
0.99
Rain
WetGrass
-c
+r
-r
+r
-r
0.8
0.2
0.2
0.8
Samples:
+c, -s, +r, +w
-c, +s, -r, +w
…
25
Forward Sampling
 This process generates samples with probability:
…i.e. the BN’s joint probability
 Let the number of samples of an event be
 Then
 I.e., the sampling procedure is consistent
26
Example
 We’ll get a bunch of samples from the BN:
+c, -s, +r, +w
+c, +s, +r, +w
-c, +s, +r, -w
+c, -s, +r, +w
-c, -s, -r, +w
Cloudy
C
Sprinkler
S
Rain
R
WetGrass
W
 If we want to know P(W)






We have counts <+w:4, -w:1>
Normalize to get P(W) = <+w:0.8, -w:0.2>
This will get closer to the true distribution with more samples
Can estimate anything else, too
What about P(C| +w)? P(C| +r, +w)? P(C| -r, -w)?
Fast: can use fewer samples if less time (what’s the drawback?)
27
Rejection Sampling
 Let’s say we want P(C)
 No point keeping all samples around
 Just tally counts of C as we go
Cloudy
C
Sprinkler
S
Rain
R
WetGrass
W
 Let’s say we want P(C| +s)
 Same thing: tally C outcomes, but
ignore (reject) samples which don’t
have S=+s
 This is called rejection sampling
 It is also consistent for conditional
probabilities (i.e., correct in the limit)
+c, -s, +r, +w
+c, +s, +r, +w
-c, +s, +r, -w
+c, -s, +r, +w
-c, -s, -r, +w
28
Sampling Example
 There are 2 cups.
 The first contains 1 penny and 1 quarter
 The second contains 2 quarters
 Say I pick a cup uniformly at random, then pick a
coin randomly from that cup. It's a quarter (yes!).
What is the probability that the other coin in that
cup is also a quarter?
Likelihood Weighting
 Problem with rejection sampling:
 If evidence is unlikely, you reject a lot of samples
 You don’t exploit your evidence as you sample
 Consider P(B|+a)
Burglary
Alarm
-b, -a
-b, -a
-b, -a
-b, -a
+b, +a
 Idea: fix evidence variables and sample the rest
Burglary
Alarm
-b +a
-b, +a
-b, +a
-b, +a
+b, +a
 Problem: sample distribution not consistent!
 Solution: weight by probability of evidence given parents
30
Likelihood Weighting
 Sampling distribution if z sampled and e fixed evidence
Cloudy
C
 Now, samples have weights
S
R
W
 Together, weighted sampling distribution is consistent
31
Likelihood Weighting
+c
-c
0.5
0.5
Cloudy
+c
-c
+s
+s
-s
+s
-s
0.1
0.9
0.5
0.5
+r
-r
-s
+r
-r
+c
Sprinkler
+w
-w
+w
-w
+w
-w
+w
-w
0.99
0.01
0.90
0.10
0.90
0.10
0.01
0.99
Rain
WetGrass
-c
+r
-r
+r
-r
0.8
0.2
0.2
0.8
Samples:
+c, +s, +r, +w
…
32
Likelihood Weighting Example
Cloudy
0
0
0
0
1
Rainy
1
0
0
0
0
Sprinkler
1
1
1
1
1
Wet Grass
1
1
1
1
1
Weight
0.495
0.45
0.45
0.45
0.09
 Inference:
 Sum over weights that match query value
 Divide by total sample weight
 What is P(C|+w,+r)?
å sample _ weight
P(C | +w,+r) =
å sample _ weight
C=1
33
Likelihood Weighting
 Likelihood weighting is good
 We have taken evidence into account as
we generate the sample
 E.g. here, W’s value will get picked
based on the evidence values of S, R
 More of our samples will reflect the state
of the world suggested by the evidence
 Likelihood weighting doesn’t solve
all our problems
Cloudy
C
S
Rain
R
W
 Evidence influences the choice of
downstream variables, but not upstream
ones (C isn’t more likely to get a value
matching the evidence)
 We would like to consider evidence
when we sample every variable
34
Markov Chain Monte Carlo*
 Idea: instead of sampling from scratch, create samples
that are each like the last one.
 Procedure: resample one variable at a time, conditioned
on all the rest, but keep evidence fixed. E.g., for P(b|c):
+b
+a
+c
-b
+a
+c
-b
-a
+c
 Properties: Now samples are not independent (in fact
they’re nearly identical), but sample averages are still
consistent estimators!
 What’s the point: both upstream and downstream
variables condition on evidence.
35
Random Walks
 [Explain on Blackboard]
36
Gibbs Sampling
1. Set all evidence E to e
2. Do forward sampling t obtain x1,...,xn
3. Repeat:
1.
2.
3.
4.
Pick any variable Xi uniformly at random.
Resample xi’ from p(Xi | x1,..., xi-1, xi+1,..., xn)
Set all other xj’=xj
The new sample is x1’,..., xn’
37
Markov Blanket
Markov blanket of X:
1. All parents of X
2. All children of X
3. All parents of children of X
(except X itself)
X
X is conditionally independent from all other
variables in the BN, given all variables in the
markov blanket (besides X).
38
Gibbs Sampling
1. Set all evidence E to e
2. Do forward sampling t obtain x1,...,xn
3. Repeat:
1.
2.
3.
4.
Pick any variable Xi uniformly at random.
Resample xi’ from p(Xi | markovblanket(Xi))
Set all other xj’=xj
The new sample is x1’,..., xn’
39
Summary
 Sampling can be your salvation
 The dominating approach to inference in
BNs
 Approaches:




Forward (/Prior) Sampling
Rejection Sampling
Likelihood Weighted Sampling
Gibbs Sampling
40
Learning in Bayes Nets
 Task 1: Given the network structure and
given data, where a data point is an
observed setting for the variables, learn
the CPTs for the Bayes Net. Might also
start with priors for CPT probabilities.
 Task 2: Given only the data (and possibly
a prior over Bayes Nets), learn the entire
Bayes Net (both Net structure and CPTs).
Turing Award for Bayes Nets
42