Transcript lecture15

CS 416
Artificial Intelligence
Lecture 15
Uncertainty
Chapter 14
Late start on Thursday
CS Annual Feedback Forum
Thursday, March 23, 4:30 – 5:25
MEC 205
Pizza
AI class will start at 5:30 on Thursday
Conditional probability
The probability of a given all we know is b
• P (a | b)
Written as an unconditional probability
•
Conditioning
A distribution over Y can be obtained by summing
out all the other variables from any joint
distribution containing Y
event
evidence
all other variables
anywhere x and
e are true
We need the full joint distribution to sum this up
Bayes Network
Bayes Network captures the full joint distribution
For comparison:
Example
P(B | john calls, mary calls)
Example
P(B | john calls, mary calls)
old way
Example
Depthfirst tree
traversal
required
Example
O(2n) time complexity
Wasted repeated computations
Example
Complexity of Bayes Net
• Bayes Net reduces space complexity of full-joint distribution
• Bayes Net does not reduce time complexity for general case
Time complexity
Note repeated subexpressions
Dynamic
Programming
Dynamic Programming
Fibonacci sequence example
function fib(n)
if n = 0 or n = 1
return n
else
return fib(n − 1) + fib(n − 2)
1.
fib(5)
2.
fib(4) + fib(3)
3.
(fib(3) + fib(2)) + (fib(2) + fib(1))
4.
((fib(2) + fib(1)) + (fib(1) + fib(0))) + ((fib(1) +
fib(0)) + fib(1))
5.
(((fib(1) + fib(0)) + fib(1)) + (fib(1) + fib(0))) +
((fib(1) + fib(0)) + fib(1))
Dynamic Programming
Memoization
var m := map(int, int)
function fib(n)
if n not in keys(m)
m[n] := fib(n − 1) + fib(n − 2)
return m[n]
Approximate Inference
It’s expensive to work with the full joint
distrbution… whether as a table or as a Bayesian
Network
Is approximation good enough?
Monte Carlo
Monte Carlo
Use samples to approximate posterior probs.
• Simulated annealing used Monte Carlo theories to justify why
random guesses and sometimes going uphill can lead to
optimality
More samples = better approximation
• How many are needed?
• Where should you take the samples?
An example
P(WetGrass)
• Requires full-joint dist.
• Full-joint is O(2n)
• Even unlikely events are
tabulated in full-joint
Prior sampling
An ability to model the prior probabilities of a set
of random variables
Imagine generating 100 of these samples
Prior sampling
Define SPS(x1, x2, …, xn)
• Probability that event (x1, x2, …, xn) is generated by network
Approximating true distribution
With enough samples, perfect modeling is possible
Rejection sampling
Compute P(X | e)
• Use PriorSample (SPS) and create N samples
• Inspect each sample for TRUTH of e
• Of those samples consistent with e, tabulate P(X|e)
– Keep track of X values
– Normalize for total number of samples
Example
• P(Rain | Sprinkler = true)
• Use Bayes Net to generate 100 samples
– Suppose 73 have Sprinkler=false
– Suppose 27 have Sprinkler=true
 8 have Rain=true
 19 have Rain=false
• P(Rain | Sprinkler=true) = Normalize (<8, 19>) = <0.3, 0.7>
Problems with rejection sampling
• Standard deviation of the error in probability is proportional to
1/sqrt(n), where n is the number of samples consistent with
evidence
• As problems become complex, number of samples
consistent with evidence becomes small
Likelihood weighting
We only want to generate samples that are
consistent with the evidence, e
• We’ll sample the Bayesian Net, but we won’t let every
random variable be sampled, some will be forced to produce
a specific output
Example – likelihood weighting
P (Rain | Sprinkler=true, WetGrass=true)
Example – likelihood weighting
P (Rain | Sprinkler=true, WetGrass=true)
• First, weight vector, w, set to 1.0
Example – likelihood weighting
Keep track: (T, T, T, T) with weight 0.099
Notice that weight is reduced according to how likely
an evidence variable’s output is given its parents
• So final probability is a function of what comes from sampling the free
variables while constraining the evidence variables
Comparing techniques
• Likelihood uses all samples
– More efficient than rejection sampling
• Less effective if lots of evidence variables (small weights)
• Less effective if evidence is late in variable ordering (samples
generated w/o early influence of evidence)
Markov Chain Monte Carlo (MCMC)
• Imagine being in a current state
– An assignment to all the random variables
• The next state is selected according to random sample of
one of the nonevidence variables, Xi
– Conditioned on the current values of the variables in the
current state
• MCMC wanders around state space, flipping one variable at
a time while keeping evidence variables fixed
Example - MCMC
Solve P(Rain | Sprinkler=TRUE, WetGrass = TRUE)
• Fix Sprinkler and WetGrass to TRUE
• Initialize “state” to [TCloudy, TSprinkler, FRain, TWetGrass]
• Sample Cloudy, P(Cloudy | TSprinkler, FRain)
– We want to “flip the cloudy bit” subject to conditional
probabilities of its parents, children, and childrens’ parents
(Markov blanket)
– Cloudy becomes False
Example - MCMC
Solve P(Rain | Sprinkler=TRUE, WetGrass = TRUE)
• Fix Sprinkler and WetGrass to TRUE
• “state” is [FCloudy, TSprinkler, FRain, TWetGrass]
• Sample P(Rain | FCloudy, TSprinkler, TWetGrass)
– Rain becomes TRUE
• “state” is [FCloudy, TSprinkler, TRain, TWetGrass]
Nice method of operation
• The sampling process settles into a “dynamic equilibrium” in
which the long-run fraction of time spent in each state is
exactly proportional to its posterior probability
• Let q(x  x’) = probability of transitioning from state x to x’
• A Markov chain is a sequence of state transitions according
to q( ) functions
 pt(x) measures the probability of being in state x after t steps
Markov chains
 pt+1(x’) = probability of being in x’ after t+1 steps
• If pt = pt+1 we have reached a stationary distribution