Transcript PowerPoint
CS 416
Artificial Intelligence
Lecture 16
Uncertainty
Chapter 14
Conditional probability
The probability of a given all we know is b
• P (a | b)
Written as an unconditional probability
•
Conditioning
A distribution over Y can be obtained by summing
out all the other variables from any joint
distribution containing Y
event
evidence
anywhere x and
e are true
all other variables
We need the full joint distribution to sum this up
Bayes Network
Bayes Network captures the full joint distribution
For comparison:
Example
P(B | john calls, mary calls)
Example
P(B | john calls, mary calls)
old way
To expedite, move terms outside summation
Example
Depthfirst tree
traversal
required
Example
Complexity of Bayes Net
• Bayes Net reduces space complexity
• Bayes Net does not reduce time complexity for general case
Time complexity
Note repeated subexpressions
Dynamic
Programming
Time complexity
Dynamic programming
• Works well for polytrees (where there is at most one path
between any two nodes)
• Doesn’t work for multiply connected networks
Clustering
• Try to convert multiply connected networks to polytrees
Approximate Inference
It’s expensive to work with the full joint
distrbution… whether as a table or as a Bayes
Network
Is approximation good enough?
Monte Carlo
Monte Carlo
Use samples to approximate solution
• Simulated annealing used Monte Carlo theories to justify why
random guesses and sometimes going uphill can lead to
optimality
More samples = better approximation
• How many are needed?
• Where should you take the samples?
Prior sampling
An ability to model the prior probabilities of a set
of random variables
Approximating true distribution
With enough samples, perfect modeling is possible
Rejection sampling
Compute P(X | e)
• Use PriorSample (SPS) and create N samples
• Inspect each sample TRUTH of e
• Keep a count for how many samples are consistent with e
• P(X | e) can be computed from count / N
Example
• P(Rain | Sprinkler = true)
• Use Bayes Net to generate 100 samples
– Suppose 73 have Sprinkler=false
– Suppose 27 have Sprinkler=true
8 have Rain=true
19 have Rain=false
• P(Rain | Sprinkler=true) = Normalize (<8, 19>) = <0.3, 0.7>
Problems with rejection sampling
• Standard deviation of the error in probability is proportional to
1/sqrt(n), where n is the number of samples consistent with
evidence
• As problems become complex, number of samples
consistent with evidence becomes small and it becomes
harder to construct accurate estimates
Likelihood weighting
We only want to generate samples that are
consistent with the evidence, e
• We’ll sample the Bayes Net, but we won’t let every random
variable be sampled, some will be forced to produce a
specific output
Example
P (Rain | Sprinkler=true, WetGrass=true)
Example
P (Rain | Sprinkler=true, WetGrass=true)
• First, weight vector, w, set to 1.0
Example
Notice that weight is reduced according to how likely an
evidence variable’s output is given its parents
• So final probability is a function of what comes from sampling the free
variables while constraining the evidence variables
Comparing techniques
• In likelihood weighting, attention is paid to evidence variables
before samples are collected
• In rejection sampling, evidence variables are considered after
the sampling
• Likelihood weighting isn’t as accurate as the true posterior
distribution, P(z | e) because the sampled variables ignore
evidence among z’s non-ancestors
Likelihood weighting
• Uses all the samples
• As evidence variables increase, it becomes harder to keep
the weighting constant high and estimate quality drops
Markov chain Monte Carlo (MCMC)
• Imagine being in a current state
– An assignment to all the random variables
• The next state is selected according to random sample of
one of the nonevidence variables, Xi
– Conditioned on the current values of the variables in the
current state
• MCMC wanders around state space, flipping one variable at
a time while keeping evidence variables fixed
Cool method of operation
• The sampling process settles into a “dynamic equilibrium” in
which the long-run fraction of time spent in each state is
exactly proportional to its posterior probability
• Let q(x x’) = probability of transitioning from state x to x’
• A Markov chain is a sequence of state transitions according
to q( ) functions
pt(x) measures the probability of being in state x after t steps
Markov chains
pt+1(x) = probability of being in x after t+1 steps
• If pt = pt+1 we have reached a stationary distribution