FA09 cs188 lecture 1..
Download
Report
Transcript FA09 cs188 lecture 1..
CS 188: Artificial Intelligence
Fall 2009
Lecture 17: Bayes Nets IV
10/27/2009
Dan Klein – UC Berkeley
Announcements
Midterms
In glookup, back at end of lecture
Class did really well!
Assignments
W3 out later this week
P4 out next week
Contest!
2
Approximate Inference
3
Approximate Inference
Simulation has a name: sampling
F
Sampling is a hot topic in machine learning,
and it’s really simple
S
Basic idea:
Draw N samples from a sampling distribution S
Compute an approximate posterior probability
Show this converges to the true probability P
A
Why sample?
Learning: get samples from a distribution you don’t know
Inference: getting a sample is faster than computing the right
answer (e.g. with variable elimination)
4
Prior Sampling
+c
-c
0.5
0.5
Cloudy
+c
-c
+s
+s
-s
+s
-s
0.1
0.9
0.5
0.5
+r
-r
-s
+r
-r
+c
Sprinkler
+w
-w
+w
-w
+w
-w
+w
-w
0.99
0.01
0.90
0.10
0.90
0.10
0.01
0.99
Rain
WetGrass
-c
+r
-r
+r
-r
0.8
0.2
0.2
0.8
Samples:
+c, -s, +r, +w
-c, +s, -r, +w
…
5
Prior Sampling
This process generates samples with probability:
…i.e. the BN’s joint probability
Let the number of samples of an event be
Then
I.e., the sampling procedure is consistent
6
Example
We’ll get a bunch of samples from the BN:
+c, -s, +r, +w
+c, +s, +r, +w
-c, +s, +r, -w
+c, -s, +r, +w
-c, -s, -r, +w
Cloudy
C
Sprinkler
S
Rain
R
WetGrass
W
If we want to know P(W)
We have counts <+w:4, -w:1>
Normalize to get P(W) = <+w:0.8, -w:0.2>
This will get closer to the true distribution with more samples
Can estimate anything else, too
What about P(C| +w)? P(C| +r, +w)? P(C| -r, -w)?
Fast: can use fewer samples if less time (what’s the drawback?)
7
Rejection Sampling
Let’s say we want P(C)
No point keeping all samples around
Just tally counts of C as we go
Cloudy
C
Sprinkler
S
Rain
R
WetGrass
W
Let’s say we want P(C| +s)
Same thing: tally C outcomes, but
ignore (reject) samples which don’t
have S=+s
This is called rejection sampling
It is also consistent for conditional
probabilities (i.e., correct in the limit)
+c, -s, +r, +w
+c, +s, +r, +w
-c, +s, +r, -w
+c, -s, +r, +w
-c, -s, -r, +w
8
Sampling Example
There are 2 cups.
The first contains 1 penny and 1 quarter
The second contains 2 quarters
Say I pick a cup uniformly at random, then pick a
coin randomly from that cup. It's a quarter (yes!).
What is the probability that the other coin in that
cup is also a quarter?
Likelihood Weighting
Problem with rejection sampling:
If evidence is unlikely, you reject a lot of samples
You don’t exploit your evidence as you sample
Consider P(B|+a)
Burglary
Alarm
-b, -a
-b, -a
-b, -a
-b, -a
+b, +a
Idea: fix evidence variables and sample the rest
Burglary
Alarm
-b +a
-b, +a
-b, +a
-b, +a
+b, +a
Problem: sample distribution not consistent!
Solution: weight by probability of evidence given parents
10
Likelihood Weighting
+c
-c
0.5
0.5
Cloudy
+c
-c
+s
+s
-s
+s
-s
0.1
0.9
0.5
0.5
+r
-r
-s
+r
-r
+c
Sprinkler
+w
-w
+w
-w
+w
-w
+w
-w
0.99
0.01
0.90
0.10
0.90
0.10
0.01
0.99
Rain
WetGrass
-c
+r
-r
+r
-r
0.8
0.2
0.2
0.8
Samples:
+c, +s, +r, +w
…
11
Likelihood Weighting
Sampling distribution if z sampled and e fixed evidence
Cloudy
C
Now, samples have weights
S
R
W
Together, weighted sampling distribution is consistent
12
Likelihood Weighting
Likelihood weighting is good
We have taken evidence into account as
we generate the sample
E.g. here, W’s value will get picked
based on the evidence values of S, R
More of our samples will reflect the state
of the world suggested by the evidence
Likelihood weighting doesn’t solve
all our problems
Cloudy
C
S
Rain
R
W
Evidence influences the choice of
downstream variables, but not upstream
ones (C isn’t more likely to get a value
matching the evidence)
We would like to consider evidence
when we sample every variable
13
Markov Chain Monte Carlo*
Idea: instead of sampling from scratch, create samples
that are each like the last one.
Procedure: resample one variable at a time, conditioned
on all the rest, but keep evidence fixed. E.g., for P(b|c):
+b
+a
+c
-b
+a
+c
-b
-a
+c
Properties: Now samples are not independent (in fact
they’re nearly identical), but sample averages are still
consistent estimators!
What’s the point: both upstream and downstream
variables condition on evidence.
14
Decision Networks
MEU: choose the action which
maximizes the expected utility
given the evidence
Umbrella
U
Can directly operationalize this
with decision networks
Bayes nets with nodes for
utility and actions
Lets us calculate the expected
utility for each action
Weather
New node types:
Chance nodes (just like BNs)
Actions (rectangles, cannot
have parents, act as observed
evidence)
Utility node (diamond, depends
on action and chance nodes)
Forecast
[DEMO: Ghostbusters]
16
Decision Networks
Action selection:
Instantiate all
evidence
Set action node(s)
each possible way
Calculate posterior
for all parents of
utility node, given
the evidence
Calculate expected
utility for each action
Choose maximizing
action
Umbrella
U
Weather
Forecast
17
Example: Decision Networks
Umbrella = leave
Umbrella
U
Weather
Umbrella = take
A
W
P(W)
leave
sun
100
sun
0.7
leave
rain
0
rain
0.3
take
sun
20
take
rain
70
W
Optimal decision = leave
U(A,W)
Evidence in Decision Networks
Find P(W|F=bad)
Umbrella
Select for evidence
W
U
P(W)
W
P(W)
W
P(F=bad|W)
sun
0.7
sun
0.7
sun
0.2
rain
0.3
rain
0.3
rain
0.9
F
P(F|sun)
good
0.8
bad
0.2
F
Weather
First we join P(W) and
P(bad|W)
Then we normalize
P(F|rain)
good
0.1
bad
0.9
Forecast
W
P(W,F=bad)
W
P(W | F=bad)
sun
0.14
sun
0.34
rain
0.27
rain
0.66
Example: Decision Networks
Umbrella = leave
Umbrella
W
P(W|F=bad)
sun
0.34
rain
0.66
U
Umbrella = take
Optimal decision = take
Weather
Forecast
=bad
A
W
U(A,W)
leave
sun
100
leave
rain
0
take
sun
20
take
rain
70
[Demo]
20
Conditioning on Action Nodes
An action node can be a
parent of a chance node
S
A
S’
U
T(s,a,s’)
R(s,a,s’)
Chance node conditions on
the outcome of the action
Action nodes are like
observed variables in a
Bayes’ net, except we max
over their values
21
Value of Information
Idea: compute value of acquiring evidence
Can be done directly from decision network
Example: buying oil drilling rights
Two blocks A and B, exactly one has oil, worth k
Prior probabilities 0.5 each, & mutually exclusive
Drilling in either A or B has MEU = k/2
Fair price of drilling rights: k/2
Question: what’s the value of information
Value of knowing which of A or B has oil
Value is expected gain in MEU from new info
Survey may say “oil in a” or “oil in b,” prob 0.5 each
If we know OilLoc, MEU is k (either way)
Gain in MEU from knowing OilLoc?
VPI(OilLoc) = k/2
Fair price of information: k/2
DrillLoc
U
OilLoc
O
P
D
O
U
a
1/2
a
a
k
b
1/2
a
b
0
b
a
0
b
b
k
22
Value of Perfect Information
Current evidence E=e, utility depends on S=s
Potential new evidence E’: suppose we knew E’ = e’
BUT E’ is a random variable whose value is currently unknown, so:
Must compute expected gain over all possible values
(VPI = value of perfect information)
23
VPI Example: Weather
MEU with no evidence
Umbrella
U
MEU if forecast is bad
Weather
MEU if forecast is good
Forecast
Forecast distribution
F
P(F)
good
0.59
bad
0.41
A
W
U
leave
sun
100
leave
rain
0
take
sun
20
take
rain
70
7.8
24
VPI Properties
Nonnegative in expectation
Nonadditive ---consider, e.g., obtaining Ej twice
Order-independent
27
Quick VPI Questions
The soup of the day is either clam chowder or split pea,
but you wouldn’t order either one. What’s the value of
knowing which it is?
If you have $10 to bet and odds are 3 to 1 that Berkeley
will beat Stanford, what’s the value of knowing the
outcome in advance, assuming you can make a fair bet
for either Cal or Stanford?
What if you are morally obligated not to bet against Cal,
but you can refrain from betting?
28