Bayesian Modeling - Washington and Lee University

Download Report

Transcript Bayesian Modeling - Washington and Lee University

CSCI 121 Special Topic:
Bayesian Modeling
Uncertainty
Classical models of reasoning (logic) use “all-ornothing” (discrete) variables and rules:
Hungry(Fido)
Toothache(Simon)
Toothache(X) → Cavity(X)
• Reality is usually more complicated:
Toothache(X) → Cavity(X)
70% of the time
Toothache(X) → Gingivitis(X) 20% of the time
Uncertainty
In general, all-or-nothing rules fail for three reasons:
1. Laziness – we don't have enough time or resources
list all such rules for a given domain.
2. Theoretical ignorance - we don't have a complete
theory of the domain.
3. Practical ignorance – even with all rules and perfect
theory, we can't make the necessary observations.
Conventional
Models of
Uncertainty
(Statistics) Are
Failing Us
Online commentators cited
my mother as an example
of why no parent should
hire a nanny. (In fact,
parents and other family
members are responsible
for nearly eighty percent of
cases involving
shaken-baby syndrome.)
Bayesian Reasoning
[C]onsider a situation in which
painstaking survey work has
previously established that in the
general population only 1% of
subjects abuse a certain dangerous
drug. Suppose that a person is
randomly selected from [the]
population for a drug test and the
test yields a positive result.
Suppose that the test has a 99%
hit rate and a 5% false alarm rate.
[How certain are we that the
person is abusing the drug?]
Basic Probability
• Probability Distribution: All possible values of a
given variable, and their probabilities (sum = 1):
•cavity=0.8; gingivitis = 0.1; abcess = 0.05; ? = 0.05
• Joint probability: How likely is it that two things
occur (are observed) together?
• rainy & cloudy = 0.3; cloudy & cool = 0.4
Axioms of Probability
1) All probabilities are between 0 and 1.
2) Necessarily true propositions (A V ~A) have
prob 1; necessarily false (A & ~A) have prob. 0.
3) P(A V B) = P(A) + P(B) – P(A & B)
Axioms of Probability
P(A V B) = P(A) + P(B) – P(A & B)
A
B
E.g., in Los Angeles, maybe P(sunny) = 0.8;
P(warm) = 0.7. Since P is always less than 1, can't
just add 08 + 0.7 to get P(sunny V warm). Need to
subtract P(sunny & warm) = 0.6 to get
P(sunny V warm) = 0.9.
Bayes’ Rule
• From the Product Rule:
• P(A|B)
• P(A &B)
= P(A & B) / P(B)
= P(A|B) * P(B)
= P(B|A) * P(A)
• We derive Bayes’ Rule by substitution:
• P(A|B) = P(A & B) / P(B)
= P(B|A) * P(A) / P(B)
Rev. Thomas Bayes
(1702-1761)
Bayesian (“Belief”) Nets (Pearl 1982)
Burglary
P(B)
.001
Earthquake
Alarm
JohnCalls
A P(J)
T .90
F .05
B
T
T
F
F
E
T
F
T
F
P(E)
.002
P(A)
.95
.94
.29
.001
MaryCalls
A P(M)
T .70
F .01
Bayesian Nets
Using recently developed techniques (Pearl
1982), we can ask, e.g., “how likely is there to
be a burglary, given that John has called?”
● Can also learn relationships, creating
●
“hidden” variables and probability tables,
based on observations.
●
Current “hot topic” in AI.
P(B=F) P(B=T)
.999
.001
P(E=F) P(E=T)
.998
.002
Earthquake
Burglary
Alarm
JohnCalls
A P(J=F) P(J=T)
F .95
.05
T .10
.90
B
F
T
F
T
E
F
F
T
T
P(A=F)
.999
.06
.71
.05
P(A=T)
.001
.94
.29
.95
MaryCalls
A P(M=F) P(M=T)
F .99
.01
T .30
.70
Why Use Bayes Nets?
• Why not just use joint prob. (can be used to
answer any query)?
B E A J M
F F F F F
P
P1
...
T T T T T
Pm
• Table size is exponential (m=2n entries for n vars.)
• Network (graph) is sparse; only encodes local
dependencies.
• n2k entries, where k = avg. # inputs to node.
Types of Inferences / Queries
•
•
•
•
Diagnostic: P(B | J) = 0.02
Causal:
P(J | B) = 0.85
Intercausal: P(B | J & M & ~E) = 0.34
Explaining Away: A university accepts a
student for being “brainy” and/or “sporty”
(athletic). If an accepted student is brainy,
are they also sporty? (Berkson's Paradox /
selection bias)
P(B=F) P(B=T)
.5
.5
P(S=F) P(S=T)
. 5
.5
Sporty
Brainy
College
B
F
T
F
T
S P(C=F)
F
1
F
0
T
0
T
0
>> brainy
P(B=true|C=true) = 0.667
P(B=true|C=true,S=true) = 0.500
http://www.cs.ubc.ca/~murphyk/Bayes/brainy.m
P(C=T)
0
1
1
1
Answering Queries
• E.g., compute P(B|A)
• Product Rule tells us that P(B|A) = P(B&A) / P(A)
• Problem: the only values we can obtain directly from
the probability tables are the priors and some joint
conditionals. For P(B&A) and P(A), the values are
confounded with other variables (E).
• Solution: we can marginalize (sum out) the other
variables to focus on the ones we care about
• First, we must convert the tables to joint probabilities of
all relevant variables:
Answering Queries:
Marginalization
P(B)
.001
B
T
T
F
F
E
T
F
T
F
P(E)
.002
P(A)
.95
.94
.29
.001
BEA
TTT
TTF
TFT
TFF
FTT
FTF
FFT
FFF
Prob
.001*.002*.95 = .000001900
.001*.002*.05 = .000000100
.001*.998*.94 = .000938120
.001*.998*.06 = .000059880
.999*.002*.29 = .000579420
.999*.002*.71 = .001418580
.999*.998*.001 = .000997002
.999*.998*.999 = .996004998
Answering Queries:
Marginalization
Now we get P(A) by summing over rows where
A = True:
BEA
TTT
TTF
TFT
TFF
FTT
FTF
FFT
FFF
Prob
.000001900
.000000100
.000938120
.000059880
.000579420
.001418580
.000997002
.996004998
P(A) =
.000001900 +
.000938120 +
.000579420 +
.000997002
.002516442
Answering Queries:
Marginalization
 Next, we need to compute P(B&A)
 Again, we marginalize:
 So P(B|A)
= .000940020 / .002516442
= .373551228
BEA
TTT
TTF
TFT
TFF
FTT
FTF
FFT
FFF
Prob
.000001900
.000000100
.000938120
.000059880
.000579420
.001418580
.000997002
.996004998
P(B&A) = .
000001900 +
.000938120
.000940020
Marginalization:
Revisiting the Drug Test Example
Suppose that the test has a 99% hit rate and a 5%
false alarm rate. If we ignore the prior knowledge, we
would conclude that there is at least a 95% chance that the
tested person abuses the drug. But if we take into account
the strong prior knowledge, then we conclude that there is
only a 17% chance that the person abuses the drug.
Marginalization:
Revisiting the Drug Test Example
Suppose that the test has a 99% hit rate and a 5%
false alarm rate. If we ignore the prior knowledge, we
would conclude that there is at least a 95% chance that the
tested person abuses the drug. But if we take into account
the strong prior knowledge, then we conclude that there is
only a 17% chance that the person abuses the drug.
P(A|B) = P(B|A) * P (A) / P(B)
Marginalization:
Revisiting the Drug Test Example
Suppose that the test has a 99% hit rate and a 5%
false alarm rate. If we ignore the prior knowledge, we
would conclude that there is at least a 95% chance that the
tested person abuses the drug. But if we take into account
the strong prior knowledge, then we conclude that there is
only a 17% chance that the person abuses the drug.
P(A|B) = P(B|A) * P (A) / P(B)
0.17
= 0.99 * 0.01 / P(B)
Marginalization:
Revisiting the Drug Test Example
Suppose that the test has a 99% hit rate and a 5%
false alarm rate. If we ignore the prior knowledge, we
would conclude that there is at least a 95% chance that the
tested person abuses the drug. But if we take into account
the strong prior knowledge, then we conclude that there is
only a 17% chance that the person abuses the drug.
P(A|B) = P(B|A) * P (A) / P(B)
0.17
= 0.99 * 0.01 / P(B)
P(B) = .058 Why?
Marginalization:
Revisiting the Drug Test Example
P(B=T | A=T) = 0.99
P(B=F | A=T) = 0.01
P(B=T | A=F) = 0.05
P(B=F | A=F) = 0.95
hit
miss
false positive
ordinary situation
Marginalization:
Revisiting the Drug Test Example
P(B=T | A=T) = 0.99
P(B=F | A=T) = 0.01
P(B=T | A=F) = 0.05
P(B=F | A=F) = 0.95
A
T
T
F
F
B
T
F
T
F
P
(.01)(.99)
(.01)(.01)
(.99)(.05)
(.99)(.95)
hit
miss
false positive
ordinary situation
P(B=T) = .99 when A=T
P(B=F) = .01 when A=T
P(B=T) = .05 when A=F
P(B=T) = .95 when A=F
Marginalization:
Revisiting the Drug Test Example
P(B=T) = (.01)(.99) + (.99)(.05) = .0594
A
T
T
F
F
B
T
F
T
F
P
(.01)(.99)
(.01)(.01)
(.99)(.05)
(.99)(.95)
Picking the Likeliest Situation
P(A|B) = P(B|A) * P (A) / P(B)
Picking the Likeliest Situation
P(A|B) = P(B|A) * P (A) / P(B)
P(A=T|B=T) = P(B=T|A=T) * P (A=T) / P(B=T)
P(A=T|B=F) = P(B=F|A=T) * P (A=T) / P(B=F)
P(A=F|B=T) = P(B=T|A=F) * P (A=F) / P(B=T)
P(A=F|B=F) = P(B=F|A=F) * P (A=F) / P(B=F)
Picking the Likeliest Hypothesis
P(A|B) = P(B|A) * P (A) / P(B)
P(A=T|B=T) = P(B=T|A=T) * P (A=T) / P(B=T)
P(A=T|B=F) = P(B=F|A=T) * P (A=T) / P(B=F)
P(A=F|B=T) = P(B=T|A=F) * P (A=F) / P(B=T)
P(A=F|B=F) = P(B=F|A=F) * P (A=F) / P(B=F)
P(A=x | B=y) = P(B=y | A=x) * P (A=x) / P(B=y)
Picking the Likeliest Hypothesis
P(A=x | B=y) = P(B=y | A=x) * P (A=x) / P(B=y)
= [1 / P(B=y)] * [P(B=y | A=x)*P (A=x)]
• Once we’ve observed that B = y (John calls;
robot sensor detects light), we can treat
[1 / P(B=y)] as a constant:
P(A=x | B=y) = k * [P(B=y | A=x) * P (A=x)]
• Then determine the value of x that maximizes
P(A=x | B=y)
Answering Queries: Problems
Difficult if graph is not singly connected (a.k.a
polytree):
A
A
C
B
D
Multiply connnected: more than
one path from A to D. P(D|A) = ?
B
C
D
Singly connnected (polytree):
just one path from A to D.
Dealing with Multiply Connected
Graphs
• Three basic options:
1) Clustering – group “offending” nodes into
“meganodes”
2) Conditioning – set variables to definite values;
then build a polytree for each combo
3) Stochastic simulation – Generate a large
number of concrete models consistent with the
domain.
• Clustering seems to be the most popular .
Clustering with the Junction-Tree
Algorithm (Huang & Darwiche 1994)
0) Note that a BN is a directed acyclic graph (DAG): remove
direction arrows
1) “Moralize” the DAG: For parents A, B of node C, draw an
edge between A and B. Then remove arrows (undirected
graph).
A
A
B
C
G
B
C
G
D
E
H
D
E
H
F
F
Clustering with the Junction-Tree
Algorithm
2) Triangulate the moral graph so that every cycle of
length ≥ 4 contains an edge between nonadjacent
nodes. Use a heuristic based on minimal # of edges
added, minimal # possible values.
A
A
B
C
G
B
C
G
D
E
H
D
E
H
F
F
Clustering with the Junction-Tree
Algorithm
3) Build cliques from triangulated graph:
A
B
C
G
D
E
H
F
ABD
ACE
ADE
CEG
DEF
EGH
Clustering with the Junction-Tree
Algorithm
4) Connect cliques by separation sets to form the
junction tree:
ABD
AD
ADE
AE
ACE
CE
CEG
DE
EG
DEF
EGH
Message-Passing


Sepset potentials are initialized via
marginalization.
When evidence is presented (A = False),
heuristically pick a “root clique” and pass
messages around the tree:
ABD
AD
ADE
AE
ACE
CE
CEG
DE
EG
DEF
EGH
Marginalization: Details
• At this point, each cluster (clique; meganode)
has a joint probability table.
• To query a variable, we (heuristically) pick a
cluster containing it, and marginalize over the
joint probability Ф from the table:
A
T
T
T
T
F
F
F
F
B
T
T
F
F
T
T
F
F
D
T
F
T
F
T
F
T
F
ФABD
.225
.025
.125
.125
.180
.020
.150
.150
P(D) =  ΦABD
AB
D
T
F
P(D)
.225 + .125 + .180 + .150 =
.025 + .125 + .020 + .150 =
Σ
.680
.320
Message-Passing: Details


Messages are passed from clique X to Y through
sepset R by multiplication and division of table
entries.
Evidence is set by “masking” table entries with a
bit vector (or probability distribution). E.g.,
observe B = F:
A B D ФABD
T
T
T
T
F
F
F
F
T
T
F
F
T
T
F
F
T
F
T
F
T
F
T
F
0
0
.125
.125
0
0
.150
.150
Learning in Bayes Nets




So far we’ve been using models
(networks) rather than building them.
What if we had a lot of data but no
explicit model?
Machine Learning!
The Holy Grail of science: Completely
automated model building
Situation #1: Known Structure,
All Variables Observable
F
F
T
.
.
Burglary
Earthquake
Alarm
F
F
T
.
.
John calls
F
F
T
.
.
Mary calls
F
F
F
.
.
F
F
T
.
.
Solution: Build probability tables directly from
observations.
Situation #2: Known Structure,
Some Variables Unobservable
F
F
T
.
.
Burglary
Earthquake
Alarm
F
F
T
.
.
John calls
?
?
?
.
.
Mary calls
F
F
F
.
.
F
F
T
.
.
Solution: Bayesian learning through Maximum
Likelihood Estimation
Bayesian Learning
• Given: Data D, Hypotheses H1, H2, ... Hn
• Want: Prediction for unknown quantity X
• E.g., D = almanac for past 100 years
Hi = chance of rain in May = 50%
X = how much rain tomorrow?
P(X | D) =  P(X | H i )P(H i | D)
i
Bayesian Learning
P(X | D) =  P(X | H i )P(H i | D)
i
• Maximum a posteriori (MAP) hypothesis
HMAP: The Hi that maximizes P(Hi | D)
• Use Bayes' Rule :
P(H i | D) =  P(D | H i )P(H i ) / P(D)
i
Bayesian Learning
P(H i | D) =  P(D | H i )P(H i ) / P(D)
i
• For a given set of hypotheses, P(D) is fixed.
• So we are maximizing P(D | Hi) P(Hi).
• P(Hi) is a philosophical issue: e.g., Ockham's
Razor (simplest consistent hypothesis is best)
• So it all comes down to P(D | Hi): Maximum
Likelihood Estimation
Maximum Likelihood Estimation
• For many problems, P(D | Hi) can’t be determined
analytically (in one step, using algebra).
• In such cases, iterate gradient methods can be
used to explore the space of possibilities.
• For example, each Hi might be a set of conditional
probability table values (a.k.a. weights)
• We can visualize the effect of different weight
values via a “hill-climbing” metaphor
• Mathematically, this is called gradient descent and
involves …
Maximum Likelihood Estimation
• For many problems, P(D | Hi) can’t be determined
analytically (in one step, using algebra).
• In such cases, iterate gradient methods can be
used to explore the space of possibilities.
• For example, each Hi might be a set of conditional
probability table values (a.k.a. weights)
• We can visualize the effect of different weight
values via a “hill-climbing” metaphor
• Mathematically, this is called gradient descent and
involves …calculus!
Maximum Likelihood Estimation:
Robot Localization Problem
http://www.cs.washington.edu/robotics/mcl/animations/global-floor.gif
Situation #3: Unknown Structure
F
F
T
.
.
F
F
T
.
.
Burglary
John calls
Earthquake
F
F
F
.
.
Mary calls
F
F
T
.
.
Solution: Hidden variables + structure learning
Learning Structure:
“Hidden” Variables
• Hidden variables: Unknown internal factors
that we posit to explain the relationship between
observed input and output
• Why not just figure out conditional probabilities
directly from observables (burglary, earthquake)
to observables (John calls, Mary calls)?
Why Hidden Variables?
Hidden variables can yield a more compact
model, making learning easier:
Learning Structure
• Problem: # G(N) of graphs of N nodes goes
up explosively in N:
A
N =1
G(N) = 1
A
B
A
B
A
B
N
1
2
3
10
G(N)
1
3
25 ...
4.2 * 1018
N =2
G(N) = 3
• Solution: Local (greedy hill-climbing) or
global (Monte Carlo) algorithms