Transcript old-04

Bayesian Networks
Background Readings: An Introduction to Bayesian Networks, Finn
Jensen, UCL Press, 1997.
Some slides have been edited from Nir Friedman’s lectures which is available at
www.cs.huji.ac.il/~nir.
Changes made by Dan Geiger.
.
The “Visit-to-Asia” Example
Visit to
Asia
Tuberculosis
Smoking
Lung Cancer
Abnormality
in Chest
X-Ray
Bronchitis
Dyspnea
What are the relevant variables and their dependencies ?
2
Verifying the (in)Dependencies
S
V
 We
can now ask the expert: do
the following assertion hold?
 I ( S; V )
 I ( T; S | V )
 I ( l; {T, V} | S )
 …
 I ( X; {V,S,T,L,B,D} | A)
L
T
B
A
X
D
Alternative verification: Is each variable becoming
independent of the rest, given its Markov boundary ?
Take-Home Question: Are other variable construction orders as good ?
3
Quantifying the Bayesian Network
V
p(t|v)
p(v)
S
p(s)
p(l|s)
L
T
p(b|s)
p(a|t,l)
B
A
p(x|a)
X
p(d|a,b)
D
P(v, s, t , l , b, a, x, d ) 
P(v) P( s ) P(t | v) P(l | s ) P(b | s) P(a | t , l ) P( x | a) P(d | a, b)
Bayesian network = Directed Acyclic Graph (DAG),
annotated with conditional probability distributions.
4
Local distributions
T
L
(Yes/No)
(Yes/No)
p(A|T,L)
A
(Yes/no)
Conditional Probability Table:
p(A=y|L=n, T=n) = 0.02
p(A=y|L=n, T=y) = 0.60
p(A=y|L=y, T=n) = 0.99 Asymmetric independence in the CPT
p(A=y|L=y, T=y) = 0.99
5
Queries
There are several types of queries.
Most queries involve evidence
An evidence e is an assignment of values
to a set E of variables in the domain
Example, A Posteriori belief: P(D=yes | V = yes )
Or in general: P(H=h | E = e ) where H and E are
subsets of variables.
Equivalent to computing P(H=h , E = e ) and then
dividing.
6
S
V
A posteriori belief
L
T
B
A
This query is useful in many cases:
X
D
 Prediction:
what is the probability of an outcome
given the starting condition
 Diagnosis:
what is the probability of disease/fault
given symptoms
7
Example: Predictive+Diagnostic
Probabilistic inference can combine evidence form
all parts of the network, Diagnostic and Predictive,
regardless of the directions of edges in the model.
P(T = Yes | Visit_to_Asia = Yes, Dyspnea = Yes )
S
V
L
T
B
A
X
D
8
Queries: MAP
 Find
the maximum a posteriori assignment for
some variable of interest (say H1,…,Hl )
 That is, h1,…,hl maximize the conditional probability
P(h1,…,hl | e)
 Equivalent to maximizing the joint P(h1,…,hl, e)
9
Queries: MAP
D1
D2
D3
D4
We can use MAP for:
S2
S1
S4
S3
 Explanation
 What is the most likely joint event, given the
evidence (e.g., a set of likely diseases given the
symptoms)
 What is the most likely scenario, given the
evidence (e.g., a series of likely malfunctions
that trigger a fault).
Bad
alternator
Bad magneto
Not charging
Bad battery
Dead
battery
10
How Expressive are Bayesian
Networks
1. Check the diamond example via all boundary
bases.
2. The following property holds for d-separation
but does not hold for conditional independence:
ID(X,{},Y) and ID (X,,Y)  ID (X,{}, ) or ID (,{},Y)
11
Markov Networks that represents probability
distributions (rather than just independence)
1. Define for each (maximal) clique Ci a nonnegative function g(Ci) called the compatibility
function.
2. Take the product i g(Ci) over all cliques.
3. Define P(X1,…,Xn) = K· i g(Ci) where K is a
normalizing factor (inverse sum of the product).
12
The two males and females example
13
Theorem 6 [Hammersley and Clifford 1971]: If a probability function
P is formed by a normalized product of non negative functions on the
cliques of G, then G is an I-map of P.
Proof: It suffices to show (Theorem 5) that the neighborhood basis of
G holds in P.
Namely, show that I(,BG(), U-  -BG() hold in P, or just that:
P(, BG(), U-  -BG()) = f1(,BG()) f2 (U-)
(*)
Let J stand for the set of indices marking all cliques in G that include .
= f1(,BG()) f2 (U-)
The first product contains only variables adjacent to  because Cj is
a clique. The second product does not contain . Hence (*) holds.
14
Note: The theorem and converse hold also for extreme
probabilities but the presented proof does not apply due to
the use of Intersection in Theorem 5.
Theorem X: Every undirected graph G has a distribution P such
that G is a perfect map of P.
(In light of previous notes, it must have the form of a product
over cliques).
15
Drawback: Interpreting the Links is
not simple
Another drawback is the difficulty with extreme probabilities.
There is no local test for I-mapness.
Both drawbacks disappear in the class of decomposable
models, which are a special case of Bayesian networks
16
Decomposable Models
Example: Markov Chains and Markov Trees
Assume the following chain is an I-map of some P(x1,x2,x3,x4)
and was constructed using the methods we just described.
The “compatibility functions” on all links can be easily
interpreted in the case of chains.
Same also for trees. This idea actually works for all chordal graphs.
17
Chordal Graphs
18
Interpretation of the links
Clique 1
Clique 2
Clique 3
A probability distribution that can be written as a product
of low order marginals divided by a product of low order
marginals is said to be decomposable.
19
Importance of Decomposability
When assigning compatibility functions it suffices to
use marginal probabilities on cliques and just make
sure to be locally consistent. Marginals can be
assessed from experts or estimated directly from data.
20
The Diamond Example – The
smallest non chordal graph
Adding one more link will turn the graph to become chordal.
Turning a general undirected graph into a chordal graph in some
optimal way is the key for all exact computations done on Markov
and Bayesian networks.
21
22
Extra Slides with more details
If times allows
23
Complexity of Inference
Theorem:
Computing P(X = x) in a Bayesian network is NPhard.
Main idea: conditional probability tables with zeros
and ones are equivalent to logical gates.
Hence reducibility to 3-SAT is the easiest to pursue.
24
Proof
We reduce 3-SAT to Bayesian network computation
Assume we are given a 3-SAT problem:
 Q1,…,Qn be propositions,
 1 ,... ,k be clauses, such that i = li1 li2  li3
where each lij is a literal over Q1,…,Qn
(e.g., Q1 = true )
  = 1... k
We will construct a Bayesian network s.t.
P(X=t) > 0 iff  is satisfiable
25
Q1
Q2
1
Q3
2
3
A1
A2
...
Q4
... 
...
k-1
Ak-2
Qn
k
X
 P(Qi
= true) = 0.5,
 P(I = true | Qi , Qj , Ql ) = 1 iff Qi , Qj , Ql
satisfy the clause I
 A1, A2, …, are simple binary AND gates
26
Q1
Q2
1
 It
Q3
2
3
A1
A2
...
Q4
... 
...
k-1
Ak-2
Qn
k
X
is easy to check



Polynomial number of variables
Each Conditional Probability Table can be described by a
small table (8 parameters at most)
P(X = true) > 0 if and only if there exists a satisfying
assignment to Q1,…,Qn
 Conclusion:
polynomial reduction of 3-SAT
27
Inference is even #P-hard
 P(X
= t) is the fraction of satisfying assignments to

 Hence 2nP(X = t) is the number of satisfying
assignments to 
if we know to compute P(X = t), we know to
count the number of satisfying assignments to.
 Consequently, computing P(X = t) is #P-hard.
 Thus,
28
Hardness - Notes
We need not use deterministic relations in our construction.
 The construction shows that hardness follows even with a
small degree graphs.


Hardness does not mean we cannot do inference
 It implies that we cannot find a general procedure that
works efficiently for all networks
 For particular families of networks, we can have provably
efficient procedures (e.g., trees, HMMs).
 Variable elimination algorithms.
29
Proof of Theorem X
Given a graph G, it is sufficient to show that for an independence
statement  = I(,Z,) that does NOT hold in G, there exists a
probability distribution that satisfies all independence statements
that hold in the graph and does not satisfy  = I(,Z,).
Well, simply pick a path in G between  and  that does not contain
a node from Z. Define a probability distribution that is a perfect map
of the chain and multiply it by any marginal probabilities on all other
nodes forming P . Now “multiply” all P (Armstrong relation) to
obtain P.
Interesting task (Replacing HMW #4): Given an undirected graph
over binary variables construct a perfect map probability
distribution. (Note: most Markov random fields are perfect maps !).
30
Interesting conclusion of Theorem X:
All independence statements that follow for strictly-positive
probability from the neighborhood basis are derivable via
symmetry, decomposition, intersection, and weak union.
These axioms are (sound and) complete for neighborhood bases.
These axioms are (sound and) complete also for pairwise bases.
In fact for saturated statements conditional independence and
separation have the same characterization. See paper P2 .
31
Chordal Graphs
32
Example of the Theorem
1. Each cycle has a chord.
2. There is a way to direct edges legally, namely,
AB , AC , BC , BD , CD, CE
3. Legal removal order (eg): start with E, than D,
than the rest.
4. The maximal cliques form a join (clique) tree.
33