Transcript BN-04

COROLLARY 4: D is an I-map of P iff each variable X is
conditionally independent in P of all its non-descendants,
given its parents.
Proof : X is d-separated of all its non-descendants, given its
parents. Since D is an I-map, by the soundness theorem the
claim holds.
Proof : Each variable X is conditionally independent
of all its non-descendants, given its parents implies
using decomposition that it is also independent of its
predecessors in a particular order d.
1
COROLLARY 5: If D=(U,E) is a boundary DAG of P
constructed in some order d, then any topological order d’
of U will yield the same boundary DAG of P.
(Hence construction order can be forgotten).
Proof : Each variable X is d-separated of all its nondescendants, given its parents in the boundary DAG of P.
In particular, due to decomposition, X is independent
given its parents from all previous variables in any
topological order d’.
2
Extension of the Markov Chain Property
I(Xk, Xk-1, X1 … Xk-2)  I(Xk, Xk-1 Xk+1, X1 … Xk-2 Xk+2… Xn )
Holds due to the soundness theorem.
Converse holds when Intersection is assumed.
Markov Blankets in DAGs
3
Consequence: There is no improvement to d-separation and no
statement escapes graphical representation.
Reasoning: (1) If there were an independence statement  not
shown by d-separation, then  must be true in all distributions
that satisfy the basis. But Theorem 10 states that there exists a
distribution that satisfies the basis and its consequences but
violates . (2) Same argument. [Note that (2) is a stronger claim.]
4
Bayesian Networks
Background Readings: An Introduction to Bayesian Networks, Finn
Jensen, UCL Press, 1997.
Some slides have been edited from Nir Friedman’s lectures which is available at
www.cs.huji.ac.il/~nir.
Changes made by Dan Geiger.
.
The “Visit-to-Asia” Example
Visit to
Asia
Tuberculosis
Smoking
Lung Cancer
Abnormality
in Chest
X-Ray
Bronchitis
Dyspnea
What are the relevant variables and their dependencies ?
6
Verifying the (in)Dependencies
S
V
 We
can now ask the expert: do
the following assertion hold?
 I ( S; V )
 I ( T; S | V )
 I ( l; {T, V} | S )
 …
 I ( X; {V,S,T,L,B,D} | A)
L
T
B
A
X
D
Alternative verification: Is each variable becoming
independent of the rest, given its Markov boundary ?
Take-Home Question: Are other variable construction orders as good ?
7
Quantifying the Bayesian Network
V
p(t|v)
p(v)
S
p(s)
p(l|s)
L
T
p(b|s)
p(a|t,l)
B
A
p(x|a)
X
p(d|a,b)
D
P(v, s, t , l , b, a, x, d ) 
P(v) P( s ) P(t | v) P(l | s ) P(b | s) P(a | t , l ) P( x | a) P(d | a, b)
Bayesian network = Directed Acyclic Graph (DAG),
annotated with conditional probability distributions.
8
Local distributions
T
L
(Yes/No)
(Yes/No)
p(A|T,L)
A
(Yes/no)
Conditional Probability Table:
p(A=y|L=n, T=n) = 0.02
p(A=y|L=n, T=y) = 0.60
p(A=y|L=y, T=n) = 0.99 Asymmetric independence in the CPT
p(A=y|L=y, T=y) = 0.99
9
Queries
There are several types of queries.
Most queries involve evidence
An evidence e is an assignment of values
to a set E of variables in the domain
Example, A Posteriori belief: P(D=yes | V = yes )
Or in general: P(H=h | E = e ) where H and E are
subsets of variables.
Equivalent to computing P(H=h , E = e ) and then
dividing.
10
S
V
A posteriori belief
L
T
B
A
This query is useful in many cases:
X
D
 Prediction:
what is the probability of an outcome
given the starting condition
 Diagnosis:
what is the probability of disease/fault
given symptoms
11
Example: Predictive+Diagnostic
Probabilistic inference can combine evidence form
all parts of the network, Diagnostic and Predictive,
regardless of the directions of edges in the model.
P(T = Yes | Visit_to_Asia = Yes, Dyspnea = Yes )
S
V
L
T
B
A
X
D
12
Queries: MAP
 Find
the maximum a posteriori assignment for
some variable of interest (say H1,…,Hl )
 That is, h1,…,hl maximize the conditional probability
P(h1,…,hl | e)
 Equivalent to maximizing the joint P(h1,…,hl, e)
13
Queries: MAP
D1
D2
D3
D4
We can use MAP for:
S2
S1
S4
S3
 Explanation
 What is the most likely joint event, given the
evidence (e.g., a set of likely diseases given the
symptoms)
 What is the most likely scenario, given the
evidence (e.g., a series of likely malfunctions
that trigger a fault).
Bad
alternator
Bad magneto
Not charging
Bad battery
Dead
battery
14
How Expressive are Bayesian
Networks
1. Check the diamond example via all boundary
bases.
2. The following property holds for d-separation
but does not hold for conditional independence:
ID(X,{},Y) and ID (X,,Y)  ID (X,{}, ) or ID (,{},Y)
15
Drawback: Interpreting the Links is
not simple
Another drawback is the difficulty with extreme probabilities.
There is no local test for I-mapness.
Both drawbacks disappear in the class of decomposable
models, which are a special case of Bayesian networks
16
Decomposable Models
Example: Markov Chains and Markov Trees
Assume the following chain is an I-map of some P(x1,x2,x3,x4)
and was constructed using the methods we just described.
The “compatibility functions” on all links can be easily
interpreted in the case of chains.
Same also for trees. This idea actually works for all chordal graphs.
17
Chordal Graphs
18
Interpretation of the links
Clique 1
Clique 2
Clique 3
A probability distribution that can be written as a product
of low order marginals divided by a product of low order
marginals is said to be decomposable.
19
Importance of Decomposability
When assigning compatibility functions it suffices to
use marginal probabilities on cliques and just make
sure to be locally consistent. Marginals can be
assessed from experts or estimated directly from data.
20
The Diamond Example – The
smallest non chordal graph
Adding one more link will turn the graph to become chordal.
Turning a general undirected graph into a chordal graph in some
optimal way is the key for all exact computations done on Markov
and Bayesian networks.
21
22
Complexity of Inference
Theorem:
Computing P(X = x) in a Bayesian network is NPhard.
Main idea: conditional probability tables with zeros
and ones are equivalent to logical gates.
Hence reducibility to 3-SAT is the easiest to pursue.
23
Proof
We reduce 3-SAT to Bayesian network computation
Assume we are given a 3-SAT problem:
 Q1,…,Qn be propositions,
 1 ,... ,k be clauses, such that i = li1 li2  li3
where each lij is a literal over Q1,…,Qn
(e.g., Q1 = true )
  = 1... k
We will construct a Bayesian network s.t.
P(X=t) > 0 iff  is satisfiable
24
Q1
Q2
1
Q3
2
3
A1
A2
...
Q4
... 
...
k-1
Ak-2
Qn
k
X
 P(Qi
= true) = 0.5,
 P(I = true | Qi , Qj , Ql ) = 1 iff Qi , Qj , Ql
satisfy the clause I
 A1, A2, …, are simple binary AND gates
25
Q1
Q2
1
 It
Q3
2
3
A1
A2
...
Q4
... 
...
k-1
Ak-2
Qn
k
X
is easy to check



Polynomial number of variables
Each Conditional Probability Table can be described by
a small table (8 parameters at most)
P(X = true) > 0 if and only if there exists a satisfying
assignment to Q1,…,Qn
 Conclusion:
polynomial reduction of 3-SAT
26
Inference is even #P-hard
 P(X
= t) is the fraction of satisfying assignments to

 Hence 2nP(X = t) is the number of satisfying
assignments to 
if we know to compute P(X = t), we know to
count the number of satisfying assignments to.
 Consequently, computing P(X = t) is #P-hard.
 Thus,
27
Hardness - Notes
We need not use deterministic relations in our construction.
 The construction shows that hardness follows even with a
small degree graphs.


Hardness does not mean we cannot do inference
 It implies that we cannot find a general procedure that
works efficiently for all networks
 For particular families of networks, we can have provably
efficient procedures (e.g., trees, HMMs).
 Variable elimination algorithms.
28
Extra Slides with more details
If times allows
29
Chordal Graphs
30
Example of the Theorem
1. Each cycle has a chord.
2. There is a way to direct edges legally, namely,
AB , AC , BC , BD , CD, CE
3. Legal removal order (eg): start with E, than D,
than the rest.
4. The maximal cliques form a join (clique) tree.
31
Theorem X: Every undirected graph G has a distribution P such
that G is a perfect map of P.
32
Proof of Theorem X
Given a graph G, it is sufficient to show that for an independence
statement  = I(,Z,) that does NOT hold in G, there exists a
probability distribution that satisfies all independence statements
that hold in the graph and does not satisfy  = I(,Z,).
Well, simply pick a path in G between  and  that does not contain
a node from Z. Define a probability distribution that is a perfect map
of the chain and multiply it by any marginal probabilities on all other
nodes forming P . Now “multiply” all P (Armstrong relation) to
obtain P.
Interesting task (Replacing HMW #4): Given an undirected graph
over binary variables construct a perfect map probability
distribution. (Note: most Markov random fields are perfect maps !).
33
Interesting conclusion of Theorem X:
All independence statements that follow for strictly-positive
probability from the neighborhood basis are derivable via
symmetry, decomposition, intersection, and weak union.
These axioms are (sound and) complete for neighborhood bases.
These axioms are (sound and) complete also for pairwise bases.
In fact for saturated statements conditional independence and
separation have the same characterization. See paper P2 .
34