Undirected Graphical Models

Download Report

Transcript Undirected Graphical Models

Undirected Graphical Models
Eran Segal
Weizmann Institute
Undirected Graphical Models


Useful when edge directionality cannot be assigned
Simpler interpretation of structure





Simpler inference
Simpler independency structure
Harder to learn
We will also see models with combined directed and
undirected edges
Some computations require restriction to discrete
variables
Reminder: Perfect Maps

G is a perfect map (P-Map) for P if I(P)=I(G)

Does every distribution have a P-Map?

No: some structures cannot be represented in a BN

Independencies in P: Ind(A;D | B,C), and Ind(B;C | A,D)
A
C
A
D
C
B
B
D
Ind(B;C | A,D) does not hold
Ind(A,D) also holds
Representing Dependencies

Ind(A;D | B,C), and Ind(B;C | A,D)


Cannot be modeled with a Bayesian network without
extraneous edges
Can be modeled with an undirected (Markov) network
Undirected Model (Informal)


Nodes correspond to random variables
Local factor models are attached to sets of nodes



Factor elements are positive
Do not have to sum to 1 A
Represent affinities
a
C
1[A,C
]
0
c0
4
a0
c1
12
a1
c0
2
a1
c1
9
A
C
3[C,D
]
C
D
c0
d0
30
c0
d1
5
c1
d0
1
1
1
A
B
2[A,B]
a0
b0
30
a0
b1
5
a1
b0
1
a1
b1
10
B
D
4[B,D]
b0
d0
100
b0
d1
1
b1
d0
1
b1
d1
1000
B
D
Undirected Model (Informal)









Represents joint distribution
Unnormalized factor
F
(
a
,
b
,
c
,
d
)

[
a
,
b
]
[
a
,
c
]
[
b
,
d
]
[
c
,
d
]
1
2
3
4
Partition function
A
Z

[
a
,
b
]
[
a
,
c
]
[
b
,
d
]
[
c
,
d
]

1
2
3
4
a
,
b
,
c
,
d
C
B
Probability
1
D
P
(
a
,
b
,
c
,
d
)

[
a
,
b
]
[
a
,
c
]
[
b
,
d
]
[
c
,
d
]
1
2
3
4
Z


As Markov networks represent joint distributions, they
can be used for answering queries
Markov Network Structure

Undirected graph H


Nodes X1,…,Xn represent random variables
H encodes independence assumptions


A path X1,…,Xk is active if none of the Xi variables along the
path are observed
X and Y are separated in H given Z if there is no active
path between any node xX and any node yY given Z

Denoted sepH(X;Y|Z)
D  {A,C} | B
A
B
D
C
Global Markov assumptions: I(H) = {(XY|Z) : sepH(X;Y|Z)}
Relationship with Bayesian Network

Can all independencies encoded by Markov networks
be encoded by Bayesian networks?


Can all independencies encoded by Bayesian
networks be encoded by Markov networks?


No, Ind(A;B | C,D) and Ind(C;D | A,B) example
No, immoral v-structures (explaining away)
Markov networks encode monotonic independencies

If sepH(X;Y|Z) and ZZ’ then sepH(X;Y|Z’)
Markov Network Factors

A factor is a function from value assignments of a set
of random variables D to real positive numbers +


The set of variables D is the scope of the factor
Factors generalize the notion of CPDs

Every CPD is a factor (with additional constraints)
Markov Network Factors

Can we represent any joint distribution by using only
factors that are defined on edges?


No!
Example: binary variables


Joint distribution has 2n-1 independent
Markov network with edge factors has
parameters
n
4   parameters
2
A
B
G
F
C
E
D
Edge parameters: 421=84
Needed: 127!
Markov Network Factors

Are there constraints imposed on the network
structure H by a factor whose scope is D?


Hint 1: think of the independencies that must be satisfied
Hint 2: generalize from the basic case of |D|=2
The induced subgraph over D must be a clique (fully connected)
(otherwise two unconnected variables may be independent by
blocking the active path between them, contradicting the direct
dependency between them in the factor over D)
Markov Network Factors
A
A
B
D
C
Maximal cliques
 {A,B}
 {B,C}
 {C,D}
 {A,D}
B
D
C
Maximal cliques
 {A,B,C}
 {A,C,D}
Markov Network Distribution

A distribution P factorizes over H if it has:


A set of subsets D1,...Dm where each Di is a complete
subgraph in H
Factors 1[D1],...,m[Dm] such that
1
P
(
X
,...,
X
)

f
(
X
,...,
X
)
1
n
1
n
Z
where:

f
(
X
,...,
X
)

[
D
]

1
n
i
i

Z

f
(
X
,...,
X
)

[
D
]



1
n
i
i
X
,...,
X
1
n


X
,...,
X
1
n
Z is called the partition function
P is also called a Gibbs distribution over H
Pairwise Markov Networks

A pairwise Markov network over a graph H has:



A set of node potentials {[Xi]:i=1,...n}
A set of edge potentials {[Xi,Xj]: Xi,XjH}
Example:
X11
X12
X13
X14
X21
X22
X23
X24
X31
X32
X33
X34
Logarithmic Representation

We represent energy potentials by applying a log
transformation to the original potentials




[D]=exp(-[D]) where [D]=-ln[D]
Any Markov network parameterized with factors can be
converted to a logarithmic representation
The log-transformed potentials can take on any real value
The joint distribution decomposes as

1m 
P
(
X
,...,
X
)

exp

[
D
]

1
n
i
i


Z
i

1

I-Maps and Factorization

I-Map


An I-Map I(P) is the set of independencies (X  Y | Z) in P
Bayesian Networks

Factorization and reverse factorization theorems
n

G is an I-map of P iff P factorizes as P
(
X
,...,
X
)

P
(
X
|
Pa
(
X
))

1
n
i
i
i

1

Markov Networks

Factorization and reverse factorization theorems

H is an I-map of P iff P factorizes as P
(
X
,...,
X
)


[
D
]

1
n
i
i
1
Z
Reverse Factorization


1
P
(
X
,...,
X
)

[
D
]


1
n
i
i
Z

Proof:



H is an I-map of P
Let X,Y,W be any three disjoint sets of variables such that
W separates X and Y in H
We need to show Ind(X;Y|W)P
Case 1: XYW=U (all variables)





As W separates X and Y there are no direct edges between X and Y
 any clique in H is fully contained in XW or YW
Let IX be subcliques in XW and IY be subcliques in YW
1
1
(
X
,...,
X
)

[
D
]
[
D
]

f
(
X
,
W
)
g
(
Y
,
W
)
 P

1n 
ii
ii

Z
i

I
i

I
X
Y
 Ind(X;Y|W)P
Z
Reverse Factorization


1
P
(
X
,...,
X
)

[
D
]


1
n
i
i
Z

Proof:



H is an I-map of P
Let X,Y,W be any three disjoint sets of variables such that
W separates X and Y in H
We need to show Ind(X;Y|W)P
Case 2: XYWU (all variables)




Let S=U-(XYW)
S can be partitioned into two disjoint sets S1 and S2 such that W
separates XS1 and YS2 in H
From case 1, we can derive Ind(X,S1;Y,S2|W)P
From decomposition of independencies  Ind(X;Y|W)P
Factorization



Holds only for positive distributions P
1
(
X
,...,
X
)


[
D
]
If H is an I-map of P then P

1
n
i
i
Z
Defer proof
d-Separation: Completeness

Theorem:
 d-sepG(X;Y | Z) = no

Proof outline:




There exists P such that
 G is an I-map of P
 P does not satisfy
Ind(X;Y | Z)
Construct distribution P where independence does not hold
Since there is no d-sep, there is an active path
For each interaction in the path, correlate the variables
through the distribution in the CPDs
Set all other CPDs to uniform, ensuring that influence flows
only in a single path and cannot be cancelled out
Separation: Completeness

Theorem:
 sepH(X;Y | Z) = no

Proof outline:




There exists P such that
 H is an I-map of P
 P does not satisfy
Ind(X;Y | Z)
Construct distribution P where independence does not hold
Since there is no sep, there is an active path
For each interaction in the path, correlate the variables
through the potentials
Set all other potentials to uniform, ensuring that influence
flows only in a single path and cannot be cancelled out
Relationship with Bayesian Network

Bayesian Networks




Semantics defined via local Markov assumptions
Global independencies induced by d-separation
Local and global independencies equivalent since one
implies the other
Markov Networks


Semantics defined via global separation property
Can we define the induced local independencies?


We show two definitions
All three definitions (global and two local) are equivalent only for
positive distributions P
Pairwise Markov Independencies


Every pair of disconnected nodes are separated given
all other nodes in the network
Formally: IP(H) = {(XY|U-{X,Y}) : X—YH}
Ind(A;D | B,C,E)
Ind(B;C | A,D,E)
Ind(D;E | A,B,C)
E
A
Example:
C
B
D
Markov Blanket Independencies


Every node is independent of all other nodes given its
immediate neighboring nodes in the network
Formally: IL(H) = {(XU-{X}-NH(X)|NH(X)) : XH}
Ind(A;D | B,C,E)
Ind(B;C | A,D,E)
Ind(C;B | A,D,E)
Ind(D;E,A | B,C)
Ind(E;D | A,B,C)
E
A
Example:
C
B
D
Relationship Between Properties

Let I(H) be the global separation independencies
Let IL(H) be the Markov blanket independencies
Let IP(H) be the pairwise Markov independencies

For any distribution P:



I(H)  IL(H)


The assertion in IL(H), that a node is independent of all other nodes
given its neighbors, is part of the separation independencies since
there is no active path between a node and its non-neighbors given
its neighbors
IL(H)  IP(H)

Follows from the monotonicity of independencies in Markov
networks (if Ind(X;Y|Z) and ZZ’ then Ind(X;Y|Z’))
Relationship Between Properties

Let I(H) be the global separation independencies
Let IL(H) be the Markov blanket independencies
Let IP(H) be the pairwise Markov independencies

For any positive distribution P:



IP(H)  I(H)


Proof relies on intersection property for probabilities
Ind(X;Y|Z,W) and Ind(X;W|Z,Y)  Ind(X;Y,W|Z)
which holds in general only for positive distributions
Thus, for positive distributions

I(H)  IL(H)  IP(H)
The Need for Positive Distribution

Let P satisfy






B
C
P satisfies IP(H) and IL(H)


Ind(A;E)
A and E uniformly distributed
B=A
D=E
C=BD
A
E.g., Ind(B;D,E|A,C) (since A determines B)
D
P does not satisfy I(H)

E.g., Ind(A;E|C) does not hold in P but
should exist according to separation
E
The Need for Positive Distribution

Let P satisfy


A is uniformly distributed
A=B=C
A
B

P satisfies IP(H)


Ind(B;C|A), Ind(A;C|B)
(since each variable determines all others)
C
P does not satisfy IL(H)

Ind(C;A,B) needs to hold according to IL(H) but does not
hold in the distribution
Constructing Markov Network for P

Goal: Given a distribution, we want to construct a
Markov network which is an I-map of P

Complete graphs will satisfy but are not interesting

Minimal I-maps: A graph G is a minimal I-Map for P if:



G is an I-map for P
Removing any edge from G renders it not an I-map
Goal: construct a graph which is a minimal I-map of P
Constructing Markov Network for P

If P is a positive distribution then I(H)IL(H)IP(H)


Construction algorithm


Thus, sufficient to construct a network that satisfies IP(H)
For every (X,Y) add edge if Ind(X;Y|U-{X,Y}) does not hold
Theorem: network is minimal and unique I-map

Proof:



I-map follows since IP(H) by construction and I(H) by equivalence
Minimality follows since deleting an edge implies Ind(X;Y|U-{X,Y})
But, we know by construction that this does not hold since we
added the edge in the construction process
Uniqueness follows since any other I-map has at least these edges
and to be minimal cannot have additional edges
Constructing Markov Network for P

If P is a positive distribution then I(H)IL(H)IP(H)


Construction algorithm


Thus, sufficient to construct a network that satisfies IL(H)
Connect each X to every node in the minimal set Y s.t.:
{(XU-{X}-Y|Y) : XH}
Theorem: network is minimal and unique I-map
Markov Network Parameterization

Markov networks have too many degrees of freedom


A clique over binary variables has 2n parameters but the
joint has only 2n-1 parameters
A
The network A—B—C has clique {A,B} and {B,C}


Both capture information on B which we can choose
where we want to encode (in which clique)
We can add/subtract between the cliques
B
C

Need: conventions for avoiding ambiguity in
parameterization

Can be done using a canonical parameterization (see
handout for details)
Factor Graphs

From the Markov network structure we do not know
whether parameterization involves maximal cliques


Example: fully connected graph may be pairwise potentials
or one large (exponential) potential over all nodes
Solution: Factor Graphs


Undirected graph
Two types of nodes



Variable nodes
Factor nodes
Parameterization


Each factor node is associated with exactly one factor
Scope of factor are all neighbor variables of the factor node
Factor Graphs

Example


Exponential (joint) parameterization
Pairwise parameterization
A
A
B
C
B
A
VAB
C
B
VAB
C
Markov network
VABC
Factor graph for
joint parameterization
VAB
Factor graph for
pairwise parameterization
Local Structure

Factor graphs still encode complete tables


Goal: as in Bayesian networks, represent context-specificity
A feature [D] on variables D is an indicator function
that for some yD:
1y

Val
(
D
)



[
D
]



0otherwise



A distribution P is a log-linear model over H if it has


Features 1[D1],...,k[Dk] where each Di is a subclique in H
A set of weights w1,...,wk such that
 
1 k
P
(
X
,...,
X
)

exp

w
[
D
]

1 n
ii
i
i

1
Z
Feature Representation

Several features can be defined on one subclique



 any factor can be represented by features, where in the
most general case we define a feature and weight for each
entry in the factor
Log-linear model is more compact for many
distributions especially with large domain variables
Representation is intuitive and modular

Features can be modularly added between any interacting
sets of variables
Markov Network Parameterizations

Choice 1: Markov network



Choice 2: Factor graph



Product over graphs
Useful for inference (later)
Choice 3: Log-linear model




Product over potentials
Right representation for discussing independence queries
Product over feature weights
Useful for discussing parameterizations
Useful for representing context specific structures
All parameterizations are interchangeable
Domain Application: Vision

The image segmentation problem


Task: Partition an image into distinct parts of the scene
Example: separate water, sky, background
Markov Network for Segmentation


Grid structured Markov network
Random variable Xi corresponds to pixel i




Neighboring pixels are connected in the network
Appearance distribution



Domain is {1,...K}
Value represents region assignment to pixel i
wik – extent to which pixel i “fits” region k (e.g., difference
from typical pixel for region k)
Introduce node potential exp(-wik1{Xi=k})
Edge potentials

Encodes contiguity preference by edge potential
exp(1{Xi=Xj}) for >0
Markov Network for Segmentation
Appearance
distribution

X11
X12
X13
X14
X21
X22
X23
X24
X31
X32
X33
X34
Solution: inference

Contiguity
preference
Find most likely assignment to Xi variables
From Bayesian nets to Markov nets

Goal: build a Markov network H capable of
representing any distribution P that factorizes over G


Equivalent to requiring I(H)I(G)
Construction process


Connect each X to every node in the smallest set Y s.t.:
{(XU-{X}-Y|Y) : XH}
How can we find Y by querying G?
Blocking Paths
Active path:
parents
Active path:
descendants
Active path:
v-structure
Y
X
X
Z
X
Y
Y
Block path:
parents
Block path:
children
Block path:
children
children’s parents
From Bayesian nets to Markov nets

Goal: build a Markov network H capable of
representing any distribution P that factorizes over G


Equivalent to requiring I(H)I(G)
Construction process


Connect each X to every node in the smallest set Y s.t.:
{(XU-{X}-Y|Y) : XH}
How can we find Y by querying G?

Y = Markov blanket of X in G (parents, children, children’s parents)
Moralized Graphs

The Moral graph of a Bayesian network structure G is
the undirected graph that contains an undirected
edge between X and Y if


X and Y are directly connected in G
X and Y have a common child in G
Bayesian network
Moralized graph
Parameterizing Moralized Graphs

Moralized graph contains a full clique for every Xi and
its parents Pa(Xi)


 We can associate CPDs with a clique
Do we lose independence assumptions implied by the
graph structure?

Yes, immoral v-structures
A
B
C
Ind(A;B)
A
B
C
From Markov nets to Bayesian nets


Transformation is more difficult and the resulting
network can be much larger than the Markov network
Construction algorithm



Use Markov network as template for independencies
Fix ordering of nodes
Add each node along with its minimal parent set according
to the independencies defined in the distribution
From Markov nets to Bayesian nets
A
B
A
C
B
C
D
E
Order: A,B,C,D,E,F
D
E
F
F
Chordal Graphs



Let X1—X2—...—Xk—X1 be a loop in the graph
A chord in the loop is an edge connecting Xi and Xj for
two nonconsecutive nodes Xi and Xj
An undirected graph is chordal if any loop
X1—X2—...—Xk—X1 for k4 has a chord



That is, longest minimal loop is a triangle
Chordal graphs are often called triangulated
A directed graph is chordal if its underlying
undirected graph is chordal
From Markov Nets to Bayesian Nets


Theorem: Let H be a Markov network structure and G
be any minimal I-map for H. Then G is chordal
The process of turning a Markov network into a
Bayesian network is called triangulation

The process loses independencies
A
A
B
C
B
C
D
E
D
E
F
Ind(B;C|A,F)
F
Chain Networks



Combines Markov networks and Bayesian networks
Partially directed graph (PDAG)
As for undirected graphs, we have three distinct
interpretations for the independence assumptions
implied by a P-DAG
Example:
E
A
C
B
D
Pairwise Markov Independencies


Every node X is independent from any node which is
not its descendant given all non-descendants of X
Formally:

IP(K) = {(XY|ND(X)-{X,Y}) : X—YK, YND(X)}
Ind(D;A | B,C,E)
Ind(C;E | A)
E
A
Example:
C
B
D
Local Markov Independencies



Let Boundary(X) be the union of the parents of X and
the neighbors of X
Local Markov independencies state that a node X is
independent of its non-descendants given its
boundary
Formally:

IL(K) = {(XND(X)-Boundary(X)|Boundary(X)) : XU}
Ind(D;A,E | B,C)
E
A
Example:
C
B
D
Global Independencies



I(K) = {(XY|Z) : X,Y,Z, X is c-separated from Y
given Z}
X is c-separated from Y given Z if X is separated from
Y given Z in the undirected moralized graph M[K]
The moralized graph of a P-DAG K is an undirected
graph M[K] by


Connecting any pair of parents of a given node
Converting all directed edges to undirected edges
For positive distributions: I(K)  IL(K)  IP(K)
Application: Molecular Pathways


Pathway genes are regulated together
Pathway genes physically interact
Goal

Automatic discovery of molecular pathways: sets
of genes that


Are regulated together
Physically interact
Finding Pathways: Attempt I

Use physical interaction data


Yeast 2 Hybrid (pairwise interactions: DIP, BIND)
Mass spectrometry (protein complexes: Cellzome)
Pathway IIII
II
Finding Pathways: Attempt I

Use physical interaction data



Yeast 2 Hybrid (pairwise interactions: DIP, BIND)
Mass spectrometry (protein complexes: Cellzome)
Problems:



Fully connected component
(DIP: 3527 of 3589 genes)
Data is noisy
No context-specific interactions
Finding Pathways: Attempt II

Use gene expression data

Thousands of arrays available under different conditions
experiments
genes
Pathway I
Clustering
Pathway II
Less gene activity
More gene activity
Finding Pathways: Attempt II

Use gene expression data


Thousands of arrays available under different conditions
Problems:



Expression is only ‘weak’
indicator of interaction
Data is noisy
Interacting pathways are
not separable
Pathway I
Pathway II
Finding Pathways: Our Approach

Use both types of data to find pathways


Find “active” interactions using gene expression
Find pathway-related co-expression using interactions
Pathway I
Pathway III
Pathway II
Pathway IV
Probabilistic Model

Genes are partitioned into “pathways”:



Expression component:


Every gene is assigned to one of ‘k’ pathways
Random variable for each gene with domain {1,…,k}
Model likelihood is higher when genes in the same
pathway have similar expression profiles
Interaction component:

Model likelihood is higher when genes in the same
pathway interact
Expression Component
Naïve Bayes
g.C=1
Pathway of gene g
g.E1
0
g.E2
…
g.E3
0
g.Ek
0
0
Expression level of gene g in k arrays
g.C – cluster of gene ‘g’
g.Ei – expression of gene ‘g’ in experiment i
Expression Component
g.C=1
g.E1
0
g.E2
…
g.E3
0
g.Ek
0
0
Pathway I
g.C=2
g.E1
0
g.E2
…
g.E3
0
0
0
g.Ek
Pathway II
g.C=3
g.E1
0
g.E2
0
g.E3
0
…
g.Ek
0
Pathway III
Expression Component
g.C=1
g.E1
0
g.E2
0
…
g.E2
0
g.Ek
0
Pathway I
Joint Likelihood:
m
P
(
G
.
C
,
G
.
E
)

P
(
g
.
C
)
P
(
g
.
E
|
g
.
C
)


i
g
i

1
Protein Interaction Component

Interacting genes are more likely to be in the same
pathway
Gene 1
(g.C,g.C)
g1.C
g1.C g2.C
protein product
interaction
Gene 2
g2.C
1
1
1
2
2
2
3
3
3
1

1
2
3
1
2
3
1
2
3
Compatibility potential
Protein Interaction Component

Interaction data:




g1
g2
g2
g3
–
–
–
–
g2
g3
g4
g4
Gene 1
g1.C
Gene 2
g2.C
Gene 3
g3.C
Gene 4
g4.C
 
Joint Likelihood:
P
(
G
.
C
)

(
g
.
C
)
(
g
.
C
,
g
.
C
)


g
g
,
g
j
i i
i
j i
g
i
[
g



g
]

E
i
j
Joint Probabilistic Model
Gene 1
g1.C
Gene 2
g2.C
g1.E1
g2.E1
Gene 3
g3.C
Gene 4
g4.C
1
1
1
2
2
2
3
3
3
1
2
3
1
2
3
1
2
3
g4.E1

1
g1.C
g1.E3
Gene 2
g2.C
g2.E3
g1.E2
g2.E2
Gene 3
g3.C
g3.E3
Gene 4
g4.C
g4.E3
g3.E2
g4.E2
1
0
0
1
g3.E1
Gene 1
2
1 2 3
0
3
Path. I
Joint Probabilistic Model

Joint Likelihood:

P
(
G
.
C
,
G
.
E
|)

expression
interactions
n
m


1


P
(
g
.
C
)
P
(
g
.
E
|
g
.
C
)


i
j
i


Z
i

1
j

1




 


(
g
.
C
,
g
.
C
)
g
g
i
j
i,
j


[
g


g

E
i
j]


Model Involved


Chain graph (directed and undirected edges)
Hybrid network



Hidden and observed nodes



Discrete “cluster” variables
Continuous expression variables
Expression is observed
Cluster variables are hidden
Conditional Linear Gaussian CPDs

Expression given cluster variable
Comparison to Expression Clustering
Number of Interactions
250
Our Method
Standard Expression Clustering
200
150
20 pathways with
significantly more
interactions between
genes in pathway
100
50
0
0
10
20
30
40
Pathway/Cluster
50
60
Comparison to Interaction Clustering
Interaction Clustering (MCL) – Enright et al. (2002)
0.9
Expression Coherence

Our Method
Interaction Clustering (MCL)
0.8
0.7
0.6
0.5
Only one interaction
pathway is active
0.4
0.3
0.2
0.1
0
0
10
20
30
40
Pathway/Cluster
50
60
Capturing Biological Processes
Compare p-value of enrichment for GO annotations
80
-Log(p-value) for Our Method

70
60
50

40
30

20
10
214 more significant
for our method
21 more significant for
expression clustering
0
0
10 20 30 40 50 60 70 80
-Log(p-value) for Standard Expression Clustering
Capturing Biological Processes
Compare p-value of enrichment for GO annotations
-Log(p-value) for Our Method

80
70

60
50
73 vs. 20 functions more
enriched at P<10-20:

40
30
20
10
Includes processes with
many interactions to other
processes (e.g. translation,
protein degradation,
transcription, replication)
0
0
10 20 30 40 50 60 70 80
-Log(p-value) for Interaction Clustering
Summary

Markov Networks – undirected graphical models




Relationship to Bayesian networks



Like Bayesian networks, define independence assumptions
Three definitions exist, all equivalent in positive distributions
Factorization is defined as product of factors over complete
subsets of the graph
Represent different families of independencies
Independencies that can be represented in both – chordal
graphs
Moralized graphs


Minimal network that represents Bayesnets dependencies
d-separation in Bayesnets is equivalent to separation in
moral graphs