Gene Networks: Bayesian Networks models

Download Report

Transcript Gene Networks: Bayesian Networks models

Bayesian Networks & Gene
Expression
.
The Problem
Experiments
j
Genes
i
Goal:
Aij - the mRNA level of gene j in experiment i
 Learn regulatory/metabolic networks
 Identify causal sources of the biological
phenomena of interest
Analysis Approaches
 Clustering


Groups together genes with similar expression patterns
Does not reveal structural relations between genes
 Boolean



networks
Deterministic models of the logical interactions between
genes
Deterministic, static
 Linear

of expression data
models
Deterministic fully-connected linear model
Under-constrained, assumes linearity of
interactions
Probabilistic Network Approach
 Characterize
stochastic (non-deterministic!)
relationships between expression patterns of
different genes
 Beyond


pair-wise interactions => structure!
Many interactions are explained by intermediate factors
Regulation involves combined effects of several geneproducts
 Flexible
in terms of types of interactions (not
necessarily linear or Boolean!)
Probabilistic Network Structure
Noisy stochastic process of hair
heredity (pedigree example) :
Homer
Al
Marge
A
node represents
an individual’s
genotype
Bart
 An
Lisa
Maggie
edge represents direct influence/interaction
 Note: Bart’s genotype is dependent on Al’s genotype,
but independent on it given Marge’s genotype
A Model: Bayesian Network
Ancestor
Structure: DAG
Parent
 Meaning:
a child is
conditionally
independent on its nondescendants, given the
value of its parents
 Does
not impose
causality, but suits
modeling of causal
processes
Y1
Y2
X
Child
Non-descendent
Descendent
Local Structures & Independencies
B
 Common
parent
C
A
 Cascade
A
B
A
 V-structure
The
C
B
C
language is compact, the concepts are rich!
Bayesian Network – CPDs
Local Probabilities: CPD - conditional probability
distribution P(Xi|Pai)
 Discrete
variables: Multinomial Distribution
(can represent any kind of statistical dependency)
Earthquake
Radio
Burglary
Alarm
Call
E
e
B
b
0.9
0.1
e
b
0.2
0.8
e
b
0.9
0.1
e
b
0.01
0.99
P(A | E,B)
Bayesian Network – CPDs (cont.)
Continuous variables: e.g. linear Gaussian
k
P( X | Y1 ,..., Yk )  N (a0   ai yi ,  )
2
i 1
P(X | Y)

X
Y
Bayesian Network Semantics
B
E
R
A
C
 The
Qualitative part
DAG specifies
conditional
independence
statements
Quantitative part
+
local
probability
models
=
Unique joint
distribution
over domain
joint distribution decomposes nicely:
P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A)

Every node is dependent on many others, BUT :
k
n
  k parents  O(2 n) vs. O(2 ) params
 => good for memory conservation & learning robustness
So, Why Bayesian Networks?
 Flexible
representation of (in)dependency
structure of multivariate distributions and
interactions
 Natural for modeling global processes with local
interactions => good for biology
 Clear probabilistic semantics
 Natural for statistical confidence analysis of
results and answering of queries
 Stochastic
in nature: models stochastic processes
& deals (“sums out”) noise in measurements
Inference
Encloses
answers to valuable queries, e.g.:
 Is node X independent on node Y given nodes
Z,W ?
 What is the probability of X=true if (Y=false
and Z=true)?
 What is the joint distribution of (X,Y) if
R=false?
What is the likelihood of some full assignment?
 What is the most likely assignment of values
to all the nodes of the network?
The queries are relatively efficient
Learning Bayesian Network
The goal:
• Given set of independent samples (assignments
random variables) to random variables, find the best (the
most likely?) Bayesian Network
B
E
(both DAG and CPDs)
R
{ (B,E,A,C,R)=(T,F,F,T,F)
(B,E,A,C,R)=(T,F,T,T,F)
……..
(B,E,A,C,R)=(F,T,T,T,F) }
A
C
E
e
B
b
0.9
0.1
e
b
0.2
0.8
e
b
0.9
0.1
e
b
0.01
0.99
P(A | E,B)
Learning Bayesian Network
• Learning of best CPDs given DAG is easy (collect
statistics of values of each node given specific
assignment to its parents). But…
•The structure (G) learning problem is NP-hard =>
heuristic search for best model must be applied,
generally bring out a locally optimal network.
•It turns out, that the richer structures give higher
likelihood P(D|G) to the data (adding an edge is
always preferable), because…
Learning Bayesian Network
A
B
C
A
B
C
• If we add B to Pa(C) , we have more parametes to fit =>
more freedom => can always optimize SPD(C) , such
that:
P(C | A)  P(C | A, B)
• But we prefer simpler (more explanatory) networks
(Occam’s razor!)
•Therefore, practical scores of Bayesian Networks
compensate the likelihood improvement by “fine” on
complex networks.
Modeling Biological Regulation
Variables of interest:
 Expression levels of genes
 Concentration levels of proteins
 Exogenous variables: Nutrient levels, Metabolite
Levels, Temperature
 Phenotype information
 …
Bayesian Network Structure:
 Capture dependencies among these variables
Possible Biological Interpretation
Measured expression
level of each gene
Random variables
Gene interaction
Probabilistic
dependencies
Interactions are represented by a graph:
 Each gene is represented by a node in the graph
 Edges between the nodes represent direct
dependency
X
A
B
A
B
More Local Structures
 Dependencies
can be mediated through other
nodes
B
A
A
B
C
Common cause
Intermediate gene
 Common/combinatorial
effects:
A
B
C
C
The Approach of Friedman et al.
Bayesian Network
Learning Algorithm
Expression data
B
E
R
A
C
Use learned network to make predictions about
structure of the interactions between genes –
No prior biological knowledge is used!
The Discretization Problem
The
expression measurements are real
numbers.
=> We need to discretize them in order to
learn general CPDs => lose information
=> If we don’t, we must assume some
specific type of CPD (like “linear Gaussian”) =>
lose generality
Problem of Sparse Data
There
are much more genes than experiments
(=assignments) => many different networks suit the
data well => we can’t rely only on the usual
learning algorithm.
Shrink
the network search space. E.g., we can
use the notion, that in biological systems each
gene is regulated directly by only a few regulators.
Don’t
take for granted the resulting network, but
instead fetch from it pieces of reliable
information.
Learning With Many Variables
Sparse Candidate algorithm - efficient heuristic
search, relies on sparseness of regulation nets.
For
each gene, choose promising
“candidate parents set” for direct
influence for each gene
Find
(locally) optimal BN
constrained on those parent
candidates for each gene
Iteratively
improve candidate set
candidates
parents in BN
Experiment
Data from Spellman et al.
(Mol.Bio. of the Cell 1998).
 Contains 76 samples of all the
yeast genome:
 Different methods for
synchronizing cell-cycle in
yeast.
 Time series at few minutes
(5-20min) intervals.
 Spellman et al. identified 800
cell-cycle regulated genes.
Network Learned
Challenge: Statistical Significance
Sparse Data
 Small number of samples
 “Flat posterior” -- many networks fit the data
Solution
 estimate confidence in network features
 E.g., two types of features
 Markov neighbors: X directly interacts with Y
(have mutual edge or a mutual child)
 Order relations: X is an ancestor of Y
Confidence Estimates
B
E
Bootstrap approach
[FGW, UAI99]
D1
Learn
R
A
C
E
D
resample
D2
Learn
B
R
A
C
...
Estimate “Confidence level”:
Dm
E
R
B
A
Learn
C
1
C (f )   1f  Gi 
m i 1
m
In summary…
Normalization,
Discretization
Expression
data
Bayesian Network
Learning Algorithm,
Mutation modeling
+ Bootstrap
Markov
Edge
Separator
Preprocess
Learn
model
Ancestor
Feature
extraction
Result: a list of features with high confidence.
They can be biologically interpreted.
Resulting Features: Markov Relations
Question: Do X and Y directly interact?
Parent-Child
SST2
(0.91)
confidence
Hidden Parent
ARG5
(0.84)
(one gene regulating the other)
STE6
SST2
STE6
Regulator in
mating
pathway
Exporter of
mating factor
(two genes co-regulated by a hidden factor)
Transcription
factor
GCN4
ARG3
ARG3
Arginine
Biosynthesis
ARG5
Arginine
Biosynthesis
Resulting Features: Separators
Given that X andY are indirectly
dependent, who mediates this dependence?
 Separator relation:
 Question:


X affects Z who in turn affects Z
Z regulates both X and Y
Mating
transcriptional
regulator of
nuclear fusion
KAR4
AGA1
FUS1
Cell fusion
Cell fusion
Separators: Intra-cluster Context
MAPK of cell
wall integrity
pathway
SLT2
CRH1
Cell wall
protein
SLR3
YPS3
Cell wall
protein
 All
YPS1
Cell wall
protein
pairs have high correlation
 Clustered together
Protein of
unknown
function
Separators: Intra-cluster Context
MAPK of cell
wall integrity
pathway
SLT2
+
CRH1
Cell wall
protein
YPS3
Cell wall
protein
 SLT2:
+
YPS1
SLR3
Protein of
unknown
function
Cell wall
protein
Pathway regulator, explains the dependence
 Many signaling and regulatory proteins identified as
direct and indirect separators
Next Step: Sub-Networks
Automatic reconstruction
 Goal: Dense sub-network with highly confident
pair-wise features
 Score: Statistical significance
 Search: High scoring sub-networks
 Advantages
 Global picture
 Structured context for interactions
 Incorporate mid-confidence features

Normalization,
Discretization
Expression
data
Preprocess
Bayesian Network
Learning Algorithm,
Mutation modeling
+ Bootstrap
Markov
Edge
Separator
Reconstruct Sub-Networks
Learn
model
Ancestor
Feature
extraction
Feature
assembly
Global network Local features  Sub-network
Results
6
well structured sub-networks representing
coherent molecular responses






Mating
Iron metabolism
Low osmolarity cell wall integrity pathway
Stationary phase and stress response
Amino acid metabolism, mitochondrial function and
sulfate assimilation
Citrate metabolism
 Uncovered
interactions
regulatory, signaling and metabolic
“Mating response”
Substructure
Two branches:
Signaling pathway
regulator
•Cell fusion
Transcriptional regulator
•Outgoing Mating Signal of nuclear fusion
KAR4
SST2
TEC1
NDJ1
KSS1
YLR343W
YLR334C
MFA1
STE6
FUS1
PRM1
AGA1
AGA2 TOM6
FIG1
FUS3
YEL059W
Genes that participate
in Cell fusion
We missed: STE12 (main TF), Fus3 (Main MAPK) is marginal
More insights to incorporate in BN
approach




Sequence (mainly promoters)
Protein-protein interactions measurements
Cluster analysis of genes and/or experiments
Incorporating prior knowledge
Incorporate large mass of biological knowledge,
and insight from sequence/structure databases