cs726 - Computer Science

Download Report

Transcript cs726 - Computer Science

Modeling regulatory networks
in cells using Bayesian networks
Golan Yona
Department of Computer Science
Cornell University
cs726
Outline
• Regulatory networks
• Expression data
• Bayesian Networks
• What
• Why
• How
• Learning networks from expression data
• Using Bayesian networks to analyze expression data (Friedman
et al)
cs726
Regulatory networks
KEGG Regulatory
Pathways
cs726
cs726
Metabolic pathways
KEGG Metabolic
Pathways
cs726
cs726
Expression Arrays
• Measure the expression levels of thousands of genes in a cell under
specific conditions (e.g. cell cycle) simultaneously
• Each cell has the same genomic data but different subsets of proteins
are being expressed in different cells and at the same cell under
different conditions.
• Protein level is controlled by controlling
–
–
–
–
–
–
transcription initiation
mRNA transcription
mRNA transport
splicing
post-translational modifications
degradation of mRNA and proteins.
• Microarray measure the level of mRNA, thus providing an indirect
evidence for the control of protein levels
cs726
Micro Spotting
pin
cs726
cs726
• Some are over-expressed (red), some under-expressed (green)
measured with respect to a control group of genes (“fixed” genes)
• Different pathways are activated under different conditions
cs726
Goals
• Recover protein interactions and sub-networks that correspond to
regulatory networks in the cell.
• Basic assumption: some genes are dependent on others while others
exhibit independence or conditional independence
• The means: Bayesian networks. Capable of modeling the statistical
dependencies between different variables (genes)
• Different from clustering analysis ..
• Applicable when the dependency between genes is “local”
• Problems: data is noisy, partial, sometimes misleading (translation,
activation), not enough to ensure statistically significant models, time
scale
cs726
Bayesian Networks
• A compromise between the assumption of
complete dependency and complete
conditional independence (Naïve Bayes)
• Less constraining yet still tractable
• We know something about the statistical
dependencies between features but not
necessarily about the type of the underlying
distributions
cs726
Example
Oil pressure
In engine
Oil temp.
Engine temp.
Air pressure
In tire
Smoke
Fan speed
Coolant temp.
cs726
Bayesian Networks
• Also called belief nets
• A graph description of dependencies and independencies between
variables. Each node corresponds to a variable (gene).
• The graph is directed and acyclic
• The variables are discrete
• A variable can take on a value from a set of values {a1,a2,…} e.g.
on/off
• The probability of a specific value P(ai) and i P(ai) = 1
• A link joining node A to node C is directional and represents the set of
conditional probabilities P(cj/ai) – causality (the probability that C is on
when A is off)
• The network is described in terms of its nodes and edges and the
conditional probability distributions associated with each node.
cs726
Oil temp.
Network structure
• For every node A
Engine temp.
Smoke
Fan speed
Coolant temp.
• Network assertions:
Coolant temp.
• The parents of A is the set of immediate
predecessors of A
• The children of A is the set of immediate successors of A
• B is a descendant of A if there is a directed path from A to B
• Conditional probability
Parents
High
Eng cold, Eng cold, Eng hot,
Fan fast Fan slow Fan fast
0.1
0.4
0.6
Eng hot
Fan slow
0.9
Low
0.9
0.1
0.6
0.4
• The value of a variable depends on its parents
• A variable is conditionally independent of it non-descendants
given its parents
cs726
Calculating the probability of an assignment
• The network describes the joint probability
distribution of all variables (some conditionally
independent and some are not)
• Depends on the structure!
• The probability of a specific assignment of values
y1,y2,…yn for the variables Y1,Y2,…,Yn
n
P( y1, y 2,..., yn )   P ( yi / Parents(Yi ))
i 1
This is the likelihood of the data given the model.
All you need to know is..
cs726
Learning Bayesian network from data
• Given the data set with specific assignments for variables
(on/off for each gene), how can we find the most probable
network structure that explains the data (the best “match”
to the data)?
• How to quantify a match?
• Note that there are two aspects of the network that we need
to learn
– Structure (nodes, edges)
– Conditional probability distributions
• Common strategy: assign a score to each network G
cs726
• Common strategy: assign a score to each network G
score (G )  log P(G / Data)
 log P( Data / G )  log P(G )  const
Likelihood
prior
• Pick the network that maximizes the score
cs726
Learning
• The likelihood of the data given the model is estimated by
averaging over all possible assignments of parameters
(conditional probabilities) to G
P( Data / G)   P( Data / G, ) P( / G)d

• Summation over all possible assignments for conditional
probabilities. The major contribution is from the set
estimated from the data
• Given a specific structure, for every node we lookup its
parents and calculate the empirical conditional probability
distribution
cs726
Model selection
• The second term (log prior) is a measure for the
complexity of the model (through uncertainty)
• Occam razor:
entia non sunt multiplicanda praeter necessitatem
(thou shall not multiply entities)
MDL principle
• In the papers discussed here it is being ignored
cs726
In search for the best network
• In theory: test different structures, calculate the
probability of assignment to variables for each
network structure, and output the network that
maximizes the likelihood of the data given the
network.
• Impossible in practice – the number of possible
networks over n genes is 2
• For the yeast genome with 6000 genes this is >
105,000,000
n
2
 
cs726
Possible solution
• Apply a heuristic local greedy search: Start with a
random network and locally improve it, by testing
perturbations over the original structure.
• Test one edge at a time, by adding, removing or
reversing the edge, and testing its affect on the
score. If the score improves - accept
cs726
How to learn from expression data
• Two types of features learned from multiple
networks
• First - a gene Y is in the Markov blanket of X
(two genes are involved in the same biological
process. No other gene mediates the dependence)
• Problem of unobserved variables that can
intermediate the interaction
• Second type – a gene X is ancestor of Y (based on
all networks that are learned)
cs726
Application to the Yeast Cell cycle data
• Expression level measurements for 6177 genes
along different time points in six cell cycles –
altogether 76 measurements for each gene
• Only 800 genes vary during cell cycle and 250
cluster into 8 fairly distinct classes.
• Networks are learned for the 800 genes
• Confidence values based on the set of networks
learned from different bootstrap sets
cs726
Typical sub-network
cs726
Biological significance
• Order relations: there are a few dominant
genes that appear before many others, e.g.
genes that are involved in cell cycle control
and initiation.
cs726
• Most are nuclear proteins, but also cytoplasm membrane proteins
(budding and sporulation)
• Some DNA repair proteins (prerequisite for transcription)
• RSR1 – initiator of signal trunsduction cascades in the cell
cs726
Biological significance
• Markov connection: functionally related
cs726
•
•
•
•
Most pairs have similar functions (verified sometimes through transitivity)
Some are physically adjacent on the chromosome
Some relations cannot be detected directly from expression data
Detect conditional independence – group of genes that are expressed
similarly, but one is a parent of all others and there are no connections
between the others
the parent is a control gene (e.g. CLN2 early cell
cycle control gene, that controls RNR3, SVS1, SRO4 and RAD41 that are
functionally unrelated).
cs726
Conclusions
• A powerful tool, but
–
–
–
–
not enough data
Computational problems
Learning algorithms
Authors decompose networks into basic
elements again
• Many possible extensions
cs726