A Physical Module Network

Download Report

Transcript A Physical Module Network

Constrained graph structure learning by
integrating diverse data types
Sushmita Roy
[email protected]
Computational Network Biology
Biostatistics & Medical Informatics 826
Computer Sciences 838
https://compnetbiocourse.discovery.wisc.edu
Sep 27th 2016
Goals for this lecture
• Different types of integrative inference frameworks
– Supervised
• A Naïve Bayes Classification approach
– Unsupervised
• Physical Module Networks (PMNs)
– N. Novershtern, A. Regev, and N. Friedman, "Physical
module networks: an integrative approach for
reconstructing transcription regulation," Bioinformatics,
vol. 27, no. 13, pp. i177-i185, Jul. 2011.
• Application of PMNs to real data
Why constrained structure learning?
• Learning genome-scale networks is
computationally challenging
– The space of possible graphs is huge
– There is not sufficient amount of training
examples to learn these networks reliably
– Multiple equivalent models can be learned
• One type of data (expression) might not
inform us of all the regulatory edges
RECAP: Different types of networks
• Physical networks
– Transcriptional regulatory networks: interactions between
regulatory proteins (transcription factors) and genes
– Protein-protein: interactions among proteins
– Signaling networks: protein-protein and protein-small molecule
interactions to relay signals from outside the cell to the nucleus
• Functional networks
– Metabolic: reactions through which enzymes convert substrates
to products
– Genetic: interactions among genes which when perturbed
together produce a significant phenotype than when individually
perturbed
Types of integrative inference frameworks
• Supervised learning
– Require examples of interaction and noninteractions
– Train a classifier based on edge-specific features
• Unsupervised learning
– Edge aggregation
– Model-based learning
• Auxiliary datasets serve to provide priors on the graph
structure
Supervised learning for integrative network
inference
A few supervised learning approaches
– Functional networks
• MouseNET (Y. Guan, C. L. Myers et al., "A genomewide functional
network for the laboratory mouse," PLoS Comput Biol, vol. 4, no. 9,
pp. e1 000 165+, Sep. 2008)
• STRING (L. J. Jensen, M. Kuhn, et al, "STRING 8-a global view on
proteins and their functional interactions in 630 organisms."
Nucleic acids research, vol. 37, no. Database issue, pp. D412-D416,
Jan. 2009)
– Regulatory networks
• D. Marbach, S. Roy, F. Ay et al., "Predictive regulatory models in
drosophila melanogaster by integrative inference of transcriptional
networks," Genome Research, vol. 22, no. 7, pp. 1334-1349, Jul.
2012
• F. Mordelet and J.-P. Vert, "SIRENE: supervised inference of
regulatory networks," Bioinformatics, vol. 24, no. 16, pp. i76-i82,
Aug. 2008.
Key points of supervised learning approaches
• Ground truth for training a classifier or
computing benchmarking scores
• Different datasets are represented as features
of a pair of genes/proteins
• Largely applied for functional network
inference and less for regulatory networks
Supervised learning of interactions
I12=?
X1
Define:
Y2
1 if X1 interacts with Y2
0 otherwise
I12 =
Given:
X1Y2.features: Attributes of X1 and Y2
We need:
Prob. of interaction: P(I12=1|X1Y2.features)
Prob. of no interaction: P(I12=0|X1Y2.features)
X1
I12=0
No
Y2
Prob. of
interacting >
Prob. of noninteracting?
Yes
X1
I12=1
Y2
Supervised learning of interactions
Positive examples
Negative examples
A
B
G
H
C
D
I
J
F
K
Feature extraction
….
….
E
FEATURE SET
E
G
?
?
Training
A
TRAINED CLASSIFIER
L
Testing
E
A
G
L
Predicted edges
L
MouseNET: Inferring functional networks by
supervised integration diverse datasets
• Goal:
– Predict functional interactions between pairs of genes based on diverse data
sets
• Gold standard:
– Positive set
• Hand-curated pairs of proteins that are known to be involved in the same function
– Negative set
• Pairs of proteins with functional annotation but do not share annotations
• Diverse datasets to represent noisy observations of edges:
–
–
–
–
Physical interaction databases
Co-association with different diseases
Transferred interactions from orthologous pairs of yeast proteins
Co-expression and co-tissue localization
• Based on a probabilistic framework for data integration
• Classification algorithm
– Naïve Bayes Classifier
Y. Guan, C. L. Myers et al., "A genomewide functional network for the laboratory mouse," PLoS Comput
Biol, vol. 4, no. 9, pp. e1 000 165+, Sep. 2008
Naïve Bayes classifier to integrate different
datasets
• Let FR denote the random variable for an interaction
• Let E1.. En represent the evidence from different databases for the edge
• Probabilistic data integration treats each of the evidences as noisy
observations for the edge
FR
E1
E2
…
En
Naïve Bayes assumption: assume
independence among evidences given
the class variable
Learning entails estimating
the conditional distributions
MouseNet: A Naïve Bayes classification
approach to infer a functional network
Functional Network for Mouse
Different types of datasets that
contribute to P(Ei|FR)
MouseNET recovers functional relationships
between mouse proteins
Data integration helps!
Functional Network for Mouse
Figure 2. Computat ional perform ance analysis of the integrated net work to predict functi
performance of different dat asets. (A) Five-fold cross-validation of the integrated results applied to p
annotation to specific GO terms. Positive pairs were defined as those having at least one co-annotation to a
that have a specific annotation, but share no co-annotations. Precision, or the fraction of correct predictions
across a number of cutoffs in prediction confidence (higher cutoff allows for less predictions of higher qua
predictions to be made at the cost of some decrease in accruacy). MouseNET predictions always have high
datasets. (B) Performance of the integrated results when evaluated against a different test set where posit
the same KEGG pathways, and negatives are pairs in which both members are annotated in KEGG, but sha
measurements show that the integrated results are better in recovering known functional relationships th
doi:10.1371/journal.pcbi.1000165.g002
non-essential sets was not significant, nor was that between
disease-related set and the genome average (Figure 6A), suggesting
the observed relationships between essentiality and network
connectivity are likely to be explained by investigational biases
functional network relate
Although most phenotypehave a higher than averag
input data (Figure S4B), on
Classes of methods for integrative
unsupervised network inference
• Two approaches
– Weighted Edge aggregation
– Constrained model-based learning
• Weighted aggregation of different networks
– D. Marbach, S. Roy, F. Ay et al., "Predictive regulatory models in
drosophila melanogaster by integrative inference of transcriptional
networks," Genome Research, vol. 22, no. 7, pp. 1334-1349, Jul. 2012
• Model-based learning
– Auxiliary datasets serve to impose constraints on the graph structure
– We will look at three approaches to integrate other types of data to
better learn regulatory networks
– Physical Module Networks (Sep 27th)
– Bayesian network structure prior distributions (Oct 3rd, 4th,6th)
– Dependency network parameter prior distributions (Oct 6th, 11th)
Strengths and weaknesses of different
integrative inference paradigms
Supervised
+ Evaluation is straightforward
+ Leverage ground truth directly
+ Easy to integrate different data
sources/clear optimization
function
- Need ground truth for training
- Negative examples are usually
not available
- Typically do not predict
expression of a target gene
Unsupervised
+ Do not need ground truth
+ Broadly applicable and flexible
with data sources
- Difficult to evaluate
- Typically do not perform as
well as supervised learning
when ground truth is known
- Learning/Setting hyperparameters is challenging
Goals for this lecture
• Different types of integrative inference
frameworks
– Supervised
– Unsupervised
• Physical Module Networks (PMNs)
– N. Novershtern, A. Regev, and N. Friedman, "Physical
module networks: an integrative approach for
reconstructing transcription regulation,"
Bioinformatics, vol. 27, no. 13, pp. i177-i185, Jul.
2011.
• Application of PMNs to real data
Motivation for Physical Module Networks
• Three main approaches to build a transcriptional
regulatory network
– Observational models (e.g. Bayesian networks)
• Fail to distinguish true regulation from co-expression
– Perturbational models (e.g. knockout)
• Fail to distinguish direct from indirect targets)
– Physical models (TF binding)
• Fail to distinguish functional from non-functional binding
• Challenge
– Build a realistic model of gene regulation
– Combine changes in gene expression with the
underlying physical interactions.
Types of data for used in Physical Module networks
• Expression data
– Genome-wide mRNA levels from multiple
microarray or RNA-seq experiments
expression
Samples
Fi g u r e 3. Var i at i o n i n g en e ex p r essi o n i n S. cerevisiae i so l at es. The diagrams show the averag
denoted strains. Each row represents a given gene and each column represents a different strain, color
patterns of 2,680 genes that varied significantl y (FDR= 0.01, paired t-test) in at least one strain comp
genes that varied significantly in at least one strain compared to strain YPS163 (FDR= 0.01, unpaire
higher expression and a green color represents lower expression in the denoted strain compared t
patterns of 1,330 genes that varied significantly (FDR= 0.01, paired t-test) in at least one strain comp
Here, red and green correspond to higher and lower expression, respect ively, compared to the mea
were organized independently in each plot by hierarchical clustering.
doi:10.1371/ journal.pgen.1000223.g003
• Physical interactions
differences from the mean ranged from 30 (in vineyard strain I 14)
to nearly 600 (in clinical isolate YJM 789), with a median of 88
expression differences per strain. T he number of expression
differences did not correlate strongly with the genetic distances of
the strains (R 2 = 0.16). H owever, this is not surprising since many
of the observed expression differences are likely linked in trans to
the same genetic loci [ 27,31,34,35,43] . Consistent with this
interpretation, we found that the genes affected in each strain
were enriched for specific functional categories (T able S4),
revealing that altered expression of pathways of genes was a
common occurrence in our study.
We noticed that some functional categories were repeatedly
affected in different strains. T o further explore this, we identified
individual genes whose expression differed from the mean in at
least 3 of the 17 non-laboratory strains. T his group of 219 genes
was strongly enriched for genes involved in amino acid metabolism
(p, 102 14), sulfur
metabolism
(p, 102 14), and transposition
(p, 102 47), revealing that genes involved in these functions had
a higher frequency of expression variation. Differential expression
– Transcription Factor-Gene interaction
• ChIP-chip and ChIP-seq
• Sequence specific motifs
of some of these catego
vineyard strains [ 26,2
expression of amino ac
has recently been link
sensory protein [ 35] .
statistically variable ex
strain were enriched f
elements [ 46] (p = 102
under-enriched for ess
and statistical significan
significantly from YPS
and regulatory features
the conditions examin
recent studies [ 30,43,48
Gene
PLoS Genetics | www.plosgenetics.org
ChIP
– Protein-protein interactions
Y
X
Y
Expression from tran
across strains. H owever
5
motif
X
Influence of Copy N
Variat ion
Octo
Approach
• Formulate a probabilistic graphical model
called Physical Module Network (PMN)
• Two components of the model:
– Module network (M )
– Physical interaction graph (I )
A Physical Module Network
Module
Regulation program
Module Network RECAP
• Segal et al, 2005
• Key assumptions
– Genes are co-expressed in modules
– Genes in the same module have the same regulators
– Expression of a gene is predictable by the expression of
the regulators (made for all expression-based network
inference methods)
• Module networks are made up of module assignments
• Graph structure specifying parents of each module
• Conditional probability distributions
Parents of module Mj
Physical Interaction Graph I
• A graph between genes and proteins
• Three types of edges
– Protein-protein interactions
– Protein-DNA interactions (TF binding)
– Transcriptional edges connecting genes to its
protein product
• The graph may have nodes that are not
measured by expression
Consistency between M and I
• An MN is consistent with an interaction graph if for each pair
of regulator Xi and target module Mj, there is a consistent
physical Regulation Path from Xi to Mj.
• A Regulation Path explains how the “state” of the regulator
reaches a particular target module through a set of physical
interactions.
• Formally, a Regulation Path is
– a sequence of nodes ⟨v1,...,vn⟩ in I, where v1 is a protein
node of the protein product of Xi and vn is a transcription
factor (TF) that binds all the genes in Mj.
– partially directed such that edge between vl and vl+1 is
undirected or partially directed
Example of consistency
A regulation path is needed for consistency between a Module Network and a Physical
interaction graph.
Learning a PMN
Learning in PMN
• Similar to Module Networks
– Optimize regulatory program per module
– Update module assignments
• But need to update I as well
– Need to check that I and M are consistent
– Change I to make sure it is consistent with M
– Assess the score due to change in I
PMN Learning algorithm
• Given
– Input gene expression data DX
– Observations of physical interactions DI
– Pool of potential regulators
• Find a PMN that best describes our data
• Use a score-based learning framework similar to MNs
– Iterative algorithm
• Optimize regulatory program per module (improve the quality of
gene expression prediction)
– Results in modification to the physical interaction graph
• Update module assignments
• Check for consistency in physical network and the MN
Scoring a PMN
• Score of a PMN P=<M,I> is
• Score decomposes into the Module Network (M) and
Interaction graph (I) part provided they are consistent
with each other
From Segal et al, 2005
This paper
A little bit of notation for
•
: An indicator variable set to 1 if edge e
appears in
•
: An indicator variable set to 1 if edge e
appears in
Scoring the interacting graph
Score of the empty graph:
constant, that we will ignore
Assuming edges are independent, the first and third terms can be re-written as
Prior probabilities
Defining the edge probabilities
• Prior probability of edge present in
Adding an edge is more
costly than not
• Probability of observing an edge in
: P-value associated with e. P(de=1|Ie=1) will be high when pe is small
Consistency check and updating the interaction
graph
• Module network learning entails
– Adding an edge from regulator R to module Mj
– Removing an edge from regulator R to module Mj
– Reassigning genes to a module
• Each time, we need to change to make sure
it is consistent and evaluate the effect on the
score
Updating interaction graph I when adding an
edge
• When adding an edge from regulator R to module Mj,
check if there is a consistent regulatory path from R to
Mj
• If there is none, consider adding TF-DNA edges to
introduce such paths
– For each TF T, search for the heaviest (shortest) path from
regulator R to T
– Add the cost of edges from T to genes in the module
– Select T that maximizes this score (sum of shortest path
and the sum of all edges)
– If addition of TF-DNA edges does not improve score, do
not add R to Mj.
Updating interaction graph I when adding an
edge
R
R
R
Consider new
edges in I
Add R to Mj ?
X
T
g1
Mj
Module network
A TF
T
g2
g1
Shortest Path
g2
Current physical interaction
graph I
g1
g2
TF-DNA
interactions
Potential changes to the
physical interaction graph:
must account for the cost of
the shortest path and new TFDNA interactions
Updating interaction graph I when removing an
edge
• When removing an edge from regulator R to
module Mj, remove edges in I while
maintaining consistency
• Examine all edges from R to Mj and remove all
edges that do not violate consistency
Updating interaction graph I during module reassignment
• For a gene g, being considered to be moved
from Mj to Mk this would entail
• Removing TF-DNA edges for g in Mj
• Adding TF-DNA edges for g in Mk
Summary of PMN learning
• Uses the same structure as the MN learning
• Each move in MN has some additional bookkeeping for ensuring a consistent physical
interaction graph
Experiments on simulated data
• 312 genes
– 7 modules regulated by 10 genes
• Sample physical networks to exhibit similar properties as
experimentally determined physical networks
– That is node degree, edge density are similar to those
measured experimentally
– Select 7 TF proteins as true TFs associated with module
• Learn using 200 gene expression samples
• Evaluate using
– Likelihood on test data
– Accurate connection of regulators to genes
– Accurate inference of regulation path (only for PMN)
On simulated data PMNs have higher
likelihood, and higher precision
(a)
(b)
Number of modules Number of modules
(c)
(d)
Noise in
distributions
Number of
bound targets
per module
(PMN only)
Goals for this lecture
• Different types of integrative inference
frameworks
– Supervised
– Unsupervised
• Physical Module Networks (PMNs)
– N. Novershtern, A. Regev, and N. Friedman, "Physical
module networks: an integrative approach for
reconstructing transcription regulation,"
Bioinformatics, vol. 27, no. 13, pp. i177-i185, Jul.
2011.
• Application of PMNs to real data
Evaluation on real data: yeast
• Two expression datasets
– Assess the ability to recover “known” regulator-‐module
relationships based on regulator perturbation followed by
mRNA measurements
– Yeast cell cycle
• Physical interactions for 5,640 genes
– Protein- protein interactions: ~18K
– Protein-DNA interactions: ~91K
G1 and S phase induced module
Dataset description:
50 time points, 594 cycling genes, Protein-DNA interactions specifically from the cell cycle
PMN major results:
• PMN learned 36 modules, 11 had 1 regulator, 4 had two regulators. Regulation path
length ~2.5
• Modules differ based on which phase of cell cycle they peak
• The above module is one module associated
• TFs are chosen as regulators only in a few modules
PMN analysis of human host response to
influenza infection
Novel insight: Viral
polymerase proteins
act upon host
signaling pathway
through several
apoptosis pathway
proteins (TRAF1, API1
etc)
Novel insight: New
mechanistic pathways
from viral proteins to
known major immune
response regulators
(NFKB1, E2F1, IRF1)
Dataset description:
10 time points, protein-protein interactions between human host and 10 viral proteins.
Human-protein interactions from various sources, but including only 32 human TFs
Used 12 predefined modules. Connect 10 viral proteins to modules.
PMN key points
• A per-module probabilistic graphical model based
approach
• Regulatory program is learned while checking for
support in the physical network
– Learn a mechanistic program (we will see other ways
to do this in later lectures)
• Checking for consistency in the physical network
adds to additional computational complexity
• Dependent upon the accuracy and completeness
of the physical network
PMNs vs MNs
• What are the advantages of module networks
compared to physical module networks?
– Enable a regulator to be selected based on expression
and a physical path
– Provides a more detailed picture of the regulatory
network
• What are the challenges in using PMNs?
– Need less noisy physical interaction graphs
– Application to mammalian systems required
additional pre-processing
– Likely not as scalable as module networks