BerkmanMay05
Download
Report
Transcript BerkmanMay05
Methods for
evaluating inference
algorithms
June, 2005
Omer Berkman
Tel Aviv University, Israel.
1
Introduction
A central goal of molecular biology is to understand
the regulatory mechanisms of gene transcription and
protein synthesis.
One major goal of functional genomics research is to
take large sets of biological data and elucidate
functional interactions between elements in a causal
pathway or network.
The invention of DNA microarrays, which measure the
abundance of thousands of mRNA targets
simultaneously, has been an important milestone.
Many approaches to the reverse engineering of
genetic regulatory networks from gene expression
data have been explored.
2
Reminder of the problem
Interactions between thousands of genes have
to be learned from short time series of typically
only about a dozen measurements.
When can genetic networks be
identified from microarray
data?
According to Zak et al. (2002) – not too soon:
Identifying the network architecture may only be
possible when a rich microarray time course 3is
Is the inference possible?
It is not clear whether an elicitation of biological
structures from expression data is at all
possible (whether the posterior probabilities
over structures can be expected to be
sufficiently informative).
Most studies assess the inference results by
comparing predicted regulatory interactions with
those known from the biological literature.
This approach is controversial due to the
absence of known gold standards, which
renders the estimation unreliable and difficult.4
Biological literature – problem
one
Having detected an interaction, authors tend to delve
into the literature to confirm their findings.
They back up their results with interactions between
proteins which are similar to the ones encoded by
genes studied in the performed experiment.
First, sequence similarity does not necessarily imply
similar functions.
Second, there is an inherent arbitrariness in deciding
what is similar.
One may therefore suspect that some of the reported
true interactions are spurious and do not really support
the interactions detected in the reverse engineering
procedure.
5
Biological literature – problem
two
The second and more serious drawback is the
difficulty in estimating the false detection rate.
This is because on predicting a gene
interaction that is not supported by the
literature, it is impossible to decide, without
further expensive interventions in the form of
multiple gene knock-out experiments, whether
the algorithm has discovered a new,
previously unknown interaction, or whether it
has flagged a spurious edge.
6
Simulation motivation
Some of the predicted interactions are
biologically reasonable, some appear to be
unreasonable, but the validity of the vast
majority is difficult to assess.
Testing all of the predicted interactions
experimentally would take decades.
No suitable approach has been formulated for
evaluating the effectiveness at recovering
models of complex biological systems from
limited data.
To overcome this limitation, recent works used
7
biologically reasonable simulated data.
Simulation approach benefits
Enables evaluation of different algorithms.
Enables better understanding of the inference
algorithms.
Enables parameters setting such as noise level,
network topology and sampling interval.
Enables learning with unlimited data.
Gives indications about yet-unrealistic
environments.
Enables a priori to design experiments and
data-collection protocols that are amenable to
8
functional network inference.
Simulation approach illustration
Smith et al.
2002
Build a known network.
Create a simulator in which we know all the
rules.
Sample data from it.
Present the sampled data to inference
algorithm.
Evaluate the algorithm output.
9
Smith et al. (2002) introduction
The first approach for evaluating the
effectiveness of inference algorithms at
recovering models from limited data.
They created data simulator, sampled the
simulated data, and used the sampled data to
evaluate the effectiveness of the algorithm they
developed.
The first application to modeling complex
systems at multiple biological levels of
organization beyond genetic regulation.
10
BRAINSIM – Smith et al. (2002)
Design
A simulator which models an
electrophysiological activity
(0-400 Hz) and expression
level of 100 genes (0-50
arbitrary units).
90% of the simulated
variables are unregulated by
other variables in the system
and are included simply as
distracters.
Smith et al.
2002
11
BRAINSIM – Initialization
Activity begins at a random value.
The initial level of a gene is called its ‘target’
value, intends to correspond to its
constitutive expression level.
This value is selected as a random value in
the range 0-10 for up-regulated genes, 4050 for down-regulated genes and 0-50 for
the unregulated genes.
12
BRAINSIM – Dynamics rules
1. Degradation
- from all genes a constant
amount, chosen to be 4, is added or subtracted
to move closer to their target levels.
2. Regulation - from all regulated genes a
proportion, chosen to be 0.2, of their regulator’s
level is added or subtracted.
The unregulated genes added or subtracted a
random number, chosen to be 0-5, to simulate
regulation by other unmeasured processes.
3. Noise – A random amount, chosen to be 0-6,
was added to or subtracted from each gene to
simulate stochasticity in gene expression. 13
BRAINSIM – Simulated Data
Generated data for Activity, Gene1, Gene10
and Gene12. Note the regulation effects and
the time lags, as well as the random walk of the
unregulated gene.
Data sampled every 5 time steps, time 45 – 150,
causing loss of information.
Quartile discretized values, causing another
information loss but making the data more
robust.
14
NETWORKINFERENCE –
Smith et al. (2002)
Design
Takes a collection of observed data as input.
Searches for Bayesian networks that are good
at explaining the observed data without
unnecessary complexity.
Designed in a manner that is capable of
searching for Dynamic Bayesian Networks from
temporal data.
15
Bayesian networks definitions
Representation
of a joint probability
distribution which consists of two
components:
G is a directed acyclic graph (DAG) whose
vertices correspond to the random variables.
θ describes a conditional distribution for each
variable, given its parents in G.
16
Bayesian networks illustration
The network structure implies that the joint
distribution has the product form
P(A,B,C,D,E) =
P(A)P(B|A,E)P(C|B)P(D|A)P(E)
A graph-based model that captures properties
of conditional independence between
variables.
Encodes the Markov assumption: Each
variable is independent of its non-descendants,
given its parents in the graph.
17
Learning network from data
Find
the parameters:
G arg max G {P(G | D)},
*
From
And:
arg max {P( | G*, D)},
*
Bayes rule we have:
P( M )
P(G | D) P( D | G)
P( D)
P( D) G P( D | G)P(G),
P( D | G) P( D | , G) P( | G)d
18
Bayesian networks properties
Can
capture many types of relationships
between variables.
Owing to their probabilistic nature, BN
algorithms are capable of handling noisy
data.
Effective handling of hundreds of
variables.
19
BN model genetic networks
BN were first applied to this problem by
Friedman (2000), Pe’er and Hartemink (2001).
At a qualitative level, the structure of a BN
describes the relationships between variables in
the form of conditional independence relations.
At a quantitative level, relationships between
the interacting agents are described by
conditional probability distributions.
The probabilistic nature of this approach is
capable of handling noise inherent in both the
biological processes and the microarray
20
experiments.
Bayesian networks drawbacks
Networks
with the same skeleton but
different edge directions can have the
same marginal likelihood P(D|M). Thus
it’s impossible to distinguish between
them on the basis of the data.
The acyclicity constraint rules out
recurrent structures. However, feedback
is an essential feature of biological
systems.
21
Dynamic Bayesian networks
illustration
Recurrent network comprising
two genes with feedback that
interact with each other
Equivalent DBN obtained by
unfolding the recurrent
network in time
Husmeier
2003
An interaction between genes is not
instantaneous. Its effect happens with
a time delay after its cause.
22
NETWORKINFERENCE –
Search strategy
Identifying the highest-scoring network has
been shown to be hard (Chickering 1996).
NETWORKINFERENCE uses simulated
annealing (with extension for re-annealing to
avoid becoming trapped in local maxima).
NETWORKINFERENCE also allows to modify
the prior over network structures by
specifying sets of links that are required to be
present or required to be absent.
23
Simulated annealing overview
The concept is based on the process of annealing melt is slowly cooled so that the system at any time is
approximately in thermodynamic equilibrium.
If the change in energy is negative the new
configuration is accepted. If the change in energy is
positive it is accepted with a certain probability.This
processes is then repeated sufficient times to give
good sampling statistics for the current temperature,
and then the temperature is decremented and the
process repeated until a frozen state is achieved.
If the initial temperature of the system is too low or
cooling is done insufficiently slowly the system may
become trapped in a local minimum energy state.
24
Simulated annealing
illustration
Computational Science
Education Project
http://www.phy.ornl.gov/csep
25
NETWORKINFERENCE – Results
Sampling rate of 5
Recovery: 89±0.1%, Correctness: 98±0.1%
Sampling rate of 10
Recovery: 27±0.3%, Correctness: 30±0.4%
26
Smith et al. (2002) - discussion
It is possible to sort out noise and distracters
from a specific regulatory network of interest.
It is possible to evaluate and thus develop
inference algorithms by using simulated data.
Shows how sampling influences the inference.
Simulator and inference are based on
distinguished models
Used very simple model for simulator.
The inference procedure was tested for an
unrealistically large training set
Conclusions are not convincing enough.
27
Husmeier (2003) - abstract
The objective of this study is to test the
viability of the BN paradigm in a realistic
simulation study.
First, gene expression data are simulated
from a realistic biological network.
Then, interaction networks are inferred from
these data using DBN and Bayesian
learning with Markov Chain Monte Carlo.
28
Inference - MCMC
One has to resort to heuristic optimization
methods.
Markov Chain Monte Carlo (Chib and
Greenberg 1995): Given a network structure
Mold, a new structure Mnew is proposed, with
proposal probability Q(Mnew|Mold).
Husmeier
2003
29
DBN settings
Nachman
2004
To avoid an explosion of the model complexity,
parameters are tied such that the transition
probabilities between time slices are the same.
Edges within a time slice are not allowed.
A limitation was set on the maximum number of
edges converging on a single node.
30
Simulation study
Data are generated from a known Bayesian
network.
New networks are learned and compared with
the true network.
31
Evaluating inference
Sensitivity is the proportion of recovered true
edges:
TP / (TP + FN)
Complementary specificity is the proportion of
erroneously recovered spurious edges:
FP / (TN + FP)
Plotting the ensuing sensitivity against the
corresponding complementary specificity gives
32
the Receiver Operator Characteristic.
Evaluation on synthetic data structure
Using a structure of sub-network of the yeast
cell cycle, taken from Friedman et al. (2000).
In a second series of simulations, 38 redundant
nodes were added to this network as
confounders, giving a total of 50 nodes.
33
Evaluation on synthetic data distribution
First conditional probability distribution: Noisy
regulation according to a binomial distribution
Excitation: P(on|on) = 0.9, P(on|off) = 0.1;
Inhibition: P(on|on) = 0.1, P(on|off) = 0.9;
Noisy XOR co-regulation:
P(on|on,on) = P(on|off,off) = 0.1,
P(on|off,on) = P(on|on,off) = 0.9;
Second conditional probability distributions:
Stochastic interaction where all parameters
were chosen at random (binomial or trinomial).
34
Evaluation on synthetic –
results (ROC curves)
Structure 1
Structure 2
T=100
T=30
T=7
Solid line shows noisy regulation, dashed line shows stochastic interaction
35
And still, this is an upper
bound
The model used for inference is identical to
the simulator model.
The true continuous signals are typically
sampled at discrete time points, which loses
information, especially if the sampling intervals
are not matched to the relaxation times of the
true biological processes.
Gene expression ratios are typically
discretized, which inevitably adds noise and
causes further loss of information.
36
Evaluation on realistic simulated
data
The study applies the
model regulatory
network proposed by
Zak et al. (2001), but
in contrast to it, the
system was
augmented by
adding 41 spurious,
unconnected genes
(giving a total of 50
genes), which were
up- and downregulated at random.
Husmeier
2003
37
Realistic simulated data – two
experiments
The first experiment (followed closely Zak’s
study): Ligand was injected at time 1000 min.
Then, 12 data points were collected over 4000
min in equidistant intervals.
The second experiment (adopted a sampling
strategy different from Zak): focusing on the
time immediately after external perturbation,
when the system is in disequilibrium. The
sampler therefore collected 12 data points
over a shorter interval of only 500 min
immediately after ligand injection, between
times 1100 and 1600 min.
38
Inference results
The most restrictive prior (maximum fan-in = 2).
The threshold θ was chosen so as to obtain the
same number of edges between non-spurious
nodes as in the true network.
True
Learned
Solid arrows show true edges and dashed arrows represent spurious edges
39
Realistic simulated data –
Results
Experiment1
Experiment2
Fan-in=2
Fan-in=3
Fan-in=4
Averaged over three MCMC simulations. The solid line shows the ROC curve
obtained from gene expression data alone, while the dash-dotted line shows the
ROC curve obtained when including sequence information.
40
Realistic simulated data –
Discussion
The inclusion of available prior knowledge improves
the performance of the inference.
Sampling over a long time interval, which covers the
system in equilibrium, gives areas that are small, and
the low slope of the curves at the left-hand side
implies that even the dominant true edges are
obscured by a large proportion of spurious edges.
Sampling to a short time interval, when the system is
in a perturbed non-equilibrium state, gives areas that
are significantly increased, and the larger slope of the
curves implies that the dominant true edges are
obscured by far fewer spurious edges.
41
Discussion
The search for new genetic interactions is hard,
but it is significantly more effective than a
search from tabula rasa.
Simulations quantify how much can be learned
from the data in the described unfavourable
situation.
It demonstrates how the network inference
performance varies with the training set size,
the prior assumptions, the experimental
sampling strategy and the inclusion of biological
information.
42
References
Friedman et al., “Learning the structure of dynamic probabilistic networks”,
Proceedings of the Fourteenth Conference on Uncertainty in Artificial
Intelligence, 1998.
Friedman et al., “Using Bayesian Networks to Analyze Expression Data”,
Journal of Computational Biology, 2000.
Husmeier Dirk, “Sensitivity and specificity of inferring genetic regulatory
interactions from microarray experiments with dynamic Bayesian
networks”, Bioinformatics 2003.
Nachman et al., “Inferring quantitative models of regulatory networks from
expression data”, Bioinformatics 2004.
Smith et al. “Evaluating functional network inference using simulations of
complex biological systems”, Bioinformatics 2002.
Yu et al.,“Advances to Bayesian network inference for generating causal
networks from observational biological data”, Bioinformatics 2004.
Zak et al.,“Local Identifiability: when can genetic networks be identified
from microarray data?”, Proceedings of the Third International Conference
43
on Systems Biology 2004.