Gene network inference - Institute for Mathematics and its

Download Report

Transcript Gene network inference - Institute for Mathematics and its

A comparative approach for gene
network inference using time-series
gene expression data
Guillaume Bourque* and
David Sankoff
*Centre de Recherches Mathématiques,
Université de Montréal
October 2003
DNA Microarrays
• Experiment design
• Noise reduction
• Normalization
•…
http://www.sri.com/pharmdisc/cancer_biology/laderoute.html
• Data analysis
Gene Expression Data
Beyond Clustering…
_
x1
?
+
+
x4
_
+
_
x2
_
_
Time series
Gene network
x3
Comparative Framework
Specie A
Specie B
Specie C
Harder Problem?
• This new problem seems more ambitious and
harder to solve.
• BUT, we will show that, for closely related
species (samples), the comparative framework
can actually improve the quality of the
solutions recovered.
• The repetitive nature of the data can be used to
sort through some of the noise and some of the
ambiguity.
Outline
• Gene network model
• Single network inference
– Algorithm
– Simulations
• Multiple networks inference
– Algorithm
– Simulations
• Conclusions
Gene Network Model
• We use linear differential equations to model the gene
trajectories (Chen et al. ‘99, D’haeseleer et al. ‘99):
dxi(t) / dt = a0 + ai,1 x1(t)+ ai,2 x2(t)+ … + ai,n xn(t)
• Several reasons for that choice:
– Takes advantage of the continuous aspect of the data.
– Allows for feed-back loops.
– Low number of parameters implies that we are less likely
to over fit the data.
– Sufficient to model complex interactions between genes.
Small Network Example
_
x1
+
+
x4
_
+
_
x2
x3
_
_
dx1(t) / dt = 0.491 - 0.248 x1(t)
dx2(t) / dt = -0.473 x3(t) + 0.374 x4(t)
dx3(t) / dt = -0.427 + 0.376 x1(t) - 0.241 x3(t)
dx4(t) / dt = 0.435 x1(t) - 0.315 x3(t) - 0.437 x4(t)
Small Network Example
_
x1
+
+
x4
_
_
+
_
x2
x3
_
interaction
coefficient
dx1(t) / dt = 0.491 - 0.248 x1(t)
dx2(t) / dt = -0.473 x3(t) + 0.374 x4(t)
dx3(t) / dt = -0.427 + 0.376 x1(t) - 0.241 x3(t)
dx4(t) / dt = 0.435 x1(t) - 0.315 x3(t) - 0.437 x4(t)
Small Network Example
_
x1
+
+
x4
_
_
+
_
x2
x3
_
constant
coefficient
dx1(t) / dt = 0.491 - 0.248 x1(t)
dx2(t) / dt = -0.473 x3(t) + 0.374 x4(t)
dx3(t) / dt = -0.427 + 0.376 x1(t) - 0.241 x3(t)
dx4(t) / dt = 0.435 x1(t) - 0.315 x3(t) - 0.437 x4(t)
Problem Revisited
a0,i
x1
x2
x3
x4
a1,i
.431 -.248
0
0
-.427 .376
0
.435
a2,i
a3,i
a4,i
0
0
0
0
-.473 .374
0
-.241
0
-.315 -.437
0
Given the time-series data, can we find
the interactions coefficients?
Linear Differential Equations
• Even under the simplest linear model, there are
m(m+1) unknown parameters to estimate:
• m(m-1) directional effects
• m self effects
• m constant effects
• Number of data points is mn and we typically have
that n << m (few time-points).
• To avoid over fitting, extra constraints must be
incorporated into the model such as:
• Smoothness of the equations (D’haeseleer et al. ‘99)
• Sparseness of the network, i.e. few non-null interaction
coefficients (Yeung et al. ‘02, De Hoon et al. ‘02)
Algorithm for Network Inference
• To recover the interaction coefficients, we use
stepwise multiple linear regression.
• Why?
– This procedure finds coefficient that significantly improve
the fit in the regression. It limits the number of non-zero
coefficients (i.e. it finds sparse networks) a feature we
were seeking.
– It is highly flexible and provides p-value scores which can
be interpreted easily.
Partial F Test
• The procedure finds the interaction coefficients
iteratively for each gene xi.
• A partial F test is constructed to compare the total
square error of the predicted gene trajectory with a
specific subset of coefficients being added or
removed.
• If the p-value obtained from the test exceeds a
certain cutoff, the subset of coefficients is
significant and will be added or removed.
• The procedures iterates until no more subsets of
coefficients are either added or removed.
Simulations
• Difficult to find coefficients that will produce
realistic gene trajectories.
• We select coefficients such that the resulting
trajectories satisfy 3 conditions:
– They are bounded
– The correlation of any pair is not too high
– They are not too stable
• We added gaussian noise to model errors.
Gaussian Noise
Network Inference
a0,i
regression
procedure
x1
a1,i
.431 -.248
x2
0
x3
0
-.427 .376
x4
0
.435
a2,i
a3,i
a4,i
0
0
0
0
-.473 .374
0
-.241
0
-.315 -.437
_
x1
+
x2
+
_ x4
+
_
_
x3
_
Procedure recovers perfectly this network with 4 genes
and 10 interactions coefficients.
0
10 Genes
Procedure also recovers perfectly this network with 10
genes and 22 interactions coefficients.
Multiple Networks
Specie A
Specie B
Specie C
Types of Problems
• Multiple networks related by a graph or a tree can
arise from various situations:
– Different species
– Different developments stages
– Different tissues
• The goal is now not only to maximize the fit (with as
few interactions as possible) but also to minimize an
evolutionary cost on the graph of the networks.
Evolutionary Cost
{1, 2}
sets of predicted
regulators
evolutionary
event
{1, 2, 3}
{1}
{1, 3}
{1, 2, 3}
Evolutionary cost = 3
Multiple Network Inference
• The stepwise regression algorithm is modified to add/remove
subsets of regulators directly on the edges of the graph.
• Partial F tests are computed on the vertices affected by this
change the evaluate the change in fit.
• The p-values obtained are then modified based on the change
in evolutionary cost.
• The p-values are finally combined into a scoring function
using a Kolmogorov-Smirnov Test.
• The algorithm iteratively adds/removes the best scoring move
when above/below a certain threshold.
Simulation Example
Simulation Example
Simulation Results
Conclusions
• The comparative framework actually simplifies the
inference process especially for instances of the
problem with more genes, more noise or fewer timepoints.
• The procedure could also be used for the revision of
gene networks.
• Possibility of exploring different evolutionary
models.
• We need to try the procedure on real data.