Transcript file1
Reverse engineering gene
networks using singular value
decomposition and robust
regression
M.K.Stephen Yeung
Jesper Tegner
James J. Collins
General idea
Reverse-engineer:
• Genome-wide scale
• Small amount of data
• No prior knowledge
• Using SVD for a family of possible
solutions
• Using robust regression to choose from
them
If the system is near a steady state, dynamics
can be approximated by linear system of N
ODEs:
N
xi (t ) i xi (t ) Wij x j (t ) bi (t ) i (t )
j 1
xi = concentration of mRNA
(reflects expression level of genes)
λi = self-degradation rates
bi = external stimuli
ξi = noise
Wij = type and strength of effect
of jth gene on ith gene
Suppositions made:
• No time-dependency in connections
(so W is not time-dependent), and they are
not changed by the tests
• System near steady state
• Noise will be discarded, so exact
measurements are assumed
• X can be calculated exactly enough
In M experiments with N genes,
• each time apply stimuli (b1,…,bN) to the genes
• measure concentrations of N mRNAs (x1,…,xN)
using a microarray
You get:
X N xM
x11 x12 x1M
x12 x22 x2M
x1 x 2 x M
N
N
N
subscript i = mRNA number
superscript j = experiment number
Goal is to use as few measurements as
possible. By this method (with exact
measurements):
M = O(log(N))
e.g. in 1st test, the results will be:
System becomes:
X N x M AN x N X N x M BN x M
With A = W + diag(-λi)
Compute X by using several measurements of the
data for X. (e.g. using interpolation)
Goal = deduce W (or A) from the rest
X A
T
T
M xN
NxN
T
X
M xN
B
T
If M=N, compute (XT)-1, but mostly M << N
(this is our goal: M = log(N))
M xN
Therefore, use SVD (to find least squares sol.):
X
T
M xN
U M x NWN x N V T
NxN
Here, U and V are orthogonal (UT = U-1)
and W is diag(w1,…,wN) with wi the singular
values of X
Suppose all wi = 0 are in the beginning, so wi = 0
for i = 1…L and wi ≠ 0 (i=L+1...L+N)
T
T
U M x N diag ( wi ) N x N V N x N A N x N
T
T
X M xN B M xN
Then the least squares (L2) solution to the
problem is:
1 T
A0 X N x M BN x M U M x N diag V N x N
w
j
With 1/wj replaced by 0 if wj = 0
So this formula tries to match every
datapoint as closely as possible to the
solution.
But all possible solutions are:
A A0 CV
T
with C = (cij)NxN where cij = 0 if j > L and
otherwise just a scalar coefficient
How to choose from the family of solutions ?
The least squares method tries to match
every datapoint as closely as possible
→ a not-so-sparse matrix with a lot of
small entries.
1. Basing on prior biological knowledge,
impose this on the solutions.
e.g.: when we know 2 genes are related,
the solution must reflect this in the matrix
2. Work from the assumption that normal
gene networks are sparse,
and look for the matrix that is most
sparse
thus: search cij to maximize the number
of zero-entries in A
So:
• get as much zero-entries as you can
• therefore get a sparse matrix
• the non-zero entries form the connections
•
fit as much measurements as you can,
exactly: “robust regression”
(So you suppose exact measurements)
Do this using L1 regression. Thus, when
considering
A A0 CV
T
we want to “minimize” A.
The L1 regression idea is then to look for the
T
solution C where || A0 CV ||1 is
minimal.
This causes as many zeros as possible.
Implementation was done using the simplex
method (linear adjustment method)
Thus, to reverse-engineer a network of N
genes, we “only” need Mc = O(logN)
experiments.
Then Mc << N, and the computational cost
will be O(N4)
(Brute-force methods would have a cost of
O(N!/(k!(N-k)!)) with k non-zero entries)
Test 1
• Create random connectivity matrix:
for each row, select k entries to be non-zero
- k < kmax << N (to impose sparseness)
- non-zero entry random from uniform distrib.
• Do random perturbations
• Do measurements while system relaxes back
to its previous steady state → X
• Compute X by interpolation
• Do this M times
Test 1
• Then apply algorithm to become
approximation of A
~
• Computed error (with A the computed A):
N
N
E eij
i 1 j 1
~
1 if | Aij Aij |
where eij
0 otherwise
• Results: Mc = O(log(N))
• Better than only SVD,
without regression:
Test 2
• One-dimensional cascade of genes
• Result for
N = 400:
Mc = 70
Test 3
• Large sparse gene
network, with random connections,
external stimuli,…
• Results the same
as in previous tests
Discussion
Advantages:
• Very few data needed, in comparison with
neural networks, Bayesian models
• No prior knowledge needed
• Easy to parallelize, as it recovers the
connectivity matrix row by row (gene by
gene)
• Also applicable to protein networks
Discussion
Disadvantages:
• Less efficient for small networks (M≈N)
• No quantification yet of the necessary
“sparseness”, though avg. 10 connections
is good for a network containing > 200
genes
• Uncertain X
• Especially useful with exact data, which
we don’t have
Improvements
• Other algorithms to impose sparseness:
alternatives are possible both for L1 (basic
criterion) as for simplex (implementation)
• By using a deterministic linear system of ODEs,
a lot has been neglected (noise, time delays,
nonlinearities)
• Connections could change by experiments;
then the use of time-dependent W is
necessary