Presentation

Download Report

Transcript Presentation

Exploring Network
Inference Models
Math-in-Industry Camp & Workshop:
Michael Grigsby: Cal Poly, Pomona
Mustafa Kesir: Northeastern University
Nancy Rodriguez: University of California, Los Angeles
Man Vu: Cal State University, Long Beach
Introduction-Problem Statement

Problem proposed by Ruye Wang from Harvey Mudd
College.

Some biological processes are modeled by networks
comprised of a group of interacting components such as
genes in a gene regulatory network, neurons in the
brain, or proteins.

Biologists want to know how the components of the
network are related and how they interact to make
predictions about the behavior of a biological system.
Introduction

Network Inference is an approach for modeling
and analyzing networks composed of many
interacting component units. That is, given a set
of genes a biologist performs a series of
experiments to test how the genes affect (excite
or inhibit) one another and also determine the
magnitude of that affect.
Introduction

There are several different mathematical models
for network inference each with it’s own
advantages and disadvantages.

One is the Boolean Network Model which
simulates the components by a group of binary
nodes interacting with each other that follow
logical operations.
Introduction
Another is the Linear and Quasi-Linear
models that assume the components are
linearly or quasi-linearly related in the
network.
 Then there is the Differential Equation
(DE) model that simulates the dynamics of
the network by a system of differential
equations. This is the model we studied.

Introduction

Given a set of n nodes (genes) in the network
and a set of k data points taken over time the
differential equation governing the dynamics of
the network is:
 rivi’ (t)
+ λivi(t) = g[Σ Tim vm(t) + hi]
Where (i=1,…,n and t=1,…m)

vi(t) is the observed data and the other
parameters are unknown where ri is a time
constant, λi is a scaling factor, Tim is a constant
that describes how node m affects node i, and hi
is a constant.
Introduction

The goal is to find estimates for the n by n matrix
T, along with the other unknown parameters
from the observed data Vi(t).

However an optimal search of a O(n2)dimensional space must be conducted in order
to find the parameters that minimize the error.
This is very computationally expensive and can
realistically be done for a network with only a
small number of nodes.
Our first Attempt
rv’ + λv= g(X)
 Try to find a relationship between r and λ
to reduce parameters.
r

λ
r = f(λ)
λ
But How?


Given g(x), some s function with range (-1,1)
g(x)= ex -1 then g’(x) Є (0,1]
ex+1
α Є (-1,1) and β Є (0,1]
rv’ + λv = g(X)
a
b
r
rv’’ + λv = g’(X)
c
d
λ
=
α
β
Other Methods
Most existing methods requires a heuristic
approach
 Requires many assumptions and parallel
programming
 Other than heuristic methods, statistical
methods are viable but not feasible for
large numbers of genes

Bayesian Networks
Statistical approach for modeling gene
networks
 Treats each gene as a random variable
 Joint distribution over all genes represents
the cell states
 Goal: estimate and study the structures of
the
distributions
1. http://www.cs.huji.ac.il/labs/compbio/ismb01/ismb01.pdf

2. http://www.cs.unm.edu/~patrik/networks/robust.pdf
To Name a Few

Boolean Networks: uses 0’s and 1’s to
represent the excitation or not
http://www.cs.ucdavis.edu/~filkov/classes/289a-W03/l10.pdf

Differential Equation Models:
 Many
unknown parameters and assumptions
 Nonlinear models needs to be linearized
 Computationally costly for large number of
genes
http://www.biochemsoctrans.org/bst/031/1519/bst0311519.htm
Simulated Annealing
1. Let X := initial configuration
2. Let E := Energy(X)
3. Let i = random move from the Moveset
4. Let Ei := Eval(move(X,i))
5. If E < Ei then X := move(X,i)
E := Ei
Else with some probability,
accept the move even though
things get worse:
X := move(X,i)
E := Ei
6. Goto 3 unless we have reached t_max
Allowable moves. Choosing this
is key!
Algorithm: Choosing (τ,λ)
The domain of g-1 is (-1,1)! This
is where conditions for λ come
in.
Algorithm: Solving for T and h
i.e
Algorithm: Decreasing Cost
Tm decreases with each iteration. The more iterations the
less likely you make possible “bad moves” same for change
in cost.
Possible Area of Improvement

If we had more time where would we focus?
 Simulated Annealing
is a good idea provided you
move within your moveset intelligently.
 Choosing the moveset is also important, for us g(x)
helps restrict the domain of λ based on τ. How do
you know the domain of τ.
 Finding the derivative matrix can possibly be
improved.
 Recovering the data, solving the ODE.
 Choosing the correct energy function.
 Solving the system of algebraic equations.
Ideas for moving within Moveset
Neighborhood of
search must be small
enough.

Recall the computations:

Might be better to check if λ0 lies within the range dictated by τ1, and
compare C(λ0 , τ1) to C(λ0 , τ0).
When k is not big enough, i.e. when k<n;

One obvious way could be:
 Once
 We
we interpolate to get vi (t);
can get as many time observations as we
need, i.e. we can make k as big as necessary.
Another way could be:

Again taking DE as the model;
 We
can reduce the number of nodes, i.e. get
a smaller number of nodes
 To get all unknowns ,,Tij, hi we need to
have k=n+1 or bigger. If k<n, then eliminate
(n-k-1) nodes.
 It can result in a loss of important data, the
way we do that is really important. Thinking of
vi (t)’s as functions, it’s possible that all n of
them are linearly independent.

Functional Data Analysis (FDA)(*) could be
extremely helpful in this manner. The thing is, in
biological applications, we usually have huge
n(~10000), and FDA is extremely useful in
dealing with big data samples.

(*) Ramsay, J. O. and Silverman, B.W. (2002) Applied functional data analysis :
methods and case studies, Springer series in statistics, New York ; London : Springer
(*) Ramsay, J. O. and Silverman, B.W. (2005) Functional data analysis, 2nd ed., New
York : Springer
Also available to view online through Claremont campus:


http://site.ebrary.com/lib/claremont/docDetail.action?docID=5006429
Working with the DE model, one immediately
notices that computational cost (O(n2)) is a major
obstacle. As long as complexity of FDA is not as
big as O(n2), at does not make things any worse.
(Actually, even if O(n2) is fine).
