duplicativenetworks
Download
Report
Transcript duplicativenetworks
Proteome Network
Evolution
by Gene Duplication
S. Cenk Şahinalp
Simon Fraser University
Acknowledgements
Colin Cooper, KCL
Joe Nadeau, CWRU
Petra Berenbrink, SFU
Gurkan Bebek, CWRU
NSF, NSERC, CRC Programme, Charles Wang Foundation
Networks are found in biological systems of
varying scales:
time units: millions of years
1. the evolutionary tree of life
2. ecological networks
3. the genetic control networks of organisms
4. the protein interaction network in cells
5. the metabolic network in cells
time units: millionth of a second
Proteins in a cell
There are thousands of different active proteins
in a cell acting as:
enzymes, catalysors to chemical reactions of the
metabolism
components of cellular machinery (e.g. ribosomes)
regulators of gene expression
Certain proteins play specific roles in special
cellular compartments
Others move from one compartment to another
as “signals”.
Protein Interactions
Proteins are produced and degraded all of the time.
The rates at which these processes occur depend on
what proteins are already present, how they interact with
one another directly and how they interact with genes (at
DNA or mRNA level).
Proteins that bind to DNA or RNA have direct effect on
production or degradation of other proteins.
One protein can speed up or slow down the rate of
production of another by binding to the corresponding
DNA or mRNA,
What is a proteome network?
Represents interactions between pairs of
proteins as a binary relationship.
Forms a network in which:
Vertex = protein
Link = interaction
Establishes an ordinary graph of all
proteins in an organism and all possible
interactions between them.
MIPS Proteome Network Topology
PPI Database Sources
DIP (Database of Interacting Proteins - UCLA)
BIND (also include other molecule interactions)
MIPS (Munich information center for proteins)
others including:
PROTEOME
PRONET
CURAGEN
PIM
see http://www.hgmp.mrc.ac.uk/GenomeWeb/prot-interaction.html
Complete Yeast Proteome Network
The yeast proteome network seems to reveal
two basic graph theoretic properties:
The frequency of proteins having interactions with
exactly k other proteins follows a power law:
f(d) ~ C.d-a
The network exhibits the small world phenomena:
small degree of separation between individuals
Degree Distribution of PPI Network
of the yeast [Wagner], [Jeong et al.]
Small world phenomena & power-law degree distribution
also observed in:
communication networks
web graphs
research citation networks
social networks
[Albert, Barabasi & Jeong], [Broder et al.], [Faloutsos3]
Classical -Erdos-Renyi type random graphs do not
exhibit these properties:
Links between pairs of fixed set of nodes picked uniformly:
Maximum degree logarithmic with network size
No hubs to make short connections between nodes
Preferential Attachment Model
[Yule], [Simon]
New node
G
Power-law graphs can be
generated by an iterative
process:
Add one new node at a time
Connect new node to existing
ones independently:
Probability that a node is
connected to the new node is
proportional to degree
[Bollabas et al]
Such graphs also exhibit small
world phenomena
[Barabasi & Albert],
[Bollabas & Riordan]
Proteome network modeling
The model should capture the underlying mechanisms
that generate the network while satisfying known
mathematical properties:
Ohno’s model of genome growth by duplication
Duplication based graphs [Kleinberg et al.], [Kumar et al]
[Pastor-Satorras et al], [Chung et al.]:
each iteration duplicates a randomly chosen vertex with all its
links.
it then independently deletes existing edges and inserts new
ones.
Analysis of incoming degree distribution in directed graphs reveal a
power law.
Simulations on undirected networks exhibit power law like degree
distributions.
Duplication Model
At each iteration t
(= total number of nodes)
G
1. Existing vertex is chosen
uniformly at random and is
``duplicated'' with all its links.
2. Emulate mutations by
a. each link of the new
vertex is deleted with
probability q = 1-p
b. inserting edges between
the new node and every
other node with
probability r/t
Degree distribution of the best
fitting duplication model
Expected degree distribution
Iterative process give difference equations based on both degree and
time; for r = 0:
Ft+1(d) = Ft(d) (1- pd/t) + Ft(d-1) p(d-1)/t + 1/t Sj>d-1Ft(j) pk qj-k [j! / k! (j-k)!]
[Pastor-Satorras et al.]: approximate difference equations by differential
equations to come up with a power law with exponential cutoff:
F(d)/t = f(d) ~ C d-a b-d
Underlying assumption: Pr[ t+1 generates a degree d node]
depends on ft(d+1) and ft(d) only
[Chung et al.]: verify whether power law degree distribution is satisfied
by the difference equations:
ft(d) ~ C d-a for sufficiently large t,d
Underlying assumption: ft(d) is independent of t for all d
Counter evidence
ft(d) with exponential cutoff will result in a
maximum degree of O(log t) as per
Erdos-Renyi graphs.
Power law degree distribution implies a
maximum degree of W(tp)
ft(d) can not be independent of t:
for r=0 and p=.5, the fraction of singletons
approach 1 with growing t
What to do with singletons
Allow them to exist and duplicate (and let them dominate)
Allow them to exist but not duplicate (will have fixed
fraction – do not agree with the yeast network)
Do not allow them to exist:
Either delete as soon as one is created
Or, always have one default connection to one of the existing
nodes
adding a fourth term to the difference equation
Ft+1(k) = Ft(k) (1- pk/t)
+ Ft(k-1) p(k-1)/t
+ 1/t Sj>k-1Ft(j) pk qj-k [j! / k! (j-k)!]
+ (Ft(k-1) - Ft(k)) Sj>0 Ft(j) qj / t2
which gives a power law if 1 = pa – p + p-a
Other properties: k-reachability
rk(n): number of nodes that are at most k
hops away from n.
r1(n1) = 5
r2(n1) = 9
r3(n1) = 10
k-reachability of individual nodes
(nodes sorted by degree)
Average k-reachability
of nodes with fixed initial degree
Average degree
distribution of the
neighbors is
independent of
the degree of the
original node
However the
variance is high
Sequence Homology
Two proteins are sequencewise homologous if their pairwise cDNA
alignment results with 50% similarity and above:
Dual phase distribution of the total number of protein pairs as a function of
percentage similarity:
cDNA sequence source: ftp://genome-ftp.stanford.edu/pub/yeast
Sequence homology vs interactions
If p1 - p2 are similar and p2 - p3 interact,
then with .03 chance p1 - p3 interact.
if p1 - p2 are similar, p2 - p3 are similar,
then p1 - p3 are similar with .64 chance
If two genes physically interact with each other, it
is very likely that they are not similar (excluding
self interactions).
Sequence Homology Enhanced
Duplication Model
•
Given duplicate node i:
Each interaction edge (i,j) is deleted with probability q.
For each similarity edge (j,k), with .03 probabilty, the interaction edge (i,k) is
deleted.
•
Each similarity edge (i,j) is deleted with probability q’.
For each similarity edge (j,k), with .64 probabilty the similarity edge (i,k) is
deleted.
For each interaction edge (j,k), with .03 probabilty the interaction edge (i,k) is
deleted.
•
For each j, a new interaction edge (i,j) is added with probability r/t.
For each similarity edge (j,k), with .03 probabilty the interaction edge (i,k) is
added.
•
A new similarity edge (i,j) is added with probability r’/t.
For each similarity edge (j,k), with .64 probabilty, the similarity edge (i,k) is
added.
For each interaction edge (j,k), with .03 probabilty, the interaction edge (i,k) is
added.
Degree Distribution of the
Enhanced Model
k-reachability of individual nodes
(nodes sorted by degree)