Transcript Lecture2x
Lecture 2
Complex Network Models
Properties of Protein-Protein Interaction
Networks
Handling Multivariate data: Concept and
types of metrics, distances etc.
Hierarchical clustering
Self organizing Mapping
Complex Network Models:
Average Path length L, Clustering coefficient C, Degree
Distribution P(k) help understand the global structure
of the network.
Some well-known types of Network Models are as
follows:
•Regular Coupled Networks
•Random Graphs
•Small world Networks
•Scale-free Networks
•Hierarchical Networks
Regular networks
Regular networks
Diamond Crystal
Both diamond and graphite
are carbon
Graphite Crystal
Regular network (A ring lattice)
Average path length L is
high
Clustering coefficient C is
high
Degree distribution is delta
type.
P(k)
1
1
2
3 4 5
Random Graph
Erdos and Renyi introduced the concept of random
graph around 60 years ago.
Random Graph
N=10
Emax = N(N-1)/2
=45
p=0.1
p=0
p=0.15
p=0.25
ER network p = 0.01
ER network p = 0.02
ER network p = 0.078
Above figure shows the ER network consisting of 50 nodes with three
different values of p (p < 1/n, p =1/n, p = log(n)/n), here n=50.
For small p the network is disconnected and consists of isolated nodes
and isolated components. At p = 1/n (when the average degree is 1) a
phase transition occur creating a giant component
In almost all the cases, the ER network becomes connected for
Random Graph
The degree distribution of the ER
model follows binomial distribution
that becomes approximately
poissonian as the network size grows
bigger
k
P(k ) e
p=0.25
Average path length L is
Low
Clustering coefficient C is
low
Degree distribution is
exponential type.
k!
Here, λ = average degree = p(N-1) = ~pN
Random Graph
Usually to compare a real network with a random network we
first generate a random network of the same size i.e. with the
same number of nodes and edges.
Other than Erdos Reyini random graphs there are other type of
random graphs
A Random graph can be constructed such that it matches the
degree distribution or some other topological properties of a
given graph
Geometric random graphs:
each vertex is assigned random coordinates in a geometric
space of arbitrary dimensionality and random edges are allowed
between adjacent points or points constrained by a threshold
distance.
Geometric random graph: Example
Small world model (Watts and Strogatz)
Oftentimes,soon after meeting a stranger, one is surprised to
find that they have a common friend in between; so they both
cheer:
“What a small world!”
What a small world!!
Small world model (Watts and Strogatz)
Randomly rewire each edge
Begin with a nearest-neighbor of the network with some
coupled network
probability p
Fig A
Fig B
Fig. B shows the small world network generated by starting from the regular
coupled network of Fig. A with p=0.25 i.e. 25% edges of the network of Fig A
are rewired randomly to generate the network of Fig. B.
As p approaches to 1 the network approaches a random network of ER type.
Small world model (Watts and Strogatz)
Average path length L is
Low
Clustering coefficient C is
high
Degree distribution is
exponential type.
P(k)
Scale-free model (Barabási and Albert)
Start with a small number of nodes; at every time step,
a new node is introduced and is connected to alreadyexisting nodes following Preferential Attachment
(probability is high that a new node be connected to
high degree nodes)
Average path length L is
Low
Clustering coefficient C is
not clearly known.
Degree distribution is
power-law type.
1
0.1
γ=2
0.01
γ=3
0.001
P(k) ~
k-γ
0.0001
1
10
100
1000
Scale-free networks exhibit robustness
Robustness – The ability of complex systems to maintain their
function even when the structure of the system changes significantly
Tolerant to random removal of nodes (mutations)
Vulnerable to targeted attack of hubs (mutations) – Drug
targets
Scale-free model (Barabási and Albert)
The term “scale-free” refers to any functional form
f(x) that remains unchanged to within a
multiplicative factor under a rescaling of the
independent variable x i.e. f(ax) = bf(x).
This means power-law forms (P(k) ~ k-γ), since
these are the only solutions to f(ax) = bf(x), and
hence “power-law” is referred to as “scale-free”.
Hierarchical Graphs
NETWORK BIOLOGY: UNDERSTANDING THE CELL’S FUNCTIONAL ORGANIZATION
Albert-László Barabási & Zoltán N. Oltvai
NATURE REVIEWS | GENETICS VOLUME 5 | FEBRUARY 2004 | 101
The starting point of this construction
is a small cluster of four densely
linked nodes (see the four central
nodes in figure).Next, three replicas of
this module are generated and the
three external nodes of the replicated
clusters connected to the central node
of the old cluster, which produces a
large 16-node module. Three replicas
of this 16-node module are then
generated and the 12 peripheral nodes
connected to the central node of the
old module, which produces a new
module of 64 nodes.
Hierarchical Graphs
The hierarchical network model seamlessly integrates a scale-free topology with
an inherent modular structure by generating a network that has a power-law
degree distribution with degree exponent γ = 1 +ln4/ln3 = 2.26 and a large,
system-size independent average clustering coefficient <C> ~ 0.6. The most
important signature of hierarchical modularity is the scaling of the clustering
coefficient, which follows C(k) ~ k –1 a straight line of slope –1 on a log–log plot
NETWORK BIOLOGY: UNDERSTANDING THE CELL’S FUNCTIONAL ORGANIZATION
Albert-László Barabási & Zoltán N. Oltvai
NATURE REVIEWS | GENETICS VOLUME 5 | FEBRUARY 2004 | 101
NETWORK BIOLOGY: UNDERSTANDING THE CELL’S FUNCTIONAL ORGANIZATION
Albert-László Barabási & Zoltán N. Oltvai
NATURE REVIEWS | GENETICS VOLUME 5 | FEBRUARY 2004 | 101
Comparison of
random, scalefree and
hierarchical
networks
protein-protein interaction
Typical protein-protein interaction
A protein binds with another or several other proteins in
order to perform different biological functions---they are
called protein complexes.
protein-protein interaction
This complex
transport oxygen
from lungs to cells all
over the body through
blood circulation
PROTEINPROTEIN
INTERACTIONS
by Catherine Royer
Biophysics Textbook
Online
protein-protein interaction
PROTEINPROTEIN
INTERACTIONS
by Catherine Royer
Biophysics Textbook
Online
Network of interactions and complexes
•Usually protein-protein interaction data are produced by
Laboratory experiments (Yeast two-hybrid, pull-down
assay etc.)
detected complex data
A
A
B D
C
E F
A
Bait protein
Interacted protein
B
C D
E
F
Spoke approach
B
F
C
E
D
Matrix approach
•The results of the experiments are converted to binary
interactions.
•The binary interactions can be represented as a
network/graph where a node represents a protein and an edge
represents an interaction.
Network of interactions
AtpB
AtpG
AtpA
AtpB
AtpG
AtpE
00101
00011
10001
01001
11110
AtpA
AtpE
AtpH
AtpH
AtpH
AtpH
List of
interactions
Corresponding
network
Adjacency
matrix
The yeast protein interaction network evolves rapidly and contain
few redundant duplicate genes by A. Wagner.
Mol. Biology and Evolution. 2001
985 proteins and 899
interactions
S. Cerevisiae
giant component consists
of 466 proteins
The yeast protein interaction network evolves rapidly and contain
few redundant duplicate genes by A. Wagner.
Mol. Biol. Evol. 2001
Average degree ~ 2
Clustering coefficient = 0.022
Degree distribution is scale free
An E. coli interaction network from DIP
(http://dip.mbi.ucla.edu/).
Components of this
graph has been
determined by applying
Depth First Search
Algorithm
There are total 62
components
Giant component
93 proteins
300 proteins and 287
interactions
E. coli
An E. coli interaction network from DIP
(http://dip.mbi.ucla.edu/).
2.5
Log(No. of Node)
2
1.5
1
0.5
0
0
0.5
1
1.5
2
Log(Degree)
Average degree ~ 1.913
Clustering co-efficient
= 0.29
Degree distribution ~ scale free
Lethality and Centrality in protein networks by
H. Jeong, S. P. Mason, A.-L. Barabasi, Z. N. Oltvai
Nature, May 2001
Almost all proteins
are connected
1870 proteins and 2240
interactions
S. Cerevisiae
Degree distribution is scale free
PPI network based on MIPS database consisting of 4546 proteins
12319 interactions
Average
degree 5.42
Clustering coefficient =
0.18
Giant
component
consists of
4385 proteins
PPI network
based on MIPS
database
consisting of
4546 proteins
12319
interactions
3.5
3
Degree distribution ~ scale free
2.5
2
1.5
1
0.5
0
0
0.5
1
1.5
2
2.5
3
# of
# of
proteins Interac.
Average
degree
Clustering Giant
Coeffi.
Compo.
Degree
Distribu.
985
899
~2
0.022
Exist
47.3%
Power
law
300
287
1.913
0.29
Exist
31%
Almost
Power
law
1870
2240
______
______
Exist
~100%
Power
law
4546
12319
5.42
0.18
Exist
~96%
Almost
Power
law
A complete PPI network tends to be a connected graph
And tends to have Power law distribution
Handling Multivariate data: Concept and types of metrics
Multivariate data format
Multivariate data example
Distances, metrics, dissimilarities and similarities are related concepts
A metric is a function that satisfy the following properties:
A function that satisfy only conditions (i)-(iii) is referred to as distances
Source: Bioinformatics and Computational Biology Solutions Using R and
Bioconductor (Statistics for Biology and Health)
Robert Gentleman ,Vincent Carey ,Wolfgang Huber ,Rafael Irizarry ,Sandrine Dudoit
(Editors)
These measures consider the expression measurements as points in some
metric space.
Example:
Let,
X = (4, 6, 8)
Y = (5, 3, 9)
Widely used function for finding similarity is Correlation
Correlation gives a measure of linear association between variables and
ranges between -1 to +1
Statistical distance between points
Statistical distance /Mahalanobis distance between two vectors can be calculated if the
variance-covariance matrix is known or estimated.
The Euclidean distance between point Q and P is larger than that between Q and
origin but it seems P and Q are the part of the same cluster but Q and O are not.
Distances between distributions
Different from the previous approach (i.e. considering expression measurements as
points in some metric space) the data for each feature can be considered as independent
sample from a population.
Therefore the data reflects the underlying population and we need to measure
similarities between two densities/distributions.
Kullback-Leibler Information
Mutual information
KLI measures how much the
shape of one distribution
resembles the other
MI is large when the joint
distribution is quiet different
from the product of the
marginals.
Hierarchical clustering
Hierarchical Clustering
Data is not always
available as binary
relations as in the case of
protein-protein
interactions where we
can directly apply
network clustering
algorithms.
AtpB
AtpG
AtpA
AtpB
AtpG
AtpE
AtpA
AtpE
AtpH
AtpH
AtpH
AtpH
In many cases for
example in case of
microarray gene
expression analysis
the data is
multivariate type.
An Introduction to Bioinformatics Algorithms by Jones & Pevzner
Hierarchical Clustering
We can convert multivariate data into networks and can apply
network clustering algorithm about which we will discuss in
some later class.
If dimension of multivariate data is 3 or less we can cluster
them by plotting directly.
An Introduction to Bioinformatics Algorithms by Jones & Pevzner
Hierarchical Clustering
Some data reveal good cluster structure when plotted but some
data do not.
Data plotted in 2
dimensions
However, when dimension is more than 3, we can apply
hierarchical clustering to multivariate data.
In hierarchical clustering the data are not partitioned into a
particular cluster in a single step. Instead, a series of partitions
takes place.
Hierarchical Clustering
Hierarchical clustering is a technique that organizes
elements into a tree.
A tree is a graph that has no cycle.
A tree with n nodes can have maximum n-1 edges.
A Graph
A tree
Hierarchical Clustering
Hierarchical Clustering is subdivided into 2 types
1.
agglomerative methods, which proceed by series of fusions of the n objects
into groups,
2.
and divisive methods, which separate n objects successively into finer
groupings.
Agglomerative techniques are more commonly used
Data can be viewed as a single
cluster containing all objects
to n clusters each containing a
single object .
Hierarchical Clustering
Distance measurements
The Euclidean distance between points
and
, in Euclidean n-space, is defined as:
Euclidean distance between
g1 and g2
(10 10) 2 (8 0) 2 (10 9) 2
0 64 1 8.0622
Hierarchical Clustering
An Introduction to Bioinformatics Algorithms by Jones & Pevzner
In stead of Euclidean distance correlation can also be used as
a distance measurement.
For biological analysis involving genes and proteins, nucleotide
and or amino acid sequence similarity can also be used as
distance between objects
Hierarchical Clustering
•An agglomerative hierarchical clustering procedure produces
a series of partitions of the data, Pn, Pn-1, ....... , P1. The first Pn
consists of n single object 'clusters', the last P1, consists of
single group containing all n cases.
•At each particular stage the method joins together the two
clusters which are closest together (most similar). (At the first
stage, of course, this amounts to joining together the two
objects that are closest together, since at the initial stage each
cluster has one object.)
Hierarchical Clustering
An Introduction to Bioinformatics Algorithms by Jones & Pevzner
Differences between methods arise because of the
different ways of defining distance (or similarity)
between clusters.
Hierarchical Clustering
How can we measure distances between clusters?
Single linkage clustering
Distance between two clusters A and B, D(A,B) is computed as
D(A,B) = Min { d(i,j) : Where object i is in cluster A and
object j is cluster B}
Hierarchical Clustering
Complete linkage clustering
Distance between two clusters A and B, D(A,B) is computed as
D(A,B) = Max { d(i,j) : Where object i is in cluster A and
object j is cluster B}
Hierarchical Clustering
Average linkage clustering
Distance between two clusters A and B, D(A,B) is computed as
D(A,B) = TAB / ( NA * NB)
Where TAB is the sum of all pair wise distances between objects
of cluster A and cluster B. NA and NB are the sizes of the clusters
A and B respectively.
Total NA * NB edges
Hierarchical Clustering
Average group linkage clustering
Distance between two clusters A and B, D(A,B) is computed as
D(A,B) = = Average { d(i,j) : Where observations i and j are in
cluster t, the cluster formed by merging clusters A and B }
Total n(n-1)/2 edges
Hierarchical Clustering
Alizadeh et al.
Nature 403: 503-511
(2000).
Classifying bacteria
based on 16s rRNA
sequences.
Self organizing Maps
Time-series Data
Growth curve
10
j
…
1
…
T
2
0.1
1
0.01
Time
Expression profiles
Gene1
Gene2
...
Genei
...
GeneD
Stage
x11
x21
...
xi1
...
xD1
1
x12
x22
...
xi 2
...
xD 2
2
... x1 j
... x2 j
... ...
... xij
... ...
... xDj
…. j
... x1T
... x2T
... ...
... xiT
... ...
... xDT
… T
When we measure time-series microarray, gene expression profile is represented by a matrix
SOM makes it possible to examine gene similarity and stage similarity simultaneously.
x1
x2
...
xi
...
x D
T, # of time-series microarray experiments
D, # of genes in a microarray
Time-series Data
Growth curve
10
j
…
1
…
T
2
0.1
1
0.01
Time
Expression profiles
Gene1
Gene2
...
Genei
...
GeneD
Stage
x11
x21
...
xi1
...
xD1
1
x12
x22
...
xi 2
...
xD 2
2
... x1 j
... x2 j
... ...
... xij
... ...
... xDj
…. j
... x1T
... x2T
... ...
... xiT
... ...
... xDT
… T
…
…
Stage similarity
STATES
State-Transition
When we measure time-series microarray, gene expression profile is represented by a matrix
SOM makes it possible to examine gene similarity and stage similarity simultaneously.
x1
x2
... Expression similarity
xi
...
x D
T, # of time-series microarray experiments
D, # of genes in a microarray
Multivariate Analysis
SOM : expression similarity of
genes and stage similarity
simultaneously.
BL-SOM is available at
http://kanaya.aist-nara.ac.jp/SOM/
SOM was developed by Prof. Teuvo Kohonen in the early 1980s
Multi-dimensional data/input vectors are mapped onto a two
dimensional array of nodes
In original SOM, output depends on input order of the vectors.
To remove this problem Prof. Kanaya developed BL-SOM.
[1] Initial model vectors are determined based on PCA of the data.
[2] The learning process of BL-SOM makes the output independent
of the order of the input vectors.
SOM Algorithm
Source: “Clustering Challenges in Biological Networks” edited by S. Butenko et. al.
SOM Algorithm
Source: “Clustering Challenges in Biological Networks” edited by S. Butenko et. al.
SOM Algorithm
Source: “Clustering Challenges in Biological Networks” edited by S. Butenko et. al.
SOM Algorithm
in Fig. before
Source: “Clustering Challenges in Biological Networks” edited by S. Butenko et. al.
Self-organizing Mapping
(Summary)
X
[1] Detection method for transition points in
metabolite quantity based on batch-learni
(BL-SOM)
1
[2] Diversity of metabolites in species
Species-metabolite relation Database
XT
X2
Gene i (xi1,xi2,..,xiT)
Gene1
Gene2
...
Genei
...
GeneD
x11
x21
...
xi1
...
xD1
x12
x22
...
xi 2
...
xD 2
... x1 j
... x2 j
... ...
... xij
... ...
... xDj
... x1T
... x2T
... ...
... xiT
... ...
... xDT
x1
x2
...
xi
...
x D
T, different time-series microarray experiments
Self-organizing Mapping (Summary)
Arrangement of lattice points in
multi-dimensional expression
space
X1
Lattice points are optimized for reflecting data
distribution
Gene Classification
Genes are classified into the nearest lattice points
XT
X2
Gene i (xi1,xi2,..,xiT)
Self-organizing Mapping (Summary)
Arrangement of lattice points in
multi-dimensional expression
space
X1
Lattice points are optimized for reflecting data
distribution
Gene Classification
Genes with similar expression profiles are clusterized to
identical or near lattice points
X1 (Time 1)
Feature Mapping
X2 (Time 2)
In the i-th condition,
lattice points containing only highly
(low) expressed genes are colored by
red (blue).
XT
X2
(ex.)
Xk> Th.(k)
Xk< -Th.(k)
X3 (Time 3)
k=1,2,…,T
…..
…..
…..
XT (Time T)
Visually comparing among
each stage of time-series data
Non-linear projection of multi-dimensional expression profiles of genes.
Original dimension is conserved in individual lattice points.
Several types of information is stored in SOM
Estimation of transition points; Bacillus subtilis (LB medium)
(Data: Kazuo Kobayashi, Naotake Ogasawara (NAIST))
Stage 1
2
3
4
5
6
7
High prob.
10
Cell Density
(OD600 )
0
6
5
1
7
8
4
3
log(Prob. Density)
2
0.1
-1000
1
0.01
LB
0.001
-2000
0
200
400
600
800
1000
Low prob.
(min)
SOM for time-series expression profile
State transition point is observed between stages 3 and 4
8
Integerated analysis of gene expression profile and metabolite quantity data of Arabidopsis thaliana
(sulfur def./cont.; Data are provided by K.Saito, M. Hirai group (PSC) )
ppm(error rate)
Nakamura et al (2004)
State transition
Feature Maps
Leaf
Leaf
Gene
Metabolites
(m/z)
Root
Lattice points with
highly difference
between 12 and 24 h.
Blue: Decreased
Red: increased
Accurate molecular weights
Candidate metabolites corresponding to accurate molecular weights
3.
Species-metabolite relation Database
Root
Download sites of BL-SOM
Riken: http://prime.psc.riken.jp/
NAIST: http://kanaya.naist.jp/SOM/
Application of BL-SOM to “-omics”
Genome
Kanaya et al., Gene, 276, 89-99 (2001)
Abe et al., Genome Res., 13, 693-702, (2003)
Abe et al., J.Earth Simulator, 6, 17-23, (2003)
Abe et al., DNA Res., 12, 281-290. (2005)
Transcriptome
Haesgawa et al., Plant Methods, 2:5:1-18 (2006)
Metabolome
Kim et al., J. Exp.Botany, 58, 415-424, (2007)
Fukusaki et al., J.Biosci.Bioeng., 100, 347-354, (2005)
Transcriptome and Metabolome
Hirai, M. Y., M. Klein, et al. J.Biol. Chem., 280, 25590-5 (2005)
Hirai, M. Y., M. Yano, et al. Proc Natl Acad Sci U S A 101, 10205-10 (2004)
Morioka, R, et al., BMC Bioinformatics, 8, 343, (2007)
Yano et al., J.Comput. Aided Chem.,7,125-136 (2007)
Summary of Bioinformatics Tool developed in our laboratory
http://kanaya.naist.jp/~skanaya/Web/JTop.html
All softwares and DB are freely accessable via Web.
Metabolomics
-- MS data processing
Transcriptome and Metabolomics Profiling
-- estimation of transition points
Species-metabolite DB
Network analysis: PPI
Transcriptomics
-- Statistics, Profiling, …
Introduction to self organizing mapping software