CSCE590/822 Data Mining Principles and Applications

Transcript CSCE590/822 Data Mining Principles and Applications

CSCE555 Bioinformatics

Lecture 18 Network Biology: Comparison of
Networks Across Species
Meeting: MW 4:00PM-5:15PM SWGN2A21
Instructor: Dr. Jianjun Hu
Course page: http://www.scigen.org/csce555
University of South Carolina
Department of Computer Science and Engineering
2008
www.cse.sc.edu.
In the beginning there was DNA…
Liolios K, Tavernarakis N, Hugenholtz P, Kyrpides, NC. The Genomes On Line Database
(GOLD) v.2: a monitor of genome projects worldwide. NAR 34, D332-334
…then came protein interactions
Arabidopsis
PPI network
E. Coli
PPI network
Yeast PPI network
Comparative Genomics to
Comparative Interactomics

Evolutionary conservation implies functional
relevance
◦ Sequence conservation implies functional
conservation
◦ Network conservation implies functional
conservation too!

What new insights might we gain from network
comparisons? (Why should we care?)
Network comparisons allow us to:
Identify conserved functional modules
 Query for a module, ala BLAST
 Predict functions of a module
 Predict protein functions
 Validate protein interactions
 Predict protein interactions

Only possible with
network comparisons
Possible with existing
techniques, but improved
with network comparisons
What is a Protein Interaction Network?
Proteins are nodes
 Interactions are edges
 Edges may have
weights

Yeast PPI network
H. Jeong et al. Lethality and centrality in protein networks. Nature 411, 41 (2001)
The Network Alignment Problem
Given k different protein interaction
networks belonging to different species,
we wish to find conserved subnetworks within these networks
 Conserved in terms of protein sequence
similarity (node similarity) and interaction
similarity (network topology similarity)

Example Network Alignment
Sharan and Ideker. Modeling cellular machinery through biological
network comparison. Nature Biotechnology 24, pp. 427-433, 2006
General Framework For Network
Alignment Algorithms
Network construction
Alignment
algorithm
Scoring function
Sharan and Ideker. Modeling cellular machinery through biological
network comparison. Nature Biotechnology 24, pp. 427-433, 2006
Building Co-expression Networks
Gene B
Gene A
Arrays
Gene C
Pearson Correlation
Expression
Genes
Gene A
Gene B
Gene C
1
.8
1
.8
-.7
-.6
1
=
-.7
-.6
Microarray data
Balaji S. Srinivasan
Two Algorithms
NetworkBLAST (covered today)
Sharan et al. Conserved patterns of
protein interaction in multiple
species. PNAS, 102(6):1974-1979, 2005.
 Græmlin
Flannick et al. Græmlin: General and
robust alignment of multiple large
interaction networks. Genome Res 16:
1169-1181, 2006.

Overview of
Sharan et al. Conserved patterns of protein interaction in multiple
species. PNAS, 102(6):1974-1979, 2005.
Estimation of Interaction Probabilities

In the preprocessing step, edges in the
network are given a reliability score using a
logistic regression model based on three
features:
1.
2.
3.
Number of times an interaction was observed
Pearson correlation coefficient between
expression profiles
Proteins’ small world clustering coefficient
Network Alignment Graphs




Construct a Network Alignment Graph to represent
the alignment
Nodes contain groups of sequence similar proteins
from the k organisms
Edges represent conserved interactions.
An edge between two nodes is present if:
1.
2.
3.

One pair of proteins directly interacts, the rest are distance
at most 2 away
All protein pairs are of distance exactly 2
At least max(2, k – 1) protein pairs directly interact
Tries to account for interaction deletions
Example Network Alignment Graph
b
a
b’’
Species X
Species Y
Species Z
b’
a’’
b’
b
a’
a
a’
a’’
Nodes
c
c’’
c
c’
Network alignment graph
b’’
c’
c’’
Individual species’ PPI network
Scoring Function
Sharan et al. devise a scoring scheme
based on a likelihood model for the fit of
a single sub-network to the given
structure
 High scoring subgraphs correspond to
structured sub-networks (cliques or
pathways)
 Only network topology is scored, node
similarity is not

Log Likelihood Ratio Model

Measures the likelihood that a subgraph occurs
if it is a conserved network vs. that if it were a
randomly constructed network
Pr(Subgraph occurs | Conserved Network)
log
Pr(Subgraph occurs | Random Network)

Randomly constructed network preserves
degree distribution for nodes
Log Likelihood Ratio Model

(i) in a real subnetwork,each
interaction should be present
independently with high probability,and
(ii) in a random subnetwork, the
probability of an interaction between
any two proteins depends on their total
number of connections in the network.
Likelihood Ratio Scoring of a Protein
Complex in a Single Species
Probability of complex being observed in a conserved network model
Probability of subgraph being observed in a random network model
U : a subset of vertices (proteins) in the PPI graph
OU : collection of all observations on vertex pairs in U
Ouv : interaction between proteins u, v observed
Ms : conserved network model
Mn: random network (null) model
Tuv : proteins u, v interact
Fuv : proteins u, v do not interact
β : probability that proteins u, v interact in conserved model
puv : probability that edge u, v exists in a random model
Likelihood Ratio Scoring of a Protein
Complex in a Single Species

Hence, log likelihood for a complex
occurring in a single species is given by

For multiple complexes across different
species, it is the sum of the log likelihoods
L(A, B, C) = L(A) + L(B) + L(C)
Example of Complex Scoring
b
a
b’’
Complex X1
in Species X
Complex Y1
in Species Y
Complex Z1
in Species Z
b’
a’’
b’
b
a’
a
a’
a’’
Nodes
c
b’’
c’
c’’
Individual species’ PPI network
c’’
c
c’
Conserved complex A in the
Network alignment graph
L(A) = L(X1) + L(Y1) + L (Z1)
Alignment algorithm
Problem of identifying conserved subnetworks reduces to finding high scoring
subgraphs
 NP-complete problem
 Heuristic solution:

◦ Greedy extension of high scoring seeds
◦ (Does this sound familiar? BLAST?)
◦ Common to both papers discussed
Alignment algorithm
1.
Find seeds for each node v in the
alignment graph
a. Find high scoring paths of 4 nodes by
exhaustive search
b. Greedily add 3 other nodes one by one,
that maximally increase the score of the
seed
Alignment algorithm
2.
Iteratively add or remove nodes to
increase the overall score of the node
 Original seeds are preserved
 Limit size of discovered subgraphs to 15
nodes
 Record up to 4 highest scoring subgraphs
discovered around each node
Alignment algorithm
3.
Filter subgraphs with a high degree of
overlap
 Iteratively find high scoring subgraph and
remove all highly overlapping ones
remaining
Results
Conserved network regions
within yeast (orange ovals), fly
(green rectangles) and worm
(blue hexagons) PPI
networks.
Results
Prediction of protein function
• ‘Guilt by association’
• If a conserved cluster or path is
significantly enriched in a functional
annotation
Prediction of protein interactions
Predictions based on 2 strategies:
•
Evidence that proteins with
similar sequences interact
•
Co-occurrence of proteins in the
same conserved cluster or path
•
Experimental verification of
Yeast interactions using Y2H
yielded 40-62% success rate
Overview of

Fast, scalable, network alignment
◦ Scales linearly in number of networks
compared
◦ NetworkBLAST scales exponentially
Supports efficient querying of modules
 Speed-sensitivity control via user defined
parameter

◦ Not supported in NetworkBLAST
Input to the Algorithm

Weighted protein interaction graphs
◦ Weights represent probability that proteins
interact
◦ Constructed via network integration
algorithm

A phylogenetic tree relating the species in
the desired alignment
◦ Used for progressive alignment
Key Ideas of Graimin
Generating An Initial Alignment From The
Seed
 Greedy Seed Extension Phase
 Progressive alignment technique using the
phylogenetic tree

Results
Functional module
identification using
network alignment
Results
Multiple alignment of 10
networks showing possible cell
division module
Functional annotation
using network alignment
The Future of Network Comparison
Græmlin?
Græmlin
Sharan and Ideker. Modeling cellular machinery through biological
network comparison. Nature Biotechnology 24, pp. 427-433, 2006
Summary
The problem: Network
comparison/comparative interactomes
 NetworkBlast algorithm
 Brief introduction fo
 The analogy between sequence
comparison and network comparison

Reference & Acknowledgements

Chuan Sheng Foo

Sharan et al. Conserved patterns of protein interaction
in multiple species. PNAS. February 8, 2005 | vol. 102 |
no. 6 | 1974-1979

CSCE590/822 Data Mining Principles and Applications

Transcript CSCE590/822 Data Mining Principles and Applications

Directory