Transcript TamarEldad

Predicting Protein Function
Annotation using ProteinProtein Interaction Networks
By Tamar Eldad
Advisor: Dr. Yanay Ofran
89-385 Computational Biology - Projects Workshop
Bar-Ilan University, the Mina and Everard Goodman Faculty of Life Sciences
1
Protein Function Prediction
 Exponential increase in the number of proteins being identified
by sequence genomics projects
 Impossible to perform functional assay for every uncharacterized
gene
 Turn to sophisticated computational methods for assistance in
annotating the huge volume of sequence and structure data being
produced






homology-based annotation transfer
sequence patterns
structure similarity
structure patterns
genomic context
microarray data
2
What is Function?
 Biological function has more than one aspect
 Sub-cellular to whole-organism context
 Physiological aspect
 Phenotype
The need of a well-defined
vocabulary
3
Protein Sequence:
Protein Structure:
4
The Gene Ontology
The Gene Ontology project is a major bioinformatics initiative
with the aim of standardizing the representation of gene and
gene product attributes across species and databases.
The project provides a controlled vocabulary of terms for
describing gene product characteristics and gene product
annotation data.
6
The Gene Ontology
 Cellular component
 Molecular function
 Biological process
 DAG (1….N parent nodes)
 General  Specific
 Term is assigned to Gene Product
7
The Gene Ontology
8
A New Approach
 Classical Biology – collect a set of features for each protein
 Systems Biology – study protein function in the context of a network
Assemblies represent more than the
sum of their parts
9
Protein Interactions

Data on thousands of interactions in humans and most model
species have become available

mass spectrometry

genome-wide chromatin immunoprecipitation

yeast two-hybrid assays

combinatorial reverse genetic screens

rapid literature mining techniques
10
PPI Networks
 Data are represented as networks, with nodes representing
proteins and edges representing the detected PPIs.
11
Existing Methods
 Alignment – aligning sequence-matching proteins between species
and checking if they also share network alignment can teach us about
conserved pathways between species
 Integration - data from different types of networks (i.e. protein,
genetic, and transcriptional interaction networks) are integrated in order
to get a better picture of the whole biological system
 Querying - find sub-networks similar to functional units (by comparing
interactions and the proteins themselves) - likely to be functioning units
too
12
New Method
conserved network motifs between two species convey evidence for
function similarity of the individual proteins that make up these motifs
1e-09
5e-15
8e-13
2e-10
HUMAN
YEAST
13
New Method
What do we need?
1. list of proteins in human cell
2. list of proteins in yeast cell
3. interactions in each cell
4. sequence similarity grades
5. known GO annotations
6. function distance calculation
14
Protein Lists - UniProt DB
15
Interaction Databases
HPRD - The Human Protein Reference Database.
Dip - Database of Interacting Proteins.
Mips -Munich information center of proteins sequences
IntAct – interaction molecular database.
Reliable interaction performs one of these conditions:
1. was at least observed in 2 different experiments.
OR
2. was reported in 3 different articles.
16
Sequence Similarity Grades
BLAST - bl2seq
YEAST
HUMAN
1
2
3
4
1
-
0.008
3e-18
X
2
10
-
0.02
3.6
17
GO annotations –UniProt DB
18
Evidence Codes
19
Function Distance Calculation
20
Implementation
1. Prepare similarity matrix for cutoff e-value
2. Find all components of size N – 1 (DFS search)
3. Compare sub-graphs found using similarity matrix
4. Add N-th non-similar component to each pair of matching graphs
5. Get GO function annotation of N-th components
6. Calculate average distance of N-th component’s function
21
Quality Assurance
1. Compare to random-pair annotation
No-sequence similarity
2. Compare to sequence-similar annotation
BLAST
Only proteins under cut-off value
Human genes only
22
Detailed Results
graph1
new
comp
go func
graph2
new comp
go func
term type
Eval
average
,4814,4256,591,1584,
Q12495
GO:0005515
,4253,1335,2447,2353,
Q9UHD2
GO:0005515
MolecularFunction
4
0.079
,4814,4256,591,1584,
Q12495
GO:0030528
,4253,1335,2447,2353,
Q9UHD2
GO:0030528
MolecularFunction
3
0.079
,4814,4256,591,1584,
Q12495
GO:0006334
,4253,1335,2447,2353,
Q9UHD2
GO:0006334
BiologicalProcess
0
0.079
,4814,4256,591,1584,
Q12495
GO:0005515
,4253,1335,2447,2353,
O15111
GO:0005515
MolecularFunction
1
0.079
,4814,4256,591,1584,
Q12495
GO:0005515
,4253,1335,2447,2353,
O15111
GO:0005515
MolecularFunction
12
0.079
,4819,2,236,234,
P16649
GO:0016584
,4354,2303,2890,3693,
P55060
GO:0016584
BiologicalProcess
1
0.062
,4819,2,236,234,
P16649
GO:0016565
,4354,2303,2890,3693,
Q96KB5
GO:0016565
MolecularFunction
1
0.062
,4819,2,236,234,
P16649
GO:0016584
,4354,2303,2890,3693,
Q15699
GO:0016584
BiologicalProcess
8
0.062
,4819,2,236,234,
P16649
GO:0016584
,4354,2303,2890,3693,
Q15699
GO:0016584
BiologicalProcess
5
0.062
,4867,2966,168,1224,
P13393
GO:0000120
,4387,1383,1452,2289,
P63279
GO:0000120
CellularComponent
4
0.041
,4867,2966,168,1224,
P13393
GO:0000120
,4387,1383,1452,2289,
P63279
GO:0000120
CellularComponent
3
0.041
,4867,2966,168,1224,
P13393
GO:0000126
,4387,1383,1452,2289,
P63279
GO:0000126
CellularComponent
7
0.041
dist
23
Results
E-value 5e-05
24
Play with Parameters
• Change graph size
• Lower e-value
• Start with larger amount of connected components
• Use only graphs with higher connectivity
• Non-similar proteins can be any protein in the graph
• Different network topology
• Limit number of paired proteins
25
Results
26
Conclusions
 Most results are random
 Significant improvement only for Biological Process
prediction
 Still far behind Homology Based Transfer
27
Summary
 Functional annotation is one of the greatest challenges in
the post-genomic era
 PPI data for functional annotation as a new approach for
promoting this field
 Method tried out is unsuccessful
 Other Ideas:


Find a more specific search pattern
Start from best results – what specializes them?
28
References





Friedberg,I. (2006) Automated function prediction: the
genomic challenge. Brief. Bioinform. Accepted for
publication
Sharan R, Ulitsky I, Shamir R: Network-based prediction
of protein function. Mol Syst Biol 2007, 3:88.
Sharan R, Ideker T: Modeling cellular machinery
through biological network comparison. Nature
Biotechnology 24, 4: 427 - 433.
http://www.geneontology.org/
http://www.chem.qmul.ac.uk/iubmb/enzyme/
29
Thanks
 Advisor – Dr. Yanay Ofran
 Guys at the lab – Rotem, Vered, Sivan
 Roi Adadi & Omer Erel
30
Alignment
Querying
Integration
Similarity Matrix
E-value = 0.0005
YEAST
HUMAN
1
2
3
4
1
-
0.008
TRUE
3e-18
TRUE
X
FALSE
2
10
FALSE
-
FALSE
0.02
FALSE
3.6
Neighboring matrix
HUMAN CELL INTERACTIONS
1
2
3
4
1
-
TRUE
FALSE
TRUE
2
TRUE
-
FALSE
FALSE