Protein Function Prediction

Transcript Protein Function Prediction

Predicting Protein Function
Using Machine-Learned
Hierarchical Classifiers
Roman Eisner
Supervisors: Duane Szafron and Paul Lu
09 / 23 / 2005
[email protected]
1
Outline






Introduction
Predictors
Evaluation in a Hierarchy
Local Predictor Design
Experimental Results
Conclusion
09 / 23 / 2005
[email protected]
2
09 / 23 / 2005
[email protected]
3
Proteins


Functional Units in the cell
Perform a Variety of Functions


e.g. Catalysis of reactions, Structural and
mechanical roles, transport of other molecules
Can take years to study a single protein

Any good leads would be helpful!
09 / 23 / 2005
[email protected]
4
Protein Function Prediction and
Protein Function Determination

Prediction:


Determination:


An estimate of what function a protein performs
Work in a laboratory to observe and discover what
function a protein performs
Prediction complements determination
09 / 23 / 2005
[email protected]
5
Proteins

Chain of amino acids


20 Amino Acids
FastA Format:
>P18077 – R35A_HUMAN
MSGRLWSKAIFAGYKRGLRNQREHTALLKIEGVYARDETEFYLGKR
CAYVYKAKNNTVTPGGKPNKTRVIWGKVTRAHGNSGMVRAKFRSNL
PAKAIGHRIRVMLYPSRI
09 / 23 / 2005
[email protected]
6
Ontologies


Standardized Vocabularies
(Common Language)
In biological literature, different terms can be
used to describe the same function


e.g. “peroxiredoxin activity” and
“thioredoxin peroxidase activity”
Can be structured in a hierarchy to show
relationships
09 / 23 / 2005
[email protected]
7
Gene Ontology



Directed Acyclic Graph (DAG)
Always changing
Describes 3 aspects of protein annotations:



Molecular Function
Biological Process
Cellular Component
09 / 23 / 2005
[email protected]
8
Gene Ontology



Directed Acyclic Graph (DAG)
Always changing
Describes 3 aspects of protein annotations:



Molecular Function
Biological Process
Cellular Component
09 / 23 / 2005
[email protected]
9
Hierarchical Ontologies



Can help to represent a large number of
classes
Represent General and Specific data
Some data is incomplete – could become
more specific in the future
09 / 23 / 2005
[email protected]
10
Incomplete Annotations
09 / 23 / 2005
[email protected]
11
Goal

To predict the function of proteins given their
sequence
09 / 23 / 2005
[email protected]
12
Data Set

Protein Sequences


Ontology



Gene Ontology Molecular Function aspect
Experimental Annotations


UniProt database
Gene Ontology Annotation project @ EBI
Pruned Ontology: 406 nodes (out of 7,399)
with ≥ 20 proteins
Final Data Set: 14,362 proteins
09 / 23 / 2005
[email protected]
13
Outline






Introduction
Predictors
Evaluation in a Hierarchy
Local Predictor Design
Experimental Results
Conclusion
09 / 23 / 2005
[email protected]
14
Predictors

Global:


BLAST NN
Local:

PA-SVM

PFAM-SVM

Probabilistic Suffix Trees
09 / 23 / 2005
[email protected]
15
Predictors

Global:


BLAST NN
Local:

PA-SVM
Linear

PFAM-SVM

Probabilistic Suffix Trees
09 / 23 / 2005
[email protected]
16
Why Linear SVMs?

Accurate
Explainability

Each term in the dot product in meaningful

09 / 23 / 2005
[email protected]
17
PA-SVM
Proteome Analyst
09 / 23 / 2005
[email protected]
18
PFAM-SVM
Hidden Markov Models
09 / 23 / 2005
[email protected]
19
PST

Probabilistic Suffix Trees

Efficient Markov chains

Model the protein sequences directly:

Prediction:
09 / 23 / 2005
[email protected]
20
BLAST

Protein Sequence Alignment for a query protein
against any set of protein sequences
09 / 23 / 2005
[email protected]
21
BLAST
09 / 23 / 2005
[email protected]
22
Outline






Introduction
Predictors
Evaluation in a Hierarchy
Local Predictor Design
Experimental Results
Conclusion
09 / 23 / 2005
[email protected]
23
Evaluating Predictions in a Hierarchy

Not all errors are
equivalent


Error to sibling different
than error to unrelated
part of hierarchy
Proteins can perform
more than one function

Need to combine
predictions of multiple
functions into a single
measure
09 / 23 / 2005
[email protected]
24
Evaluating Predictions in a Hierarchy





Semantics of the
hierarchy – True Path
Rule
Protein labeled with:
{T} -> {T, A1, A2}
Predicted functions:
{S} -> {S, A1, A2}
Precision = 2/3 = 67%
Recall = 2/3 = 67%
09 / 23 / 2005
[email protected]
25
Evaluating Predictions in a Hierarchy




Protein labelled with
{T} -> {T, A1, A2}
Predicted:
{C1} -> {C1, T, A1, A2}
Precision = 3/4 = 75%
Recall = 3/3 = 100%
09 / 23 / 2005
[email protected]
26
Supervised Learning
09 / 23 / 2005
[email protected]
27
Cross-Validation


Used to estimate
performance of
classification
system on future
data
5 Fold CrossValidation:
09 / 23 / 2005
[email protected]
28
Outline






Introduction
Predictors
Evaluation in a Hierarchy
Local Predictor Design
Experimental Results
Conclusion
09 / 23 / 2005
[email protected]
29
Inclusive vs Exclusive
Local Predictors


In a system of local predictors, how should
each local predictor behave?
Two extremes:



A local predictor predicts positive only for those
proteins that belong exactly at that node
A local predictor predicts positive for those
proteins that belong at or below them in the
hierarchy
No a priori reason to choose either
09 / 23 / 2005
[email protected]
30
Exclusive Local Predictors
09 / 23 / 2005
[email protected]
31
Inclusive Local Predictors
09 / 23 / 2005
[email protected]
32
Training Set Design


Proteins in the current fold’s training set can
be used in any way
Need to select for each local predictor:


Positive training examples
Negative training examples
09 / 23 / 2005
[email protected]
33
Training Set Design
09 / 23 / 2005
[email protected]
34
Training Set Design
Positive
Examples
Negative
Examples
Exclusive
T
Not [T]
Less
Exclusive
T
Not [ T U
Descendants(T)]
Less
Inclusive
TU
Descendants(T)
Not [ T U
Descendants(T)]
Inclusive
TU
Descendants(T)
Not [ T U
Descendants(T) U
Ancestors(T)]
09 / 23 / 2005
[email protected]
35
Training Set Design
Positive
Examples
Negative
Examples
Exclusive
T
Not [T]
Less
Exclusive
T
Not [ T U
Descendants(T)]
Less
Inclusive
TU
Descendants(T)
Not [ T U
Descendants(T)]
Inclusive
TU
Descendants(T)
Not [ T U
Descendants(T) U
Ancestors(T)]
09 / 23 / 2005
[email protected]
36
Training Set Design
Positive
Examples
Negative
Examples
Exclusive
T
Not [T]
Less
Exclusive
T
Not [ T U
Descendants(T)]
Less
Inclusive
TU
Descendants(T)
Not [ T U
Descendants(T)]
Inclusive
TU
Descendants(T)
Not [ T U
Descendants(T) U
Ancestors(T)]
09 / 23 / 2005
[email protected]
37
Training Set Design
Positive
Examples
Negative
Examples
Exclusive
T
Not [T]
Less
Exclusive
T
Not [ T U
Descendants(T)]
Less
Inclusive
TU
Descendants(T)
Not [ T U
Descendants(T)]
Inclusive
TU
Descendants(T)
Not [ T U
Descendants(T) U
Ancestors(T)]
09 / 23 / 2005
[email protected]
38
Comparing Training Set
Design Schemes

Using PA-SVM
Method
Precision Recall
F1-Measure
Exceptions
per Protein
Exclusive
75.8%
32.8%
45.8%
1.52
Less
Exclusive
77.7%
40.4%
53.1%
1.74
Less
Inclusive
77.3%
63.8%
69.9%
0.05
Inclusive
75.3%
65.2%
69.9%
0.09
09 / 23 / 2005
[email protected]
39
Exclusive have more exceptions
09 / 23 / 2005
[email protected]
40
Lowering the Cost of Local Predictors

Top-Down

Compute local predictors
top to bottom until a
negative prediction is
reached
09 / 23 / 2005
[email protected]
41
Lowering the Cost of Local Predictors

Top-Down

Compute local predictors
top to bottom until a
negative prediction is
reached
09 / 23 / 2005
[email protected]
42
Lowering the Cost of Local Predictors

Top-Down

Compute local predictors
top to bottom until a
negative prediction is
reached
09 / 23 / 2005
[email protected]
43
Top-Down Search
Method
Previous
F1-Measure
Top-Down
F1-Measure
Number of
Local
Predictors
Computed
Exclusive
45.8%
0.4%
10
Less
Exclusive
53.1%
2.7%
10
Less
Inclusive
69.9%
69.8%
32
Inclusive
69.9%
69.9%
32
09 / 23 / 2005
[email protected]
44
Outline






Introduction
Predictors
Evaluation in a Hierarchy
Local Predictor Design
Experimental Results
Conclusion
09 / 23 / 2005
[email protected]
45
Predictor Results
09 / 23 / 2005
Predictor
Precision
Recall
PA-SVM
75.4%
64.8%
PFAM-SVM
74.0%
57.5%
PST
57.5%
63.6%
BLAST
76.7%
69.6%
Voting
76.3%
73.3%
[email protected]
46
Similar and Dissimilar Proteins

89% of proteins – at least one good BLAST
hit


Proteins which are similar (often homologous) to
the set of well studied proteins
11% of proteins – no good BLAST hit

Proteins which are not similar to the set of well
studied proteins
09 / 23 / 2005
[email protected]
47
Coverage

Coverage: Percentage of proteins for which a
prediction is made
Organism
Good BLAST Hit
No Good BLAST Hit
D. Melanogaster
60%
40%
S. Cerevisae
62%
38%
09 / 23 / 2005
[email protected]
48
Similar Proteins – Exploiting BLAST

BLAST is fast and accurate when a good hit is found




Can exploit this to lower the cost of local predictors
Generate candidate nodes
Only compute local predictors for candidate nodes
Candidate node set should have:


High Recall
Minimal Size
09 / 23 / 2005
[email protected]
49
Similar Proteins – Exploiting BLAST

candidate nodes
generating methods:

Searching outward from
BLAST hit

Performing the union of
more than one BLAST
hit’s annotations
09 / 23 / 2005
[email protected]
50
Similar Proteins – Exploiting BLAST
09 / 23 / 2005
Method
Precision
Recall
Avg Cost
per Protein
All
77%
80%
1219
Top-Down
77%
79%
111
BLAST-2-Union
79%
78%
20
BLAST-Search-3
78%
78%
221
[email protected]
51
Dissimilar Proteins

The more interesting case
Method
Precision
Recall
Avg Cost
per Protein
BLAST
19%
20%
1
Voting
55%
32%
812
Top-Down Voting
56%
32%
58
09 / 23 / 2005
[email protected]
52
Comparison to Protfun


On a pruned ontology (9 Gene Ontology classes)
On 1,637 “no good BLAST hit” proteins
09 / 23 / 2005
Precision
Recall
Protfun
14%
13%
Voting
69%
29%
[email protected]
53
Future Work




Try other two ontologies – biological process
and cellular component
Use other local predictors
More parameter tuning
Predictor cost
09 / 23 / 2005
[email protected]
54
Conclusion



Protein Function Prediction provides good leads for
Protein Function Determination
Hierarchical ontologies can represent incomplete
data allowing the prediction of more functions
Considering the hierarchy:



More accurate & Less Computationally Intensive
Methods presented have a higher coverage than
BLAST alone
Results accepted to IEEE CIBCB 2005
09 / 23 / 2005
[email protected]
55
Thanks to…

Duane Szafron and Paul Lu

Brett Poulin and Russ Greiner

Everyone in the Proteome Analyst research
group
09 / 23 / 2005
[email protected]
56
Incomplete Data & Prediction



Inclusive avoids using ambiguous
(incomplete) training data
Does this help?
To test:

Train on more Incomplete Data:


Choose X% of proteins, and move one annotation up
Evaluation Predictions on “Complete” data
09 / 23 / 2005
[email protected]
57
Robustness to Incomplete Data
09 / 23 / 2005
[email protected]
58
Local vs Global Cross-Validation


Some node predictors have as little as 20 positive
examples
How to do cross-validation to make sure each
predictor has enough positive training examples?
09 / 23 / 2005
[email protected]
59
Local vs Global Cross-Validation

Local cross-validation is
invalid



Predictions must be
consistent
Need fold isolation
A single global split

global cross-validation
09 / 23 / 2005
[email protected]
60

Protein Function Prediction

Transcript Protein Function Prediction

Directory