Protein Function Prediction
Download
Report
Transcript Protein Function Prediction
Predicting Protein Function
Using Machine-Learned
Hierarchical Classifiers
Roman Eisner
Supervisors: Duane Szafron and Paul Lu
09 / 23 / 2005
[email protected]
1
Outline
Introduction
Predictors
Evaluation in a Hierarchy
Local Predictor Design
Experimental Results
Conclusion
09 / 23 / 2005
[email protected]
2
09 / 23 / 2005
[email protected]
3
Proteins
Functional Units in the cell
Perform a Variety of Functions
e.g. Catalysis of reactions, Structural and
mechanical roles, transport of other molecules
Can take years to study a single protein
Any good leads would be helpful!
09 / 23 / 2005
[email protected]
4
Protein Function Prediction and
Protein Function Determination
Prediction:
Determination:
An estimate of what function a protein performs
Work in a laboratory to observe and discover what
function a protein performs
Prediction complements determination
09 / 23 / 2005
[email protected]
5
Proteins
Chain of amino acids
20 Amino Acids
FastA Format:
>P18077 – R35A_HUMAN
MSGRLWSKAIFAGYKRGLRNQREHTALLKIEGVYARDETEFYLGKR
CAYVYKAKNNTVTPGGKPNKTRVIWGKVTRAHGNSGMVRAKFRSNL
PAKAIGHRIRVMLYPSRI
09 / 23 / 2005
[email protected]
6
Ontologies
Standardized Vocabularies
(Common Language)
In biological literature, different terms can be
used to describe the same function
e.g. “peroxiredoxin activity” and
“thioredoxin peroxidase activity”
Can be structured in a hierarchy to show
relationships
09 / 23 / 2005
[email protected]
7
Gene Ontology
Directed Acyclic Graph (DAG)
Always changing
Describes 3 aspects of protein annotations:
Molecular Function
Biological Process
Cellular Component
09 / 23 / 2005
[email protected]
8
Gene Ontology
Directed Acyclic Graph (DAG)
Always changing
Describes 3 aspects of protein annotations:
Molecular Function
Biological Process
Cellular Component
09 / 23 / 2005
[email protected]
9
Hierarchical Ontologies
Can help to represent a large number of
classes
Represent General and Specific data
Some data is incomplete – could become
more specific in the future
09 / 23 / 2005
[email protected]
10
Incomplete Annotations
09 / 23 / 2005
[email protected]
11
Goal
To predict the function of proteins given their
sequence
09 / 23 / 2005
[email protected]
12
Data Set
Protein Sequences
Ontology
Gene Ontology Molecular Function aspect
Experimental Annotations
UniProt database
Gene Ontology Annotation project @ EBI
Pruned Ontology: 406 nodes (out of 7,399)
with ≥ 20 proteins
Final Data Set: 14,362 proteins
09 / 23 / 2005
[email protected]
13
Outline
Introduction
Predictors
Evaluation in a Hierarchy
Local Predictor Design
Experimental Results
Conclusion
09 / 23 / 2005
[email protected]
14
Predictors
Global:
BLAST NN
Local:
PA-SVM
PFAM-SVM
Probabilistic Suffix Trees
09 / 23 / 2005
[email protected]
15
Predictors
Global:
BLAST NN
Local:
PA-SVM
Linear
PFAM-SVM
Probabilistic Suffix Trees
09 / 23 / 2005
[email protected]
16
Why Linear SVMs?
Accurate
Explainability
Each term in the dot product in meaningful
09 / 23 / 2005
[email protected]
17
PA-SVM
Proteome Analyst
09 / 23 / 2005
[email protected]
18
PFAM-SVM
Hidden Markov Models
09 / 23 / 2005
[email protected]
19
PST
Probabilistic Suffix Trees
Efficient Markov chains
Model the protein sequences directly:
Prediction:
09 / 23 / 2005
[email protected]
20
BLAST
Protein Sequence Alignment for a query protein
against any set of protein sequences
09 / 23 / 2005
[email protected]
21
BLAST
09 / 23 / 2005
[email protected]
22
Outline
Introduction
Predictors
Evaluation in a Hierarchy
Local Predictor Design
Experimental Results
Conclusion
09 / 23 / 2005
[email protected]
23
Evaluating Predictions in a Hierarchy
Not all errors are
equivalent
Error to sibling different
than error to unrelated
part of hierarchy
Proteins can perform
more than one function
Need to combine
predictions of multiple
functions into a single
measure
09 / 23 / 2005
[email protected]
24
Evaluating Predictions in a Hierarchy
Semantics of the
hierarchy – True Path
Rule
Protein labeled with:
{T} -> {T, A1, A2}
Predicted functions:
{S} -> {S, A1, A2}
Precision = 2/3 = 67%
Recall = 2/3 = 67%
09 / 23 / 2005
[email protected]
25
Evaluating Predictions in a Hierarchy
Protein labelled with
{T} -> {T, A1, A2}
Predicted:
{C1} -> {C1, T, A1, A2}
Precision = 3/4 = 75%
Recall = 3/3 = 100%
09 / 23 / 2005
[email protected]
26
Supervised Learning
09 / 23 / 2005
[email protected]
27
Cross-Validation
Used to estimate
performance of
classification
system on future
data
5 Fold CrossValidation:
09 / 23 / 2005
[email protected]
28
Outline
Introduction
Predictors
Evaluation in a Hierarchy
Local Predictor Design
Experimental Results
Conclusion
09 / 23 / 2005
[email protected]
29
Inclusive vs Exclusive
Local Predictors
In a system of local predictors, how should
each local predictor behave?
Two extremes:
A local predictor predicts positive only for those
proteins that belong exactly at that node
A local predictor predicts positive for those
proteins that belong at or below them in the
hierarchy
No a priori reason to choose either
09 / 23 / 2005
[email protected]
30
Exclusive Local Predictors
09 / 23 / 2005
[email protected]
31
Inclusive Local Predictors
09 / 23 / 2005
[email protected]
32
Training Set Design
Proteins in the current fold’s training set can
be used in any way
Need to select for each local predictor:
Positive training examples
Negative training examples
09 / 23 / 2005
[email protected]
33
Training Set Design
09 / 23 / 2005
[email protected]
34
Training Set Design
Positive
Examples
Negative
Examples
Exclusive
T
Not [T]
Less
Exclusive
T
Not [ T U
Descendants(T)]
Less
Inclusive
TU
Descendants(T)
Not [ T U
Descendants(T)]
Inclusive
TU
Descendants(T)
Not [ T U
Descendants(T) U
Ancestors(T)]
09 / 23 / 2005
[email protected]
35
Training Set Design
Positive
Examples
Negative
Examples
Exclusive
T
Not [T]
Less
Exclusive
T
Not [ T U
Descendants(T)]
Less
Inclusive
TU
Descendants(T)
Not [ T U
Descendants(T)]
Inclusive
TU
Descendants(T)
Not [ T U
Descendants(T) U
Ancestors(T)]
09 / 23 / 2005
[email protected]
36
Training Set Design
Positive
Examples
Negative
Examples
Exclusive
T
Not [T]
Less
Exclusive
T
Not [ T U
Descendants(T)]
Less
Inclusive
TU
Descendants(T)
Not [ T U
Descendants(T)]
Inclusive
TU
Descendants(T)
Not [ T U
Descendants(T) U
Ancestors(T)]
09 / 23 / 2005
[email protected]
37
Training Set Design
Positive
Examples
Negative
Examples
Exclusive
T
Not [T]
Less
Exclusive
T
Not [ T U
Descendants(T)]
Less
Inclusive
TU
Descendants(T)
Not [ T U
Descendants(T)]
Inclusive
TU
Descendants(T)
Not [ T U
Descendants(T) U
Ancestors(T)]
09 / 23 / 2005
[email protected]
38
Comparing Training Set
Design Schemes
Using PA-SVM
Method
Precision Recall
F1-Measure
Exceptions
per Protein
Exclusive
75.8%
32.8%
45.8%
1.52
Less
Exclusive
77.7%
40.4%
53.1%
1.74
Less
Inclusive
77.3%
63.8%
69.9%
0.05
Inclusive
75.3%
65.2%
69.9%
0.09
09 / 23 / 2005
[email protected]
39
Exclusive have more exceptions
09 / 23 / 2005
[email protected]
40
Lowering the Cost of Local Predictors
Top-Down
Compute local predictors
top to bottom until a
negative prediction is
reached
09 / 23 / 2005
[email protected]
41
Lowering the Cost of Local Predictors
Top-Down
Compute local predictors
top to bottom until a
negative prediction is
reached
09 / 23 / 2005
[email protected]
42
Lowering the Cost of Local Predictors
Top-Down
Compute local predictors
top to bottom until a
negative prediction is
reached
09 / 23 / 2005
[email protected]
43
Top-Down Search
Method
Previous
F1-Measure
Top-Down
F1-Measure
Number of
Local
Predictors
Computed
Exclusive
45.8%
0.4%
10
Less
Exclusive
53.1%
2.7%
10
Less
Inclusive
69.9%
69.8%
32
Inclusive
69.9%
69.9%
32
09 / 23 / 2005
[email protected]
44
Outline
Introduction
Predictors
Evaluation in a Hierarchy
Local Predictor Design
Experimental Results
Conclusion
09 / 23 / 2005
[email protected]
45
Predictor Results
09 / 23 / 2005
Predictor
Precision
Recall
PA-SVM
75.4%
64.8%
PFAM-SVM
74.0%
57.5%
PST
57.5%
63.6%
BLAST
76.7%
69.6%
Voting
76.3%
73.3%
[email protected]
46
Similar and Dissimilar Proteins
89% of proteins – at least one good BLAST
hit
Proteins which are similar (often homologous) to
the set of well studied proteins
11% of proteins – no good BLAST hit
Proteins which are not similar to the set of well
studied proteins
09 / 23 / 2005
[email protected]
47
Coverage
Coverage: Percentage of proteins for which a
prediction is made
Organism
Good BLAST Hit
No Good BLAST Hit
D. Melanogaster
60%
40%
S. Cerevisae
62%
38%
09 / 23 / 2005
[email protected]
48
Similar Proteins – Exploiting BLAST
BLAST is fast and accurate when a good hit is found
Can exploit this to lower the cost of local predictors
Generate candidate nodes
Only compute local predictors for candidate nodes
Candidate node set should have:
High Recall
Minimal Size
09 / 23 / 2005
[email protected]
49
Similar Proteins – Exploiting BLAST
candidate nodes
generating methods:
Searching outward from
BLAST hit
Performing the union of
more than one BLAST
hit’s annotations
09 / 23 / 2005
[email protected]
50
Similar Proteins – Exploiting BLAST
09 / 23 / 2005
Method
Precision
Recall
Avg Cost
per Protein
All
77%
80%
1219
Top-Down
77%
79%
111
BLAST-2-Union
79%
78%
20
BLAST-Search-3
78%
78%
221
[email protected]
51
Dissimilar Proteins
The more interesting case
Method
Precision
Recall
Avg Cost
per Protein
BLAST
19%
20%
1
Voting
55%
32%
812
Top-Down Voting
56%
32%
58
09 / 23 / 2005
[email protected]
52
Comparison to Protfun
On a pruned ontology (9 Gene Ontology classes)
On 1,637 “no good BLAST hit” proteins
09 / 23 / 2005
Precision
Recall
Protfun
14%
13%
Voting
69%
29%
[email protected]
53
Future Work
Try other two ontologies – biological process
and cellular component
Use other local predictors
More parameter tuning
Predictor cost
09 / 23 / 2005
[email protected]
54
Conclusion
Protein Function Prediction provides good leads for
Protein Function Determination
Hierarchical ontologies can represent incomplete
data allowing the prediction of more functions
Considering the hierarchy:
More accurate & Less Computationally Intensive
Methods presented have a higher coverage than
BLAST alone
Results accepted to IEEE CIBCB 2005
09 / 23 / 2005
[email protected]
55
Thanks to…
Duane Szafron and Paul Lu
Brett Poulin and Russ Greiner
Everyone in the Proteome Analyst research
group
09 / 23 / 2005
[email protected]
56
Incomplete Data & Prediction
Inclusive avoids using ambiguous
(incomplete) training data
Does this help?
To test:
Train on more Incomplete Data:
Choose X% of proteins, and move one annotation up
Evaluation Predictions on “Complete” data
09 / 23 / 2005
[email protected]
57
Robustness to Incomplete Data
09 / 23 / 2005
[email protected]
58
Local vs Global Cross-Validation
Some node predictors have as little as 20 positive
examples
How to do cross-validation to make sure each
predictor has enough positive training examples?
09 / 23 / 2005
[email protected]
59
Local vs Global Cross-Validation
Local cross-validation is
invalid
Predictions must be
consistent
Need fold isolation
A single global split
global cross-validation
09 / 23 / 2005
[email protected]
60