Sequence Data Analysis: A Bioinformatics Application

Transcript Sequence Data Analysis: A Bioinformatics Application

Methods for Improving Protein
Disorder Prediction
Slobodan Vucetic1, Predrag Radivojac3, Zoran Obradovic3, Celeste J.
Brown2, Keith Dunker2
1 School of Electrical Engineering and Computer Science,
2 Department of Biochemistry and Biophysics
Washington State University, Pullman, WA 99164
3 Center for Information Science and Technology
Temple University, Philadelphia, PA 19122
ABSTRACT
Attribute construction, choice of classifier and postprocessing were explored for improving prediction of
protein disorder. While ensembles of neural networks
achieved the higher accuracy, the difference as compared
to logistic regression classifiers was smaller then 1%.
Bagging of neural networks, where moving averages over
windows of length 61 were used for attribute construction,
combined with postprocessing by averaging predictions
over windows of length 81 resulted in 82.6% accuracy for a
larger set of ordered and disordered proteins than used
previously.
This result was a significant improvement over previous
methodology, which gave an accuracy of 70.2%. Moreover, unlike the previous methodology, the modified
attribute construction allowed prediction at protein ends.
Motivation
Standard ``Lock and Key’’ Paradigm for
Protein Structure/Function Relationships
Amino Acid Sequence
3-D Structure
Protein Function
(Fischer, Ber. Dt. Chem. Ges.,1894)
Protein Disorder - Part of a Protein without
a Unique 3D Structure
Example: Calcineurin Protein
?
?
(Kissinger et al, Nature, 1995)
Overall Objective
Better Understand Protein Disorders
Hypothesis:
• Since amino acid sequence determines
structure, sequence should determine lack of
structure (disorder) as well.
Test
• Construct a protein disorder predictor
• Check its accuracy
• Apply it on large protein sequence databases
Objective of this Study
• Previous results showed that disorder can be
predicted from sequence with ~70% accuracy
(based on 32 disordered proteins)
• Our goals are to increase accuracy by
– Increasing database of disordered proteins
– Improving knowledge representation and attribute
selection
– Examining predictor types and post-processing
– Perform extensive cross-validation using different
accuracy measures
Data Sets
• Searching disordered proteins
(DIFFICULT)
– Keyword search of PubMed
(http://www.ncbi.nlm.nih.gov) for disorders
identified by NMR, Circular dichroism,
protease digestion
– Search over Protein Data Bank (PDB) for
disorders identified by X-ray crystallography
• Searching ordered proteins (EASY)
– Most proteins in Protein Data Bank (PDB)
are ordered
Data Sets
• Set of protein disorders (D_145)
– Search revealed 145 nonredundant
proteins (<25% identity) with long
disordered regions (>40 amino acids) with
16,705 disordered residues
• Set of ordered proteins (O_130)
– 130 nonredundant completely ordered
proteins with 32,506 residues were chosen
to represent examples of protein order
Data representation
Background
• Conformation is mostly influenced by
locally surrounding amino acids
• Higher order statistics not very useful in
proteins [Nevill-Manning, Witten, DCC
1999]
• Domain knowledge is a source of
potentially discriminative features
Attribute Selection
(including protein ends)
W C Y LAA M A H Q F A GA G K L K C T SA L S C T
WINDOW
(size = Win)
class: (1 / 0)
(disordered/ordered)
SEQUENCE
Calculate over window:
20 Compositions
K2 entropy
14Å Contact Number
Hydropathy
Flexibility
Coordination Number
Bulkiness
CFYW
Volume
Net Charge
Attribute Selection
(including protein ends)
• Attribute construction resembles low-pass
filtering. Consequence
– effective data size of D_145 is ~ 2*16,705/Win
– effective data size of O_130 is ~ 2*32,506/Win
• K2 entropy - low complexity proteins are likely
disordered
• Flexibility, Hydropathy, etc. - correlated with
disorder
• 20 AA compositions - occurrence or lack of
some AA from the window is correlated with
disorder incidence
Disorder Predictor Models
We examine:
• Logistic Regression (LR)
Classification model, stable, linear
• Neural Networks
Slow training, unstable, powerful, need much data
• Ensemble of Neural Networks (Bagging,
Boosting)
Very slow, stable, powerful
Postprocessing
• We examine LONG disordered regions:
– neighboring residues likely belong to the
same ordered/disordered region
• Predictions can be improved:
– Perform moving averaging of prediction
over a window of length Wout
Data
Disorder
Predictor
Wout Filter
Prediction
Accuracy Measures
• Length of disordered regions in different
proteins varies from 40 to 1,800 AA
• We measure two types of accuracy
– Per-residue (averaged over residues)
– Per-protein (averaged over proteins)
• ROC curve - measures True Positive
(TP) against False Negative (FN)
predictions
Experimental Methodology
• Balanced data sets of order/disorder
examples
• Cross-validation:
– 145 disordered proteins divided into 15 subsets
(15-fold cross validation for TP accuracy)
– 130 ordered proteins divided into 13 subsets (13fold CV for TN accuracy)
• To prevent collinearity and overfitting 20
attributes are selected (18 AA compositions,
Flexibility and K2 entropy)
Experimental Methodology
• 2,000 examples randomly selected for training
• Feedforward Neural Networks with one hidden layer
and 5 hidden nodes.
• 100 epochs of resilient backpropagation
• Bagging and Boosting ensembles with 30 neural
networks
• Examined Win, Wout = {1, 9, 21, 41, 61, 81, 121}
• For each pair (Win, Wout) CV repeated 10 times for
neural networks and once for Logistic Regression,
Bagging and Boosting
Results – Model Comparison
Per-protein accuracy, (Win, Wout) = (41,1)
Model
Accuracy
TN
TP
Logistic Regression 79.7
69.9
Neural Networks
79.21.3 72.51.4
Bagging
81.4
72.8
Boosting
81.5
73.1
Average
73.5
75.8
77.1
77.3
•Neural networks slightly more accurate then linear predictors
•Ensemble of NNs slightly better then individual NN
•Boosting and Bagging result in similar accuracy
• TN rate is significantly higher then TP
rate (~ 10%)
DISORDER
ORDER
• Indication that attribute space coverage of
disorder is larger then coverage of order
 Disorder is more diverse then order
Results – Influence of Filter Size
Per-protein accuracy with bagging
0.85
Win=61
0.75
•Different pairs of
(Win, Wout) can result
in similar accuracy
0.7
•Wout=81 seems to be
the optimal choice
Accuracy
0.8
Win=21
0.65
Win=9
0
20
40
60
Wout
80
100
120
Results – Optimal (Win, Wout)
Per-protein and per-residue accuracy of bagging
Win Wout* Per-protein Acc
TN TP Average
9 81 93.5 65.2 79.3
21 81 93.5 71.5 82.5
41 81 90.3 73.7 82.0
61 81 88.8 76.5 82.6
81 61 86.1 77.9 82.0
121 61 85.3 76.8 81.0
Accuracy
Per-residue
81.1
84.3
84.5
85.3
85.3
85.4
Per-residue accuracy gives higher values
For a wide range of Win, optimal Wout=81
The best result achieved with (Win, Wout) = (41,1)
Results – ROC Curve
Compare (Win, Wout) = (21,1) and (61,81)
1
(61,81)
TP
0.8
•(Win, Wout) = (61,81) is
superior: ~10%
improvement in perprotein accuracy
(21,1)
0.6
•(Win, Wout) = (21,1)
corresponds to our
previous predictor
0.4
0.2
0
0
0.2
0.4
FN
0.6
0.8
1
Results – Accuracy at Protein Ends
Comparison on O_130 proteins
Accuracy
0.9
Solid:
(Win=61, Wout=81)
0.8
0.7
0.6
Dashed:
(Win=21, Wout=1)
0.5
0.4
I
20
1
II
20
• Comparison of accuracies at the first 20 (Region I)
and last 20 (Region II) positions of O_130 proteins
Results – Accuracy at Protein Ends
Comparison on D_145 proteins
Accuracy
0.9
Solid:
(Win=61, Wout=81)
0.8
0.7
Dashed:
(Win=21, Wout=1)
0.6
0.5
I
20
1
II
20
1
III
20
1
IV
20
• Averaged accuracies of the first 20 positions of 91 disorder
regions that start at the beginning of protein sequence
(Region I) and 54 disordered regions that do not start at
the beginning of protein sequence (Region II)
Averaged accuracies of the last 20 positions of 76 disordered
regions that do not end at the end of protein sequence
(Region III) and 69 disorder regions that end at the end of
protein sequence (Region IV).
Conclusions
• Modifications in data representation, attribute
selection, and prediction post-processing were
proposed
• Predictors of different complexity were proposed
• Achieved 10% accuracy improvement over our
previous predictors
• Difference in accuracy between linear models
and ensembles of neural networks is fairly small
Acknowledgements
Support from NSF-CSE-IIS-9711532 and
NSF-IIS-0196237 to Z.O. and A.K.D. and
from N.I.H. 1R01 LM06916 to A.K.D. and
Z.O is gratefully acknowledged.

Sequence Data Analysis: A Bioinformatics Application

Transcript Sequence Data Analysis: A Bioinformatics Application

Directory