PREDICTION OF CATALYTIC RESIDUES IN PROTEINS USING …
Download
Report
Transcript PREDICTION OF CATALYTIC RESIDUES IN PROTEINS USING …
PREDICTION OF CATALYTIC RESIDUES IN PROTEINS USING MACHINE-LEARNING TECHNIQUES
Natalia V. Petrova (Ph.D. Student, Georgetown University, Biochemistry Department), Cathy H. Wu (Protein Information Resource, Georgetown University, Departments of Biochemistry and Molecular Biology and Oncology )
PRELIMINARY ATTRIBUTE SET
90.0
3
BENCHMARKING DATASET
We present a method for the prediction of catalytic residues in
proteins using machine learning techniques. We found the bestperforming machine learning algorithm (support vector classifier,
SMO), and relevant features of protein residues for the prediction of
catalytic residues using benchmarking dataset of enzymes with
known catalytic sites.
WRAPPER ATTRIBUTE SELECTION
ALGORITHM
*
85.0
80.0
Correctly Classified, %
75.0
70.0
65.0
60.0
55.0
50.0
This method can predict catalytic residues and 3D location of the
active site with an accuracy > 86% for proteins with unknown
function, provided that the structure of the protein is known.
4
NaiveBayes
Logistic
NeuralNetwork
SMO
SimpleLogistic
VotedPerception
IB1
IBK
KStar
LWL
HyperPipes
VFI
ADTree
DecisionStump
J48
LMT
RandomForest
RandomTree
REPTree
ConjunctiveRule
DecisionTable
Nnge
OneR
PART
Jrip
Zero
ABSTRACT
2
1
SMO is the best performing algorithm (among tested) for the
prediction of catalytic residues
10-fold Cross-Validation Analysis of the Performance of
the Different Algorithms
Algorithm
BEST-PERFORMING CLASSIFIER – ‘SMO’
METHODS
INTRODUCTION
One of the major goals of proteomics is to assign a function to
every protein. The knowledge of the protein function is a key to
determining the role it plays in the cell. The number of proteins,
whose functions have been experimentally characterized, is
growing linearly every year. Experimental data provide reliable (in
most cases) information about protein functional residues as well as
possible mechanism of protein function. Furthermore, analytical
methods used for experimental characterization of protein function
involve many man-hours. It is true that it can be reduced by either
improving the existing or, perhaps, by the development of new
methods in experimental biology. But, since the sizes of the protein
sequence and protein structure databases are growing exponentially,
the gap between experimentally characterized and uncharacterized
proteins is also growing exponentially. As a result, two major
groups of computational methods are progressively developing:
homology transfer of known experimental data, and prediction of
protein function using various properties of proteins and amino
acids.
Prediction of the functional residues is a challenging and
interesting task. The results of such prediction could be successfully
used in many research areas such as drug design, experimental
biology, and protein database annotations.
REFERENCES
In order to train a machine learning algorithm we used the
benchmarking dataset which is a subset of the “Catalytic Residue
Dataset” database. Every protein from the benchmarking dataset is
a member of a manually curated protein family of PIR iProClass
database. The dataset has 254 catalytic residues from 79 proteins
out of 178 enzymes from Catalytic Residues Dataset (1).
CONCLUSIONS
5
Using “Catalytic Residue Database” we decided to build a dataset,
where each instance would be represented as a list of attribute
values and a class label {+1 / -1}, which in this case would be an
indicator of the residue being catalytic (+1) or not (-1). Each
attribute in this dataset is a property of the protein residues. The list
of attributes was chosen based mostly on work of Bartlett et al., and
other authors who pointed out the importance of particular residue
property (2).
#
1
8
15
16
18
19
20
24
Since for the complex dataset it is almost impossible to know a
priory which classification algorithm is going to perform better, our
first goal was to determine one of the best performing algorithms
among machine learning techniques built in WEKA, JAVA-software
package (3, 4).
FINAL ATTRIBUTE SET
Improvement of the prediction
all
part 1
part 2
Attribute name
aa_name
SAS_Total_Side_REL_Naccess
cleft_rank_CastP
cleft_Vol_SA_CastP
nearest_cleft_distance
distance_to_3_largest_clefts
HB_main_chain_protein_MolMol
conservation_score_Scorecons
all attributes (24)
selected attributes(8)
ACKNOWLEDGEMENTS
This work would not have been complete without the wise help, and guidance
that was provided by our colleagues at PIR:
•Hongzhan Huang, Ph.D.
(PIR: Team Lead, Bioinformatics and Research Assistant Professor)
• Sona Vasudevan,Ph.D.
(PIR: Senior Bioinformatics Scientist)
• C.R. Vinayaka, Ph.D.
(PIR: Senior Research Scientist)
83.0
87.5
87.1
88.3
87.3
87.5
87.7
78.3
86.1
88.3
82.0
86.7
86.9
86.3
85.9
86.3
86.1
78.3
87.1
86.3
Reduction of the number of the attributes increases
the prediction accuracy of SMO algorithm
8 out of 24 attributes are selected as relevant for the
prediction of catalytic residues
Model
Different authors seem to focus on different features of the protein
in order to predict catalytic residues. Therefore, we found relevant
features of the protein residues for the prediction of catalytic
residues using our benchmarking dataset of enzymes with known
catalytic sites and machine learning attribute selection algorithm –
“Wrapper” (5).
6
EXAMPLES OF PREDICTION
The selection of the attributes combined with best-performing
algorithm was used to build a model for the prediction of catalytic
residues (6).
Acetyl-coA Acetyltransferase, 1afw
GenBank database statistics, http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html
PDB database statistics, http://www.rcsb.org/pdb/holdings.html
Bartlett G.J., Porter C.T., Borkakoti N., Thornton J.M. Analysis of Catalytic Residues in Enzyme Active
Sites. J. Mol. Biol., 324: 105-121, 2002
Campbell S. J., Gold N. D., Jackson R. M., Westhead D. R., Ligand binding: functional site location,
similarity and docking. Current Opinion in Structural Biology, 13: 389-395, 2003
Sjolander K., Karplus K., Brown M., Hughey R., `Krogh A., Mian S., Haussler D., Dirichlet Mixtures: A
Method for Improved Detection of Weak but Significant Protein Sequence Homology, 1996
Smith D. K., Radivojac P., Obradovic Z., A. Keith Dunker A. K., Zhu G., Improved amino acid flexibility
parameters. Protein Science, 12: 1060-1072, 2003
82.6
87.5
86.1
86.7
84.6
86.1
85.7
80.9
85.9
86.9
Average
82.5
87.2
86.7
87.1
85.9
86.6
86.5
79.1
86.4
87.2
The performance of a support vector classifier
suggests that the linear separation using one
dimension, corresponding to one feature, is not
sufficient for the prediction of catalytic residues.
SMO algorithm trained on the dataset, represented
by the selected attributes has:
Prediction Accuracy : > 86%
TP Rate: 0.898%
FP Rate: 0.126%
Acylphosphatase, 2acy
True Positive (TP): red
True Positive (TP): red
False Positive (FP): yellow
False Positive (FP): blue
RESULTS
SMO (the support vector classifier) found to be the best performing
algorithm (among tested) for the prediction of catalytic residues (4).
8 attributes out of 24 were selected as relevant.
As anticipated, the selection of the attributes did improve the
performance of the SMO classifier (5).
We measured the algorithm accuracy of prediction without each
individual attribute present and found that no attribute can be
excluded from the final list without reduction in the performance of
SMO classifier (5).
Catalytic Residues: C125, H375, C403, G405
Catalytic Residues: R23, N41