Transcript Slide 1

Identification of amino acid residues in protein-protein
interaction interfaces using machine learning and a
comparative analysis of the generalized sequence- and
structure- based features employed
Angshuman Bagchi, Ph.D
Assistant Professor of Biochemistry
Department of Biochemistry and Biophysics
University of Kalyani
Formerly postdoctoral fellow in Buck Institute,
Stanford University, California, USA
Purdue University, Indianapolis, USA
Email: [email protected]
Importance of protein-protein interactions (PPIs)
• Crucial for the understanding of the biological
pathways, like cell signalling
• PPI dysfunctions may lead to disease situations
• Important targets for therapy
http://nrc.bu.edu/cluster/
Angshuman Bagchi – [email protected]
Aim of the Present Research
• To extract features of PPIs from known PP heterocomplex structures and thereby to predict PPIs with
their help using machine learning tools
• To build
machine learning (Support Vector
Machine and Random Forest) classifiers with the
help of the training dataset
• To set up an online server to predict PPI residues
from protein sequence and structural information
• To build a web service plug-in for UCSF Chimera
to visualize the PPI residues
Angshuman Bagchi – [email protected]
Overview of Support Vector Machine (SVM)
•A support vector machine (SVM) is a concept
in statistics and computer science for a set of
related supervised learning methods that analyze
data and recognize patterns, used for
classification and regression analysis.
•Given a set of training examples, each marked
as belonging to one of two categories, an SVM
training algorithm builds a model that assigns new
examples into one category or the other.
•An SVM model is a representation of the
examples as points in space, mapped so that the
examples of the separate categories are divided
by a clear gap that is as wide as possible.
•New examples are then mapped into that same
space and predicted to belong to a category
based on which side of the gap they fall on.
Angshuman Bagchi – [email protected]
Overview of Random Forest (RF)
•A Random Forest (RF) is an ensemble classifiers that consists of many decision
trees.
•Given a set of training examples, it generates random decision trees. The output
of the tree is the class which has got the maximum votes.
•RF has the ability to give estimates of the importance of the variables.
•It efficiently handles the problem of missing data..
Angshuman Bagchi – [email protected]
Assumptions – employed
• Surface residue: An amino acid with its accessible surface area
(ASA) > 15% of its total area
• Interface residue: A surface residue with at least one heavy atom
located within a distance of 5Å from any of the heavy atoms of its
interacting partner
• Dataset: 274 high resolution X-ray hetero-complex structure files
with 10597 interface residues (+ve) and 27333 non-interface
surface residues (-ve) (Jo-Lan et al., Proteins, 2006)
Features
• Sequence based: Obtained from sequence conservations using
PSI-BLAST
• Structure based (2ndary Structure, Charge, Solvent accessibility,
B-factor etc.): Obtained using S-BLEST (Mooney et al., Proteins,
2005), DSSP (Kabasch & Sander, Biopolymers, 1983), PDB files
Angshuman Bagchi – [email protected]
Development of PPI predictor
The dataset was divided into the following two categories with
equal number of PPI (positive) and non-PPI (negative)
examples. This balanced dataset was used for the training
purposes.
Dataset
Sequence
Based
Angshuman Bagchi – [email protected]
Structure
Based
Development of PPI predictor-Continued
•The RF package in R and the LibSVM package were used to
implement separate RF and SVM predictors using each of the
aforementioned datasets with 10-fold cross-validation.
•Two SVM predictors, one using a linear kernel and the other
using a Radial Basis Function (RBF) kernel, were created from
each dataset.
•Throughout the experiments, the default values of the
regularization parameter (C) and γ for linear and RBF kernel
SVM were used.
•For RF, we generated 1000 trees keeping other parameters to
their default values.
Angshuman Bagchi – [email protected]
Best features ranked on the basis of their AUC
Rank & Description
AUC
B-factor
0.91
PSSM
0.85
Frequency of Lys residues in a 20
amino acid sequence window
0.83
Solvent accessibility
0.80
Number of neighboring charged
residues (Arg, Asp, Glu, Lys)
0.78
Acidic residue
0.75
Atomic charge
0.71
Hydrophobicity
0.70
AUC: Area under Receiver Operating Characteristics (ROC) Curve
Angshuman Bagchi – [email protected]
Machine learning results
The dataset used is sequence (interface residues as positives
and all non-interface surface and core residues as negatives)
Method
SVM linear
Accuracy (%)
60.5
Sensitivity (%)
57.9
Specificity (%)
63.1
AUC
0.63
SVM RBF
58.9
51.6
66.3
0.59
RF
76.7
74.8
78.7
0.77
TPR = True Positive Rate ,
FPR = False Positive Rate
Angshuman Bagchi – [email protected]
Machine learning results-continued
The dataset used is sequence (interface residues as positives
and non-interface surface residues as negatives)
Method
SVM linear
Accuracy (%)
53.3
Sensitivity (%)
22.7
Specificity (%)
83.9
AUC
0.53
SVM RBF
50.2
70.7
29.6
0.50
RF
69.3
67.3
71.3
0.70
The dataset used is structure (interface residues as positives
and non-interface surface residues as negatives)
Method
SVM linear
Accuracy (%)
57
Sensitivity (%)
47.1
Specificity (%)
66.6
AUC
0.57
SVM RBF
57.4
49.3
65.5
0.57
RF
70.7
66.3
75.1
0.71
Angshuman Bagchi – [email protected]
Case Study
Top-scoring amino acid residues from the crystal
structure of the antibody N10-staphylococcal
nuclease complex (PDB ID: 1NSN). The backbone
of the antibody N10 is presented in black whereas
the staphylococcal nuclease is shown as surface in
cyan. The top scoring amino acid residues are
highlighted.
Angshuman Bagchi – [email protected]
Conclusion
•We have developed and evaluated several classification models (RF, SVM-linear
& -RBF) for identifying PPI interfaces using both a combination of sequence- &
structure-based features as well as only sequence-based features.
•The wider application of our classifier could have important consequences for the
prediction, prognosis and treatment of inherited disease states brought about by
disruption of PPI sites.
•Since we have developed a sequence-only predictor for PPI interface prediction,
our method can be used by researchers to have a quick idea about the probable
function of the protein for which no structures are available.
•Finally, we have constructed a web resource that can be used for the prediction of
PPI sites using either sequence alone, or structure and sequence together. This
resource can be found at http://www.sblest.org/ppi
Angshuman Bagchi – [email protected]
Acknowledgement
Angshuman Bagchi – [email protected]