Transcript hotspots

Protein-Protein Interaction Hotspots Carved
into Sequences
Yanay Ofran1,2, Burkhard Rost1,2,3
1.Department of Biochemistry and Molecular Biophysics, Columbia University
2. Columbia University Center for Computational Biology and
Bioinformatics(C2B2)
3. NorthEast Structural Genomics Consortium (NESG), Columbia University
Presented by
Navreet Virk
CISC-841
Background
Protein-Protein Interactions (PPI)
• Refer to associations of protein molecules
• Important for almost all biological processes
For e.g. signals from exterior of a cell are
mediated to the inside of that cell by PPI of
signaling molecules
• To understand biological mechanisms requires
knowledge of PPI and underlying molecular
principles
Protein-Protein Interface Residues
• Studies have discovered networks of PPI in
cells and identified pairs of interacting
proteins
• To understand PPI, the residues that account
for binding of proteins and stabilizing
complexes need to be identified i.e. the
interface residues
• Prediction methods have mainly focused on
entire interfaces of 2 interacting proteins
Hotspots
• Only few of the interface residues are
essential for recognition and binding to other
proteins
• This small subset of essential residues is
referred to as “hotspots”
• If mutated, these residues hamper proteinprotein interaction
Significance of Hotspots
• For large interfaces, less than 5% of interface
residues mostly contribute to total energy in
binding
• For small interfaces, it could be 1 amino acid
for each protein
• Identification of these residues may benefit
function prediction
• Hotspots predictions assist in identifying
binding sites for drugs or desired drug targets
Protein-protein Interfaces, Hotspots and Predictions
(A) Human growth hormone
(yellow) bound to extra
cellular position of its
receptor
(B) The chains of receptor are
201 residues long. PPI
interface covers 31 of these
residues (blue and red) on
each chain. Mutating one of
6 red residues hampers the
interaction.
(C) Prediction method ISIS
captures 5 on interface
residues (green colored)
Hotspot Prediction
Identifying Hotspots
• Given 3-D structure of complex, the residues
crucial for binding can be predicted
• Less than 1% of 3-D structures are known for
interacting pairs
• More efficient method- Alanine scans
Experimentally mutate residues to alanine
and measure effect of substitution on
interaction
• Method theoretically debatable
Current methods to predict hotspots
• Currently no method identifies hotspots
• Methods that identify all interface residues
have good accuracy and low coverage
Accuracy = TP/( TP + FN)
Coverage = TP/( TP + FP)
• Authors gave a hypothesis for low coverage of
prediction methods
Hypothesis
• The residues that are missed by prediction
methods are more similar to the general
population of surface residues than to the
essential residues
• Machine learning algorithm trained on all
protein-protein interfaces discards nonhotspot as noise
• Only hotspots are identified as signal to be
learned
Test of hypothesis using ISIS
•
Authors applied ISIS method to task of
identifying hotspots
ISIS: Interaction Sites Identified from Sequence
• Knowledge based method developed to
identify interface residues from sequence
• Based on system of neural networks
• Takes as input:
- sequence environment of residues
- evolutionary profile of residues in the window
- predicted solvent accessibility
- predicted secondary structure
- predicted conservation score
Overlap between two subsets
• 90% accuracy and 5% coverage
• Results of ISIS confirmed the hypothesis
• There was a surprising overlap between:
- the subset of residues confirmed to be
hotspots by experimental alanine mutation
- the subset of residues predicted by ISIS to
be protein-protein interface residues
Results
ISIS prediction method to identify hotspots
• Dataset - ASEdb database of experimental
alanine scans
• Used 296 point mutations from 30 proteins
• Used 2 representative examples initially
• Compared experimental results of alanine
scans and residues predicted by ISIS
• No 3-D structure of proteins were used
Example1: HIV gp120/ CD4 Receptor Complex
• Ashkenazi et al sequentially
mutated many residues in
V1 domain of CD4 receptor
• 25 positions within 94
residues were known to
substantially effect affinity
of CD4
• ISIS predicted 30 residues
• 19 of them were also
experimentally predicted
• Of the 6 missed, 4 were next
to predicted residues
• 5 of the residues were not
experimentally mutated
Example 2: Voltage-Gated Potassium Channel
• A membrane protein which helps control
voltage gradient across plasma membrane
• Within region of 29 consecutive residues, 8
had significant effect on affinity of channel to
its inhibitors agitoxin2 and charybdotoxin
• ISIS predicted 13 residues including 7 of the 8
experimentally found hotspots
• Residue missed was buried in structure and
didn’t have a direct effect on binding
• 5 of 6 residues that did not coincide had
significant but less dramatic effect than others
Performance over Entire Dataset
• Almost all binding residues predicted by ISIS
were experimentally found to be significant
• 90% of negative predictions were
experimentally confirmed
• Using different points in curve, accuracy (true
positives/ all positives) can be increased at the
cost of coverage (true positive/ predicted
positive)
Discussion
Hotspots are easy to identify but hard to define
• Neural networks were able to disregard most
of residues observed in interface i.e. pupil
(neural network) ignored the teacher (labeled
data)
• NNs identified a set of complex nonlinear
correlations between input features and
hotspot residues
• Not possible to translate subtle and complex
dependencies into simple explanations
• Authors inferred which features are more or
less relevant
Training Neural Networks on Different Parameters
• NNs trained on sequence environment alone
performed slightly better than random
• Adding evolutionary information significantly
improved performance on both interface
residues and hotspots
• Conservation of hotspots is marginally
different from other residues
• Although difference is marginal but
performance improved significantly
• NNs learned to distinguish between
conservation that is indicative of hotspots and
conservation that is not
Sequence Conservation
• X-axis is conservation score
of residues
• Y-axis gives the fraction of
residues within a given level
of conservation
• Compares the distribution of
residue conservation between
entire dataset, hotspots and
residues with no effect
• Differences are marginal but
overall effect of conservation
is substantial
Results of different methods to predict hotspots
• Hydrophobic moment method
represents the approach that
relies on local
physicochemical features
• Knowledge based method
using amino acid info for
sequence only approach
• Evolutionary Trace method
for evolutionary conservation
• ISIS for combination of all of
them
• ProMate for representing
method using known 3-D
structure
What does it take to predict hotspots?
• ISIS and ProMate were most successful
• Linear combinations of features is not enough
to predict hotspots
• Results show that complex combination of:
- physicochemical
- evolutionary
- structural
features is required to predict hotspots
How Hotspots Differ from Other Residues ?
• ISIS was trained on large dataset to create
large datasets of predicted hotspots and
predicted non-hotspots
• Characteristics of hotspots (positive values)
and non-spots (negative values) were
compared
• Multiple sequence alignment is done for each
protein
• For each interface residue, the average
occupancy of its position by each type of amino
acid is calculated
Position Occupancy in Hotspots versus Rest of the Interface
• the p-value is the probability of
obtaining a result at least as
extreme as a given data point,
under the null hypothesis
• for some amino acids there are
significant differences between
of hotspot and non-hotspot
interface residues, while for
others there are no substantial
differences
• Experimental values are close
to values obtained for predicted
hotspots
Secondary Structure Comparison of Hotspots and Non-hotspots
Table comparing structural
features
• The secondary structure state
of 39% of non-hotspot
interface residues is loop.
• In predicted hotspots, 57% of
residues are in a loop state.
• In both categories, rest of the
residues were divided
roughly equally between
helices and strands
• There is again a similarity
between experimentally
determined and predicted
values
Conclusion
• Alanine scans indeed capture some genuine physicochemical
commonalities of interaction hotspots (as shown by Baker Lab)
that could be identified by a general method
• While alanine scans require high-resolution structure of protein
complex, ISIS needs only sequence of a single chain regardless of
its binding partner
• Alanine scans produces numerical predictions in terms of binding
energy (∆∆G), ISIS produces a binary prediction (hotspot/nonhotspot)
• For similar levels of positive accuracy, coverage of ISIS is roughly
half that of in silico alanine scanning
• Thus when 3-D structure of complex is available, the in silico
alanine scan is a powerful tool for identifying hotspots whereas
when only sequence is available, ISIS can provide accurate
predictions
• Analyzing a single protein using ISIS typically requires a few
minutes, therefore, allowing large-scale analysis of hotspots at a
relatively small CPU cost.
• Server for prediction available at
http://www.rostlab.org/services/isis