Transcript Slide 1
Prediction of Protein Inter-Domain Linkers Using
Compositional Index and Simulated Annealing
Maad Shatnawi and Nazar Zaki
College of Information Technology
United Arab Emirates University (UAEU)
UAE
[email protected]
Nazar Zaki
Amsterdam, The Netherlands, July 06-10, 2013
Outline
•
•
•
•
Introduction
Existing methods
Proposed solution
Method
– Compositional index
– SA optimization
• Experimental results
• Conclusion and future directions
Introduction
• Proteins have two types of segments: domains
and linkers
• Predicting inter-domain linkers is very important
– Accurate identification of functional domains
– Less computational cost
– Classify proteins, Predict PPI, fold prediction,
transmembrane, etc
Existing methods
Approach
DomCut (Suyama
and Ohara 2003)
Scooby-Domain
(George et al.
2005 )
FIEFDom
(Bondugula et al.
2009)
Extracted Features
Linker index
Domain lengths and
hydrophobicities
PSSM
DROP (Ebina et al.
2011)
Secondary structures
PSSM elements of hydrophilic
residues and prolines
Technique/Tools
Weaknesses
Linker index profile information contained in linker
index) is not sufficient
lack of biological knowledge
input.
A*-search
A* search suffers from an
exponential computational time
complexity
FMO
did not address the issue of
predicting domains with noncontiguous sequences and
therefore it discarded such
proteins.
Random forest
random forest can possibly be
SVM
trapped in local minima and
suffers from over-prediction
Proposed solution
• Our approach consists of
two main steps:
– Calculation of the
compositional index
– Employing Simulated
Annealing to refine the
prediction
Compositional index
Calculate the averaged compositional index values
Compositional index
Calculate the averaged compositional index values
Domain Linker (12-35), 𝐿𝑘 = 46,
Threshold = 0, 𝑤 = 5 𝑡𝑜 19
Compositional index
Compositional index (Illustration)
>1LGH:B
(AERSLSGLTEEEAIAVHDQFKTTFSAFIILAAVAHVLVWVWKPWF)
• Window size 5.
Compositional index (Illustration)
>1LGH:B
(AERSLSGLTEEEAIAVHDQFKTTFSAFIILAAVAHVLVWVWKPWF)
• Window size 5.
Compositional index (Illustration)
>1LGH:B
(AERSLSGLTEEEAIAVHDQFKTTFSAFIILAAVAHVLVWVWKPWF)
• Window size 5.
Compositional index (Illustration)
>1LGH:B
(AERSLSGLTEEEAIAVHDQFKTTFSAFIILAAVAHVLVWVWKPWF)
• Window size 5.
Compositional index (Illustration)
Dynamic threshold is needed
Why Simulated Annealing (SA)?
•
•
•
•
•
•
A protein sequence is seen as a
set of sequence chunks.
Each chunk would have its
proper dynamic threshold value.
This is a search problem of a set
of dynamic threshold values.
In other terms: partitioning a
given set of positive real
numbers into k subsets (k is
unknown) so as to maximize an
objective function.
SA is known to be well adopted
for partitioning problem
An intuitive customization is
straightforward
SA Customization
•
•
•
•
AS is a probabilistic searching
method for the global
optimization of a given function
in a large search space.
Inspired by the annealing
technique which is the heating
and controlled cooling of a metal
to increase the size of its crystals
and reduce their defects.
Ability to avoid being trapped in
local optima.
SA algorithms are usually better
than greedy algorithms, when it
comes to problems that have
numerous locally optimum
solutions.
Initial position
of the ball
Simulated Annealing explores
more. Chooses this move with a
small probability (Hill Climbing)
Greedy Algorithm gets
stuck here!
Locally Optimum
Solution.
Upon a large no. of iterations, SA
converges to this solution.
SA Optimization
•
•
•
•
•
•
5
Divide each protein sequence into segments.
The segment size was set to the average linker
size among the dataset.
Start from a random threshold value for each
segment (starting 0.1)
Calculate the AA compositional index of the
input protein sequence.
Classify each AA as linker or domain according to
its compositional index value with respect to the
corresponding segment threshold.
Calculate recall and precision.
Randomly increase or decrease the threshold
value of a segment.
SA accepts or rejects the transition in order to
maximize both the recall and precision of the
linker segment prediction.
4
3
2
Threshold
•
1
0
-1
-2
-3
-4
-5
0
50
100
150
200
250
300
350
400
450
500
Amino Acid
Optimal threshold values for XYNA_THENE
protein sequence in DomCut dataset which
contains 133 AAS
Evaluation Measures
• Recall is the proportion of correctly predicted linkers
to all of the structure-derived linkers listed in the
dataset
• Precision is defined as the proportion of correctly
predicted linkers to all of the predicted linkers
Experimental Results
Datasets
Experimental Results
Applying the proposed method on Dataset (1)
0.9
0.8
0.7
Proposed Method (Average
linker size (AA 36)
0.6
0.5
0.4
Proposed Method (Average
linker size (AA 18)
0.3
DomCut (Threshold = -0.09)
0.2
0.1
0
Recall
Precision
Experimental Results
Applying the proposed method on Dataset (2)
0.7
Proposed method
0.6
DROP
0.5
DROP-SD5.0
0.4
DROP-SD8.0
SVM-PeP
0.3
SVM-SD3.0
0.2
SVM-SD2.0
0.1
SVM-Af
Random
0
Recall
Precision
Conclusion
• We examined the amino acid compositional index to predict
protein inter-domain linker segments from amino acid
sequence information.
• We employed simulated annealing to improve the prediction
by finding the optimal set of threshold values that separate
domains from linker segments.
• Experimental results show that the proposed method
outperformed the currently available approaches for interdomain linker prediction in terms of recall and precision.
Conclusion
• This work can be extended
by examining different
sliding window sizes in
computing AA
compositional index.
• Additional SA parameter
tuning and use of dynamic
segment sizes.
• Combine compositional
index with other features
such as PSSM, AA
physiochemical properties,
hydrophobicity can be
examined.
Thank you