CASAx - Computer Science
Download
Report
Transcript CASAx - Computer Science
Predicting Protein Solvent Accessibility
with Sequence, Evolutionary Information
and
Context-based Features
Ashraf Yaseen
Department of Mathematics &
Computer Science
Central State University
Wilberforce, Ohio
12/05/2013
Yaohang Li
Computer Science Department
Old Dominion University
Norfolk, Virginia
BIOT 2013: Biotechnology and Bioinformatics Symposium
Contents
2
Introduction
Research
Objective
Background
Method
Protein
data sets
Context-based features
Neural Network model
Results
Summary
Introduction
3
The solvent-accessible surface area, or accessibility,
of a residue is the surface area of the residue that
is exposed to solvent.
The residue accessibility is a useful indicator to the
residue's location, on the surface or in the core
Surface area of a protein segment
Introduction-cont.
4
DSSP program calculates the absolute solvent
accessibility values of proteins
Relative values are calculated as the ratio between
the absolute solvent accessibility value and that in
an extended tripeptide (Ala-X-Ala) conformation
To
allow comparisons between the accessibility of the
different amino acids in proteins
A threshold of 0.25 to define 2-state (exposed if
>0.25, buried otherwise)
Prediction effectiveness
5
Residue solvent accessibility plays an important role
in folding and enhancing proteins’ thermodynamic
and mechanical stability
The
burial of residues at core (hydrophobic residues) is
a major driving force for folding
Active sites of proteins are located on its surface.
Reduce the conformational space to aid modeling
protein structures in three dimensions
Help predict important protein functions
Predicting Structural Features in Protein Modeling
6
Protein Modeling
Sequence
3D
intermediate prediction steps
Correctly predicting structural features is a critical
step stone to obtain correct 3D models
Protein Structural Features
Properties of the residues in proteins
7
Protein 1BOO Chain A
Secondary Structure: General 3D form
of local segments of residues
Disulfide bond in protein chain
Surface area of a protein segment
Background
8
Many methods using different protein datasets and
different computational methods,
Neural
networks, support vector machines, nearest
neighbor, information theory, and Bayesian statistics
The prediction is in a discrete fashion
Significant accuracy increase when using
evolutionary information
2-state
prediction accuracy of ~75% with 0.25
threshold
PSI-BLAST derived profiles
2-state
prediction accuracy of ~78%
Background-cont.
9
Structural features prediction classification
Each residue is predicted to be in one of few states
Machine Learning
(ANN, SVM, HMM, ...)
Residue Solvent Accessibility Prediction
Predictor
• 2-state (buried or exposed)
Structural feature (state) of Ri
Secondary Structure Prediction
•
•
3-state (helix, sheet, coil)
8-state (α-helix, π-helix, 310-helix, β-strand, β-bridge, turn, bend and
others)
Disulfide Bonding Prediction
• Stage1: Bonding state prediction (bonded/free)
• Stage2: Connectivity prediction (connected, not connected)
Statement of the Problem
10
The improvement of prediction methods benefits
from the incorporation of effective features
MSA
in machine learning
The accuracy of current prediction methods is
stagnated for the past few years
2-state
solvent accessibility ~78%
3-state secondary structure ~76-80%
8-state secondary structure ~68%
Statement of the Problem-cont.
11
How to continuously improve the accuracy of
predicting protein structural features toward their
theoretical upper bounds?
Reducing the inaccuracy of protein structural
features prediction, will be very useful in improving
the efficiency of protein tertiary structure prediction
the
search space for finding a tertiary structure goes
up super-linearly with the fraction of inaccuracy in
structural feature prediction
Our Approach
12
Extracting and selecting “good” features can
significantly enhance the prediction performance
Probably the most effective features, when
predicting the structural state of a residue, are the
structural states of the neighboring residues
With true states >90%
Ri
Secondary Structure
H: Helix
E: Sheet
C: Coil
B
H
BC
XHB
H
CB
B
Solvent Accessibility
B: Buried
E: Exposed
Our Approach-cont.
13
Unfortunately, using the true structural states as
features is not feasible
However, this inspires us that the favorability of a
residue adopting a certain structural state can be
also an effective feature
Statistical
scores measuring the favorability of a
residue adopting a certain structural state within its
amino acid environment can be evaluated from the
experimentally determined protein structures in (PDB)
Our Approach-cont.
14
Input
encoding
Sequence & evolutionary info (MSA)
+ Structure info (context-based scores)
Predictor
Structural feature (state) of Ri
We expect that our approaches will improve the predictions of protein
structural features with the goal of achieving high accuracy levels
Method
15
Context-based features
potential scores
calculated based on the contextbased statistics, derived from the
protein datasets
estimate the favorability of
residues in adopting specific
structural states, within their
amino acid environment.
Context-based Model
Context-based Statistics & Potentials
16
Ci
Ci
Ri
Ri
Ci
Y
Ri
X
X
Encoding & Neural Network Model
17
Results
18
COMPARISON OF PREDICTION PERFORMANCE OF SOLVENT
ACCESSIBILITY USING PSSM ONLY AND PSSM WITH CONTEXTBASED SCORES ON CULL USING 7-FOLD CROSS VALIDATION
PSSM Only
PSSM+Score
QB
QE
Q2
78.44%
79.21%
80.61%
82.00%
79.50%
80.76%
NETASA
Q2 = total number of residues
correctly predicted /total number
of residues
QB and QE to measure the quality
of predicting the buried state and
the exposed state respectively
Sable
t=0.2
Sable
t=0.3
Netsurf
SPINE
ACCpro
Casa
COMPARISON OF Q2 ACCURACY BETWEEN OUR AND OTHER
POPULARLY USED SOLVENT ACCESSIBILITY PREDICTION SERVERS
Q2
QB
QE
Q2
QB
QE
Q2
QB
QE
Q2
QB
QE
Q2
QB
QE
Q2
QB
QE
Q2
QB
QE
CASP9
Manesh215
Carugo338
69.32
70.86
67.59
78.47
78.27
78.69
75.13
89.55
59.58
79.15
80.04
78.19
77.86
83.22
72.08
76.18
81.15
70.81
80.82
81.46
80.13
71.09
72.1
69.9
79.83
80.2
79.4
77.04
91.08
60.35
80.83
83.35
78.49
80.5
85.3
74.8
78.87
83.19
73.76
81.93
84.27
79.14
69.7
72.04
67.22
78.68
78.48
78.91
75.94
90.29
60.33
80.04
81.27
78.13
79.68
85.33
73.53
77.99
83.12
72.41
81.14
83.65
78.39
Results-cont.
19
DAVMVFARQGDKGSVSVGDKHFRTQAFKVRLVNAAKSEISLKNSCLVAQSAAGQSFRLDTVDEELTADTLKPGASVEGDAIFASEDDAVYGASLVRLSDRCK
EEB.BEBEEEEEEEEEEEEEEEEBBBBEBEBBBEBEEEBEBEEEBBBBBBEEEEEBEEEEEEEEBEEEEBEEEEEBEBEBEBBBEEEBBEEBBBBBBBEEEE
3NRF-A
DSSP SA2
Q2
EEB.BBBBEEEEBBBBEEEEEEBBBEBEBBBBEBEEEEBEBEEBBBBBBBEEEEEBEBEEBEEEBEEEBBEEEEEBEBBBBBBBEEEEBBEBEBBEBBEEBE PSSM Only 73.58
EEB.BBBEEEEEEEBEEEEEEBBBBEBEBBBBEEEEEEBEBEEBBBBBBBEEEEEBEBEEBEEEBEEEEBEEEEEBEBBBBBBBEEEBBBEBBBBEBBEEBE PSSM+Score 80.19
Solvent Accessibility Prediction on protein 3NRF(A)
20
Working with Casa
Input title
Input your sequence
Input your e-mail
Submit, then wait for the results..
“Casa” available at: http://hpcr.cs.odu.edu/casa
Working with Casa
Check your e-mail,
Click the link provided
The results are displayed
21
Summary
22
The effectiveness of using context-based features has been
demonstrated in our computational results in N-fold cross
validation as well as on benchmarks, where enhancements
of prediction accuracies in secondary structures, disulfide
bond and solvent accessibility are observed.
Web servers implementing our prediction methods are
currently available.
Dinosolve,
available at http://hpcr.cs.odu.edu/dinosolve
C3-Scorpion, available at: http://hpcr.cs.odu.edu/c3scorpion
C8-Scorpion, available at: http://hpcr.cs.odu.edu/c8scorpion
Casa,
available at: http://hpcr.cs.odu.edu/casa
Publications
23
1
2
3
4
5
6
Publication
Ashraf Yaseen and Yaohang Li “Enhancing Protein Disulfide Bonding Prediction Accuracy with Contextbased Features”, Proceedings of Biotechnology and Bioinformatics Symposium, (BIOT2012), Provo, 2012
Ashraf Yaseen and Yaohang Li, "Dinosolve: A Protein Disulfide Bonding Prediction Server using Contextbased Features to Enhance Prediction Accuracy". Accepted, BMC Bioinformatics 2013.
Ashraf Yaseen and Yaohang Li “Template-based Prediction of Protein 8-state Secondary structures”. 3rd
IEEE International Conference on Computational Advances in Bio and Medical Sciences (ICCABS), New
Orleans, April 2013.
• Accepted, BMC Bioinformatics
Ashraf Yaseen and Yaohang Li “Predicting Protein Solvent Accessibility with Sequence, Evolutionary
Information and Context-based Features”, Accepted at BIOT2013
Ashraf Yaseen and Yaohang Li “Context-based features can enhance protein secondary structure
prediction accuracy”. Submitted to Bioinformatics.
Ashraf Yaseen and Yaohang Li, “Accelerating Knowledge-based Energy Evaluation in Protein Structure
Modeling with Graphics Processing Units,” Journal of Parallel and Distributed Computing, 72(2): 297-307,
2012
Acknowledgement
24
This work is partially supported by NSF through
grant 1066471 and ODU SEECR grant
25
Questions?
Thank You