Transcript Document

IPM-POLYTECHNIQUE-WPI
Workshop on
Bioinformatics and
Biomathematics
April 11-21, 2005
IPM
School of Mathematics
Tehran
Prediction of protein surface
accessibility based on residue
pair types and accessibility state
using dynamic programming
algorithm
R. Zarei1, M. Sadeghi2, and S. Arab3
1,2) NRCGEB, Tehran, Iran
3) IBB, University of Tehran
Proteins & structure of proteins
Prediction of protein structure
Prediction of protein accessible surface
area
Method
conclusion
Flow of information
DNA
RNA
PROTEIN SEQ
PROTEIN STRUCT
PROTEIN FUNCTION
……….
Proteins are the Machinery of life
Proteins have Structural & functional roles
in cells
No other type of biological macromolecule
could possibly assume all of the functions
that proteins have amassed over billions of
years of evolution.
Proteins structure leads to protein
function
Precise placement of chemical groups allows proteins
to have :
Catalysis function
Structural role
Transport function
Regulatory function
Then the determination of 3-dimentional structure of
proteins is important.
4 levels of protein structures
 The Primary structure of proteins (A string of 20
different Amino acids)
 The secondary structure of proteins (Local 3-D
structure)
 The Tertiary structure of proteins (Global 3-D structure)
 The Quaternary structure of proteins (Association of
multiple polypeptide chains)
The Primary structure of
proteins
The secondary structure of proteins
 α- helices
α-helix
310-helix
Π-helix
parallel
 β- sheets
anti parallel
Loops
 Other secondary structures
Coils
Hairpin loops
Ώ loops
Extended loops
random coil
The Tertiary structure of proteins
 There are a wide variety of ways in which the
various helix, sheets & loop elements can
combine to produce a complete structure.
 At the level of tertiary structure, the side
chains play a much more active role in
creating the final structure.
Why predict protein structure?
Structural knowledge brings understanding of
function and mechanism of action
Protein structure is determined experimentally
by X-ray and NMR
The sequence- structure gap is rapidly
increasing.
1000 000 known sequences, 20 000 known
structures
What is protein structure
prediction?
In its most general form
A prediction of the (relative) spatial position of
each atom in the tertiary structure generated
from knowledge only of the primary structure
(sequence)
Hypotheses of Prediction
No general prediction of 3D structure from
sequence yet.
Sequence determines structure determines
function
The 3D structure of a protein (the fold) is
uniquely determined by the specificity of the
sequence(Afinsen,1973)
Methods of structure prediction
Comparative (homology) modelling
Fold recognition/threading
Ab initio protein folding approaches
3D structure prediction of proteins
New folds
Existing folds
Ab initio
prediction
Threading
0
10
20 30
Building by
homology
40 50 60 70 80 90 100
similarity (%)
Levels of structure prediction
1D
secondary structure, accessibility,……
2D
contact map of residues
3D
Tertiary structure
Prediction in 1D
Structure prediction in 1D is To project 3D structure
onto strings of structural assignments.
Secondary Structure prediction
Prediction of Accessible Surface Area
Prediction of Membrane Helices
What is prediction in 1D?
 Given a protein sequence (primary structure)
GHWIATRGQLIREAYEDYRHFSSECPFIP

Assign the residues
(C=coils H=Alpha Helix E=Beta Strands)
CEEEEECHHHHHHHHHHHCCCHHCCCCCC
secondary structure prediction in 1D
less detailed results
 only predicts the H (helix), E (extended) or C
(coil/loop) state of each residue, does not predict the
full atomic structure
 Accuracy of secondary structure
prediction
 The best methods have an average accuracy of just
about 73% (the percentage of residues predicted
correctly)
History of prediction of protein
structure in 1D methods
 First generation
– How: single residue statistics
– Accuracy: low
 Second generation
– How: segment statistics
– Accuracy: ~60%
 Third generation
– How: long-range interaction, homology based
– Accuracy: ~70%
Protein surface
Accessible Surface Area
Reentrant Surface
Solvent Probe
Accessible Surface
Van der Waals Surface
The accessible surface is traced out by the probe sphere
center as it rolls over the protein. It is a kind of expanded
van der waalse surface.
Accessibility
Accessible Surface Area (ASA)
in folded protein
 Accessibility =
Maximum ASA
 Two state = b (buried) ,e (exposed)
e.g. b<= 16% e>16%
 Three state = b (buried), I (intermediate), e (exposed)
e.g. b<=16% 16%>i,<36% e>36%
Use of Solvent Accessibility
studies of solvent accessibility in proteins have led to
many insight into protein structure like:
 Protein function
 Sequence motifs
 Domains
 Formulating antigenic determinants & site-directed
mutagenesis
Why Predict Solvent Accessibility?
Helpful for :
 Predicting the arrangement of secondary structure segments
in 3-D structure
 Estimating the number of protein-protein & protein- solvent
contacts of residues
 Threading procedure to find putative remote homologues
 Improving prediction of glycosylation sites
 Predicting epitops
Problems of predicting solvent
Accessibility
Prediction of solvent accessibility is less
accurate than that of secondary structure
Problem of approximation for residue
accessibility (a projection of surface area
onto 2 states leads to reduce of information )
The problem of how to define the threshold
ASA Calculation
 DSSP - Database of Secondary Structures for
Proteins (swift.embl-heidelberg.de/dssp)
 VADAR - Volume Area Dihedral Angle Reporter
(http://redpoll.pharmacy.ualberta.ca/vadar/)
 GetArea - www.scsb.utmb.edu/getarea/area_form.html
Other ASA sites
 Connolly Molecular Surface Home Page
http://www.biohedron.com/
 Naccess Home Page
http://sjh.bi.umist.ac.uk/naccess.html
 ASA Parallelization
http://cmag.cit.nih.gov/Asa.htm
 Protein Structure Database
http://www.psc.edu/biomed/pages/research/PSdb/
Methods of Accessibility prediction
Method
CC
Accuracy Year
Scientists
1 Decision tree
DT
0.43
71 ~ 72%
1998
Salzberg
2 Bayesian
BS
0.43
71 ~ 72%
1996
Tompson,
Goldstein
MLR
0.43
71 ~ 72%
2001
Li, Pan
4 Support vector SVM
2~4 %
79%
2002
Yuan, et al
5 Neural
2~4%
79%
1994
Rost, sander
6 A method Based on information
2001
Sadeghi et al
statistics
3 Multiple linear
regression
Machine
network
theory
PHD Prediction of rCD2
Accessibility Prediction
 PredictProtein-PHDacc (58%)
http://cubic.bioc.columbia.edu/predictprotein
 PredAcc (70%?)
http://condor.urbb.jussieu.fr/PredAccCfg.html
QHTAW...
QHTAWCLTSEQHTAAVIW
BBPPBEEEEEPBPBPBPB
THEORY
&
METHOD
Data sets
A set of 230 nonredundant protein structures
in the PDB with mutual sequence similarity
<25% were selected to construct the
training and testing sets from the
PDBSELECT and with 2.5 Å resolution
determined by x-ray and without chain
breaks
ASA calculation
Surface area and accessibility for dataset
proteins were calculated by software
developed in our group
Accessibility states defined as two states
and three states with different threshold
Two states B and E ( 5%, 9%, and 16%)
Three states B , I , E ( 4,9% - 9, 16% 4,16% )
Conformation(State) of a residue is affected by:
Short range interactions(between near residues)
Long range interactions(between far residues)
Most efforts have been focused on the analysis
of near residues(local effects).
 our method is based on :
Residue type (R)
 Residue conformation (state of neighbor
residues S & S’):
different neighbor residue types cause that residue adopt
to different states.
B
E
n1
n2
E
n3 E
B
B
I
I
E
I
B
E
B
3n Branch
n=length of protein
Branch with maximum information
I
I
E
B
I
E
B
I
Single residue prediction
n1 n2 n3 n4 n5 n6 n7 n8 n9 n10
s1
s2
s3
s4
s5
s6
Double residue prediction
n1 n2 n3 n4 n5 n6 n7 n8 n9 n10
S
S
S
S
SS
S
S
S
S
S
S
S
S
S
S
S
S
S
Where
P(SS’= XX’ ) is the probability of the occurrence of an event
P(SS’=XX’ RiRj) is the conditional probability
of SS’= XX’ if residues Ri and Rj have occurred.
The complementary event of
Complexity & problems of method
 Considering pairwise residue type:
 20*20 entry
 considering both types of Pair residues & pair
residue states simultaneously :
 For two states : 20*20*2 entry
 For three states : 20*20*3 entry
Note:
because of sample limitation we can’t analyze triplets or
more.
Problems that we encountered for considering
pairwise residue types & states simultaneously
was:
 Each residue in a window with length of L predicts L times.
for example in a window with length of two residues, each
residue predicts 2 times and so on.
 If we consider the state of each residue in a window with the
length of L , there are L times prediction for each residue.
Result : the ambiguity in answering the question or Which
state stands for each residue ?
Solution: Use of dynamic programming
Double residue prediction
n1 n2 n3 n4 n5 n6 n7 n8 n9 n10
S
S
S
S
SS
S
S
S
S
S
S
S
S
S
S
S
S
S
double residue prediction for long length
wndows
n1 n2 n3 n4 n5 n6 n7 n8 n9
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
information content I of a sequence length L, amino acid
types Ri and Ri+m and
accessibility states S and S’  (E,I,B) in window size L calculate as follow:
Dynamic programming algorithm
Build an optimal solution from optimal
solutions to sub problems
Decompose a large problem into
number of small problems. Solve the
small problems and use these to solve
the large problem.
Three basic components
The development of a dynamic programming
algorithm has three basic components:
– The recurrence relation (for defining the value of
an optimal solution);
– The tabular computation (for computing the value
of an optimal solution);
– The trace back (for delivering an optimal solution).
Dynamic programming algorithm
Dynamic programming algorithm
Three states accessibility for two
residues length window
n1
n1
n2
n3 n4 n5 n6
n2
n3
n1n2
n2
n3
n4
n2n3
n3n4
n4
n2
n3
n4
EE EB EI
n1
BE BB BI
IE IB
n2
II
EE EB
EI
BE BB
BI
IE
n3
IB
II
EE EB
EI
BE BB
BI
IE
IB
II
EE II
Results
&
discussion
Two states accuracy
Window
length
2
3
4
5
6
7
threshold
5%
9%
16%
66.77
68.51
69.34
70.2
70.96
71.93
68.2
69.37
70.22
71.29
71.34
72.1
65.2
66.37
66.42
67.29
67.34
68.3
Two states accuracy
74
72
70
5%
68
9%
66
16%
64
62
60
2
3
4
5
6
7
Three states accuracy
Window
length
2
3
4
5
6
7
thresholds
4, 9 %
9, 16%
4,16%
63.81
64.21
64.56
65.3
65.8
66.18
64.79
65.54
66.74
67.36
68.15
69.3
62.79
63.54
63.74
64.26
64.85
65.1
Three states accuracy
70
68
66
4,9%
64
9,16%
4,16
62
60
58
2
3
4
5
6
7
Suggestions
• Taking longer windows surely increases
prediction accuracy
• Analysis and scoring of amino acid pairs
by other statistical methods such as
markov chain
• Using larger data sets and analysis of
amino acid triplets (8000* 27 states)