Protein secondary structure

Download Report

Transcript Protein secondary structure

Structure Prediction (I):
Secondary structure
DNA/Protein structure-function analysis and prediction
Lecture 8
Victor A. Simossis
Center for Integrative Bioinformatics VU
Faculty of Sciences
First two levels of protein structure
Protein primary structure
20 amino acid types
Protein secondary structure
A generic residue
Peptide bond
SARS Protein From Staphylococcus Aureus
1
31
61
91
121
151
181
211
241
MKYNNHDKIR
DMTIKEFILL
CYKQSDLVQH
NTYISISEEQ
ADQSESQMIP
KKHLTLSFVE
IETIHHKYPQ
EDERKILIHM
DKDHLHLVFE
DFIIIEAYMF
TYLFHQQENT
IKVLVKHSYI
REKIAERVTL
KDSKEFLNLM
FTILAIITSQ
TVRALNNLKK
DDAQQDHAEQ
RFKKKVKPEV
LPFKKIVSDL
SKVRSKIDER
FDQIIKQFNL
MYTMYFKNII
NKNIVLLKDL
QGYLIKERST
LLAQVNQLLA
Alpha-helix Beta strands/sheet
SARS Protein From Staphylococcus Aureus
1 MKYNNHDKIR
SHHH
51 LPFKKIVSDL
EEHHHHHHHS
101 REKIAERVTL
HHHHHHHHHH
151 KKHLTLSFVE
HHH SS HHH
201 QGYLIKERST
HTSSEEEE S
DFIIIEAYMF
HHHHHHHHHH
CYKQSDLVQH
SS GGGTHHH
FDQIIKQFNL
HHHHHHHHHH
FTILAIITSQ
HHHHHHHHTT
EDERKILIHM
SSTT EEEE
RFKKKVKPEV
HHHHHHTTT
IKVLVKHSYI
HHHHHHTTS
ADQSESQMIP
HTT SS S
NKNIVLLKDL
TT EEHHHH
DDAQQDHAEQ
HHHHHHHHH
DMTIKEFILL
SS HHHHHHH
SKVRSKIDER
EEEE SSSTT
KDSKEFLNLM
SHHHHHHHH
IETIHHKYPQ
HHHSSS HHH
LLAQVNQLLA
HHHHHHHHTS
TYLFHQQENT
HHHHS S SE
NTYISISEEQ
EEEE HHH
MYTMYFKNII
HHHHHHHHHH
TVRALNNLKK
HHHHHHHHHH
DKDHLHLVFE
SS TT SS
Why predict when we can get the real thing?
UniProt Release 1.2 consists of:
Swiss-Prot Release
: 143790 protein sequences
TrEMBL Release
: 1075779 protein sequences
PDB structures :
: 24168
protein structures
Secondary structure is derived by tertiary coordinates
To get to tertiary structure we need NMR, X-ray
We have an abundance of primaries..so why not use them?
Primary structure
No problems
Secondary structure
Overall 77% accurate at predicting
Tertiary structure
Overall 30% accurate at predicting
Quaternary structure
No reliable means of predicting yet
Function
Do you feel like guessing?
Some SSE rules that help
ALPHA-HELIX: Hydrophobic-hydrophilic residue periodicity patterns
BETA-STRAND: Edge and buried strands, hydrophobic-hydrophilic
residue periodicity patterns
OTHER: Loop regions contain a high proportion of small polar
residues like alanine, glycine, serine and threonine.
The abundance of glycine is due to its flexibility and proline for
entropic reasons relating to the observed rigidity in its kinking the
main-chain. As proline residues kink the main-chain in an
incompatible way for helices and strands, they are normally not
observed in these two structures, although they can occur in the Nterminal two positions of a-helices.
Edge
Buried
Historical background
Using computers in predicting protein secondary has its onset 30
ago (Nagano (1973) J. Mol. Biol., 75, 401) on single sequences.
The accuracy of the computational methods devised early-on was
in the range 50-56% (Q3). The highest accuracy was achieved by
Lim with a Q3 of 56% (Lim, V. I. (1974) J. Mol. Biol., 88,
857). The most widely used method was that of ChouFasman (Chou, P. Y. , Fasman, G. D. (1974) Biochemistry,
13, 211).
Random prediction would yield about 40% (Q3) correctness
given the observed distribution of the three states H, e and C in
globular proteins (with generally about 30% helix, 20% strand
and 50% coil).
Nagano 1973 – Interactions of residues in a window of 6. The
interactions were linearly combined to calculate interacting residue
propensities for each SSE type (H, E or C) over 95
crystallographically determined protein tertiary structures.
Lim 1974 – Predictions are based on a set of complicated
stereochemical prediction rules for a-helices and b-sheets based
on their observed frequencies in globular proteins.
Chou-Fasman 1974 - Predictions are based on differences in
residue type composition for three states of secondary structure:
a-helix, b-strand and turn (i.e., neither a-helix nor b-strand).
Neighbouring residues were checked for helices and strands and
predicted types were selected according to the higher scoring
preference and extended as long as unobserved residues were
not detected (e.g. proline) and the scores remained high.
The older standard: GOR
The GOR method (version IV) was reported by the authors to perform single sequence prediction accuracy with an
accuracy of 64.4% as assessed through jackknife testing over a database of 267 proteins with known structure.
(Garnier, J. G., Gibrat, J.-F., , Robson, B. (1996) In: Methods in Enzymology (Doolittle, R. F., Ed.) Vol. 266, pp.
540-53.)
The GOR method relies on the frequencies observed for residues in a 17- residue window (i.e. eight residues Nterminal and eight C-terminal of the central window position) for each of the three structural states.
The sliding window: GOR
Central residue
Sliding window
H H H E E E E
A constant window of
n residues long slides
along sequence
Sequence of
known structure
The frequencies of the residues in the
window are converted to probabilities
of observing a SSE type
The amino acid frequencies are converted to secondary structure
propensities for the central window position using an information
function based on conditional probabilities. As it is not feasible to
sample all possible 17-residue fragments directly from the PDB
(there are 2017 possibilities) increasingly complex approximations
have been applied.
In GOR I and GOR II, the 17 positions in the window were treated
as being independent, and so single-position information could be
summed over the 17-residue window.
In GOR III, this approach was refined by including pair
frequencies derived from 16 pairs between each non-central and the
central residue in the 17-residue window.
The current version, GOR IV combines pair-wise information over
all possible paired positions in a window .
Accuracy burst due to four separate improvements
1) Using Multiple sequence Alignments instead of single sequence input
2) More advanced decision making algorithms
3) Improvement of sequence database search tools
1) PSI-BLAST (Altschul et al, 1997) – most widely used
2) SAM (Karplus et al, 1998)
4) Increasingly larger database size (more candidates)
Using Multiple Sequence Alignments
Zvelebil et al. (1987) for the first time exploited multiple sequence
alignments to predict secondary structure automatically by
extending the GOR method and reported that predictions were
improved by 9% compared to single sequence prediction.
Multiple alignments, as opposed to single sequences, offer a
much improved means to recognise positional physicochemical
features such as hydrophobicity patterns. Moreover, they provide
better insight into the positional constraints of the amino acid
composition. Finally, the placement of gaps in the alignment can
be indicative for loop regions.
Levin et al. (1993) also quantified the effect and observed 8%
increased accuracy when multiple alignments of homologous
sequences with sequence identities of 25% were used.
As a consequence, the current state-of-the-art methods all use
input information from multiple sequence alignments but are
Sequence cheY (PDB code 3chy)
INIT
Iter
Iter
Iter
Iter
Iter
Iter
Iter
Iter
Iter
INIT
Iter
Iter
Iter
Iter
Iter
Iter
Iter
Iter
Iter
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
9
AA
PHD
PHD
PHD
PHD
PHD
PHD
PHD
PHD
PHD
PHD
DSSP
|ADKELKFLVVDDFSTMRRIVRNLLKELGFNNVEEAEDGVDALNKLQAGGYGFVISDWNMP|
| EEEEEEE
HHHHHHHHHHHHHHHHH
E HHHHHHHHHH HHHEEE
|
| EEEEEEEE
HHHHHHHHHHHHHHH
HHHHHHHH
EEEEEE
|
| EEEEEEEE
HHHHHHHHHHHHHH
HHHHHHHHH EEEEEE
|
| EEEEEEEE
HHHHHHHHHHHHHH
EEE
HHHHHH
EEEEE
|
| EEEEEEEE
HHHHHHHHHHHHHH
HHHHHHH
EEEEE
|
| EEEEEEEE
HHHHHHHHHHHHHH
EEE
HHHHHH
EEEEE
|
| EEEEEEEE
HHHHHHHHHHHHHH
HHHHHHHH EEEEEE
|
| EEEEEEEE
HHHHHHHHHHHHHH
EEE
HHHHHH
EEEEE
|
| EEEEEEEE
HHHHHHHHHHHHHH
HHHHHHH
EEEEEE
|
| EEEEEEEE
HHHHHHHHHHHHHH
HHHHHHHHHH
EEEEE
|
| TT EEEE S HHHHHHHHHHHHHHT
EEEESSHHHHHHHHHH
EEEEES S|
AA
PHD
PHD
PHD
PHD
PHD
PHD
PHD
PHD
PHD
PHD
DSSP
|NMDGLELLKTIRADGAMSALPVLMVTAEAKKENIIAAAQAGASGYVVKPFTAATLEEKLNKIFEKLGM|
|
HHHHHHEEEEEE
HHHHHHHHHHHHHHHHH
HHHHHHHHHHHHHH
|
|
HHHHHHEEEEEE
HHH HHHHHHHHHHHHHHHHHH
EEE
HHHHHHHHHHHHHH
|
|
HHHHHHEEEEEE
HHHHHHHHHHHHHHHHHH
EEE
HHHHHHHHHHHHHH
|
| HHHHHHHHHHHH
HHHHHHHHHHHHHHHHHH
EEE
HHHHHHHHHHHHHH
|
|
HHHHH
EEEEE HHHHHHHHHHHHHHHHH
EEE
HHHHHHHHHHHHHH
|
|
HHHHHHHH
EEEEE
HHHHHHHHHHHHHHHH
EEE
HHHHHHHHHHHHHH
|
|
HHHHHHHH
EEEEE
HHHHHHHHHHHHHHHH
EEEE
HHHHHHHHHHHHHH
|
|
HHHHHHHH
EEEEEE
HHHHHHHHHHHHHHHH
EEE
HHHHHHHHHHHHHH
|
|
HHHHHHHH
EEEEE
HHHHHHHHHHHHHHHH
EEE
HHHHHHHHHHHHHH
|
|
HHHHHHHH
EEEEE
HHHHHHHHHHHHHHH
EEEE
HHHHHHHHHHHHHH
|
|SS HHHHHHHHHH TTTTT EEEEESS HHHHHHHHHTT SEEEESS HHHHHHHHHHHHHHHT |
Improved Methods: K-Nearest Neighbour
Requires an initial training phase
TRAINING:
Sequence fragments from database of known structures
Sequence fragments of a certain length derived from a
database of known structures are used, so that the central
residue of such fragments can be assigned the true
secondary structural state as a label.
Then a window of the same length is slid over the query
sequence (or multiple alignment) and for each window
the k most similar fragments are determined using a
certain similarity criterion.
The distribution of the thus obtained k secondary
structure labels is then used to derive propensities for
three SSE states (H,E or C).
Sliding window
Qseq
Central residue
Similarity
good enough
HHE
PSS
Improved Methods: Neural Networks
Neural networks are learning systems based upon complex non-linear statistics. They are organised as interconnected layers
of input and output units, and can also contain intermediate (or "hidden") unit layers (neurons). Each unit in a layer receives
information from one or more other connected units and determines its output signal based on the weights of the input signals
(synapses).
Sequence database of known structures
A neural network has to be trained.
TRAINING:
Like k-NN but this time the information is used to
adjust the weights of the internal connections for
optimising the grouping of a set of input patterns
into a set of output patterns.
Normally difficult to understand the internal
functioning of the network.
Beware: overtraining the network.
Sliding window
Qseq
Central residue
Neural
Network
The weights are adjusted according to the model
used to handle the input data.
Diversity and alignment size gives better predictions
The reigning secondary structure prediction method for the last 5 years PSIPRED (Jones, 1999)
incorporates multiple sequence information from database searching and neural nets.
The method exploits position specific scoring matrices (PSSMs) as generated by the PSI-BLAST
algorithm (Altschul et al, 1997) and feeds those to a two-layered neural network.
Since the method invokes the PSI-BLAST database search engine to gather information from related
sequences, the method only needs a single sequence as input. The accuracy of the PSIPRED method is
76.5%, as evaluated by the author.
An investigation into the effects of larger databases and more accurate sequence selection methods has shown that
these improvements provide better and more diverse MSAs for secondary structure prediction. (Przybylski, D. and
Rost, B. (2002) Proteins, 46, 197-205.)
PHD, PHDpsi, PROFsec
The PHD method (Profile network from HeiDelberg) broke the 70% barrier of prediction accuracy. (Rost and Sander (1993)
Three neural networks:
1) A 13 residue window slides over the alignment and produces 3-state raw
secondary structure predictions.
2) A 17-residue window filters the output of network 1. The output of the second
network then comprises for each alignment position three adjusted state
probabilities. This post-processing step for the raw predictions of the first network
is aimed at correcting unfeasible predictions and would, for example, change
(HHHEEHH) into (HHHHHHH).
3) A network for a so-called jury decision between networks 1 and 2 and a set of
independently trained networks (extra predictions to correct for training biases.
The predictions obtained by the jury network undergo a final simple filtering step
to delete predicted helices of one or two residues and changing those into coil.
Since the original method, the BLAST search and
MAXHOM alignment routines have been replaced
by PSI-BLAST in PHDpsi and more recently the
use of complex bi-directional neural networks have
given rise to PROFsec which is a close competitor
and in many cases better than PSIPRED.
How to develop a secondary structure prediction method
Test set of T<<N
sequences with known
structure
Database of N sequences
with known structure
For jackknife
test: K=N-1
Training set of K<N
sequences with known
structure
For jackknife
test: T=1
Other
method(s)
prediction
Standard of truth
Method
Trained
Method
Assessment
method(s)
Prediction
Method
benchmark
Prediction
accuracy
For full jackknife test: Repeat process N times and
average prediction scores
The Jackknife test
A jackknife test is a test scenario for prediction methods that need to be tuned using a training database.
Its simplest form:
For a database containing N sequences with known tertiary (and hence secondary) structure, a prediction is made for one
test sequence after training the method on the remaining training database containing the N-1 remaining sequences (oneat-a-time jackknife testing).
A complete jackknife test would involve N such predictions.
If N is large enough, meaningful statistics can be derived from the observed performance. For example, the mean
prediction accuracy and associated standard deviation give a good indication of the sustained performance of the method
tested.
If this is computationally too expensive, the db can be split in larger groups, which are then jackknifed.
Protein Secondary structure: Standards of Truth
What is a standard of truth?
1) DSSP (Kabsch and Sander, 1983) – most popular
- a structurally derived secondary structure
assignment
2) STRIDE (Frishman and Argos, 1995)
Why do we need one?
3) DEFINE (Richards and Kundrot, 1988)
- it dictates how accurate our prediction is
Annotation:
How do we get it?
Helix: 3/10-helix (G), a-helix (H), -helix (I)
- methods use hydrogen-bonding patterns along the
main-chain to define the Secondary Structure
Elements (SSEs).
Strand: b-strand (E), b-bulge (B)
Turn: H-bonded turn (T), bend (S)
Rest: Coil (“ “)
Assessing prediction accuracy
How do we decide how good a prediction is?
1) Qn :
2) SOV:
3) MCC:
the number of correctly predicted n SSE states over the
total number of predicted states
the number of correctly predicted n SSE states over the
total number of predictions with higher penalties for
core segment regions (Zemla et al, 1999)
the number of correctly predicted n SSE states over the
total number of predictions taking into account how
many prediction errors were made for each state
Which one would you use?
• Biological information impact
• What are you testing?
• What is your prediction used for?
Making sense of the scores:
• Compare to your selected Standard Of Truth
• Use all three to get a better picture
Automated Evaluation Initiatives
The EVA Server
CASP (also includes fold recognition assessments), CAFASP biannual experiments
With the amount of methods available freely online, biologists are puzzled and have no way of knowing which one to use.
These initiatives allow continual evaluation on sequences that are added to the PDB and use DSSP as a standard of truth.
LETS GO TO WEB …
The consensus superiority
Deriving a consensus from multiple methods is always more accurate than
any one individual method used.
Early on Jpred (Cuff and Barton, 1998) investigated weighted and unweighted multiple method majority voting with a upper limit 4% increase.
Nowadays, any three top scoring methods can be improved by 1.5-2% by
simple majority voting consensus. It is the three clocks on a boat scenario. If
one clock goes wrong, the likelihood that the other two will go wrong at the
same time and in the same way is very low.
We are currently completing a dynamic programming consensus algorithm
that produces an optimally segmented consensus which is more biologically
correct than simple majority voting and intend to set it as a standard on EVA
for method consensus evaluation.
Predictions set
H
H
H
E
E
E
E
C
E Max observations
are kept as correct
Single sequence
Single sequence
Step 1:
Database
sequence
search
Step 1:
Database
sequence
search
Sequence
database
PSI-BLAST
SAM-T2K
Sequence
database
Sequence
database
PSI-BLAST
SAM-T2K
Sequence
database
Homologous sequences
Step 2:
MSA
Check file
PSSM
Homologous sequences
HMM model
Step 2:
MSA
MSA
MSA method
MSA
Step 3:
SS
Prediction
Trained
machine-learning
Algorithm(s)
Secondary structure
prediction
MSA method
Step 3:
SS
Prediction
Trained
machine-learning
Algorithm(s)
Secondary structure
prediction
Optimised MSA and SS prediction
Iterative
MSA/SS
prediction
mutual
optimisation
Iterative
homologue
detection
by
optimised
information