Transcript Document

Chapter 9
Structure Prediction
Motivation





Given a protein, can you predict molecular
structure
Want to avoid repeated x-ray crystallography,
but want accuracy
You could use nucleotide alignment, but what
do you do with the gapped regions?
More complex methods are only justified if
they can be shown to perform better than
simpler methods
Simpler methods are only justified if they can
perform better than basic sequence
alignment
First Step
Some structure comparison methods
use secondary structures of the new
sequence
 Predict location of secondary structure
elements along the protein’s backbone
and the degree of residue burial
 Supervised learning has been shown to
perform well in this task

Artificial Neural Network
Predicts
Structure
at this
point
Danger
You may train the network on your
training set, but it may not generalize to
other data
 Perhaps we should train several ANNs
and then let them vote on the structure

Profile network from HeiDelberg






family (alignment is used as input) instead of just
the new sequence
On the first level, a window of length 13 around
the residue is used
The window slides down the sequence, making a
prediction for each residue
The input includes the frequency of amino acids
occurring in each position in the multiple
alignment (In the example, there are 5
sequences in the multiple alignment)
The second level takes these predictions from
neural networks that are centered on neighboring
proteins
The third level does a jury selection
PHD
Predicts 4
Predicts 5
Predicts 6
Threading
Threading matches structure to
sequence
 True threading considers 3D spatial
interactions

3D-1D Matching (Bowie et al.)
Convert 3D structure into a string
 Include -helix, -sheet or neither
 Include buried or solvent accessible (6
levels)
 Total of 3X6=18 distinct states
 With Pa:j= probability of finding amino
acid (a) in environment (j) and
Pa=probability of finding (a) anywhere

Pa: j 
saj  log 
 Pa 
3D-1D
Calculate the information values score
on a training set of multiple alignments
and the score was used as a profile for
each column
 When applied to the globin family an
clearly identified myoglobins from
nonglobins but not from other globins

Methods using 3D interactions
Residues that have large separation in
the sequence may end up next to each
other when the protein is folded.
 Define a measure of contact between
residues (two atoms within 5Å) and
count frequency of contact between all
pairs in PDB
 Use measure in alignment to evaluate
cost, or to select the best alignment

3D interactions
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Potentials of mean force (POMF)
Since the notion of contact is somewhat
arbitrary, a more general formulation
can be tried
 Derive an empirical function for the
propensity of each of the 400 pairs of
residues to be any given distance apart.

Multiple Sequence Threading

Multiple Sequence Alignment



Align the most similar to create a consensus
sequence
Align consensus sequences to create overall
alignment
Use the same strategy with structures
 Assume that conserved hydrophobic
positions should pack in the core
 This appears to be work in progress (1997)
Example


The POMF would have a
peak around 5A
Aspartate (D) and valine
since do not often pack
together

POMF(A,V)
Probability
Two small hydrophobic
residues alanine (A) and
valine (V), both of which
favor packing in the core
of the protein.
The POMF will have a dip
around 5A
5A
Distance
POMF(D,V)
Probability

5A
Distance
Sequence-Structure Alignment

For all know structures
 Align
the unknown sequence to that
structure
 Find the best alignment
 Return the structure with the best global
alignment

Unfortunately, we cant use dynamic
programming (NP Complete)
 Heuristics
space.
must be used to explore the
Evaluating Methods



Is the complexity worth it?
This is difficult without a benchmark
Few comparative studies have been
performed


When they have been performed, authors of
competing methods have complained that wrong
parameters were used …
Critical Assessment of Structure Prediction
(CASP 1994) releases protein structures prior
to publication.


All methods submit their predictions
Predictions are analyzed based on fold
recognition, modeling accuracy and alignment
accuracy.
 No one method or approach is obviously superior