LocalStructureBystro..

Download Report

Transcript LocalStructureBystro..

Prediction of Local Structure
in Proteins Using a Library of
Sequence-Structure Motifs
Christopher Bystroff & David Baker
Paper presented by: Tal Blum
[email protected]
The Approach
• Learn a set of clusters or structure
segments that can be identified from short
local sequence
• Combine a set of local structural
predictions into one whole structure
Methods - Database
• Database of 471 protein sequence families
• By Sander & Schneider 1994
• Each family contains one known sequence
structure
• No more than 25% sequence identity
between any 2 alignments
• Well determined structures
• Non-membrane proteins
Clustering of Sequence Segments
• Each position in the database is described by a
weighted amino acid frequency (Vingron & Argos
1989)
• Similarity between a sequence and a cluster is
defined by “Cross-Entropy”:
• Segments of given length (3-15) were clustered
via the K-means algorithm
• Unsupervised
Assessing Structure within a cluster
and Choice of Paradigm
• Structural similarity between 2 peptide structure
segments
• S1i->j is the distance between -carbon atoms i
and j in segments S1
• The paradigm for a cluster was chosen from the
top 20 segments as the one with the smallest
sum of mda/dme values with the others
True/False Boundaries
in Structure Space
• Used for the refinement procedure
• Find Natural Boundaries
• Compute Histograms of dme & mda vs the
paradigm over all segments in the cluster
• The boundary was set to the point where the
histogram first dropped to ½ of its maximum
• If reached 130o or 1.3Ao the cluster is rejected
• Average boundaries is 81o and 89A
• 82 cluster were constructed (I-site library)
DMA-MDA for
9 residue serine B-hairpin
Iterative Refinement of Clusters
•
•
•
•
For each cluster with good boundaries
Clustering increases P(cluster|sequence)
In order to increase P(structure|cluster)
2 residues are also observed on each side of each
sequence
• All segments that are not within the natural boundaries
of the paradigm are removed
• The frequency profile of the cluster is calculated
• The database is searched using the new profile and
the highest 400 scored sequences are the new cluster
Cross-Validation and confidence
• A 10 fold cross validation was performed
• If the 10 paradigm were not structurally the
same or if the 10 runs did not converge to the
same profile then the cluster was rejected
• If the cluster was not rejected a confidence
curve was computed as a function of the Dpq
sequence to cluster similarity.
• This enables to compare different profile lengths
and incorporates P(clust|seq) and P(struct|clust)
Confidence for Similarity
Clustering – What do we want?
• Direction: Sequence -> Structure
• We want to as separated as possible cluster of
sequences so that given a test sequence we can
assign it to 1 cluster
• Each cluster should have 1 or a few possible
structures. Those structures will be used to
predict the test protein structure
• P(struct|seq) = cluster P(struct|clust,seq)*P(clust|seq)
= P(struct|clust)* P(clust|seq)
Iterative Peak Removal
• Similar Sequences can map to different
structures in some cases
• When this happens, the predominant pattern
occludes the second one
• To find those clusters the refinement was
performed using subset of the data that excludes
the other class members
• This helped identifying two distinct -C-cap
extensions which were very similar in sequence
Cluster Weights
• The prediction accuracy is improved by
weighting the confidence curves
• Iterative update was used
• Where F+C are the false positive of cluster
C and F-C are the false negative errors
Prediction Protocol
•
1.
2.
3.
4.
5.
Given a sequence to predict:
Submit the sequence to PHD (Rose 94) to obtain
a set of multiple aligned sequences and hence a
profile
Each segment of the profile is scored against each
of the 82 clusters to produce weighted
confidences
Confidences are sorted
The first segment assigns  &  from its paradigm
For all the subsequent segments in the sorted list
the prediction is used if it doesn’t conflict with
previously assigned  & 
Results
• Reported on the training set and on 55
independent protein family set
• Local evaluation is measured by agreement over
8 residue window
• 8 residue segment prediction is considered to be
correct if non of the  &  differences is larger
than 120o or if the rmsd between the correct and
predicted structure was less than 1.4A
• An error is counted per position iff all 8
overlapping segments are incorrect
• Mda is stricter than the commonly used Q3 score
Results
• Training Set
– 471 sequences -> 122,510 residues
– 95% of 471 had 1 match ¸ 0.8 confidence
– 40% of the residues had confidence ¸ 0.6 and
were 71%(mda) correct
Results
Combinations of I-sites and
conventional Secondary Structure
Predictions
• With the PHD program
• Requires translation into Sec Structure or from
SS into torsion angles
• Every program performed better in it’s pwn
domain
• 64% Q3 because of under predicting loops and
over predicting strands
• I-site was much better in loops and specific
angles of turns
• Can compliment PHD
Comparison of I-Site & PHD
I-site library
• 82 cluster represents 13 structural motifs
Summary of the I-site library
Conclusions
• Method is fast – requires only profile
comparisons
• There is a measure of “confidence” in the
prediction
• They do not provide accuracy over the
whole protein
• Believe that the strong local sequencestructure relationships (that occur more
than 30 times) are present in I-site
Discussion
• NMR studies of isolated peptides of less
than 30 residue show that the peptides do
not have a well defined structure. The Isite motif are the exceptions
• It might be that the motifs are the areas
that adopt structure independence to the
rest of the protein
• An extension might be context specific
motifs
2 Approaches for global scoring
functions
• Derived from the protein Database
– Large # of parameters
– Complicated
• Potentials
– Based on Chemical Intuitions
– Simpler
– Clearer insights into sequence/structure relations
• They chose the Database approach
– Because of the dangers of crafting a measure for a
specific protein family rather than for the whole DB
Scoring Functions
P( Sequence | Structure)
P( Structure | Sequence)  P( Structure) 
P( Sequence)
 P( Structure)  P( Sequence | Structure)
since P(Sequence ) is independen t of Structure
• P(Seq|Str) is used when computing sequence profiles
for motifs
• P(Structure) is hardest to estimate and contains most
of the non-local interactions.
• For ab-initio, P(Structure) captures the features that
distinguish folded structures from random chain (local)
configurations.
Radius of gryation2
Scoring Function
• Measures the largest radius from the
center of the fold
Radius of gryation2
Scoring Function
• Advantages
– Non-dependent on alpha-beta decomposition
- since the generated structures is made from
segments of real proteins its alpha-beta
decomposition much like of real proteins
• Disadvantages
– Structures with beta paired strands are no
more probable than those of unpaired beta
strands