Transcript Document

Structural Bioinformatics
•
•
•
•
Motivation
Concepts
Structure Prediction
Summary
Lecture 12 CS566
1
Motivation
• Holy Grail: Mapping between sequence and
structure. Structure = F(Sequence). What is F?
• Why
– Structure dictates chemistry, thermodynamics and
therefore function
– Not all structures can be (need be?) determined
experimentally
• Cost
• Experimental limitations
Lecture 12 CS566
2
Concepts – Prediction spectrum
Decreasing reliance on known structures
Homology
Modeling
Threading
Lecture 12 CS566
ab initio
Quantum
Mechanics
3
Concepts - Common Principles
• Constraints to reduce search space
• Consideration of many alternate conformations
– Protein backbone dihedral angles (‘Twists along axis
of protein’)
– Amino-acid geometry (‘Amino-acids can have more
than one shape’)
• Method for local optimization
• Scoring function to compare conformations
Lecture 12 CS566
4
Evaluation of quality of prediction
• RMSD comparison with experimentally known
structure
• Comparison with crystal structure quality criteria
– Ramachandran Plot
• Residue specific dihedral angle distribution
• CASP (Critical assessment of structure prediction)
and CAFASP (..Fully Automated..) competitions
Lecture 12 CS566
5
Methods
• Knowledge-based constraints of search space
– Homology Modeling
– Threading
– ab initio (Based on knowledge primitives: not true ab
initio)
• Approaches to refinement
– Quantum mechanics (ab initio)
• Based on quantum mechanical model of elementary particles
• Unscalable
– Molecular mechanics
• Uses parametric Force Fields (Newton’s laws, Hooke’s law,
…)
• Typically used for local or constrained global optimization
• Molecular Dynamics or Monte Carlo-based
Lecture 12 CS566
6
Homology modeling
• Homology
– Based on sequence-sequence similarity ( > ~25%, the
higher, the better)
– Steps
• Pair-wise local sequence similarity to identify related structures
(possible templates)
• Refine alignment by global pair-wise sequence similarity and
msa
• Overlay sequence backbone (N-C-C) on template
• Model loops based on
– Statistical knowledge from databases of known structures
– Molecular mechanics
• Model side-chains (approach similar to that of loops)
• Molecular mechanical unconstrained local optimization
• Pray for a good solution!
Lecture 12 CS566
7
Threading
• Based on sequence-structure similarity
• Concept
– Residues in core adopt fewer conformations than
surface
• Approach
– Thread sequence through all known structures
– Score match with core of each structure based on
• Environmental scoring matrices and/or
• Amino acid neighborhood matrices (a la Dot matrix)
– Refine structure using molecular mechanics based on
best template(s)
Lecture 12 CS566
8
Rosetta (“ab initio”) Approach
• Pioneered by David Baker’s group in the late 1990s
• Remarkable success in CASP and CAFASP experiments
• Recently made publicly available on an automated server
by Christopher Bystroff’s group
• Pot pourri of many different approaches
• Key components
– ‘Divide and conquer’ strategy with respect to length of sequence to
be modeled
– Use of knowledge based energy function
Lecture 12 CS566
9
‘Divide and conquer’
• Mimics natural process of protein folding
• Compromise between extremes of
– Looking for homologous sequences with known
structure
– Modeling a priori (one amino acid at a time)
• Use library of 3D structures of fragments of length
3 and 9 derived from the crystal structure database
(a priori estimates = 8K and ~ 1012).
• Break up query sequence into a set of 3mers and
9mers, to find matches with above library – using
a sequence profile approach
Lecture 12 CS566
10
‘Divide and conquer’
• Once matches found, reduces to
combinatorial problem of selecting best set
of fragments with most energetically
favorable structure
• In practice, Monte Carlo based search of
possible combinations is carried out.
Lecture 12 CS566
11
Knowledge based energy function
• Fundamentally,
∆G = ∆H - T ∆S
• Free energy is the enthalpy less an entropic
term that is proportional to temperature
• Entropy is proportional to the natural log of
the number of conformations/possible states
S = K ln W
Lecture 12 CS566
12
Knowledge based energy function
• Hence makes sense to use existing distribution of
structures to derive energy function
• Energy function is based on taking statistical
distribution of 3D shapes in database of known
structures as the underlying probability
distribution
• For a given structure, deviations from probability
distribution are subject to proportional energetic
penalties
Lecture 12 CS566
13
Rosetta – Steps used in CASP4
1. If possible, use PSI-BLAST to find similar
sequences
A. If found, use the multiple sequence alignment
to break down sequence into domains to be
modeled independently
B. For domains with similarity to known
structures, use Homology based approach
C. For remaining domains, carry out Rosetta
Lecture 12 CS566
14
Rosetta - Steps
2.
3.
For domains with similarity to other sequences, apply
following steps to the homologs as well (consensus
modeling)
Generate fragment library for each query
A.
4.
Collect 3mer and 9mer sub-structures from the PDB with
similarity to 3mer and 9mer subsequences
Use Monte Carlo approach for backbone fragment
substitution into query
A.
B.
C.
Pick a fragment at random from library (~40,000 fragment
substitutions for each structure)
Repeat A several times
Between 10K and 100K conformations (‘decoys’) generated for
each target
Lecture 12 CS566
15
Rosetta - Steps
5.
Filter set of conformations to remove unlikely structures
A.
B.
6.
7.
8.
Remove structures with minimal long range interactions (low
contact order)
Remove structures with unrealistic strands
Add side chains as statistically predicted by the
backbone conformation
Cluster set of conformations (including, when available,
the generated structures of homologues)
Representative structures from the top 5 most-populous
clusters are candidate structures
Lecture 12 CS566
16
Summary
• Methods like Rosetta represents a breakthrough in
the ab initio prediction of protein 3D structure and
are very useful in cases where homology cannot
be observed
• For CASP4, at least one subsequence longer than
50 residues could be predicted ‘correctly’ (< 6.5
rmsd) in 17 of 21 cases
• Combination of various approaches works best
Lecture 12 CS566
17
Summary
• However, both completeness and accuracy
of prediction leave ample room for
improvement
– RMS error frequently too high to be useful
– Even in homology modeling, template per se is
often better match!
– Often, only subsequences are accurately
modeled, and not the whole structure
– The Nobel Prize is still up for grabs!
Lecture 12 CS566
18