Transcript Charles
Protein Design
CS273: Final Project
Charles Kou
[email protected]
Crystal structure of top7 –
A novel protein structure created with RosettaDesign.
http://rosettadesign.med.unc.edu/
What is Protein Design
Opposite of structure prediction:
determine low energy sequence that
yield given structure.
Computationally difficult:
Search space of 20^n where n =
sequence length (20 amino acids)
Major algorithms: Dead-end elimination,
genetic algorithms, Monte Carlo, Branch
& Bound.
http://www.stanford.edu/class/cs273/project/project.html
Major Algorithms
Trade off between thoroughness and
computational speed.
Monte Carlo / Genetic Algorithm:
Can sample space with infinite number of solutions
Sidechain identity, side chain orientation and
backbone structure can be varied continuously.
No guarantee of reaching global energy minimum.
Dead-End Elimination
Allows only discrete conformations.
Rejection criteria is used to prune the search
space.
Desjarlais JR, Clarke ND.
Computer search algorithms in protein modification and design.
Curr Opin Struct Biol. 1998 Aug;8(4):471-5. PubMed
Review: Energy Landscape
qj
defined over large dimensional
conformation space
qi
qN-1
qN
q2
q1
JC Lantombe, Energy2.ppt
Review: Example Energy Function
• E=
•
S bonded terms
+ S non-bonded terms
+ S solvation terms
E = (ES + EQ + ES-B + ETor) + (EvdW + Edipole)
Bonded terms
- Relatively few
Non-bonded terms
- Depend on distances between pairs of atoms
- O(n2) Expensive to compute
Solvation terms
- May require computing molecular surface
JC Lantombe, Energy2.ppt
Review: Monte Carlo Simulation
(MCS)
Random walk through conformation space
At each cycle:
– Perturb current conformation at random
– Accept step with probability:
- ΔE
kT
P(accept)=min 1,e
(Metropolis acceptance criterion)
The conformations generated by an arbitrarily long
MCS are Boltzman distributed, i.e.,
#conformations in V ~
V
e
-
E
kT
dV
JC Lantombe, Energy2.ppt
Monte Carlo Simulation
Tend to waste time in local min.
May consist of millions of steps.
Energy must be evaluated frequently
(computationally expensive).
Use ChainTree to improve performance.
Lotan, I., Schwarzer, F., Halperin, D., Latombe, J.C.:
Efficient maintenance and self-collision testing for kinematic chains.
In: Symposium on Computational Geometry (2002) 43–52
Koehl, P and Levitt, M. De novo protein design. I. In search of stability and specificity.
Journal of Molecular Biology, 293, 1161-1181 (1999).
Genetic Algorithm
Starts with First generation pool.
Iteratively apply genetic operators
(selection, recombination, mutation).
Evloves toward better solution (low
energy function).
S. M. Larson, J. England, J. DesJarlais, and V. S. Pande.
Thoroughly sampling sequence space: large-scale protein design of structural ensembles.
Protein Science 11 2804-281 (2002). Protein Science
Selection
• Selection function takes into account
the value of fitness function. This gives
priority to the “fit” organism but also
gives chance for “less fit” organisms.
http://en.wikipedia.org/wiki/Genetic_algorithm
Selection Method
• Roulette Method: probability of selection is
proportional to the value of fitness function
• Tournament: picks k individuals (tournament
size), and choose the individual with
probability p. Iterate with probability p*(1-p),
then p*(p*(1-p)) …
• Higher k = less chance for weaker individual.
http://en.wikipedia.org/wiki/Roulette_wheel_selection
http://en.wikipedia.org/wiki/Tournament_selection
Recombination, Mutation
Recombination: different segment of the
structure which is optimized in parallel can be
recombined into the same model.
Recombination occurs with a set probability.
Otherwise, the population is propogated to
the next generation.
Mutation: avoids local minima by mutating the
child with a set probability.
Similar to MC: there is no guarantee to
converge into global minimum.
http://en.wikipedia.org/wiki/Genetic_operator
Genome@home
• Genome@Home uses
distributed computing
and genetic algorithm.
• It also incorporates
backbone flexibility
using Monte Carlo
(random perturbation
with RMSD<1.0a) which
improves the result.
http://www.stanford.edu/group/pandegroup/genome/
Dead-end Elimination
Discrete conformational search.
Functionally equivalent to exhustive search.
It uses rejection criteria to prune the search
space.
The robustness depends on the discreteness
and the rejection criteria used.
Guaranteed convergence to global min.
Initially used for sidechain placement. More
difficult for protein design because of high
degrees of freedom.
Looger LL, Hellinga HW.
Generalized dead-end elimination algorithms make large-scale protein side-chain structure prediction tractable:
implications for protein design and structural genomics.
J Mol Biol. 2001 Mar 16;307(1):429-45. PubMed
Energy of conformation
Reformulation of sidechain placement problem:
Amino acid identity is used instead of rotamer.
The general DEE allows residue up to 300.
Energy of conformation is defined as sum of
interaction among side chains and sum of interaction
of sidechain and the backbone.
Rejection criteria is used and iterated until no more
rotamers can be eliminated. Convergence occurs, or
reduces the problem sufficiently for exhaustive
serach.
DEE filter: Rejection Criteria
Simple Criterion: If lowest energy
struct that can be found using a given
sidechain rotamer (low energy side chain
conformation) is higer than the highest
energy struct w/ different rotamer, the
first rotamer is eliminated.
DEE filter: Rejection Criteria
Goldstein Criteria: if energy struct
containing one rotamer is always
lowered by changing to a second one,
the first one is eliminated.
DEE filter: Rejection Criteria
Generalized Criterion: residues are
added in group, eliminated clusters of
rotamers in the groups maybe excluded
from the minimum operator, in addition
to those which form dead-end clusters
with c.
Mean Field Theory
• Reduce search space.
• Self-consistency is sought
by placing amino acids at
pre-selected positions in a
given structure.
• Energy function is
minimized by mean field.
Voigt CA, Mayo SL, Arnold FH, Wang ZG.
Computational method to reduce the search space for directed protein evolution.
Proc Natl Acad Sci U S A. 2001 Mar 27;98(7):3778-83. PNAS
Review: Branch & Bound
•
•
Set of solutions can be
partitioned into subsets
(branch)
Upper limit on a subset’s
solution can be computed
fast (bound)
Branch & Bound
1. Select subset with best
possible bound
2. Subdivide it, and
compute a bound for
each subset
S.Batzoglou, Threading2.ppt
Rosetta Design
• Initial backbone designed
without regard to side-chain
packing.
• Iterates between sequence
design and backbone optimization
using Monte Carlo.
• Perturbation in random change in
the torsional angles of 1-5
random residue, or substitution
of backbone torsonal angles of 13 consecutive residues with
torsional angles from a structure
in the PDB. Sidechain
optimization. Accept/reject
using Metropolis criterion.
• 1.17-a backbone atom RMSD
between model and structure.
Crystal structure of top7 –
A novel protein structure created with RosettaDesign.
http://rosettadesign.med.unc.edu/
Kuhlman B, Dantas G, Ireton GC, Varani G, Stoddard BL, Baker D.
Design of a novel globular protein fold with atomic-level accuracy.
Science. 2003 Nov 21;302(5649):1364-8. PubMed
Using Rosetta Design
• Red: PDB 1A1M: Mhc
Class I Molecule
B*5301 Complexed
With Peptide
Typdinqml From Gag
Protein Of Hiv2
• Blue: Rosetta Stone
Designed
• Visualized with Deep
View / SwissPdbViewer.
http://us.expasy.org/spdbv/
http://www.rcsb.org/pdb/cgi/explore.cgi?pid=195321117535569&pdbId=1A1M
b.e.a.n.s.
• A simple openGL based
program was developed
to test monte carol and
genetic algorithms on
designing “chain of
jelly beans.”
• User is able to vary the
initial structure of the
“beans” and compare
the efficiency of the
algorithms via built-in
timer.