Transcript Round 1
Protein Folding: Interrelation
between Secondary and Tertiary
Structure Determination
Karl F. Freed
James Franck Institute
and Department of Chemistry
University of Chicago
KITPC, Beijing China, July 29, 2009.
Human, Dog, Rat, Worm Genome Projects:
Obtain genes which code for protein sequences
www.genomesonline.org/images/gold_s1.gif
www.ncbi.nlm.nih.gov/Genbank/genbankgrowth.j
pg
Proteins: The primary functional biomolecules
insulin:
hormone
glucose
levels
hemoglobin:
oxygen
transport
cytochrome c:
heme group
electron transfer
ribonuclease: lysozyme:
cleaves RNA
cleaves
carbohydrates
antibody:recognize/target
foreign bodies
Function follows from structure
myoglobin:
oxygen
storage
glutamine synthetase:
synthesize glutamine
Our goal is to determine the
folded structures
Protein Folding Problem
SEQUENCE
3D STRUCTURE
Folding
0.00001 – 10+ sec
“Unfolded State”
Many conformations
Amino acids - polar, neutral,
charged, or hydrophobic
“Native State”
Responsible for function
Why haven’t we solved
The Protein Folding Problem
after 40+ years?
Sequence
“Mother Folding”
Unfolded State, “statistical coil”
Sequence determines structure
Two aspects:
1. Predict pathways
2. Predict structure
Give me an aa sequence; I’ll produce a pathway
and the final structure.
Why so difficult?
1. We’re not smart enough
2. It’s a very complex system
3. …
Give me an aa sequence; I’ll produce a pathway
and the final structure.
Why so difficult?
Complex problem:
•Too many atoms (not enough computing power)
•Force fields inaccurate (pairwise interactions inadequate)
•Complex interplay between secondary and tertiary structure
formation (local vs. long-range structure)
•High degree of folding cooperativity
•Averaging doesn’t work (no mean field models)
•H2O solvent is difficult to treat
•Don’t know all the rules?
•Not enough information?
•Reductionist models (e.g. H/P) often too simple
What are the fundamental
principles needed to
predict pathways and
structures?
Two aspects of “The Protein Folding Problem”:
- Mechanistic studies:
“How does it get from the U-state to the N-state?”
- Predict structure from sequence:
Common Strategy:
Use sequence similarity (homology)
to proteins with known structures.
…ALEVADKVAIYTRE…
…ALEVAEKVAIWTRE…
?
Use parts or full structure to predict new structure (homology).
Successful when have homology, but side-steps The Question:
“What are the Principles?”
Why do they fold into specific structures?
Hydrophobic (greasy)
atoms in the center
away from water
Form hydrogen bonds
between all the
nitrogens and oxygens
Pack all atoms with no
spaces
H2O
H2O
H2O
H2O
H2O
H2O
a helix
bstrand
van der Waals interactions
What level of representation is needed?
Ca
Cb
Beads-on-a-string
Monomer structure & miscibility of polyolefins
K. F. Freed and J. Dudowicz, Adv. Polym. Sci. 183, 63-126 (2005).
Major themes and challenges in protein folding
1) Polymer bends in
certain ways
2) Satisfy main-chain hydrogen
bonds and form secondary
structure.
y
f
3) Bury hydrophobic
residues
and pack the atoms
Must satisfy 1, 2 & 3 simultaneously
Vast Conformational Search: Levinthal Paradox
How does a protein find the time to fold?
Polypeptide backbone is flexible,
adopting specific conformations
Poly-Proline II
basin
Beta
basin
preferences
y
Helical, turn
basin
Ramachandran Map
f
Each f,y pair: 3 conformations ~3n ~100.5n conformations/amino acid
e.g., 1050 states for a 100 a.a. protein
Random, unbiased search would take 1035 s even with fsec sampling
And we have to search too!
What info is needed to fold proteins? all-atom dihedral angles
Step 1: Remove solvent –
( hydrophobic, screening, etc terms)
Step 2: Replace
side chains by
Cb atoms.
All-atom protein + solvent
Simulation would take
decades
y Cb
f
All-atom protein
Only main chain heavy atoms and Cb atoms.
Backbone f,y angles are the only coordinates.
2N total for N aa’s (need ~10x3N for x,y,z of heavy atoms)
Reduced representation:
Side chains are only Cb
y
f
Retains the 3 themes
Big Challenge:
Retain sequence information lost with removal of side chains
1) Proteins bend in certain ways
b-basin
PPII-basin
Sampling in dihedral space: f-y
angles and Ramachandran BASINS
y
f
b
PP2
e
Extended
f & y show very strong
Helical
preferences for certain
regions of the
f-y map, called
Ramachandran basins.
(due to steric & electrostatic
interactions)
Where to get this information?
aL
a-basin
Phi
From computer simulations?
Psi
But, force fields can vary widely
Zaman et al.
JMB 2003
Data mine protein data
base (PDB) of crystal
structures:
Extract the distributions for
each type of amino acid
PDB
THR
ALA
y
f
y
f
Ramachandran Map of ALA with neighbors=ALA, ASP
ALAALA
ALAASP
Map depends on neighbor type and conformation
Strong correlations in sequence
Move-set uses highly selected trimers from PDB
trimer library
PDB
Filter by residue
type &
conformation
trimer move
Specifying side chain type and backbone geometry
implicitly includes all-atom side chain information
Knowledge-based energy function: Assign interaction energies
according to the observed distances in the PDB
PDB
6000
Probability
5000
4000
3000
2000
1000
0
ProbPDB(rij)
Probability of finding 2 atoms
some distance apart
e.g. Dist(Caala – Caval).
Energy = - log (Prob)
F2
0
2
0
2
4
6
10
8
10
0
-2
-4
4
6
ala
val
distance (Ca -Cb )
EnergyPDB(rij) = -ln( ProbPDB(rij) )
8
Secondary structure prediction methods…
Machine Learning
Statistics
Single residue and
nearest neighbors
Chou and Fasman
(1974)
QVD
Secondary
structure
probability
conditional
on identity
of
neighbors
Neural Network
Jones (1999)
Dor & Zhou (2007)
Support Vector
Machines
Kim (2003)
Hu et al. (2004)
Current Accuracy of Best Methods is 75 – 80 %:
Good for a, but poorer for b and a/b
Multiple sequence
alignments
+ SASA, …
McGruffin and Jones (2003):
PSIPRED
Dor and Zhou (2007)
What prevents accuracy of secondary structure prediction from reaching ~90% ?
Secondary structure often depend on long range
interactions, i.e. tertiary structure
This is supported by the following studies:
•The same fragment from different parts of
Minor and Kim (1996)
protein G forms varying secondary structures
•Secondary structure prediction accuracy
decreases with increasing contact order
•The same sequence fragment can be found
in multiple native secondary structure types
Attempt to integrate secondary and tertiary structure
Add 3-D models to Machine Learning
Folds target & homologs (~ 5% gain)
Kihara (2005)
Pan et al. (1999)
Jacobini et al. (2000)
Zhou et al. (2000)
Ikeda and Higo (2006)
Neural Network + tertiary
Meiler and Baker (2003)
What we do differently:
Couple secondary and tertiary structure
during the folding process
Restrict possible secondary structure as
the chain folds
Tertiary
structure
Secondary
structure
“Eliminate all other factors, and the one which remains must be the truth.”
A. C. Doyle, The Sign of the Four (1890)
ItFix Algorithm
Trimer library: Full PDB
Keep track of where the
trimers came from
(Helix, extended, coil)
Starting configuration (no SecStr restriction)
Stage 1. Fold 100 times with trimers from entire PDB
Round 1
Round 2
Stage 2. Refine each Stage 1 end structure
Stage 1. Fold 100 times with restricted library
Stage 2. Refine each Stage 1 end structure
Round 1:
Filter by residue
identity only
Eliminate an option
(H,E,C) if
PHelix, PCoil < 10%, PExt
< 5%
Round 2
Stage 1. Fold 100 times with restricted library
Round 3
Stage 2. Refine each Stage 1 end structure
Repeat until no further fixing is possible…
Final
Round
Stage 1. Fold 100 times with restricted library
Stage 2. Refine each Stage 1 end structure
helix
strand
Not(Strand)
Not(Helix)
Coil:
B, G, I, S, T, N
Predicted secondary structure and 3-D model
Rounds 3,4 …
B)
Iterations mimic steps in folding pathway
1
Major pathway
Unfolded
state
b1
b2
helix
b4 b5
310
b3
Round
1 0
Round
Rnd 1
10
Rnd 0
b1-b2
hairpin
+helix
+ b4
Rnd 2
Rnd 1
1 0
+ b3
Round 2
Round 3
Rnd 3
1 0
+ b3
Rnd 2
Round 4
+ b4
1 0
+helix
Round 6
10
+310
helix
Rnd 9
Round 9
+ b5
b2
helix
b4 b5
0
b1
1
residue index
310
Pred’n
b3
73
Native
state
Ca-RMSD 3.1 Å
Figure 4
A
1af7
prediction
2.9 Å
1r69
prediction
4.2 Å
1ubq
prediction
5.3 Å
1af7
best
2.5 Å
1r69
best
2.4 Å
1ubq
best
3.1 Å
B
Energy
C
1af7
Ca rmsd
1r69
Ca rmsd
1ubq
Ca rmsd
Round 5
1b72A
1di2
1r69
1
1af7
Round 0
Round 0
Round 0
Round 1
Round 1
Round 1
Round 1
Round 2
Round 2
Round 2
Round 2
Round 3
Round 3
Round 4
Round 3
Round 5
Round 4
Round 6
Round 4
Round 7
Round 6
Round 8
Round 6
10
10
1 0
10
0
Secondary Structure frequency
10
Round 0
residue index
1AF7 3.4 Å RMSD
1TIF 5.4 Å RMSD
1SAP 7.8 Å RMSD
Novel Aspects
•Predict 2° & 3° structure without using homology
•Use principles of protein structure and folding:
i) couple 2° & 3° structure formation
ii) sequential stabilization
•Iterative fixing to reduce the search.
•Use a Cb representation
•Potential function: orientational & 2° structure dependence in a Cb
model
•Q8-level 2° structure, (f,y) prediction, can outperform PSIPRED
•Outputs pathway information
Conclusions
We couple the formation of tertiary structure and the identification of
secondary structure to improve both types of structure prediction
Goal: Predict secondary structures – Exceed the 80% barrier… without
homology
Future applications:
Improved ab initio structure prediction
Assist in identifying and ranking of threading targets
Future improvements:
Better energy functions = better folding and predictions
Bigger proteins – how far can we push secondary structure accuracy even when
we fail in determining folded structure
Acknowledgements
Prof. Tobin Sosnick
Joe DeBartolo - ab initio folding, secondary structure prediction
Dr. Andres Colubri – Folding simulations, software
Dr. Abhishek Jha (MIT) – coil library, unfolded state, a, b propensities,
structure refinement, electrostatics
James Fitzgerald (Stanford) - Statistical potentials, torsional dynamics
Prof. M. Zaman (UT Austin) – Simulations on peptides
All-atom statistical potential
Dr. Min-yi Shen, Prof. A. Sali, UCSF
Funding: NIH, NSF, Burroughs Wellcome Fund