Ming Li Talk about Bioinformatics - the David R. Cheriton School of

Download Report

Transcript Ming Li Talk about Bioinformatics - the David R. Cheriton School of

Computational Protein Folding
Ming Li
Canada Research Chair in Bioinformatics
Cheriton School of Computer Science
University of Waterloo
T
A
A
T
C
G
T
A
Human: 3 billion bases, 30k genes.
E. coli: 5 million bases, 4k genes
cDNA
reverse transcription
A
G
C
G
T
C
G
T
C
G
T
A
mRNA
(A,C,G,U)
C
A
translation
transcription
Protein
(20 amino acids)
Codon: three nucleotides encode
an amino acid.
64 codons
20 amino acids, some w/more codes
A
T
Coding proteins
They are built from 20 amino acids and
fold in space into functional shapes
Several polypeptide chains can form
more complex structures:
Why should you care?
 Through 3 billion years of evolution, nature has
created an enormous number of protein structures
for different biological functions. Understanding
these structures is key to proteomics. Fast
computation of protein structures is one of the most
important unsolved problems in science today.
Much more important than, for example, the P≠NP
conjecture.
 We now have a real chance to solve it.
Proteins – the life story
 Proteins are building blocks of life. In a cell, 70% is
water and 15%-20% are proteins.
 Examples:
 hormones – regulate metabolism
 structures – hair, wool, muscle,…
 antibodies – immune response
 enzymes – chemical reactions
 Sickle-cell anemia: hemoglobin protein is made of 4
chains, 2 alphas and 2 betas. Single mutation from
Glu to Val happens at residue 6 of the beta chain.
This is recessive. Homozygotes die but
Heterozygotes have resistance to malaria, hence it
had some evolutionary advantage in Africa. 1 in 12
African Americans are carriers.
What happened in sickle-cell anemia
Mutating to
Valine.
Hydrophobic
patch on the
surface.
Codon: GTT
GTA,GTC,
GTG
Glu: Glutamic
acid, E,
Codon:
GAA,GAG
Mutating to
Valine.
Hydrophobic
patch on the
surface.
Hemoglobin
Amino acids
 There are 500 amino acids in nature. Only 20
(22) are used in proteins.
 The first amino acid was discovered from
asparagus, hence called Asparagine, in 1806.
All 20 amino acids in proteins are discovered
by 1935.
 Traces of glycin, alanine etc were found in a
meteorite in Australia in 1969. That brings the
conjecture that life began from extraterrestrial
origin.
20 Amino acids
 Hydrophobic amino acids







Alanine
Neutral
Valine
Non-polar
Phenylalanine
Proline
Methionine
Isoleucine
Leucine
 Charged Amino Acids




Aspartic acid
Glutamic acid
Lysine
Arginine
 Polar amino acids
 Serine
Polar: one positive
and one negative charged ends,
 Threonine
e.g. H O is polar, oil is non-polar.
 Tyrosine
 Histidine
 Cysteine
 Asparagine
 Glutamine
 Tryptophan
2
 Simplest Amino Acid

Glycine
The Φ and Ψ angles
 The angle at N-Cα is Φ
angle
 The angle at Cα-C’ is Ψ
angle
 No side chain is
involved (which is at Cα)
 These angles determine
the backbone structure.
Cα
Homologous proteins have similar
structure and functions
 Being homologous means that they have
evolved from a common ancestral gene.
Hence at least in the past they had the same
structure and function.
 Caution: old genes can be recruited for new
functions. Example: a structural protein in eye
lens is homologous to an ancient glycolytic
enzyme.
Conserving core regions
 Homologous proteins usually have conserved
core regions.
 When we model one protein after a similar
protein with known structure, the main
problem becomes modeling loop regions.
 Modeling loops can also depend on database
to some degree.
 Side chains: only a few side-chain
conformations frequently occur – they are
called rotamers, there is a such a database.
There are not too many candidates!
 There are only about 1000 topologically
different domain structures. There is no
reason whatsoever that we cannot compute
their structures accurately.
Protein data bank
http://www.rcsb.org/pdb/Welcome.do
 As of Oct 10, 2006 there
are 39323 structures.
 But there are only about
1000 unique folds.
 And its growth is very
slow. Each year, over
90% structures
deposited into PDB have
similar folds in PDB
already.
Why do proteins fold?



The folded structure of a protein is actually thermodynamically less favorable
because it reduces the disorder or entropy of the protein. So, why do proteins
fold? One of the most important factors driving the folding of a protein is the
interaction of polar and nonpolar side chains with the environment. Nonpolar
(water hating) side chains tend to push themselves to the inside of a protein
while polar (water loving) side chains tend to place themselves to the outside of
the molecule. In addition, other noncovalent interactions including electrostatic
and van der Waals will enable the protein once folded to be slightly more stable
than not.
When oil, a nonpolar, hydrophobic molecule, is placed into water, they push
each other away.
Since proteins have nonpolar side chains their reaction in a watery environment
is similar to that of oil in water. The nonpolar side chains are pushed to the
interior of the protein allowing them to avoid water molecule and giving the
protein a globular shape. There is, however, a substantial difference in how the
polar side chains react to the water. The polar side chains place themselves to
the outside of the protein molecule which allows for their interact with water
molecules by forming hydrogen bonds. The folding of the protein increases
entropy by placing the nonpolar molecules to the inside, which in turn,
compensates for the decrease in entropy as hydrogen bonds form with the polar
side chains and water molecules.
Marginal Stability
 The marginal stability
between native and
denatured states is
biologically important
 Control quantities of
some proteins
 Timing
 Must be able to degrade
and create proteins
easily.
 Fast turnover means
marginal stability
 Some enzymes need
structural flexibility.
How does nature fold proteins?
 X-ray studies show each sequence has a unique fold, although
having many sub-states with minor structural differences.
 How did they all get to there?
 Each protein did random search? This is impossible, timewise, the problem is NP-hard. In real life, they fold within 0.1
to 1000 seconds, in vivo or in vitro.
 They do parallel search, and once one found it, it starts
cascade effect (like the prion protein)? Project: show this is
also not possible.
 More likely: there is a fast kinetic folding pathway. The
obstacles on such pathway becomes key issues (such as
formation of wrong disulfide bonds etc)
 Finding the folding path experimentally is difficult since the
intermediates have very short lifetime.
Folding steps and molten globules
 Step 1: within a few
milliseconds, local secondary
structures form, also some
native like alpha helix and beta
strand positions. This is called
molten globule. Not unique.
 Step 2: lasts up to 1 second,
native elements and tertiary
structures begin to develop,
possibly sub-domains, although
not docked perhaps.
 Step 3: single native form is
reached, forming native
interactions, including
hydrophobic packing in the
interior & fixing surface loops.
Burying hydrophobic side chains
 The last step is the biggest mystery. There is a very little change in free
energy by forming the internal hydrophobic bonds for alpha and beta
structures since the in unfolded state, equally stable hydrogen bonds
can also be formed to water molecules!! Thus secondary structure
formation cannot be thermodynamic driving force of protein folding.
 On the other hand, there is a large free energy change by bringing
hydrophobic side chains out of contact with water and into contact with
each other in the interior of a globular entity.
 Thus a likely scenario:



Hydrophobic side chains partially buried very early
Thus it vastly reduces the number of possible conformations that need
to be searched because only those that are sterically accessible within
this shape can be sampled.
Furthermore, when side chains are buried, their polar backbone –NH
and –CO groups are also buried in a hydrophobic environment, hence
unable to form hydrogen bonds to water – hence they bond to each
other – so you get alpha and beta structures.
The α helix
The arrow indicates
direction from N to C terminal
Note: natural α
helices are
right-handed
Height: 5.4A
per turn.
Each residue
gives1.5A rise
5.4A
Hydrogen bond
Water molecule, H2O
Hydrogen bond (you know ionic bond
and covalent bond from high school)
–
+
H
Water
(H2O)
O
H
+
–
Ammonia
(NH3)
N
H
H
+
H
+
+
A hydrogen
bond results
from the
attraction
between the
partial positive
charge on the
hydrogen atom
of water and
the partial
negative charge
on the nitrogen
atom of
ammonia.
Walking on water
Antiparallel β strands
Side chains
in purple
Hydrogen bonds, note their unevenness
Core question
 Looking at the protein sequences of globular proteins, one finds
that hydrophobic side chains are usually scattered along the
entire sequence, seemingly randomly.
 In the native state of folded protein, ½ of these side chains are
buried, and the rest are scattered on the surface of the protein,
surrounded by hydrophilic side chains.
 The buried hydrophobic side chains are not clustered in the
sequence.
 Central Question: what causes these residues to
be selectively buried during the early and rapid
formation of the molten globule?
Folding pathways
 Ui‘s --- unfolded states, many
of them.
 Mi’s --- molten globule
states, i can be 1. Has most
secondary structures, but
less compact.
 Converging to F. During this
relatively slower process it
passes a high energy
transition state T.
 These facts have been
verified by NMR, hydrogen
exchange, spectroscopy, and
thermo-chemistry.
Web Lab Protein Structure Determination
 Wet Lab:
 X-ray crystallography
 NMR
 The wet lab technologies not only are slow
and expansive, but also they simply fail for:




Protein design
Alternative splicing
Insoluble proteins
Not to mention millions of proteins they can do
but will never finish.
Computational Approaches
RAPTOR: Protein Threading by
Linear Programming
 Make a structure prediction through finding an optimal
placement (threading) of a protein sequence onto each known
structure (structural template)
 “placement” quality is measured by some statistics-based
energy function
 best overall “placement” among all templates may give a
structure prediction
target sequence
MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE
template library
Threading
Threading Example
Introduction to Linear Program





Optimize (Maximize or Minimize) a linear objective function
 e.g.
2x+3y+4z
The variables satisfy some linear constraints. e.g.
1. x+y-z ≥ 1
2. 2x+y+3z=3
integer program (IP) =linear program (LP) + integral variables
LP can be solved within polynomial time --- Interior point method.
Simplex method also runs fast.
Polynomial time for IP is not likely. It is NP-hard, But:
 IP can be relaxed to LP, solve the non-integral version
 Branch-and-bound or branch-and-cut (may cost exponential
time)
Why Integer Programming?
 Treat pairwise potentials rigorously
 critical for fold-level targets
 Existing exact algorithms for pairwise potentials
 High memory requirement, or
 Expensive computational time
 Inflexibility, messy formulation
 Exploit correlations between various kinds of item
scores in the energy function
 99% real data generate integral solutions directly, no
branch-and-bound needed.
Previous approaches for threading
 Heuristic Algorithms
 Interaction-Frozen Algorithm (A. Godzik et al.)
 Monte Carlo Sampling (T. Madej et al.)
 Double dynamic programming (D. Jones et al.)
 Recursive dynamic programming (R. Thiele et al.)
 Exact Exponential Time Algorithms

Branch-and-bound (R.H. Lathrop et al.)


Exploit the relationship among various scoring
parameters, fast self-threading
Divide-and-conquer (Y. Xu et al.)

Exploit the topological structure of template
contact graphs
Formulating Protein Threading by LP
•
Protein Threading Needs:
1.
2.
3.
4.
Construction of Template Library
Design of Energy Function
Sequence-Structure Alignment
Template Selection and Model Construction
Threading Energy Function
how preferable to
put two particular
residues nearby: Ep
how well a residue
fits a structural
environment: Es
(Pairwise potential)
(Fitness score)
sequence similarity
between query and
template proteins: Em
alignment gap
penalty: Eg
(gap score)
(Mutation score)
Consistency with the secondary structures: Ess
E= Ep + Es + Em + Eg + Ess
Minimize E to find a sequence-structure alignment
A sample of detail:
 The objective function is




min E = Ep + Es + Em + Eg + Ess
Let xi,j indicate amino acid ai in the query sequence is
aligned to position j in the template structure. I.e. xi,j =
1 if ai is aligned to position j, otherwise xi,j=0.
Then if that position j is exposed to water, and ai is
hydrophobic, then we give a negative weight ai,j in the
environmental energy:
Es =  ai,j xi,j
Some contraints would be xi,j = {0,1}. Or in the LP
relaxation: 0 ≤ xi,j ≤ 1.
j=1..n xi,j =1
Contact Graph
template
1.
2.
3.
Each residue as a vertex
One edge between two
residues if their spatial
distance is within a given
cutoff.
Cores are the most
conserved segments in the
template: alpha-helix, betasheet
Simplified Contact Graph
Contact Graph and Alignment
Diagram
Contact Graph and Alignment
Diagram
Variables
 x(i,l) denotes core i is aligned to sequence position l
 y(i,l,j,k) denotes that core i is aligned to position l and core j is aligned to
position k at the same time.
 D[i] = set of positions core i can be aligned to. R[i,j,k] = set of positions core j
can be aligned to given core i is aligned to k.
Formulation 1
Minimize
E   ai ,l xi ,l   b(i ,l )( j ,k ) y(i ,l )( j ,k )
s.t.
xi ,l  xi 1,k  1
k< l
y(i ,l )( j ,k )  xi ,l x j ,k
Encodes
scoring system
x
lD[ i ]
i ,l
Eg , Ep
Es , Ess , Em
1
xi ,l , y(i ,l )( j ,k )  {0,1}
Encodes interaction structures:
the first makes sure no crosses;
the second is quadratic, but can
be converted to linear: a=bc is
eqivalent to: a≤b, a≤c, a≥b+c-1
Formulation used in RAPTOR
Minimize
E   ai ,l xi ,l   b(i ,l )( j ,k ) y(i ,l )( j ,k )
Eg, Ep
s.t.
xi ,l 
Encodes
scoring system
, l  D[i ]
y
, k  D[ j ]
( i ,l )( j , k )
kR[ i , j ,l ]
x j ,k 
x
lD[ i ]
y
( i ,l )( j , k )
lR[ j , k ,i ]
i ,l
1
xi ,l , y(i ,l )( j ,k )  {0,1}
Es, Ess, En
Encodes interaction
structures
Solving the Problem Practically
1. More than 99% threading instances can be
solved directly by linear programming, the
rest can be solved by branch-and-bound
with only several branch nodes
2. Less memory consumption
3. Less computational time
4. Easy to extend to incorporate other
constraints
CPU Time for CAFASP3 targets
Fold Recognition
 Support Vector Machines (SVM) Approach



Features are extracted from the alignments
A threading pair is treated as a positive pattern
only if they are in at least fold-level similarity
60,000 threading pairs are employed to train
SVM model.
 5% more targets are recognized by SVM
approach than the traditional z-Score
Lindahl Benchmark Test
RAPTOR
FUGUE
PSI-BLAST
HMMER-PSIBLAST
SAMT98-PSIBLAST
BLASTLINK
SSEARCH
THREADER
family
Top1
Top5
84.8
87.1
82.2
85.8
71.2
72.3
67.7
73.5
70.1
75.4
74.6
78.9
68.6
75.7
49.2
58.9
superfamily
Top1
Top5
47.0
60.0
41.9
53.2
27.4
27.9
20.7
31.3
28.3
38.9
29.3
40.6
20.7
32.5
10.8
24.7
fold
Top1
31.3
12.5
4.0
4.4
3.4
6.9
5.6
14.6
Top5
54.2
26.8
4.7
14.6
18.7
16.5
15.6
37.7
976*975 threading pairs are tested, the results of other servers are taken from
Shi et al.’s paper.
CASP5, CASP6, CASP7
 Held every 2 years.
 RAPTOR consistently ranked high since
CASP5. It was voted by CASP5 attendees as
the most novel approach, at http://forcasp.org
 62—100 targets each time. 48 hours allowed for
each target.
 No manual intervention.
 Evaluated by computer programs.
Example, CASP5 Target Category
CASP5
CM
CM/FR
FR(H)
FR(A)
CAFASP
3
HM easy
(family level)
HM hard
(superfamily
level)
FR (fold level)
# targets
20
12
30
NF/FR
Hard
Easy
Prediction Difficulty
CM: Comparative Modelling, HM: Homology Modelling
FR: Fold Recogniton, NF: New Fold
NF
RAPTOR Sensitivity on CASP5
FR targets
30 FR
targets
54 servers
Servers
Sum MaxSub Score
# correct
3ds5 robetta
5.17-5.25
15-17
pmod 3ds3 pmode3
4.21-4.36
13-14
RAPTOR
3.98
13
shgu
3.93
13
3dsn orfeus
3.64-3.90
12-13
pcons3
3.75
12
fugu3 orf_c
3.38-3.67
11-12
…
…
…
pdbblast
0.00
0
…
…
…
blast
0.00
0
(http://ww.cs.bgu.ac.il/~dfischer/CAFASP3, released on Dec., 2002.)
CAFASP3 Example
 Target ID: T0136_1
 Target Size:144
 Superimposable size
within 5Å: 118
 RMSD:1.9Å
Red: Experimental Structure
Blue/green: RAPTOR model
CASP6, T0199-2, ACE buffalo rank: 9th
From RAPTOR rank 1 model. TM=0.4183 MaxSub=0.2857. Good
parts: 116-134, 286-332
Left: predicted structure. Right: experimental structure
CASP6, T0203 ACE buffalo rank: 1st
From RAPTOR 2nd model. TM=0.6041, MaxSub=0.3485. Good parts:
19-57, 89-94, 139-178, 224-239, 312-372
RAPTOR first
Model ranks
5th
Predicted
Experimental
CASP6, T0262-2, ACE buffalo rank: 4th
From Fugue3 6th model. TM=0.4306, MaxSub=0.3459.
Good parts: 162-203
Fugue’s
top
model
ranks
low
Predicted
Experimental
CASP6, T0242, NF, ACE buffalo rank: 1
From RAPTOR rank 5 model.
TM score=0.2784, MaxSub score=0.1645
However,
RAPTOR top
model
ranks 44th !
Trivial error?
Predicted
Experimental
CASP6, T0238, NF ACE buffalo rank 1st
From RAPTOR 8th model TM=0.2748, MaxSub=0.1633
Good part: 188-237. High TM score, low MaxSub
Raptor
top
model
ranks 4th
Predicted
Experimental
About RAPTOR
 Jinbo Xu’s Ph.D. thesis work.
 The RAPTOR system has benefited
significantly from PROSPECT (Ying Xu, Dong
Xu, et al).
 References: J. Xu, M. Li, D. Kim, Y. Xu, Journal of
Bioinformatics and Computational Biology, 1:1(2003), 95-118.
J. Xu, M. Li, PROTEINS: Structure, Function, and Genetics,
CASP5 special issue.
Old Paradigm
New RAPTOR, New Paradigm
Contact Prediction
Local Threading/
Large fragments
Short Fragment
selection
Super motif/domain
Modeling
Loop / Side Chain
Modeling
Assembly by
Molecular dynamics
Refinement
Global threading/
Old RAPTOR
Hydrophobic s.c.
Burying information
NMR constraints