Transcript Slide 1

Comparative Modeling for Beta
Protein Structure Prediction
Lenore J. Cowen
Tufts University
Amino Acids
A protein is composed of a central backbone and a
collection of (typically) 50-2000 amino acids
(a.k.a. residues).
There are 20 different kinds of amino acids each
consisting of up to 18 atoms, e.g.,
Name
Leucine
Alanine
Serine
Glycine
Valine
Glutamic acid
Threonine
3-letter code
Leu
Ala
Ser
Gly
Val
Glu
Thr
1-letter code
L
A
S
G
V
E
T
Protein Structure
repeating repeating
backbone backbone
structure structure
O H
O H
O H
O H
O H
OH
OH
H3N+ CH C N CH C N CH C N CH C N CH C N CH C N CH C N CH COOCH2
CH2
COO-
CH2
CH
CH2
H3C CH3
CH2
H C CH3
CH2
OH
CH3
NH
CH2 CH2 CH2
HC
CH
HN
CH2
CH2
N
CH
C
NH2
Asp
D
Arg
R
N+H2
Val
V
Tyr
Y
Ile
I
His Pro
H
P
Protein sequence: DRVYIHPF
Phe
F
Protein Folding Problem
Given an amino acid sequence, e.g.,
MDPNCSCAAAGDSCTCANSCTCLACKCTSCK,
how will it fold in 3D?
The fold is important
because it determines
the function of the
protein.
Note: The pictures I’ve been giving
are “cartoons” of the backbone
The Inverse Protein Folding Problem
Instead of given a sequence, and asking
what’s its fold, take a fold, and ask for all
the sequences that form that fold.
…VLWIXS….
…SSCILWG…
What do we mean by “that fold”?
SCOP (http://scop.mrc-lmb.cam.ac.uk/scop/)
SCOP (http://scop.mrc-lmb.cam.ac.uk/scop/)
SCOP (http://scop.mrc-lmb.cam.ac.uk/scop/)
Can we recognize and model all
folds that form a beta-trefoil, etc.?
• If they are evolutionarily close enough the
answer is YES.
• Use BLAST to recognize homology (similar
sequences have similar folds) and align
conserved parts of the backbone.
…GVFIIIMGSHGK…
…GVD-LMG-HGR…
Comparative modeling
• One the backbone of
the conserved core is
fixed, pack in the
sidechains
• Add loops and
unstructured regions.
Can we recognize and model all
folds that form a beta-trefoil, etc.?
• But STRUCTURE can be more
CONSERVED that sequence—maybe the
structures align but we can no longer use
BLAST because the sequence similarity is
too weak
…GVFIIIMGSHGK…
…GR—CV-GCAGR…
Comparative modeling
• If you CAN find the
correct alignment, can do
as before.
• One the backbone of the
conserved core is fixed,
pack in the sidechains
• Add loops and
unstructured regions.
Approaches to Structural Motif Recognition
• Statistical template/profile methods (Altschul
et al. 1990)
• Hidden Markov Models (Eddy, 1998)
• Threading Methods (Jones et al. 1992)
• Combinations of two or more of the above
Our Results
Recognizing the Beta Helix and Beta
Trefoil Folds
The Right-handed Parallel Beta-Helix
A processive fold
composed of
repeated supersecondary units.
Each rung
consists of three
beta-strands
separated by turn
regions.
Pectate Lyase C (Yoder et al. 1993)
No sequence
repeat.
Biological Importance of Beta Helices
Surface proteins in human infectious disease:
• virulence factors
• adhesins
• toxins
• allergens
Proposed as a model for amyloid fibrils
(e.g. Alzheimer’s and Creutzfeldt-Jakob)
Virulence factors in plant pathogens
What was Known
Solved beta-helix structures:
12 structures in PDB in 7 different SCOP families
Pectate Lyase:
Pectate Lyase C
Pectate Lyase E
Pectate Lyase
Pectin Lyase:
Pectin Lyase A
Pectin Lyase B
Galacturonase:
Polygalacturonase
Polygalacturonase II
Rhamnogalacturonase A
Chondroitinase B
Pectin Methylesterase
P.69 Pertactin
P22 Tailspike
BetaWrap Program
[Bradley, Cowen, Menke, King, Berger, PNAS, 2001, 98:26, 14,81914,824 ; Cowen, Bradley, Menke, King, Berger (2002), J Comp Biol, 9,
261-276]
Performance:
• On PDB: no false positives & no false negatives.
Recognizes beta helices in PDB across SCOP
families in cross-validation.
• Recognizes many new potential beta helices when
run on larger sequence databases.
• Runs in linear time (~5 min. on SWISS-PROT).
BetaWrap Program
Histogram of protein scores for:
• beta helices not in database (12 proteins)
• non-beta helices in PDB (1346 proteins )
Single Rung of a Beta Helix
3D Pairwise Correlations
B3
T2
B2
B1
Stacking residues in
adjacent beta-strands
exhibit strong
correlations
Residues in the T2
turn have special
correlations
(Asparagine ladder,
aliphatic stacking)
Question: how
can we find
these
correlations
which are a
variable
distance apart
in sequence?
Finding Candidate Wraps
• Assume we have the correct locations of a
single T2 turn (fixed B2 & B3).
B3 T2
Candidate
Rung
B2
• Generate the 5 best-scoring candidates for the
next rung.
Scoring Candidate Wraps (rung-to-rung)
Rung-to-rung alignment score incorporates:
• Beta sheet pairwise alignment
preferences taken from
amphipathic beta
structures in PDB.
(w/o beta helices)
• Additional stacking bonuses
on internal pairs.
• Distribution on turn lengths.
Scoring Candidate Wraps (5 rungs)
• Iterate out to 5 rungs generating candidate wraps:
• Score each wrap:
- sum the rung-to-rung scores
- B1 correlations filter
- screen for alpha-helical content
Predicted Beta Helices
Features of the 200 top-scoring proteins in the
NCBI’s protein sequence database:
• Many proteins of similar function to the known betahelices; some with similar sequences.
• A significant fraction are characterized as microbial
outer membrane or cell-surface proteins.
• Mouse, human, worm and fly sequences significantly
underrepresented – only two proteins!
Some Predicted Beta Helices in Human Pathogens
Vibrio cholerae
Helicobacter pylori
Plasmodium falciparum
Chlamyidia trachomatis
Chlamydophilia pneumoniae
Listeria monocytogenes
Trypanosoma brucei
Borrelia burgdorferi
Leishmania donovani
Bordetella bronchiseptica
Trypanosoma cruizi
Bordetella parapertussis
Bacillus anthracis
Rickettsia ricketsii
Rickettsia japonica
Neisseria meningitidis
Legionaella pneumophilia
Cholera
Ulcers
Malaria
Venereal infection
Respiratory infection
Listeriosis
Sleeping sickness
Lyme disease
Leishmaniasis
Respiratory infection
Sleeping sickness
Whooping cough
Anthrax
Rocky Mtn. spotted fever
Oriental spotted fever
Meningitis
Legionnaire’s disease
The Beta-Trefoil
The beta-trefoil consists of three leaves around an axis of
three-fold symmetry.
B3
T2
Cap
T3
B2
T1
x3
Barrel
B4
B1
Single Leaf
Entire trefoil
(3 leaves)
1BFF (Kitagawa et al. 1991)
Templates
A leaf template consists of:
Cap template
•
T2
a B1-strand, followed by a T1 turn
of length 2 to 17, followed by
•
a B2-strand, followed by a T2 turn
of length 0 to 11, followed by a
B3-strand, followed by
B4
•
a T3 turn of length 4 to 20,
followed by a B4 strand.
B3
T3
B2
T1
B1
In addition, it is between 26 and 64 residues long.
A trefoil template consists of three leaf templates
separated by two T4 turns of length 0 to 16.
What Pairs Do We Consider?
B3
T2
T3
B2
T1
B4
B1
In both the barrel and
the cap, we consider
both directly aligned
pairs of residues and
pairs of residues oneoff from each other.
Different tables are
used for pairwise
preferences for buried,
exposed, and one-off
pairs of residues.
Packing moves earlier in the
modeling process
• In order to produce more accurate
sequence-structure alignments, we return
several possible “wraps” and try to pack
sidechains.
• So sidechain packing is used earlier in the
comparative modeling process; also to help
find the correct sequence-structure
alignment.
The Packing Function
Top wraps fed to packing function.
• SCWRL (Canutescu, 2003) is better at packing cap
than barrels.
• Input to SCWRL:
• Atomic coordinates of the backbone of cap strand
pairs from a member of each trefoil superfamily
in the training set.
• Top 4 wraps of the target sequence onto the
trefoil template.
• Return best-scoring wrap with a good packing, if one
exists, else reject.
Example of the Packing Phase
Partial
PDB file
from
actual
trefoil
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
…
4340
4341
4342
4343
4344
4345
4346
4347
4348
4349
4350
4351
4352
4353
4354
4355
4356
N
CA
C
O
CB
CG
CD1
CD2
H
N
CA
C
O
CB
OG1
CG2
H
LEU
LEU
LEU
LEU
LEU
LEU
LEU
LEU
LEU
THR
THR
THR
THR
THR
THR
THR
THR
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
196
196
196
196
196
196
196
196
196
197
197
197
197
197
197
197
197
41.442
40.705
40.704
41.787
41.441
41.503
41.902
40.155
42.299
39.524
39.397
38.506
37.700
38.704
39.307
38.808
38.752
LTSKD STILL
12345 67890
Known
Cap
SCWRL
Predicted
cap
atomic
positions
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
…
8
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
N
CA
C
O
CB
CG
CD1
CD2
N
CA
C
O
CB
CG
CD
NE
CZ
B3
9
10
4
5
3
LEU
LEU
LEU
LEU
LEU
LEU
LEU
LEU
ARG
ARG
ARG
ARG
ARG
ARG
ARG
ARG
ARG
B2
2 1
1ABR (Tahirov et al. 1995)
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
41.442
40.705
40.704
41.787
41.412
40.686
39.364
41.533
39.524
39.397
38.506
37.700
38.788
39.658
38.984
39.799
39.404
Cap from LRVYY RILHN
top wrap 12345 67890
B3
7
6
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
B2
Steric clash
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
Toward Automation
• For each SCOP beta-structural template
*align all known examples of fold
*find pairs in conserved core
*thread onto template (additionally use
profiles); find candidate alignments
Pack sidechains for each, determine best
structure
Place loops and unstructured regions
Toward Automation
• For each SCOP beta-structural template
*align all known examples of fold
*find pairs in conserved core
*thread onto template (additionally use
profiles); find candidate alignments
Pack sidechains for each, determine best
structure
Place loops and unstructured regions
Multiple Structure Alignment for
Remote Protein Homologs
• We spend the remainder of the talk
discussing our new program for multiple
structure alignment: MATT
The Multiple Structure Alignment
Problem
Input: atomic coordinates for the backbones of
m protein structures
Output: A sequence alignment of the protein
structures, together with a superimposition
of the structures in 3D space.
The Multiple Structure Alignment
Problem
Def: the common core of a
protein structure is the set
of positions where every
structure contributes a
residue in alignment
The Multiple Structure Alignment
Problem
Geometric criteria:
Good multiple structure
alignments MAXIMIZE
common core size while
MINIMIZING pairwise
RMSDs between structures.
Note: even simplified versions
NP-Hard (Goldman, Istrail
and Papadimitriou, 1999)
The Multiple Structure Alignment
Problem
Discrimination criteria:
Good multiple structure
alignments align what is
“supposed to be aligned”
because it is part of the
evolutionarily conserved
core.
Approaches to Structure Alignment
• AFP chaining methods
align all short pieces and
chain together using
dynamic programming
• Contact map methods look
for similarities within
distance matrices
• Geometric hashing,
secondary structure
elements, etc.
Some Popular Structure Aligners
•
•
•
•
Dali (Holm 93)
VAST (Bryant 96)
LOCK (Singh 97)
FlexProt (Shatsky et
al. 02)
• FATCAT (Ye&Godzik
04)
• LOVOALIGN
(Andreani et al. 06)
• CE/CE-MC (Shindyalov
2000)
• SSAP (Orengo&Taylor
96)
• MultiProt
(Shatsky&Wolfson 04)
• POSA (Ye&Godzik 05)
• Mustang (Konagurthu et
al. 06)
• CBA (Ebert 07)
The Benchmark Datasets
• Globins
• Homstrad
– 1028 alignments
– Each alignment contains 2-41 structures
– 399 sets with > 2 structures
The Benchmark Datasets
Sabmark
Superfamily set:
– 3645 domains in 426 subsets
Twilight zone set:
– 1740 domains in 209 subsets
Both sets contain:
– Between 3 and 25 structures
– Decoy structures (sequence matches that reside
in different SCOP domains)
Matt: Multiple Alignment with
Translation and Twists
• Matt is an AFP
chaining method that
additionally adds
flexibility in the form
of geometrically
impossible bends and
breaks.
Other work modeling flexibility
• In structure alignment:
– Flexprot [Shatsky et al., 2002]
– Fatcat/POSA [Ye&Godzik, 2004, 2005]
• For other reasons:
– Molecular docking [Echols et al,03; Bonvin,06]
– Ligand binding [Lemmen et al, 2006]
– Decoy construction [Singh&Berger, 2006]
Outline of the Matt Algorithm
Results on Sabmark (Superfamily)
Program Name
Avg. Core Size
Avg. RMSD
Multiprot
68.701
1.498
Mustang
104.162
4.146
Matt
104.692
2.639
Results on Sabmark (Twilight Zone)
Program Name
Avg. Core Size
Avg. RMSD
Multiprot
36.54
1.536
Mustang
66.833
5.035
Matt
66.967
2.916
Sabmark Decoy Set
• For each SCOP superfamily, positive
examples of the fold, and negative examples
that are
– Random examples from a different superfamily
– Examples from a different superfamily that are
nonetheless good BLAST hits
Toward Automation
• For each SCOP beta-structural template
*align all known examples of fold
*find pairs in conserved core
*thread onto template (additionally use
profiles); find candidate alignments
Pack sidechains for each, determine best
structure
Place loops and unstructured regions
On the Web
• BetawrapPro for predicting beta-helices and
beta-trefoils at:
http://betawrappro.csail.mit.edu
• Matt at: http://matt.csail.mit.edu OR
http://matt.cs.tufts.edu
Acknowledgements
•
•
•
•
•
Matt Menke
Andrew McDonnell
Phil Bradley
Bonnie Berger
Jonathan King
• National Science
Foundation