bystroff_12jun01 - Rensselaer Polytechnic Institute

Download Report

Transcript bystroff_12jun01 - Rensselaer Polytechnic Institute

Protein Folding Initiation Site
Motifs
Chris Bystroff
Dept of Biology
Rensselaer Polytechnic Institute, Troy, NY
ATCTGTATCGTATCGTATTTCTGGHACCCCCTGATGTAAAAGAGAGTTCTATATTACTACAACCACGATCGGATTTA
TTTTGGTCTADCAGCTCAGGATCATCACAGGATTCAAATCCTATCATCAGGAGGGGGGTCGTGGTTGCGCATTAG
CAAAGTTGCAGTCAGTCGTCATGCAGCGACCACATACACACTGCATGCGCGTCTTCAVATCCCACAGTCAGTAGT
AGTCACAGACCTCCAGTCAGTCGAGTACGACGTCAGTACGTCAGTCAGCCAGTCAGTCAGTCATACGTACGTCA
TGCATACGTAGCTAGCAGACGCAGCATTACGTCGCGATCGATCGATCGGCATAGCAGCATCCCAGTCAGTCATAT
GCATAGTCGATCGACGTCAGTCATGAGATCGTACGAAATACGTAGCTGATCGACGTCAGTCAGACTGATCGATCG
GATTCAGTCACGATGCATGCTAGCAAAGTCAGCGCATGCTAGCTACGTAGTCAGTACTGCATGCAGTACGTACGT
AGACGTCAGTCAGTCAGTCATGATGCTAGCTAGCTACGTCACAGTCAGTCATGACTGACTGACTGACTGCAGTC
AGTCATCGATACGTAGCTAGCTACGTCAGTCATGCAGTCAGTCATTGATGATCGATCGTACATGCAGATGCCGTAG
GCTAGCTAGCTAGCACTACGATGCATGCTAGCTAGCTACGACCAGTACCATGATGACTGCATGATCATACTGCCCA
AAAAACGACTTAATCGTATCGTATTTCTGGHACCCCCTGATGTAAAAGAGAGTTCTATATTACTACAACCACGATC
GGATTTATTTTGGTCTADCAGCTCAGGATCATCACAGGATTCAAATCCTATCATCAGGAGGGGGGTCGTGGTTGC
GCATTAGCAAAGTTGCAGTCAGTCGTCATGCAGCGACCACATACACACTGCATGCGCGTCTTCAVATCCCACAGT
CAGTAGTAGTCACAGACCTCCAGTCAGTCGAGTACGACGTCAGTACGTCAGTCAGCCAGTCAGTCAGTCATACG
TACGTCATGCATACGTAGCTAGCAGACGCAGCATTACGTCGCGATCGATCGATCGGCATAGCAGCATCCCAGTCA
GTCATATGCATAGTCGATCGACGTCAGTCATGAGATCGTACGAAATACGTAGCTGATCGACGTCAGTCAGACTGA
TCGATCGGATTCAGTCACGATGCATGCTAGCAAAGTCAGCGCATGCTAGCTACGTAGTCAGTACTGCATGCAGTA
CGTACGTAGACGTCAGTCAGTCAGTCATGATGCTAGCTAGCTACGTCACAGTCAGTCATGACTGACTGACTGACT
GCAGTCAGTCATCGATACGTAGCTAGCTACGTCAGTCATGCAGTCAGTCATTGATGATCGATCGTACATGCAGATG
CCGTAGGCTAGCTAGCTAGCACTACGATGCATGCTAGCTAGCTACGACCAGTACCATGATGACTGCATGATCATAC
TGCCCAAAAAACGACTTAATCGTATCGTATTTCTGGHACCCCCTGATGTAAAAGAGAGTTCTATATTACTACAACC
ACGATCGGATTTATTTTGGTCTADCAGCTCAGGATCATCACAGGATTCAAATCCTATCATCAGGAGGGGGGTCGT
GGTTGCGCATTAGCAAAGTTGCAGTCAGTCGTCATGCAGCGACCACATACACACTGCATGCGCGTCTTCAVATCC
CACAGTCAGTAGTAGTCACAGACCTCCAGTCAGTCGAGTACGACGTCAGTACGTCAGTCAGCCAGGAGGGGGG
TCGTGGTTGCGCATTAGCAAAGTTGCAGTCAGTCGTCATGCAGCGACCACATACACACTGCATGCGCGTCTTCAV
ATCCCACAGTCAGTAGTAGTCACAGACCTCCAGTCAGTCGAGTACGACGTCAGTACGTCAGTCAGCCAGTCAGT
CAGTCATACGTACGTCATGCATACGTAGCTAGCAGACGCAGCATTACGTCGCGATCGATCGATCGGCATAGCAGC
ATCCCAGTCAGTCATATGCATAGTCGATCGACGTCAGTCATGAGATCGTACGAAATACGTAGCTGATCGACGTCA
GTCAGACTGATCGATCGGATTCAGTCACGATGCATGCTAGCAAAGTCAGCGCATGCTAGCTACGTAGTCAGTACT
GCATGCAGTACGTACGTAGACGTCAGTCAGTCAGTCATGATGCTAGCTAGCTACGTCACAGTCAGTCATGACTGA
CTGACTGACTGCAGTCAGTCATCGATACGTAGCTAGCTACGTCAGTCATGCAGTCAGTCATTGATCATGATCATAC
TGCCCAAAAAACGACTTA
Bioinformatics = sequence analysis
Biological sequences come in two types: DNA and protein
DNA has a four-letter alphabet
Protein has a 20-letter alphabet
Sequences are an abstraction. As such, they are treated
abstractly...
Sequence alignment
Phylogenetic trees
Gene finding
Data mining
behind the abstraction...
"A free-standing reality"
ATGCATCAGG
ACTAGCTATCA
GAATC
QuickTime™ and a
decompressor
are needed to see this picture.
Any DNA sequence REPRESENTS a physical object,
and some DNA sequences translate to protein
serquences, which also REPRESENT physical objects.
Sequence = Structure
Structure = Function
Function = Life
__________________
\ Sequence = Life
Sequence = Structure
The protein folding problem
Unfolded
Folded
This happens spontaneously (in water).
The problem with the protein folding problem.
Number of amino acids residues in a typical protein: 100
Approximate number of degrees of freedom per residue: 3
Estimated total number of conformations (=3100): 1045
Time required to fold if all conformations are sampled at the rate
of 1 per 10-15s: 1020 y
Time since the Big Bang: ~13 x 109 y
pathways
folding pathways must exist
The protein is
unfolded...
...something
happens
first...
...then
something else
happens.
Early events
eliminate
alternative
pathways
What happens first?
Helix/coil transition
10-100ns
Beta-hairpin
0.1-1.0 ms
transient intermediates
< 1ms
equilibrium
0.001-1.0 s
Local structure usually isn't stable
Helices and turns form quickly but just as quickly fall apart.
Most short peptides (<20aa) do not show structural stability in
NMR studies.
Exceptions:
A few short peptides have been shown to be conformationally
stable (for example Met-enkephalin = YGGFM)
Interesting parallels between bioinformatics
and semantics
language
proteins
letters
amino acids
words
motifs
phrases
sentences
modules
whole proteins
meaning
structure
literature
genome
grammar
folding??
ATCTGTATCGTATCGTATTTCTGGHACCCCCTGATGTAAAAGAGAGTTCTATATTACTACAACCACGATCGGATTTA
TTTTGGTCTADCAGCTCAGGATCATCACAGGATTCAAATCCTATCATCAGGAGGGGGGTCGTGGTTGCGCATTAG
CAAAGTTGCAGTCAGTCGTCATGCAGCGACCACATACACACTGCATGCGCGTCTTCAVATCCCACAGTCAGTAGT
AGTCACAGACCTCCAGTCAGTCGAGTACGACGTCAGTACGTCAGTCAGCCAGTCAGTCAGTCATACGTACGTCA
TGCATACGTAGCTAGCAGACGCAGCATTACGTCGCGATCGATCGATCGGCATAGCAGCATCCCAGTCAGTCATAT
GCATAGTCGATCGACGTCAGTCATGAGATCGTACGAAATACGTAGCTGATCGACGTCAGTCAGACTGATCGATCG
GATTCAGTCACGATGCATGCTAGCAAAGTCAGCGCATGCTAGCTACGTAGTCAGTACTGCATGCAGTACGTACGT
AGACGTCAGTCAGTCAGTCATGATGCTAGCTAGCTACGTCACAGTCAGTCATGACTGACTGACTGACTGCAGTC
AGTCATCGATACGTAGCTAGCTACGTCAGTCATGCAGTCAGTCATTGATGATCGATCGTACATGCAGATGCCGTAG
GCTAGCTAGCTAGCACTACGATGCATGCTAGCTAGCTACGACCAGTACCATGATGACTGCATGATCATACTGCCCA
AAAAACGACTTAATCGTATCGTATTTCTGGHACCCCCTGATGTAAAAGAGAGTTCTATATTACTACAACCACGATC
GGATTTATTTTGGTCTADCAGCTCAGGATCATCACAGGATTCAAATCCTATCATCAGGAGGGGGGTCGTGGTTGC
GCATTAGCAAAGTTGCAGTCAGTCGTCATGCAGCGACCACATACACACTGCATGCGCGTCTTCAVATCCCACAGT
CAGTAGTAGTCACAGACCTCCAGTCAGTCGAGTACGACGTCAGTACGTCAGTCAGCCAGTCAGTCAGTCATACG
TACGTCATGCATACGTAGCTAGCAGACGCAGCATTACGTCGCGATCGATCGATCGGCATAGCAGCATCCCAGTCA
GTCATATGCATAGTCGATCGACGTCAGTCATGAGATCGTACGAAATACGTAGCTGATCGACGTCAGTCAGACTGA
TCGATCGGATTCAGTCACGATGCATGCTAGCAAAGTCAGCGCATGCTAGCTACGTAGTCAGTACTGCATGCAGTA
CGTACGTAGACGTCAGTCAGTCAGTCATGATGCTAGCTAGCTACGTCACAGTCAGTCATGACTGACTGACTGACT
GCAGTCAGTCATCGATACGTAGCTAGCTACGTCAGTCATGCAGTCAGTCATTGATGATCGATCGTACATGCAGATG
CCGTAGGCTAGCTAGCTAGCACTACGATGCATGCTAGCTAGCTACGACCAGTACCATGATGACTGCATGATCATAC
TGCCCAAAAAACGACTTAATCGTATCGTATTTCTGGHACCCCCTGATGTAAAAGAGAGTTCTATATTACTACAACC
ACGATCGGATTTATTTTGGTCTADCAGCTCAGGATCATCACAGGATTCAAATCCTATCATCAGGAGGGGGGTCGT
GGTTGCGCATTAGCAAAGTTGCAGTCAGTCGTCATGCAGCGACCACATACACACTGCATGCGCGTCTTCAVATCC
CACAGTCAGTAGTAGTCACAGACCTCCAGTCAGTCGAGTACGACGTCAGTACGTCAGTCAGCCAGGAGGGGGG
TCGTGGTTGCGCATTAGCAAAGTTGCAGTCAGTCGTCATGCAGCGACCACATACACACTGCATGCGCGTCTTCAV
ATCCCACAGTCAGTAGTAGTCACAGACCTCCAGTCAGTCGAGTACGACGTCAGTACGTCAGTCAGCCAGTCAGT
CAGTCATACGTACGTCATGCATACGTAGCTAGCAGACGCAGCATTACGTCGCGATCGATCGATCGGCATAGCAGC
ATCCCAGTCAGTCATATGCATAGTCGATCGACGTCAGTCATGAGATCGTACGAAATACGTAGCTGATCGACGTCA
GTCAGACTGATCGATCGGATTCAGTCACGATGCATGCTAGCAAAGTCAGCGCATGCTAGCTACGTAGTCAGTACT
GCATGCAGTACGTACGTAGACGTCAGTCAGTCAGTCATGATGCTAGCTAGCTACGTCACAGTCAGTCATGACTGA
CTGACTGACTGCAGTCAGTCATCGATACGTAGCTAGCTACGTCAGTCATGCAGTCAGTCATTGATCATGATCATAC
TGCCCAAAAAACGACTTA
Does anyone know the words?
What if we use the enormous database of protein sequences to
find recurrent short patterns?
Those short patterns would be the words.
But, are they "meaningful words"?
(Does the sequence correlate with the local structure?)
Maybe, protein folding pathways
can be found in protein sequence
"grammar"
1. Letters
2. Words
3. Phrases
4. Sentences
Amino acids can be grouped
A
C
D
E
F
G
H
I
K
L
M
N
P
Q
R
S
T
V
W
Y
4
0
-2
-1
-2
0
-2
-1
-1
-1
-1
-2
-1
-1
-1
1
0
0
-3
-2
A
9
-3
-4
-2
-3
-3
-1
-3
-1
-1
-3
-3
-3
-3
-1
-1
-1
-2
-2
C
6
2
-3
-1
-1
-3
-1
-4
-3
1
-1
0
-2
0
-1
-3
-4
-3
D
5
-3
-2
0
-3
1
-3
-2
0
-1
2
0
0
-1
-2
-3
-2
E
6
-3
-1
0
-3
0
0
-3
-4
-3
-3
-2
-2
-1
1
3
F
6
-2
-4
-2
-4
-3
0
-2
-2
-2
0
-2
-3
-2
-3
G
8
-3
-1
-3
-2
1
-2
0
0
-1
-2
-3
-2
2
H
4
-3
2
1
-3
-3
-3
-3
-2
-1
3
-3
-1
I
5
-2
-1
0
-1
1
2
0
-1
-2
-3
-2
K
4
2
-3
-3
-2
-2
-2
-1
1
-2
-1
L
5
-2
-2
0
-1
-1
-1
1
-1
-1
M
6
-2
0
0
1
0
-3
-4
-2
N
7
-1
-2
-1
-1
-2
-4
-3
P
5
1
0
-1
-2
-2
-1
Q
5
-1
-1
-3
-3
-2
R
4
1
-2
-3
-2
S
5
0
-2
-2
T
4
-3
-1
V
11
2
W
7
Y
Sequence alignments show evolutionary diversity
Sequence alignment
•••
VIVAANRSA
VIVSAARTA
VIASAVRTA
VIVDAGRSA
VIASGVRTA
VIVAAKRTA
VIVSAVRTP
VIVSAARTA
VIVSAVRTP
VIVDAGRTA
VIVDAGRTA
VIVSGARTP
VIVDFGRTP
VIVSATRTP
VIVSATRTP
VIVGALRTP
VIVSATRTP
VIVSATRTP
VIASAARTA
VIVDAIRTP
VIVAAYRTA
VIVSAARTP
VIVDAIRTP
VIVSAVRTA
VIVAAHRTA
Sequence profiles are condensed
sequence alignments
Sequence profile
(Gribskov)
•••
 wkd (s kj = aai )
Pij =
k = seqs
 wk
k = seqs
Red = high prob ratio (>3)
Green = background prob ratio(~1)
Blue = low prob ratio (< 1/3)
Clustering profiles
each dot
represents a
different 1-residue
profile
did it!
Resulting
clusters:
KQR
AST
A CS
WYF
APG
DEN
I LVM
HY
“distance” between two points =
"Kmeans" clustering
l =1
 | Pijl Pikl |
i = 1,20
Protein sequence grammar
1. Letters: amino acid profiles
2. Words
3. Phrases
4. Sentences
Protein sequence grammar
1. Letters: amino acid profiles
2. Words
3. Phrases
4. Sentences
Clustering profile segments, length L
each dot
represents a
different short
profile
~120,000 segments
26 27 2829 30 3132
AA
G
P
D
E
K
R
H
S
T
N
Q
A
M
Y
W
V
I
L
F
C
G
P
D
E
K
R
H
S
T
N
Q
A
M
Y
W
V
I
L
F
C
~800 clusters
for each L
L=3,15
26 27 2829 30 3132
position
“distance” from i to k =
  | Pijl Pikl |
l=1,L
i = 1,20
Learning the structure of each sequence cluster
remove all cluster members that
do not conform with the
paradigm
profile of cluster
cluster of nearest neighbors
Search the database for the
400 nearest neighbors
the database
After convergence, a crossvalidation test is done.
I-sites library of sequence
structure motifs
1000's of sequence clusters
supervised learning
Cross-validation
262 motifs
Number of different motifs after removing register variants: 31
Example of a motif
Sequences that
match sequence
profile....
...tend to have
the same
structure...
...and this is it.
Clustering finds previously known
sequence-structure motifs
amphipathic
a-helix
a-helix N-cap
amphipathic
b-strand
p•nppn•
nS••En•p
•n•n
Many new motifs are found
diverging type-2
turn
Frayed helix
Serine hairpin
Proline helix C-cap
Type-I hairpin
alpha-alpha corner
glycine helix N-cap
Why are there motifs in proteins?
Ancient conserved regions?
Selection for stability?
Folding initiation sites?
Structural features seem to drive
clustering.
1. glycine
at strained
angles
2.
conserved
sidechain
contacts
y
f
3. negative
design
against
alternative
structures
(helix)
I-sites sequence patterns are distinct
1
2
3
4
5
6
7
8
9
10
11
12
13
Motif
clusters
Number of
sites / 100 positions
overall
confid. > 0.60
Amphipathic a-helix
Non-polar a-helix
Schellman cap Type 1
Schellman cap Type 2
Proline a-helix C cap
Frayed ahelix
Helix N capping box
Amphipathic b-strand
Hydrophobic b-strand
b-bulge
Serine b-hairpin
Type-I hairpin
Diverging Type-II turn
13
6
6
10
10
2
10
8
5
2
4
2
4
3.1
0.9
0.09
0.3
1.8
1.2
1.1
6.8
2.3
0.5
1.3
0.07
0.3
0.9
0.12
0.07
0.14
0.6
0.13
0.6
2.1
0.3
0.15
0.3
0.04
0.14
Average boundaries
mda° dme rmsd
(len)
56
54
81
76
92
75
99
89
101
100
94
80
87
(15)
(11)
(15)
(15)
(13)
(15)
(15)
(6)
(7)
(7)
(9)
(13)
(9)
0.71
0.58
1.01
0.94
1.07
0.96
0.95
0.87
0.91
0.97
0.76
0.94
1.04
0.78
0.40
1.02
0.94
0.89
0.69
0.65
0.87
0.91
0.78
0.81
1.23
1.00
Pattern
of conserved
non-polar
residues
1-4-8, 1-5-8
1-4-8, 1-5-8
1-6-9-11
1-6-8-9
1-2-5-8
1-5-9-13
1-6-9-13
1-3, 1-3-5
1-2-3
1-4-6
1-8
1-7-8
1-7-9
(Bystroff & Baker, J. Mol. Biol, 1998)
A hypothesis:
I-sites sequence motifs are folding initiation sites.
• The I-sites sequence patterns are mutually exclusive.
• Each I-sites motif is found in a variety of contexts.
• Local structure forms fast.
• Early-folding units 'initiate' folding.
One reason this hypothesis may be wrong:
Database statistics may reflect bias in the data.
maybe not...
Alpha helices may fold by
packing interactions.
Dots show positions of alpha-carbons relative to the amphipathic
helix motif. The hydrophobic side is up.
How do we test this hypothesis?
• See if I-sites peptides fold in isolation from the rest
of the protein.
... by NMR.
... by simulation.
NMR structure of a 7-residue I-sites motif in isolation
diverging turn
(Yi et al, J. Mol. Biol, 1998)
Partial literature search of peptide NMR structures
I-sites motif
Authors
date
glycine helix cap
Viguera
1995
serine hairpin
Blanco
1994
Type-I hairpin
deAlba
1996
diverging turn
Sieber
1996
Molecular dynamics
... is a cheap substitute for an NMR spectrometer.
What is MD?
• A simulation of the dynamic behavior of the molecule in
water, using "first principles."
Advantages?
• You can observe the system directly.
Disadvantages?
• It's not a real system, just an approximation.
Sequences
Helical peptide simulations
AAALDRMR
AALEALLR
AANRSHMP
AARYKFIE
ADFKAAVA
AFDGETEI
AKELVVVY
AKGVETAD
ARFTKRLG
ATLEEKLN
CNGGHWIA
DAVTRYWP
DEAIDAYI
DELTRHIR
DYVRSKIA
EDLVERLK
EELKQALR
EEMVSKLK
EKLLESLE
EKPFGTSY
EQIKAAVK
FHMYFMLR
FSVMNDAS
FYSSYVYL
GQLMALKQ
HNLIEAFE
IEHTLNEK
IQNGDWTF
KAAIAQLR
KKYRPETD
KNPDNVVG
KPMGPLLV
KQAHPDLK
KQDKHYGY
KSYLRSLR
LDLHQTYL
NAVWAAIK
NETHSGRK
NFLEVGEY
NPVKESRH
PAIISAAE
PLQHHNLL
PRDANTSH
QDDARKLM
QGIIDKLD
QKMKTYFN
QTLAQLSV
RDFEERMN
RIILDRHR
RLLLKAYR
RPIARMLS
RVLGRDLF
SCDVKFPI
TEVMKRLV
TLNEKRIL
YASLRSLV
YESHVGCR
• AMBER (parm94) force
field.
• Randomly chosen natural
sequences
• Initially extended.
• 800-900 waters added.
• Ions added (Na, Cl)
• 7-30 ns at 340°K
QuickTime™ and a
decompressor
are needed to see this picture.
The MD scheme
• Select random peptides and predict how much helix they will
have, using the I-sites motif pattern.
• Run LONG simulations.
• Test to see whether they have reached equilibrium.
• If they have, find out how much of the time the peptide spent
in a helical state. (by cluster analysis)
• Does the fraction helix correlate with the prediction?
Cluster analysis of trajectories
1) Define a node for every step in the trajectory, keep
the backbone angles (q).
2) For each node, draw an edge to every other node
for which max(Dq) < 60°.
3) The node with the most edges defines the first
cluster. Remove it and all its neighbors. Then the node
with the most edges is the second cluster. Etc.
Clusters in conformational space
RPIARMLS
Our criterea for good clustering: no two clusters look
alike, and no cluster looks like two.
This is what a trajectory looks
like if it has reached equilibrium
cluster
number
ns
Both halfs of the trajectory have about the same distribution.
This is what it looks like if it has
not.
cluster
number
ns
NAIIQELE movie
QuickTime™ and a
decompressor
are needed to see this picture.
A rough energy landscape.
There is a correlation between I-sites
sequence score and the simulations
r=0.48 (all peptides)
r=0.61 (trajectories > 20ns long)
Sampling of sequence space
72 peptides were simulated. Is this a representative sample of
the space of amphipathic helix sequences?
I-sites motif
72 peptides,
weighted by %helix
72 peptides,
unweighted
What this means?
The MD experiment separates the local effects from the nonlocal effects on helix formation.
In the simulation, there are only local interations.
So the propensity for amphipathic sequences to form helix is
mostly intrinsic.
Outliers
• Simulation too short.
We see only meta-stable states.
• I-sites scoring method is missing something.
Using additive probabilities ignores statistical
dependence between different positions.
(+-) and (-+) look just like (++) and (--)
• Part-helix was not counted as helix in this study.
Helix caps are competing motifs.
an outlier
QVFMRIME (a helix in 1dldA)
Predicted to be helix with confidence = 0.86
Zero helix found in 17ns trajectory. What does it fold into?
QuickTime™ and a
decompressor
are needed to see this picture.
Protein sequence grammar
1. Letters: Amino acid profiles
2. Words: I-sites motifs
3. Phrases:
4. Sentences
Protein sequence grammar
1. Letters: Amino acid profiles
2. Words: I-sites motifs
3. Phrases: a hidden Markov model
4. Sentences
Motif “grammar”?
Arrangement of I-sites motifs in
proteins is highly non-random
helix
helix
cap
beta
strand
beta
turn
The dependencies can be modeled as a Markov chain
How to make a Markov chain
Sequence data
The dog bit the mailman. The mailman
kicked the dog back.
Markov model
the
mailman
kicked
Stochastic output
dog
bit
back
The dog back. The mailman kicked the
mailman kicked the dog bit the dog bit
the dog bit the mailman kicked the
dog.
...
A "hidden" Markov model
What's "hidden" about it?
An HMM is a Markov chain where the meaning of the
Markov state is probabilistic.
How to make a hidden Markov chain
Sequence alignment data
The dog bit
the mailman. The mailman kicked the dog back.
The dog attacked the postman. The postman hit
the dog.
hidden Markov model
the
mailman 0.5
postman 0.5
kicked 0.6
hit 0.4
Stochastic output
dog
bit 0.3
attacked 0.7
back
The dog back. The mailman kicked the
postman kicked the dog bit the dog bit
the dog attacked the mailman kicked the
dog.
...
One Markov state from HMMSTR
next
letter(s)
probabilitic
meaning of
the state
aij
sequence
profile
amino
acid
symbols
previous
letter(s)
bi = {ACDEF...}
{
ri = {HGEBdblLex}
ahi
structure
symbols
regions
di = {HST}
ci = {mnhd...}
aik
One state emits one
letter of each type
(b,r,d,c)
Constructing a HMM by aligning motifs
Merging many motifs into one HMM
Related motifs, branched model.
Type-1
G a C-cap
fy
a helix
Type-2
G a C-cap
Type-1
G a C-cap
a helix
Type-2
G a C-cap
HMMSTR
Hidden
Markov
Model for
local protein
STRucture
282 nodes
317 transitions
Unified model for
31 distinct
sequencestructure motifs
(Bystroff & Baker, J.
Mol. Biol., 2000)
Variations on a motif theme are modeled
as parallel paths
Multiple state-pathways for the helix N-cap motif
Common sub-graphs represent common
sub-structures
These
peptide
segments
have the
same state
sequence
(except
shaded
residues)
How an HMM works
We have S (the sequence).
We want Q (the 1D structure), and
P (how well S fits Q)
P(Q | S) =  q1 (S1 )  aqi 1 qi bq i (Si )
i = 2,N
initiation probability
transition probability
emission probability
3-state secondary structure prediction
74.9%
correct
74.6%
correct
Predicting super-secondary context
Results are for the independent test set.
Fully-automated tertiary structure
prediction
Protocol used for CAFASP2 experiment (2000)
sequence
(1) Find homologues in the database (Psi-Blast)
(2) Predict local structure (HMMSTR)
(3) Assemble fragments (ROSETTA, D.Baker)
structure
Rosetta ab initio
Scoring function: Bayesian classification of pairwise secondary
structure contact types.
Search function: Monte Carlo fragment insertion. A move
consists of selecting a fragment at random from a set of local
structure predictions. Coordinates are re-generated after
swapping in the new fragment.
(Simons et al, PNAS, 1997)
CASP3 Prediction results for Target 56 : DNA helicase
Predicted
structure of 66residue fragment
(23-88)
True structure of same
fragment
CAFASP Prediction results for Target 122: 1GEQ Tryptophan Synthase
Predicted
97-residue fragment
True structure of same
fragment
Protein sequence grammer
1. Alphabet: amino acid profile
2. Words: I-sites motifs
3. Phrases: HMMSTR pathways
4. Sentences: contact maps
the next step...
In progress:
Data mining of contact maps
Protein sequences
+ contact maps
HMMSTR
predictions
Association-rule
mining (M. Zaki)
Rules for tertiary contacts
Predicting tertiary contacts
Contact predictions for 2igd
overall : 20% coverage w/20% accuracy
Can the 2D map be translated to 3D?
Bystroff Lab
I-sites/HMMSTR collaborators
David Baker
Karen Han
Vestienn Thorsson
Qian Yi
Edward Thayer
Shekhar Garde
Mohammed Zaki
Susan Baxter
Chip Lawrence
Bobbie Jo Webb
Kim Simons
Yu Shao
U. Washington
Xin Yuan
UCSF
U.Washington
Jerry Huang
U. Washington
Zymogenetics
RPI
RPI
Wadsworth (->Novartis)
Wadsworth/RPI
Wadsworth
U. Washington (->Harvard)
isites.bio.rpi.edu