Structure Classification

Download Report

Transcript Structure Classification

Protein Structure
Lesk, chapter 5
Details on SCOP and CATH can be found in
Structural Bioinformatics, Bourne/Weissig, chapter 12 and 13
Michael Schroeder
BioTechnological Center
TU Dresden
Biotec
Folding
 Proteins are linear polymer
mainchains with different amino
acid side chains
 Proteins fold spontaneously
reaching a state of minimal
energy
 Side and main chains
interact with one another and
with solvent
 Example movie
Jones, D.T. (1997) Successful ab initio
prediction of the tertiary structure of NKLysin using multiple sequences and
recognized supersecondary structural
motifs. PROTEINS. Suppl. 1, 185-191
By Michael Schroeder, Biotec,
2
Examining Proteins
 Specialised tools with
different views of
structure
 Corey, Pauling, Koltun
(CPK)
 Diameter of sphere ~
atomic radius
 Hydrogen white,
carbon grey, nitrogen
blue, oxygen red,
sulphur yellow
 Cartoon
 Wire
 Balls
By Michael Schroeder, Biotec,
3
Examining Proteins
By Michael Schroeder, Biotec,
4
Protein Folding
 Conformation of residue
 Rotation around N-Ca bond,  (phi)
 Rotation around Ca-C bond,  (psi)
 Rotation around peptide bond  (omega)
Residue

 Peptide bond tends to be
 planar and
 in one of two states:
 trans 180 (usually) and
 cis, 0 (rarely, and mostly proline)



By Michael Schroeder, Biotec,
Image taken from www.expasy.org/swissmod/course
5
Sasisekharan-RamakrishnanRamachandran plot
 Solid line =
energetically preferred
 Outside dotted line =
disallowed
 Most amino acids fall into
R region (right-handed
alpha helix) or -region
(beta-strand)
 Glycine has additional
conformations (e.g. lefthanded alpha helix = L
region) and in lower right
panel
By Michael Schroeder, Biotec,
Image taken from www.expasy.org/swissmod/course
6
Ramachandran plot
Plot for a protein with
mostly beta-sheets
Image taken from www.expasy.org/swissmod/course
By Michael Schroeder, Biotec,
Example for
conformations
7
Helices and Strands
 Consecutive residues in alpha or beta
conformation generate alpha-helices and betastrands, respectively
 Such secondary structure elements are stabilised by
weak hydrogen bonds
 They are by turns or loops, regions in which the
chain alters direction
 Turns are often surface exposed and tend to
contain charged or polar residues
By Michael Schroeder, Biotec,
8
Alpha Helix
 Residue j is hydrogen-bonded
to residue j+4
 3.6 residues per turn
 1.5A rise per turn
 Repeat every 3.6*1.5A = 5.4 A
  = -60 ,  = -45 
Image takenBiotec,
from www.expasy.org/swissmod/course
By Michael Schroeder,
9
Beta strand
By Michael Schroeder, Biotec,
Image taken from www.expasy.org/swissmod/course
10
Beta Sheets
By Michael Schroeder, Biotec,
Image taken from www.expasy.org/swissmod/course
11
Turn
 Residue j is bonded to
residue j+3
 Often proline and
glycine
By Michael Schroeder, Biotec,
Image taken from www.expasy.org/swissmod/course
12
How to Fold a Structure
 All residues must have stereochemically allowed
conformations
 Buried polar atoms must be hydrogen-bonded
 If a few are missed, it might be energetically preferable
to bond these to solvent
 Enough hydrophobic surface must be buried and
interior must be sufficiently densely packed
 There is evidence, that folding occurs hierarchically:
First secondary structure elements, then supersecondary,…
 This justifies hierarchic approach when simulating
folding
By Michael Schroeder, Biotec,
13
Structure Alignment
+
By Michael Schroeder, Biotec,
Slides from Hanekamp, University of Wyoming, www.uwyo.edu
14
Structure Alignment
+
By Michael Schroeder, Biotec,
15
Structure Alignment
 In the same way that we align sequences, we wish to
align structure
 Let’s start simple: How to score an alignment
 Sequences: E.g. percentage of matching residues
 Structure: rmsd (root mean square deviation)
By Michael Schroeder, Biotec,
16
Root Mean Square Deviation
 What is the distance between two points a with
coordinates xa and ya and b with coordinates xb and
yb?
 Euclidean distance:
d(a,b) = √ (xa--xb )2 + (ya -yb )2 + (za -zb )2
a
b
By Michael Schroeder, Biotec,
17
Root Mean Square Deviation
 In a structure alignment the score measures how far
the aligned atoms are from each other on average
 Given the distances di between n aligned atoms, the
root mean square deviation is defined as
rmsd = √ 1/n ∑ di2
By Michael Schroeder, Biotec,
18
Quality of Alignment and Example
 Unit of RMSD => e.g. Ångstroms
 Identical structures => RMSD = “0”
 Similar structures => RMSD is small (1 – 3 Å)
 Distant structures => RMSD > 3 Å
 Structural superposition of gamma-chymotrypsin and
Staphylococcus aureus epidermolytic toxin A
By Michael Schroeder, Biotec,
19
Pitfalls of RMSD
 all atoms are treated equally
(e.g. residues on the surface have a higher degree of
freedom than those in the core)
 best alignment does not always mean minimal
RMSD
 significance of RMSD is size dependent
By Michael Schroeder, Biotec,
From www.uwyo.edu/molecbio/LectureNotes/ MOLB5650 20
Alternative RSMDs
 aRMSD = best root-mean-square deviation calculated over all
aligned alpha-carbon atoms
 bRMSD = the RMSD over the highest scoring residue pairs
 wRMSD = weighted RMSD
Source: W. Taylor(1999), Protein Science, 8: 654-665.
http://www.prosci.uci.edu/Articles/Vol8/issue3/8272/8272.html#relat
By Michael Schroeder, Biotec,
From www.uwyo.edu/molecbio/LectureNotes/ MOLB5650 21
Computing Structural Alignments
 DALI (Distance-matrix-ALIgnment) is one of the first tools for structural
alignment
 How does it work?
 Atoms:
 Given two structures’ atomic coordinates
 Compute two distance matrices:
 Compute for each structure all pairwise inter-atom distances.
 This step is done as the computed distances are independent of a
coordinate system
 The two original atomic coordinate sets cannot be compared, the two
distance matrices can
 Align two distance matrices:
 Find small (e.g. 6x6) sub-matrices along diagonal that match
 Extend these matches to form overall alignment
 This method is a bit similar to how BLAST works.
 SSAP (double dynamic programming) in term 3.
By Michael Schroeder, Biotec,
22
DALI Example
 The regions of common fold, as determined by the program
DALI by L. Holm and C. Sander, in the TIM-barrel proteins
mouse adenosine deaminase [1fkx] (black) and Pseudomonas
diminuta phosphotriesterase [1pta] (red):
By Michael Schroeder, Biotec,
23
Protein zinc finger (4znf)
By Michael Schroeder, Biotec,
Slides from Hanekamp, University of Wyoming, www.uwyo.edu
24
Superimposed 3znf and 4znf
30 CA atoms RMS = 0.70Å
248 atoms RMS = 1.42Å
By Michael Schroeder, Biotec,
Lys30
Slides from Hanekamp, University of Wyoming, www.uwyo.edu
25
Superimposed 3znf and 4znf
backbones
30 CA atoms RMS = 0.70Å
By Michael Schroeder, Biotec,
Slides from Hanekamp, University of Wyoming, www.uwyo.edu
26
RMSD vs. Sequence Similarity
 At low sequence identity, good structural
alignments possible
By Michael Schroeder, Biotec,
Picture from www.jenner.ac.uk/YBF/DanielleTalbot.ppt
27
Structure Classification
By Michael Schroeder, Biotec,
28
Why classify structures?
 Structure similarity is good indicator for homology,
therefore classify structures
 Classification at different levels
 Similar general folding patterns (structures not
necessarily related)
 Possibly low sequence similarity, but similar structure
and function implies very likely homology
 High sequence similarity implies similar structures
and homology
 Classification can be used to investigate
evolutionary relationships and possibly infer
function
By Michael Schroeder, Biotec,
29
Structure Classification
 SCOP: Structural Classification of Proteins
 Hand curated (Alexei Murzin, Cambridge) with some
automation
 CATH: Class, Architecture, Topology, Homology
 Automated, where possible, some checks by hand
 FSSP: Fold classification based on StructureStructure alignment of Proteins
 Fully automated
 Reasonable correspondance (>80%)
By Michael Schroeder, Biotec,
30
Evolutionary Relation
 Strong sequence similarity is assumed to be sufficient to
infer homology
 Close structural and functional similarity together are also
considered sufficient to infer homology
 Similar structure alone not sufficient, as proteins may have
converged on structure due to physiochemical necessity
 Similar function alone not sufficient, as proteins may have
developed it due to functional selection
 In general, structure is more conserved than sequence
 Beware: Descendents of ancestor may have different
function, structure, and sequence! Difficult to detect
By Michael Schroeder, Biotec,
31
What is a domain?
Single and Multi-Domain Proteins
By Michael Schroeder, Biotec,
32
What is a domain?
 Functional: Domain is “independent” functional
unit, which occurs in more than one protein
 Physiochemical: Domain has a hydrophobic core
 Topological: Intra-domain distances of atoms are
minimal, Inter-domain distances maximal
 Difficult to exactly define domain
 Difficult to agree on exact domain border
By Michael Schroeder, Biotec,
33
Domains
re-occur

A domain re-occurs in
different structures and
possibly in the context
of different other
domains

P-loop domain in
 1goj: Structure Of A
Fast Kinesin:
Implications For
ATPase Mechanism
and Interactions
With
Microtubules Motor
Protein (single
domain)
 1ii6: Crystal
Structure Of The
Mitotic Kinesin Eg5
In Complex With
Mg-ADP Cell Cycle
(two domains)
By Michael Schroeder, Biotec,
34
Domains re-occur
1in5: interaction of P-loop domain
(green & orange) and winged helix
DNA binding domain
By Michael Schroeder, Biotec,
1a5t: interaction of P-loop domain
(green & orange) and DNA
polymerase III domain
35
Domains have hydrophobic core
 Kyte J., Doolittle R.F, J.
Mol. Biol. 157:105132(1982).
Hydrophobicity Plot for 1GOJ Kinesin Motor
Hydrophobicity
3
2
1
0
-1
1
51
101
151
-2
-3
Residue
By Michael Schroeder, Biotec,
201
251
301
Ala: 1.800
Arg: -4.500
Asn: -3.500
Asp: -3.500
Cys: 2.500
Gln: -3.500
Glu: -3.500
Gly: -0.400
His: -3.200
Ile: 4.500
Leu: 3.800
Lys: -3.900
Met: 1.900
Phe: 2.800
Pro: -1.600
Ser: -0.800
Thr: -0.700
Trp: -0.900
Tyr: -1.300
Val: 4.200
36
Intra-domain distances minimal
 Distances between atoms
within domain are minimal
 Distances between atoms of
two different domains are
maximal
By Michael Schroeder, Biotec,
37
PDB, Proteins, and Domains
 Ca. 20.000 structures in PDB
Dom#
Freq.
1
8464
2
4358
3
926
4
1888
5
148
8000
6
624
6000
7
42
8
491
9
22
10
58
 50% single domain
 50% multiple domain
 90% have less than 5 domains
Distribution of Number of Domains
Frequency
10000
4000
2000
0
-2000
0
10
20
30
Number of Domains
40
50
60
…
By Michael Schroeder, Biotec,
…
30
7
31
1
32
16
36
1
40
8
42
1
48
3
49
1
38
A structure with 49 domains
 1AON, Asymmetric Chaperonin Complex Groel/Groes/(ADP)7
By Michael Schroeder, Biotec,
39
SCOP: Structural Classification of Proteins
top
CLASS
All alpha (218)
All Beta (144)
Alpha+Beta (279)
Alpha/Beta (136)
FOLD
Trypsin-like serine proteases (1)
Immunoglobulin-like (23)
SUPERFAMILY
Transglutaminase (1)
Immunoglobulin (6)
FAMILY
C1 set domains
(antibody constant)
By Michael Schroeder, Biotec,
V set domains
(antibody variable)
40
Class
 All beta
 (possibly small alpha
adornments)
 All alpha
 (possibly small beta
adornments)
By Michael Schroeder, Biotec,
41
Class
 Alpha/beta (alpha and beta) =
single beta sheet with alpha helices
joining C-terminus of one strand to
the N-terminus of the next
 subclass: beta sheet forming barrel
surrounded by alpha helices
 sublass: central planar beta sheet
 Alpha+beta (alpha plus beta) =
Alpha and beta units are largely
separated
 Strands joined by hairpins leading
to antiparallel sheets
By Michael Schroeder, Biotec,
42
Class
 Multi-domain proteins
 have domains placed in
different classes
 domains have not been
observed elsewhere
 E.g. 1hle
By Michael Schroeder, Biotec,
43
Class
 Membrane (few and most
unique) and cell surface
proteins
 E.g. Aquaporin 1ih5
By Michael Schroeder, Biotec,
44
Class
 Small Proteins
 E.g. Insulin, 1pid
By Michael Schroeder, Biotec,
45
Class
 Coiled coil proteins
 E.g. 1i4d, Arfaptin-Rac
binding fragment
By Michael Schroeder, Biotec,
46
Class
 Low-resolution structures,
peptides, designed proteins
 E.g. 1cis, a designed
protein, hybrid protein
between chymotrypsin
inhibitor CI-2 and helix E
from subtilisin Carlsberg
from Barley (Hordeum
vulgare), hiproly strain
By Michael Schroeder, Biotec,
47
Fold, Superfamily, Family
 Fold
 Common core structure
i.e. same secondary structure elements in the same
arrangement with the same topological structure
 Superfamily
 Very similar structure and function
 Family
 Sequence identity (>30%) or extremely similar
structure and function
By Michael Schroeder, Biotec,
48
Distribution (2007)
Class
Fold
Superfamily
Family
All alpha
259
459
772
All beta
165
331
679
Alpha/beta
141
232
736
Alpha+beta
334
488
897
Multidomain
53
53
74
Membrane and
cell surface
50
92
104
Small proteins
85
122
202
1086
1777
3464
Total
By Michael Schroeder, Biotec,
49
Uses of SCOP
 Automatic classification
 Understanding of protein enzymatic function
 Use superfamily and fold to study distantly related
proteins
 Study sequence and structure variability
 Derive substitution matrices for sequence
comparison
 Extract structural principles for design
 Study decomposition of multi domain proteins
 Estimate total number of folds
 Derived databases
By Michael Schroeder, Biotec,
50
PDB, Proteins, Domains revisited
 80% of PDB have only one type of
SCOP superfamily
 15% of PDB have two different SCOP
superfamilies
Frequency
Frequency of Number of SCOP Superfamilies
16000
14000
12000
10000
8000
6000
4000
2000
0
-2000 0
5
10
15
Number of Superfamilies
By Michael Schroeder, Biotec,
20
25
sfNo
sfNoFreq
1
13960
2
2721
3
495
4
178
5
33
6
25
7
1
9
4
20
9
21
1
22
1
23
6
51
A structure with
23 different
superfamilies
 1k9m Co Crystal
Structure Of Tylosin
Bound To The 50S
Ribosomal Subunit Of
Haloarcula Marismortui
Ribosome
By Michael Schroeder, Biotec,
52
The 20 Most
Frequently
Occurring
Superfamilies
Suyperfamily
SCOP ID
#PDB
Immunoglobulin
b.1.1
823
Lysozyme-like
d.2.1
777
Trypsin-like serine proteases
b.47.1
649
P-loop containing nucleotide triphosphate hydrolases
c.37.1
521
NAD(P)-binding Rossmann-fold domains
c.2.1
384
Globin-like
a.1.1
384
(Trans)glycosidases
c.1.8
332
Acid proteases
b.50.1
288
Concanavalin A-like lectins/glucanases
b.29.1
230
Thioredoxin-like
c.47.1
217
EF-hand
a.39.1
212
alpha/beta-Hydrolases
c.69.1
195
b.6.1
178
Ribonuclease H-like
c.55.3
178
PLP-dependent transferases
c.67.1
176
Periplasmic binding protein-like II
c.94.1
171
Carbonic anhydrase
b.74.1
169
Metalloproteases (\zincins\"), catalytic domain"
d.92.1
169
FAD/NAD(P)-binding domain
c.3.1
162
Cytochrome c
a.3.1
161
Cupredoxins
By Michael Schroeder, Biotec,
53
CATH
 Class
 secondary structure
composition
 Architecture
 orientation in 3D
 Topology
 connectivity
 Homology
 Grouped by evidence for
homology (sequence,
structure and function)
By Michael Schroeder, Biotec,
54
Generating CATH
 1. Identify close relatives by pairwise sequence
alignment
 2. Detect more distant relatives using
 2a. sequence profiles and
 2b. structure alignment
 3. Structures still unclassified after 1. and 2. are
examined by hand to detect domain boundaries
 4. Try 2. and 3. again
 5. If still unclassified assign manually
By Michael Schroeder, Biotec,
55
CATH step 1:
Sequence-based Identification of
Homologues Structures
 > 30% sequence similarity implies similar
structure
 Relatives identified using pairwise alignment are
clustered using hierarchical clustering with single
linkage
 Reminder…
By Michael Schroeder, Biotec,
56
1
1
2
3
4
5
0
2
6
10
9
0
5
9
8
0
4
5
0
3
2
3
4
5
(1,2)
Hierarchical Clustering
0
(1,2) 3
4
5
0
5
9
8
0
4
5
4
0
3
3
3
4
5
5
0
(1,2)
(1,2)
3
(4,5)
0
5
8
1
0
4
0
3
(4,5)
0
(1,2)
(3,(4,5))
2
(1,2)
(3,(4,5))
0
5
1
2
3
4
5
0
By Michael Schroeder, Biotec,
57
Hierarchical Clustering:
 How to define distance between clusters?
 Single linkage:
 Minimum
 Example: Distance (A,B) to C is 1
A
B
 Complete linkage:
 Maximum
C
A
B
C
0
1
2
0
1
0
 Example: Distance (A,B) is C is 2
 Average linkage:
 Average
 Example: Distance (A,B) to C is 1.5
 Are dendrograms always the same A B C
independent of the linkage method?
By Michael Schroeder, Biotec,
A B C
58
Hierarchical Clustering: Chaining
 Beware of chaining
when using single
linkage
A
B
 As nearest neighbour
selected, it appears that
all members of the
cluster are very similar
to each other, when in
fact A and Z are very
different
C
D
…
Z
A
B
C
D
…
Z
0
1
2
3
…
25
0
1
2
…
24
0
1
…
23
0
…
22
…
0
A B C D …Z
By Michael Schroeder, Biotec,
59
CATH and single linkage
 It is argued that
 structural data is quite sparse,
 hence it cannot be expected that all cluster
members will be very similar (in terms of
sequence) to each other,
 so that the chaining effect is even useful
By Michael Schroeder, Biotec,
60
CATH step 2a:
 Profile-based methods such as PSI-BLAST are used
to detect distant relatives
 Build profiles using all sequence data available
(rather than only sequences for which structure
exists)
 This increases quality of profiles dramatically
 51% distant relatives retrieved using profiles based on
sequences with known structure only
 82% distant relatives retrieved using profile based on
all sequences
By Michael Schroeder, Biotec,
61
CATH step 2b: Structure-based
methods to detect distant relatives
 For ca. 15% of structures, sequence-based method
does not work
 Example: For globins sequence similarity can fall
below 10%, yet structure and function (oxygenbinding) are preserved
 Use SSAP, the Sequential Structure Alignment
Program
By Michael Schroeder, Biotec,
62
Clustering Result of
Structure Alignment
 Relatives identified using pairwise alignment
are clustered using hierarchical clustering
with single linkage
By Michael Schroeder, Biotec,
63
Improving Efficiency: GRATH
 Screening large structures (>300 residues) against database
can take days
 Idea of GRATH (Graphical Representation of CATH):
 Improve efficiency by filtering at a higher level before doing
detailed comparison
 Represent protein as graph where
 Nodes are secondary structure elements represented as
their midpoint, tilt, and rotation
 Edges distances between midpoints of secondary structure
elements
 Use algorithm to determine subgraph isomorphism (i.e. does
one graph occur in another one)
 Yes, then do detailed comparison using SSAP
By Michael Schroeder, Biotec,
64
Structure Prediction and Modelling
By Michael Schroeder, Biotec,
65
Structure Prediction:
Four Main Problem Areas
 Given a sequence with unknown structure, predict its structure
 Secondary structure prediction
 Predict regions of helices and strands
 Homology modelling
 Predict structure from known structures of one or more related
proteins
 Fold recognition
 Given a library of structures, determine which one (if any) is
the fold of the given sequence
 Prediction of novel folds: A-priori and knowledge-based
methods
By Michael Schroeder, Biotec,
66
Structure Prediction of Novel Folds:
Two Approaches
 A priori:
 Most approaches aim to reproduce inter-atomic
interactions by
defining an energy function and
trying to find global minimum
 Problem:
Inadequacy of the energy function
Algorithms get stuck in local minima
 Knolwedge-based:
 Find similarities to known structures or substructures
By Michael Schroeder, Biotec,
67
Secondary Structure Prediction
 A successful tool for secondary structure prediction is PROF
 PROF uses a neural networks to learn secondary structure from
known structures
 ¾ of PROF’s prediction are correct
 At CASP 2000 it predicted e.g. the following
|10
|20
|30
|40
|50
Sequence
ALVEDPPLKVSEGGLIREGYDPDLDALRAAHREGVAYFLELEERERERTG
Prediction HH------------EEE------HHHHHHHHHH-HHHHHHHHHHHHHHHExperiment -E-------------E-----HHHHHHHHHHHHHHHHHHHHHHHHHHHH|60
|70
|80
| 90
|100
IPTLKVGYNAVFGYYLEVTRPYYERVPKEYRPVQTLKDRQRYTLPEMKEK
--EEEEEEEEEEEEEEEE-----------EEEEEEEE—-EEEE-HHHHHH
----EEEEE---EEEEEEEHHHHHH-----EEEEE---EEEEE-HHHHHH
|110
|120
EREVYRLEALIRRREEEVFLEVRERAKRQ
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH-By Michael Schroeder, Biotec,
68
PROF’s prediction
 The regions
predicted by the
PROF server of
Rost to be
helical are
shown as wider
ribbons. The
prediction
missed only a
short helix, at
the top left of
the picture
By Michael Schroeder, Biotec,
69
Homology modelling
 Define the model of an unknown structure by making
minimal changes to a relative with known structure
 Align amino acid sequences of target and one or more
known structures
 Insertions and deletions should be in loop regions
 Determine mainchain segments to represent the regions
containing insertions and deletions and stitch these into the
known structure
 Replace the sidechains of the residues that have been
mutated
 Examine the model (by hand and computationally) to detect
collisions between atoms
 Refine the model by limited energy minimisation
By Michael Schroeder, Biotec,
70
Accuracy of Homology Modelling
 Works for >40-50% sequence similarity
 Example: SWISS-MODEL Prediction of neurotoxin of red
scorpion (1DQ7) from neurotoxin of yellow scorpion (1PTX)
By Michael Schroeder, Biotec,
71
Fold Recognition: 3D Profiles
 Given a sequence determine which (if any) fold is most similar
 Can we build profiles to represent structures of similar fold
(similar to sequence profiles)?
 3D profiles:
 Classify the environment of each residue
 Secondary structure:
 Is it part of helix, sheet or other (determined by Mainchain
hydrogen bonding interactions)
 Surface exposure:
 <40A2, 40-114A2, or >114A2 accessible surface area
 Polar or non-polar nature of environment
 Total of 18 residue classes, one of which each residue is part of
 Sequence of these residue classes is 3D profile
By Michael Schroeder, Biotec,
72
3D Profiles and Alignments
 Structure-Structure Alignment:
 3D profiles of two known structures can be aligned against each other
 Sequence-Structure Alignment:
 Based on existing 3D profiles, probability can be determined for a
residue occurring in a residue class.
 Using this probability, we can assign 3D profile to a sequence
 And hence align the sequence 3D profile to a structure 3D profile
 For correctly determined protein structures, the structure 3D profile
fits the sequence 3D profile well
 However, other proteins may score even better
 If a structure does not match its own 3D profile well it is likely that
there is an error in the structure determination
By Michael Schroeder, Biotec,
73
Threading
 Pull query sequence
through known structure
and rate the score
 Necessary:
 Method to score the models
to select best one
 Method to calibrate the
scores to decide which of
the best is correct
By Michael Schroeder, Biotec,
Homology
modelling
Threading
Identify
homologues
Try all possible
parents
Determine
optimal
alignment
Try many
alignments
Optimize one
model
Evaluate many
rough models
74
Scoring for Threading
 Empirical patterns of residue neighbours derived
from known structures
 Observe distribution of inter-residue distances for
all 20 x 20 residue pairs
 Derive probability distribution as function of
distance in space and on sequence
 Boltzmann equation relates probability and energy
 Reverse this and derive energy function from
probability distribution
By Michael Schroeder, Biotec,
75
Threading the sequence
template
Target
By Michael Schroeder, Biotec,
Slides from Hanekamp, University of Wyoming, www.uwyo.edu
76
“Threaded” sequence
Yellow = adrenergic receptor sequence
Blue = adrenergic receptor (PDB 1F88 )
By Michael Schroeder, Biotec,
Slides from Hanekamp, University of Wyoming, www.uwyo.edu
77
Modeled structure
Gaps
By Michael Schroeder, Biotec,
Slides from Hanekamp, University of Wyoming, www.uwyo.edu
78
Corrected Model
By Michael Schroeder, Biotec,
Slides from Hanekamp, University of Wyoming, www.uwyo.edu
79
Ab initio Structure Prediction
By Michael Schroeder, Biotec,
80
Molecular dynamics
 Structure prediction = place atoms so that
interactions between them create a unique state of
maximum stability
 Problem:
 Model of inter-atomic distances is not complete
 Computational scale:
Large number of variables and massive search space
Non-linearities
Rough energy surface with many local minima
By Michael Schroeder, Biotec,
81
Conformational energy calculations




Bond stretching:
Bond angle bend
Torsion angle (e.g. , , )
Van der Waals interactions
 Short-range repulsion ~R-12 and long-range attraction ~R-6, where
R is the inter-atom distance
 Hydrogen bond
 Weak chemical/electrostatic interaction, ~R-12 and ~R-10
 Electrostatics
 Charges on atoms
 Solvent
 Interactions with water, salt, sugar, etc.
By Michael Schroeder, Biotec,
82
Rosetta
 Predicts structure by first generating structures of
fragments using known structures (3-9 residues)
 Combine fragments using Monte Carlo simulation
using an energy function with terms for
 Paired beta-sheets
 Burial of hydrophobic residues
 Carries out 1000 simulations
 Results are clustered and the centre of the largest
cluster is presented as prediction
 Demo
By Michael Schroeder, Biotec,
83
ROSETTA
 The program ROSETTA, by D. Baker and colleagues,
can predict the structures of proteins for which no
complete domain of similar folding pattern appears in
the database. Prediction by ROSETTA of H. influenzae,
hypothetical protein. Black lines, experimental
structure; red lines, prediction
By Michael Schroeder, Biotec,
84
Rosetta
 Prediction by ROSETTA of The N-terminal half of
domain 1 of human DNA repair protein Xrcc4. This
figures shows a selected substructure of Xrcc4
containing the N-terminal 55 out of 116 residues. Black
lines, experimental structure; red lines, prediction
By Michael Schroeder, Biotec,
85
LINUS
 Another programme with similar idea
 Prediction by LINUS (program by G.D. Rose and R. Srinivasan) of Cterminal domain of rat endoplasmic reticulum protein ERp29. Black
lines, experimental structure; red lines, prediction
By Michael Schroeder, Biotec,
86
Monte Carlo Simulation
 Objective: Find conformation with minimal energy
 Problem: Avoid local minima
 Algorithm:




1. Generate a random initial conformation x
2. Perturb conformation x to generate a neighbouring conformation x’
3. Calculate the energies E(x) and E(x’), resp., for conformations x and x’
4. If E(x)>E(x’) (i.e. x’ is an improvement, we go down hill from x to x’) then accept
x’ as new conformation and go to 2.
 5. If E(x)<E(x’) (i.e. x’ is no improvement, we go uphill from x to x’) then accept x’
as new conformation with probability p
 6. The probability p to accept uphill moves is reduced with every step
 7. Go to step 2.
 Step 1.-4. make sure that we “walk” downhill towards a minimum
 Step 5.-7. make sure that if we are in local minimum there is a chance to get out
of it by accepting an uphill move. It’s important that this probability decreases so
that we are getting more and more unlikely to walk uphill
By Michael Schroeder, Biotec,
87
Summary
 You should know now










What helices, strands, sheets are
What a Ramachandran plot is
How to score a structural alignment (rmsd)
How to compute a structural alignment
How a domain can be characterised
Why structure classification is useful
What the main structure classes are
How classifications can be generated automatically
What the problems are
What secondary structure prediction, homology modelling, threading,
ab-initio and knowledge-based structure prediction of novel folds are
 Visit PDB, SCOP and CATH websites and
 Read chapter 5
By Michael Schroeder, Biotec,
88