www.physics.usyd.edu.au

Download Report

Transcript www.physics.usyd.edu.au

Computational Methods in
Molecular Modelling
Uğur Sezerman
Biological Sciences and Bioengineering Program
Sabancı University, Istanbul
Motivation
Knowing the structure of molecules
enables us to understand its mechanism
of function
Current experimental techniques
X-ray cystallography
NMR
PROTEIN FOLDING
PROBLEM
STARTING FROM AMINO ACID SEQUENCE
FINDING THE STRUCTURE OF PROTEINS IS
CALLED THE PROTEIN FOLDING PROBLEM
Forces driving protein
folding
It is believed that hydrophobic collapse is
a key driving force for protein folding
Hydrophobic core
Polar surface interacting with solvent
Minimum volume (no cavities) Van der
Walls
Disulfide bond formation stabilizes
Hydrogen bonds
Polar and electrostatic interactions
SECONDARY STRUCTURE
PREDICTION
Intro. To Struc.
(Tooze and Branden)
Secondary Structure
Prediction
AGVGTVPMTAYGNDIQYYGQVT…
A-VGIVPM-AYGQDIQY-GQVT…
AG-GIIP--AYGNELQ--GQVT…
AGVCTVPMTA---ELQYYG--T…
AGVGTVPMTAYGNDIQYYGQVT…
----hhhHHHHHHhhh--eeEE…
Chou-Fasman Parameters
Name
Alanine
Arginine
Aspartic Acid
Asparagine
Cysteine
Glutamic Acid
Glutamine
Glycine
Histidine
Isoleucine
Leucine
Lysine
Methionine
Phenylalanine
Proline
Serine
Threonine
Tryptophan
Tyrosine
Valine
Abbrv
A
R
D
N
C
E
Q
G
H
I
L
K
M
F
P
S
T
W
Y
V
P(a)
142
98
101
67
70
151
111
57
100
108
121
114
145
113
57
77
83
108
69
106
P(b) P(turn)
66
83
95
93
146
54
156
89
119
119
74
37
98
110
156
75
95
87
47
160
59
130
101
74
60
105
60
138
152
55
143
75
96
119
96
137
114
147
50
170
f(i)
0.06
0.07
0.147
0.161
0.149
0.056
0.074
0.102
0.14
0.043
0.061
0.055
0.068
0.059
0.102
0.12
0.086
0.077
0.082
0.062
f(i+1)
0.076
0.106
0.11
0.083
0.05
0.06
0.098
0.085
0.047
0.034
0.025
0.115
0.082
0.041
0.301
0.139
0.108
0.013
0.065
0.048
f(i+2)
0.035
0.099
0.179
0.191
0.117
0.077
0.037
0.19
0.093
0.013
0.036
0.072
0.014
0.065
0.034
0.125
0.065
0.064
0.114
0.028
f(i+3)
0.058
0.085
0.081
0.091
0.128
0.064
0.098
0.152
0.054
0.056
0.07
0.095
0.055
0.065
0.068
0.106
0.079
0.167
0.125
0.053
Computational Approaches




Ab initio methods
Threading
Comperative Modelling
Fragment Assembly
Ab-initio protein structure prediction as
an optimization problem
1. Define a function that map protein
structures to some quality measure.
2. Solve the computational problem of
finding an optimal structure.

energy
3.
Chen Keasar
BGU
conformation
A dream function
 Has a clear minimum in the native structure.
 Has a clear path towards the minimum.
 Global optimization algorithm should find the
native structure.
Chen Keasar
BGU
An approximate function
 Easier to design and compute.
 Native structure not always the global minimum.
 Global optimization methods do not converge. Many
alternative models (decoys) should be generated.
 No clear way of choosing among them.
Decoy set
Chen Keasar
BGU
Fold Optimization
Simple lattice models
(HP-models)
Two types of residues:
hydrophobic and polar
2-D or 3-D lattice
The only force is
hydrophobic collapse
Score = number of HH
contacts
Scoring Lattice Models
H/P model scoring:
Sometimes:
Penalize for buried polar or surface
hydrophobic residues
Learning from Lattice
Models
Ken Dill ~ 1997
Hydrophobic zipper effect
Basic element
electrons &
protons
atom
extended
atom
half a
residue
residue
Some
residues
Hinds &
Levitt
diamond
lattice
Chen Keasar
BGU
torsion fine square
angle lattice lattice
fragments continuous
What can we do with
lattice models?
For smaller polypeptides, exhaustive
search can be used
Looking at the “best” fold, even in such a
simple model, can teach us interesting things
about the protein folding process
For larger chains, other optimization and
search methods must be used
Greedy, branch and bound
Evolutionary computing, simulated annealing
Graph theoretical methods
Inverse Protein Folding
Problem
Given a structure (or a functionality) identify
an amino acid sequence whose fold will be
that structure (exhibit that functionality).
Crucial problem in drug design.
NP-hard under most models.
PROTEIN THREADING
Thread the given sequence to the
different structural families exist in
structural databases
Choose the optimum structure based on
the potential energy function ( contact
potential, free energy, e.g.) used
Threading: Fold
recognition
Given:
Sequence:
IVACIVSTEYDVMKAAR
…
A database of
molecular coordinates
Map the sequence
onto each fold
Evaluate
Objective 1: improve
scoring function
Objective 2: folding
Protein Fold Families
(CATH,SCOP)

CATH website
www.cathdb.info
Genetic Algorithm used as
a search tool
 We are searching for the minima of our fitness function composed of
profile and contact energy terms.
 In this problem value encoding have been used. Parents are represented as
strings of positions. Population Size is 50.
 A sample parent (string of positions) is figured below:
12345
10 11 12 13 14
23 24 25 26 27 28 29 30 31 32
55 56 57 58
 Branch and Bound algorithm have been used to produce random initial
parents.
 Mutation:
Mutation operator is the shifting of the structure’s position either to the right
or left by some units.
 Crossover:
Two-point cross-over is applied where , selected suitable structures are
exchanged between two parents.
Our Aim
In this research, we have threaded a
structurally unknown protein sequence to
over 2200 SCOP family fold proteins and
sought the best fitting structural family.
We also tried to find the optimum fit of
the query sequence to a given fold.
Fitness Function
Energy function is a combination of
The sequence profile energy
Contact Potential energy (inter & intra
structural residues are taken into account)
TotalEnergy= p1 ( ProfileEnergy ) + c1(ContactEnergy)
The weights are chosen such that the contributing
energy from profile and contact energy terms will be
equal.
Profile Energy
 We do structural alignment on all selected secondary
structural units of the sequences.
 Same numbered secondary structural units are
selected.
 Length of the units may differ.
-- P E E L L L R W A N F H L E N ( 1aoa)
-- S E K I L L K W V R Q T -- -- -- (1qag)
N S E K I L L S W V R Q S T R -- (1dxx)
Sixth helices of the selected all-alfa sequences
Profile Matrix calculated
from a structure group
Residue Names

A
C
D
E
F
G
H
I
K
L
M
N
P
Q
R
S
T
V
W
Y
-
-0.33 -0.67 0.68 0.01 -1.33 0.01 0.34 -1 0.01 -1.33 -0.67 2.34 -0.67 0.01 -0.33 0.34 0.01 -1 -1.33 -0.67 4.01
0.34 -2 -0.33 -1 -3.33 -0.67 -1.33 -3 -0.33 -3.33 -2.33 0.01 2.68 -0.33 -1.67 3.01 1.01 -2.33 -4 -2.33 0.01
-1 -3 2.01 6.01 -3 -3 0.01 -4 1.01 -3 -2 0.01 -1 2.01 0.01 -1 -1 -3 -3 -2 0.01
-1 -3 0.01 2.68 -3.67 -2.33 0.01 -3.33 4.34 -3 -2 0.01 -1 2.01 2.01 -0.33 -1 -3 -3 -2 0.01
-1.33 -2 -4 -3.67 0.34 -4 -3.67 4.01 -3 3.01 2.34 -3.33 -3.33 -2.67 -3.67 -3 -1 3.01 -2.67 -1 0.01
-2 -2 -4 -3 1.01 -4 -3 2.01 -3 5.01 3.01 -4 -4 -2 -3 -3 -1 1.01 -2 -1 0.01
-2 -2 -4 -3 1.01 -4 -3 2.01 -3 5.01 3.01 -4 -4 -2 -3 -3 -1 1.01 -2 -1 0.01
0.01 -2 -0.67 -0.67 -3 -1 -0.67 -3.33 1.01 -3 -2 0.34 -1.67 0.34 1.68 3.01 1.01 -2.33 -3.67 -1.67 0.01
-3 -5 -5 -3 1.01 -3 -3 -3 -3 -2 -1 -4 -4 -1 -3 -4 -3 -3 15.01 2.01 0.01
1.68 -1 -3.33 -2.33 -1.67 -2.67 -3.33 2.34 -2.33 0.01 0.34 -2.33 -2.33 -2.33 -2.67 -1 0.01 3.34 -3 -1.33 0.01
-1.67 -3.33 -0.67 0.01 -3.33 -2 0.34 -3.67 2.01 -3.33 -2 1.68 -2.67 0.68 4.34 -0.33 -0.67 -3 -3.33 -1.33 0.01
-1.67 -2.67 -1.67 0.34 0.01 -2.67 0.34 -2 0.01 -1 0.01 -1.33 -2 3.34 -0.33 -1 -1.33 -2.33 -0.33 0.68 0.01
-0.33 -1.67 -0.67 -0.67 -2 -1.33 2.34 -2.67 -0.33 -2.33 -1.33 0.68 -1.33 0.01 -0.67 2.01 1.68 -2 -3.33 -0.67 0.01
-0.67 -1 -1.67 -1.33 -0.33 -2 -1.67 0.34 -1.33 1.34 0.68 -1.33 -1.67 -1 -1.33 -0.33 1.34 0.34 -1.67 -1 2.01
-1 -2.33 0.01 2.01 -2 -2 0.01 -2.67 1.34 -2 -1.33 -0.33 -1.33 1.01 2.34 -0.67 -0.67 -2 -2 -1 2.01
-0.33 -0.67 0.68 0.01 -1.33 0.01 0.34 -1 0.01 -1.33 -0.67 2.34 -0.67 0.01 -0.33 0.34 0.01 -1 -1.33 -0.67 4.01
Positions
Profile scores
Contact Potential Energy
Based on the counts of frequency of
contacts in a database of known
structures converted into energy values.
In this study, contact potential energy is
the sum of energies of the residues that
are closer than seven angstroms in
distance to each other.
Jernigan’s & Dill’s Contact Potential
Energy Tables have been used.
Selected Benchmark Set
All Alfa Set :1aoa,1dxx,1qag
Fold: Calponin-homology domain, CH-domain core: 4 helices: bundle
Superfamily: Calponin-homology domain, CH-domain
Family: Calponin-homology domain, CH-domain
All Beta Set :1acx,1hzk,1noa,2mcm
Fold: Immunoglobulin-like beta-sandwich sandwich; 7 strands in 2 sheets
Superfamily: Actinoxanthin-like
Family: Actinoxanthin-like
Alfa+Beta Set : 1dwn,1e6t,1frs,1qbe,1una
Fold: RNA bacteriophage capsid protein
6-standed beta-sheet followed with 2 helices; meander
Superfamily: RNA bacteriophage capsid protein
Family: RNA bacteriophage capsid protein
Secondary structure
prediction results of the
family of all alfa proteins

Eight helixes of the following sequences are selected and
each sequence is threaded to the other one and the shifts
from the real structures are shown below.
Target Sequences
1aoa
Template
sequences
1dxx
1qag
1aoa T T T T T T T 30
T T T -6 -1 -1 T 27
1 T T T T 12 T T
1dxx T T T -4 1 5 4 9
T -3 T -5 T T T T
3 T T T 1 T 41 37
1qag -1 T T -5 T 4 41 32 5 1 T -6 -1 T -13 -1
TTTTTTTT
prediction results of the
family of all beta proteins

Nine beta sheets of the following sequences are selected
and each sequence is threaded to the other one and the
shifts from the real structures are shown below.
Target Sequences
1acx
1hzk
1noa
2mcm
Template 1acx T T T T T T T T T
1 T T T T -2 T T T
T T T T T -3 -1 T T T T T 2 T T 1 2 4
sequences 1hzk T T T T T T T T T
TTTTTTTTT
TTTTT14TT
T T T T -3 -3 T T T
1noa T T T T T T 1 T T
T T T T T -1 T T T
TTTTTT5TT
T T T T -2 -2 T T T
2mcm T T T T T T T T T
TTTTTTTTT
TTT1TTTTT
T T T 1 T -1 T T T
family of
alfa-beta proteins

Target Sequences
1dwn
1e6t
1frs
1qbe
1una
1dwn T T 4 T T T T 4
TTTTTTT5
TTTTTTTT
TTTTTTT5
T T T -1 -1 T T 1
1e6t
TTTTTTTT
TTTTTTTT
TTTTTTTT
TTTTTTTT
TTTTTTTT
1frs
TTTTTTTT
TTTTTTTT
TTTTTTTT
T T T T T T T -1 T T T T T T 1 T
1qbe
-1 T T T 1 T T 3
T T -3 -11 T T T 4 T T -3 -11 T T T 1
TTTTTTTT
-1 T T T T T T 1
1una
TTT11TTT
T T T -5 T T T T
1TTT2TTT
TTTTTTTT
Template
sequences
T T T -5 T T 1 T
Conclusion for fitting to a
given fold
We obtained very good results for all-beta and
alfa+beta proteins .
All alfa proteins gave good results generally but
we had some shifts for the all alfa structures.
 The main reason for the alfa shifts was mainly
due to the fact that our all-alfa sequences had a
very different lenghts and highly variable
sequences which lowered the contribution from
the profile scores.
Fold Classification Results
1ubi Threading Results
-1000
-1200
-1400
Energy Values
-1600
-1800
-2000
-2200
-2400
Other members
of 1ubi's family
-2600
1e0q
1ubi
-2800
1f9j
-3000
0
100
200
300
Protein ID
400
500
600
All Beta
1acx Threading Results
-1000
-1200
-1400
Energy Values
-1600
-1800
-2000
-2200
1zfo
1klo
-2400
1c01
-2600
-2800
1acx
-3000
0
100
200
300
400
Protein ID
500
600
700
All Alpha
1bhd Threading Result
-1000
-1200
-1400
Energy Values
-1600
-1800
-2000
-2200
1qld
1hg6
1dfu
2pcf
-2400
-2600
1bhd
-2800
-3000
0
100
200
300
400
Protein ID
500
600
700
CONCLUSION
By optimising the fitting process with
genetic algorithm and using a correct
target function we have obtained quite
clear classifications in the base of families.
It is also possible to use this method for
superfamily classification by adjusting only
profile information and weights.
We also applied the method to 6 CASP
proteins and correctly classified their
HOMOLOGY MODELLING
Using database search algorithms find the
sequence with known structure that best
matches the query sequence
Assign the structure of the core regions
obtained from the structure database to
the query sequence
Find the structure of the intervening loops
using loop closure algorithms
Homology Modeling: How it works
o Find template
o Align target sequence
with template
o Generate model:
- add loops
- add sidechains
o Refine model
Prediction of Protein
Structures
 Examples – a few good examples
actual
actual
predicted
predicted
actual
actual
predicted
predicted
Prediction of Protein
Structures
 Not so good example
1esr
TURALIGN: Constrained
Structural Alignment Tool For
Structure Prediction
Motivation -1:
Structure based Alignment
Most of the alignment algorithms are only
sequence dependent (Needleman-Wunsch &
Smith-Waterman )
Functional sites are usually mismatched
Fail to give the best alignment between
highly divergent sequences having very
similar structures
Motivation -2:
Structure prediction of novel
proteins
Using
evolutionary
information
on
sequence confirmation
Secondary structure predictions and
possible locations of turns should be used
for threading
Preservation of favorable contacts
Methods
Motif Alignment Based on Dynamic Algorithm
Approach
 Recursive Smith-Waterman Local Alignment
Algorithm with Affine Gap Penalty
Secondary Structure Similarity Matrix
BLOSSUM 62
Position Specific Entropy Information
Filtering step using neighbourhood information
Jernigan Contact Potential Matrix
Motif Alignment Using
Dynamic Algorithm
Motif Alignment Using
Dynamic Algorithm
Functional sites and motifs in template
protein can be either given as input to the
program or prosite scan* tool is used to
detect the motifs.
*Gattiker,A et.al. Bioinformatics 2002:1(2) 107-108.
Recursive Smith-Waterman Local
Alignment Algorithm with Affine
Gap Penalty
pL>0.9xpc
pL>0.9xpc
pc
pc
50
47
pR>0.9xpc
pR>0.9xpc
Recursive Smith-Waterman Local
Alignment Algorithm with Affine
Gap Penalty
•A(i, j )  max X  { A, B , C} { X (i-1, j-1)  S(i,j)}
•B(i, j )  max{ A(i-1, j )  go  ge, B(i-1, j )  ge, C(i-1, j )  go  ge}
•C(i, j )  max{ A(i, j-1)  go  ge, B(i, j-1)  go  ge, C(i, j-1)  ge}
 Build 3 matrices:
A for the matches;
B for the gaps on template;
C for gaps on target.
S(i,j) : Pairwise Similarity Score
go : Gap opening penalty
ge : Gap extension penalty
 Tracing back : Include the paths that have score > 0.9xMax
Recursive Smith-Waterman Local
Alignment Algorithm with Affine
Gap Penalty
S(i,j)  sc  SSS(i,j)  ac SS(i,j)  tc  TS(i,j)
SSS(i,j) : Secondary Structure Similarity
SS(i,j) : Sequence Similarity
TS(i,j) : Turn Similarity
sc : Secondary Structure Similarity Coefficient
ac : Sequence Similarity Coefficient
tc : Turn Similarity Coefficient
Secondary Structure
Similarity
H
H
L
H:0.7 0.5 0.0
E:0.2 0.4 0.3
L:0.1 0.1 0.6
S
H
E
L
H
2
-15
-4
E
-15
4
-4
L
-4
-4
2
Secondary Structure Similarity Matrix*
Secondary Structure Prediction Servers
3
SSS(i, j )  sc   S (T (i), k )  P(k , i)
k 1
T (i) : SecondaryStructureof T emplateat positioni
P(., j ) : SecondaryStructureprofileof T argetat positionj
sc : SecondaryStructureSimilarityCoefficient
*Wallqvist,A et al. Bioinformatics. 2000 Nov;16(11):988-1002.
Sequence Similarity
Multiple Sequence Alignment
of
Template Protein’s family*
...ALVKLI...
...A-IEII...
...AL-KLI...
20
S ( j )   P(i, j )  log P(i, j )
i 1
1
C (i ) 
1  S (i)
S (i ) : Entropyat position j of template
P : Family ProfileMatrix
C (i ) : Conservation score at position j of template
SS(i, j ) ac  C (i)  BLOSSUM62(i, j )
*Glaser,F. Et al. Bioinformatics 19:163-164(2003)
Turn Similarity
T
T
N
T:0.7 0.5 0.0
N:0.3 0.5 1.0
TS (i, j )  tc  4  T (i)  P(T , j )
Turn Prediction Servers
T (i)  1 if i  T ;else 0
P(., j ) : T urn profileof T argetat position j
tc : T urnSimilarityCoefficient
Gap Penalties
...L...
...-...
go  
...H/E...
... - ...
gapSec 20
And vice versa...
2
 go
3
2
ge    ge
3
Filtering
 For each of the motif
alignments get the 25
best alignments
 Build a connectivity map
of template protein and
thread onto target.
CS  cs 
  (i, j )  J (i, j )
i 1, j i
 : Kirchoff Mat rix
J : Jernigan Cont actP ot ent ialMat rix*
Get the best 25 alignments
According to the score:
TS  S  CS



1
if
i

j

R

7
.
3
Å


ij


(i, j )  0 if i  j  Rij  7.3Å 



(
i
,
j
)
if
i

j



 i,i  j

*Miyazawa S, Jernigan R L.(1983) Macromolecules ;18:534–552.
RESULTS
To test our program we have chosen 3
families from ASTRAL40* protein list.
Citrate Synthase : 1csh,1iomA,1k3pA
Methionine aminopeptidase:1b6a,1xgsA
Methyltransferase:1fp2A,1fp1D
As testing measure: RMSD between the
predicted and actual structure of target.
RESULTS
 For all the experiments done, our algorithm perfectly matched
functional sites and motifs given as input to the program.
1csh vs 1iomA :
RMSD = 2.50
1csh vs 1k3pA
RMSD = 2.12
1k3pA vs 1iomA
RMSD = 3.03
1b6a vs 1xgsA
RMSD = 2.23
1fp2A vs 1fp1D
RMSD = 2.98
 At average we got the best results for 5 experiments:
RMSD = 2.57 with ac:0.4,sc:0.4,tc:0.2,cc:0
User Interface of TURALIGN
DOMAIN INTERACTIONS