Homology Modelling - CBS

Download Report

Transcript Homology Modelling - CBS

Protein Homology
Modelling
Thomas Blicher
Center for Biological Sequence Analysis
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Learning Objectives
After this lesson you should be able to:
– Explain the individual steps involved in
calculating a protein homology model.
– Identify suitable templates for modelling.
– Outline the principles behind ab initio protein
structure prediction.
– Describe the differences between homology
modelling and ab initio structure prediction.
– Describe the major pitfalls in protein modelling.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Outline
 Protein homology modelling
– Individual steps
– Caveats
– Pitfalls
 Ab initio protein structure prediction
– Threading
– True ab initio methods
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Why Do We Need Homology Modelling?
 Ab Initio protein folding (“random” sampling):
– 100 aa, 3 conf./residue gives approximately
1048 different overall conformations!
 Random sampling is NOT feasible, even if
conformations can be sampled at picosecond
(10-12 sec) rates.
– Levinthal’s paradox
 Do homology modelling instead.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
How Is It Possible?
 The structure of a protein is uniquely
determined by its amino acid sequence
(but sequence is sometimes not enough):
– prions
– pH, ions, cofactors, chaperones
 Structure is conserved much longer than
sequence in evolution.
– Structure > Function >> Sequence
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
How Often Can We Do It?
 There are currently ~47000 structures in the
PDB (but only ~4000 if you include only
ones that are not more than 30% identical
and have a resolution better than 3.0 Å).
 An estimated 25% of all sequences can be
modeled and structural information can be
obtained for ~50%.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Worldwide Structural Genomics






Complete genomes
Signaling proteins
Disease-causing organisms
Model organisms
Membrane proteins
Protein-ligand interactions
”Fold space coverage”
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Structural Genomics in North America
 10 year $600 million project initiated in 2000,
funded largely by NIH.
 AIM: structural information on 10000 unique
proteins (now 4-6000), so far 1000 have been
determined.
 Improve current techniques to reduce time (from
months to days) and cost (from $100.000 to
$20.000/structure).
 9 research centers currently funded (2005), targets
are from model and disease-causing organisms (a
separate project on TB proteins).
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Homology Modeling for Structural Genomics
Roberto Sánchez et al. Nature Structural Biology 7, 986 - 990 (2000)
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
How Well Can We Do It?
Sali, A. & Kuriyan, J. Trends
Biochem. Sci. 22, M20–M24 (1999)
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
How Is It Done?







Identify template(s) – initial alignment
Improve alignment
Backbone generation
Loop modelling
Side chains
Refinement
Validation 
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Template Identification
 Search with sequence
– Blast
– Psi-Blast
– Fold recognition methods
 Use biological information
 Functional annotation in databases
 Active site/motifs
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Alignment
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
1
2
3
4
5
6
7
8
9
10
11
12
13
14
PHE ASP ILE CYS ARG LEU PRO GLY SER ALA GLU ALA VAL CYS
PHE ASN VAL CYS ARG THR PRO --- --- --- GLU ALA ILE CYS
PHE ASN VAL CYS ARG --- --- --- THR PRO GLU ALA ILE CYS
F
F
D I
C
6
-2 0
-3 -2 2
N -3 2
V
0
R L
-2 -2 0
-2 2
P
G S
V
C
-2 -3 -1 -2 -3 -2 0
-3
-2 -2 0
-2 -1 2
2
A
0
-1 -1 -1 0
E
1
A
0
-1 0
-2 -2
5
-2
C
R -2 -2 -2 -2 5
-1 0
0
1
-1 0
-1 -1 -2
T
P
E
-3 2
-2 -3 0
-2 1
0
1
1
5
1
-1 -3
A
-2 0
-1 -2 -1 -1 1
0
1
5
1
5
0
-2
I
0
-2 -2 -1 -1 -2 -1 2
-2
-3 5
-2 -2 2
C -3 -2 -2 8
-2 -3 -3 -2 -1 -2 -3 -2 -2 8
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
1
2
3
4
5
6
7
8
9
10
11
12
13
14
PHE ASP ILE CYS ARG LEU PRO GLY SER ALA GLU ALA VAL CYS
PHE ASN VAL CYS ARG THR PRO --- --- --- GLU ALA ILE CYS
PHE ASN VAL CYS ARG --- --- --- THR PRO GLU ALA ILE CYS
F
F
D I
C
6
-2 0
-3 -2 2
N -3 2
V
0
R L
-2 -2 0
-2 2
G S
V
C
-2 -3 -1 -2 -3 -2 0
-3
-2 -2 0
-2 -1 2
C -3 -2 -2 8
P
2
A
0
-1 -1 -1 0
E
1
A
0
-1 0
-2 -2
5
-2
-2 -3 -3 -2 -1 -2 -3 -2 -2 8
R -2 -2 -2 -2 5
-1 0
0
T
-2 0
0
-1 0
-1 -1 -2
0
-1 2
0
1
0
0
P
-2 0
-2 -3 0
-2 8
0
0
1
1
1
-1 -3
E
-3 2
-2 -3 0
-2 1
0
1
1
5
1
-1 -3
A
-2 0
-1 -2 -1 -1 1
0
1
5
1
5
0
-2
I
0
-2 -2 -1 -1 -2 -1 2
-2
-3 5
-1 0
-2 -2 2
C -3 -2 -2 8
0
1
-1
-2 -3 -3 -2 -1 -2 -3 -2 -2 8
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Improving the Alignment
1
2
3
4
5
6
7
8
9
10
11
12
13
14
PHE ASP ILE CYS ARG LEU PRO GLY SER ALA GLU ALA VAL CYS
PHE ASN VAL CYS ARG THR PRO --- --- --- GLU ALA ILE CYS
PHE ASN VAL CYS ARG --- --- --- THR PRO GLU ALA ILE CYS
From ”Professional Gambling” by Gert Vriend
http://www.cmbi.kun.nl/gv/articles/text/gambling.html
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Template Quality
 Selecting the best template is crucial!
 The best template may not be the one with
the highest % id (best p-value…)
– Template 1: 93% id, 3.5 Å resolution 
– Template 2: 90% id, 1.5 Å resolution 
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
The Importance of Resolution
4Å
low
3Å
2Å
1Å
high
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Evaluation of NMR Structures
What regions in the structure are most well-defined?
Look at the pdb
ensembles to see
which regions are
well-defined
1RJH
Nielbo et al, Biochemistry, 2003
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Ramachandran Plot
 Allowed backbone torsion angles in proteins
N
H
Amino acid residue
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Template Quality – Ramachandran Plot
X-ray structure – good data.
NMR structure – low quality data…
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Backbone Generation
 Generate the backbone coordinates from the
template for the aligned regions.
 Several programs can do this, most of the
groups at CASP6 use Modeller:
http://salilab.org/modeller/modeller.html
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Loop Modelling
 Knowledge based:
– Searches PDB for fragments that match the sequence to
be modelled (Levitt, Holm, Baker etc.).
 Energy based:
– Uses an energy function to evaluate the quality of the
loop and minimizes this function by Monte Carlo
(sampling) or molecular dynamics (MD) techniques.
 Combination
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Loops – the Rosetta Method
 Find fragments (10 per amino acid) with the
same sequence and secondary structure
profile as the query sequence.
 Combine them using a Monte Carlo scheme
to build the loop.
David Baker et al.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Side Chains
 If the seq. ID is high, the networks of side
chain contacts may be conserved, and
keeping the side chain rotamers from the
template may be better than predicting new
ones.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Predicting Side Chain Conformations
 Side chain rotamers are dependent on
backbone conformation.
 Most successful method in CASP6 was
SCWRL by Dunbrack et al.:
– Graph-theory knowledge based method to solve
the combinatorial problem of side chain
modelling.
http://dunbrack.fccc.edu/SCWRL3.php
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Side Chains - Accuracy
 Prediction accuracy is high for buried
residues, but much lower for surface
residues
– Experimental reasons:
side chains at the surface are more flexible.
– Theoretical reasons:
much easier to handle hydrophobic packing in
the core than the electrostatic interactions,
including H-bonds to waters.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Refinement
 Energy minimization
 Molecular dynamics
– Big errors like atom
clashes can be removed,
but force fields are not
perfect and small errors
will also be introduced –
keep minimization to a
minimum or matters will
only get worse.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Error Recovery
 If errors are introduced in the model, they
normally can NOT be recovered at a later
step
– The alignment can not make up for a bad choice
of template.
– Loop modeling can not make up for a poor
alignment.
 If errors are discovered, the step where they
were introduced should be redone.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Validation
 Most programs will get the bond lengths and
angles right.
 The Ramachandran plot of the model usually looks
pretty much like the Ramachandran plot of the
template (so select a high quality template).
 Inside/outside distributions of polar and apolar
residues can be useful.
 Biological/biochemical data
– Active site residues
– Modification sites
– Interaction sites
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Validation – ProQ Server
 ProQ is a neural network based predictor
that based on a number of structural
features predicts the quality of a protein
model.
 ProQ is optimized to find correct models in
contrast to other methods which are
optimized to find native structures.
Arne Elofssons group: http://www.sbc.su.se/~bjorn/ProQ/
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Structure Validation
 ProCheck
http://www.biochem.ucl.ac.uk/~roman/procheck/proc
heck.html
 WhatIf server
http://swift.cmbi.kun.nl/WIWWWI/
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Homology Modelling Servers
 Eva-CM performs continuous and automated
analysis of comparative protein structure
modeling servers
 A current list of the best performing servers
can be found at:
http://cubic.bioc.columbia.edu/eva/doc/intro_cm.html
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Summary – Homology Modelling
 Successful homology modelling depends on
the following:
– Template quality
– Alignment (add biological information)
– Modelling program/procedure (use more than
one)
 Always validate your final model!
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Fold Recognition and Ab Initio
Protein Structure Prediction
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Outline
 Threading and pair potentials
 Ab initio methods
 Human intervention (what kind of knowledge can be used for
alignment and selection of templates?)
 Meta-servers (the principle, 3d jury)
 Summary of take-home messages
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Threading and Pair Potentials
 Compares a
given sequence
against known
structures
(folds).
 By using
potentials that
describe
tendencies
observed in
known protein
structures.
Example: Pair potentials
How normal is it to observe
a pair of an alanine and a
valine separated by 20
residues in the sequence
and 3Å in space? (X)
How normal is it to observe
any pair of residues
separated by 20 residues
and 3Å in space? (Y)
Potential: log (X/Y)
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Potentials of Mean Force
Alignment score from
structural fitness (pair
potential)
Deletions
7
4
2
1
How well does K fit
environment at P6?
If P8 is acidic then
fine, if P8 is basic then
poor
6
5
3
8
9
.. A T N L Y K E T L ..
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
10
Threading Methods Today
 Problem: No protein is average
 Interactions in proteins cannot only be described
by pairs of amino acids
 The information in the potentials is partly captured
with sequence profiles
 Today mostly used in HYBRID approaches in
combination with profile-profile based methods
 Potentials can be used to score models based on
different templates or alignments
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Ab Initio Methods
 Aim is to find the fold of native protein by simulating the
biological process of protein folding.
 A VERY DIFFICULT task because a protein chain can fold
into millions of different conformations.
 Use it only when no detectable homologues are available.
 Methods can also be useful for fold recognition in cases of
extremely low homology (e.g. convergent evolution).
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Fragment-based Ab Initio Modelling
 Rosetta method of the Baker group:
– Submit sequence to a number of secondary
structure predictors.
– Compare fragments of 3 and 9 residues to
library from know structures.
– Link fragments together.
– Use energy minimization techniques (Monte
Carlo optimization) to calculate tertiary
structure.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Potentials for Finding Good Models
 Use of energy potentials for scoring and computing
models.
 Potentials should make models more “native-like”.
 These can be based on contact potentials,
solvation potentials, Van der Waals repulsion and
attractive forces, hydrogen bond potentials.
 Globularity/radius of gyration (ab initio).
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Problems with Empirical Potentials
Fragments with
correct local structure
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Human Intervention
 The best methods
use maximum
knowledge of query
proteins.
Knowledge of function
Cysteines forming disulfide bridges or
binding e.g. zinc molecules
Proteolytic cleavage sites
Other metal binding residues
 Specialists can help
to find a correct
template and correct
alignments.
Antibody epitopes or escape mutants
Ligand binding
Results from CD or fluorescence
experiments
Knowledge of secondary structure
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Meta-Servers
 Democratic modeling
– The highest score hit is often wrong.
– Many prediction methods have the correct fold
among the top 10-20 hits.
– If many different prediction methods all have
some fold among the top hits, this fold is
probably correct.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Example of a Meta-Server

3DJury http://bioinfo.pl/meta/
–
Inspired by Ab initio modeling methods
•
–
Average of frequently obtained low energy structures is often
closer to the native structure than the lowest energy structure
Find most abundant high scoring model in a list of
prediction from several predictors
1. Use output from a set of servers
2. Superimpose all pairs of structures
3. Similarity score based on # of Cα pairs within 3.5Å
–
Similar methods developed by A. Elofsson (Pcons) and
D. Fischer (3D shotgun).
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
3DJury
 Because it is a meta-server it can be slow.
 If queue is too long some servers are
skipped.
 Output is only Cα coordinates.
 What to do with the rest of the structure?
 Use e.g. maxsprout server to build
sidechains and backbone atoms.
http://www.ebi.ac.uk/maxsprout/
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Summary – Ab Initio Methods
 Hybrid methods using both threading methods and profileprofile alignments are the best.
 Use only Ab initio methods if necessary and know that the
quality is really low!
 Try to use as much knowledge as possible for alignment
and template selections in difficult cases.
 Use meta-servers when you can.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU