Homology modeling workshop

Download Report

Transcript Homology modeling workshop

Homology Modeling
Workshop
GHIKLSYTVNEQNLKPERFFYTSAVAIL
Outline:
• Introduction to protein structure & databases
• Structure prediction approaches
– Ab-initio
– Threading
– Homology modeling
• Hands ON
• Model evaluation
From Sequence to Structure
Protein structure is hierarchic:
•
Primary – sequence of covalently attached amino acid
•
Secondary – local 3D patterns (helices, sheets, loops)
•
Tertiary – overall 3D fold
•
Quaternary – two or more protein chains
From Sequence to Structure
• All information about the native structure of a protein is
encoded in the amino acid sequence + its native solution
environment.
• Many possible conformation  still only one or few native
folds are exhibited for each protein (Levinthal’s paradox)
• Protein folding is driven by various forces:
– Ionic forces
– Hydrogen bonds
– The hydrophobic affect
– ...
Protein 3D Structures
A protein’s structure has a critical effect on its function:
1. Binding pockets
PDB ID 1nw7
Protein 3D Structures
A protein’s structure has a critical effect on its function:
2. Areas of specific chemical\electrical properties
Protein 3D Structures
A protein’s structure has a critical effect on its function:
3. Importance of the global fold for function
Motivation to Acquire a Structure
• Identifying active and binding sites
• Characterization of the protein’s mechanism
(catalysis & interactions)
• Searching for ligand of a given binding site
• Understanding the molecular basis of diseases
• Designing mutants
• Drug design
• And more...
Determining Structure
• NMR
• X-ray diffraction
• Electron Microscopy
Protein Sequence
& Structure Databases
Some of the available databases:
• RCSB- the Protein Data Bank- all deposited structures
• UniProt- main sequence database
– SwissProt
– Tremble
• NCBI- lots of databases, including sequence and structures
• PDBsum- combines structural & sequence data
Protein Data Bank (PDB)
• The PDB archive contains information about experimentallydetermined structures of proteins, nucleic acids, and complex
assemblies.
• The structures in the archive range from tiny proteins and bits
of DNA to complex molecular machines like the ribosome.
• There are currently 57013 structures deposited in the PDB.
However, taking out redundant sequences (e.g. 90%) reduces
the number of structures to 19988…
• Each structure receives a unique 4 letter ID
Number
Protein Data Bank (PDB)
Year
SCOP – fold classification
All alpha
All beta
Alpha and beta
Protein Data Bank (PDB)
Number
Growth of unique folds as defined by SCOP
Year
UniProt- Protein Sequence
Database
• UniProt is a collaboration between the
European Bioinformatics Institute (EBI), the
Swiss Institute of Bioinformatics (SIB) and the
Protein Information Resource (PIR).
• In 2002, the three institutes decided to pool
their resources and expertise and formed the
UniProt Consortium.
UniProt- Protein Sequence
Database
• The world's most comprehensive catalog of information on
proteins
• Sequence, function & more…
• Comprised mainly of the databases:
– SwissProt – 366226 last year, 412525 protein entries now –
high quality annotation, non-redundant & cross-referenced to many
other databases.
– TrEMBL - 5708298 last year, 7527796 protein entries now –
computer translation of the genetic information from the EMBL
Nucleotide Sequence Database  many proteins are poorly annotated
since only automatic annotation is generated
UniProt- Protein Sequence
Database
UniProt- Protein Sequence
Database
More Sequences Than Structures
• Discrepancy between the number of known
sequences and solved structures:
5,047,807 UniRef90 entries vs.
19988 90% Non-redundant structures
Computational methods are needed to
obtain more structures
Protein Structure Prediction
Why predict protein structure if we can use
experimental tools to determine it?
• Experimental methods are slow and expensive
• Some structures were failed to be solved
• A representative family structure can suffice to
deduce structures of the entire family sequences
Structure Prediction Approaches
1. Homology (Comparative) Modeling
Based on sequence similarity with a protein for
which a structure has been solved.
2. Threading (Fold Recognition)
Requires a structure similar to a known structure
3. Ab-initio fold prediction
Not based on similarity to a sequence\structure
Ab-initio
Structure prediction from “first principals”:
Given only the sequence, try to predict the structure
based on physico-chemical properties
(energy, hydrophobicity etc.)
•
When all else fails  works for novel folds
•
Shows that we understand the process
The Force Field
(energy function)
A group of mathematical expressions describing the
potential energy of a molecular system
•
Each expression describes a different type of physicochemical interaction between atoms in the system:
•
Van der Waals forces
•
Covalent bonds
•
Hydrogen bonds
•
Charges
•
Hydrophobic effects
Non-bonded
terms
Approaches to Ab-initio Prediction
1. Molecular Dynamics
• Simulates the forces that governs the protein within water.
• Since proteins usually naturally fold, this would lead to the
native protein structure.
Problems:
• Thousands of atoms
• Huge number of time steps to reach folded protein
 feasible only for very small proteins
Approaches to Ab-initio Prediction
2. Minimal Energy
Assumption: the folded form is the minimal energy
conformation of a protein
Main principals:
• Define an energy function.
• Search for 3D conformation that minimize energy.
Ab-initio
2. Minimal Energy
• Use of simplified energy function
• Search methods for minimal energy conformation:
– Greedy search
– Simulated annealing
–…
Ab-initio
• Current methods (e.g. Rosetta) primarily utilize the
fact that although we are far from observing all
protein folds, we probably have seen nearly all substructures:
Local sequence-structure relationships:
• A library of known sub-structures
(fragments less than 10 residues) is created.
• A range of possible conformations for
each fragment in the query protein are selected.
Moult J. Philos. Trans. R. Soc. B. 361:453–458 (2006)
Ab-initio
Non-local sequence-structure relationships:
• The primary nonlocal interactions considered are hydrophobic
burial, electrostatics, main-chain hydrogen bonding etc.
Structures that are consistent with both the local and
non-local interactions are generated by minimizing
the non-local interaction energy in the space defined
by the local structure distributions.
Moult J. Philos. Trans. R. Soc. B. 361:453–458 (2006)
Ab-initio - Example
Moult J. Philos. Trans. R. Soc. B. 361:453–458 (2006)
Fold Recognition
(Threading)
Given a sequence and a library of folds, thread the sequence
through each fold. Take the one with the highest score.
• Method will fail if new protein does not belong to any fold in
the library.
• Score of the threading is computed based on known
physical chemistry properties and statistics of amino acids.
Threading: example
• structural template
4E
• neighbor definition
C3
• energy function
C2
ACCECADAAC
-3-1-4-4-1-4-3-3=-23
A1
E
E
aib j
positions i, j
10
5
C
9
6 A
8
7 D
Eab
A
C
D
E
.
A C
-3 -1
-1 -4
0 1
0 2
. .
C
A
A
D
0
1
5
6
.
E …..
0 ..
2 ..
6 ..
7 ..
.
Find best fold for a protein sequence:
Fold recognition (threading)
1)
...
56)
...
MAHFPGFGQSLLFGYPVYVFGD...
-10
...
...
n)
...
-123
...
Potential fold
20.5
GenTHREADER
• Align the query sequence with each template
(requires some sequence homology!)
• Assess the alignment by:
– Sequence alignment score
– Pairwise potentials
– Solvation function
• Record lengths of: alignment, query, template
• Using Neural Network the overall score is computed.
Jones DT et al. J. Mol. Biol. 287: 797-815(1999)
GenTHREADER
Jones DT et al. J. Mol. Biol. 287: 797-815(1999)
I-TASSER- Hybrid Approach
• In a recent wide blind experiment, CASP7, I-TASSER
generated the best 3D structure predictions among all
automated servers.
•Based on the secondary-structure threading and the iterative
implementation of the Threading ASSEmbly Refinement
(TASSER) program.
•For predicting the biological function of the protein, the
I-TASSER server matches the predicted 3D models to the
proteins in 3 independent libraries which consist of proteins of
known enzyme classification (EC) number, gene ontology (GO)
vocabulary, and ligand-binding sites.
I-TASSER
Test Case:
Rosetta Vs. TASSER
Grey: Crystal
structure of the beta2
adrenergic receptor
Purple: Rosetta
prediction, starting
from homology
modeling
Green: TASSER
prediction
Homology Modeling
Homology Modeling –
Basic Idea
1.
A protein structure is defined by
its amino acid sequence.
2.
Closely related sequences adopt
highly similar structures, distantly
related sequences may still fold
into similar structures.
3.
Three-dimensional structure of
proteins from the same family is
more conserved than their
primary sequences.
Triophospate ismoerases
44.7% sequence identity
0.95 RMSD
General Scheme
1.
Searching for structures related to the query sequence
2.
Selecting templates
3.
Aligning query sequence with template structures
4.
Building a model for the query using information from
the template structures
NEST
5.
Evaluating the model
Fiser A et al. Methods in Enzymology 374: 461-491(2004)
General Scheme
Homology modeling requires
handling structures & sequences
• Query- only the protein sequence is available- usually found
at the UniProt database
• Template- after identification, both structural and sequencerelated data should be found- UniPort (or NCBI databases),
RCSB and PDBsum
1. Searching For Structures
•
Sequence search against the PDB sequences
•
Sequence-profile search
•
Threading: sequence-structure fitness function
1. Searching For Structures
If BLAST search against the PDB fail to recognize adequate
templates, turn to fold recognition (threading) servers:
• FFAS03- http://ffas.ljcrf.edu/ffas-cgi/cgi/ffas.pl
• HHPRED- http://toolkit.tuebingen.mpg.de/hhpred
• HMAP (available through the FUDGE pipeline)http://wiki.c2b2.columbia.edu/honiglab_public/index.php/Software:
PUDGE
• I-TASSER- http://zhang.bioinformatics.ku.edu/I-TASSER/
These servers not only find optional templates, but also suggest a
pairwise alignment and in some cases even construct the 3D
model.
2. Selecting Templates
How to select the right template?
•
Higher sequence similarity - %ID
•
Close subfamily - phylogenetic tree
•
Seq. 1
“Environment” similarity - solvent, pH, ligand,
Seq. 2
quaternary interactions
Seq. 3
Seq. 4
determined
Seq. 5
Seq. 6
•
The quality of the experimentally
structure
•
Purpose of modeling - e.g. protein-ligand model vs.
geometry of active site
2. Selecting Templates
More than one template
•
Two ways to combine multiple templates:
–
Global model – alignment with different domain of
the target with little overlap between them
–
Local model – alignment with the same part of the
target
2. Selecting Templates
More than one template
The more the merrier -
multiple structures with
the same fold:
2. Selecting Templates
Trial and error
•
Generate a model for each candidate
template and/or their combination.
•
Evaluate the models by an energy or
any other scoring function.
(will be discussed later…)
3. Aligning query and
template sequences
• All comparative modeling programs depend on a
target-template alignment.
• When the sequence similarity between the template
and target proteins is high, simple pairwise alignments
are usually fine (e.g. Needleman-Wunsch global
alignment).
• Gaps or low/medium sequence similarity indicate that
we should improve the alignment...
3. Aligning query and
template sequences
Guidelines:
1.
Create a multiple sequence alignment and extract the
template-query pairwise alignment.
Pairwise alignments – not enough!
3. Aligning query and
template sequences
Guidelines:
1.
Create a multiple sequence alignment and extract the
template-query pairwise alignment.
Template
Query
•
Visual inspection of alignments - difficult to teach…
a matter of experience…
3. Aligning query and
template sequences
Guidelines:
1.
Create a multiple sequence alignment and extract the
template-query pairwise alignment.
2.
Use secondary structure information to improve
pairwise alignment- avoid gaps in these regions!
Query
Template
3. Aligning query and
template sequences
Guidelines:
1.
Create a multiple sequence alignment and extract the
template-query pairwise alignment
2.
Use secondary structure information to improve
pairwise alignment- avoid gaps in these regions!
3.
Biochemical and structural previous data
3. Aligning query and
template sequences
Tips for MSA building
• Where? (to find homologues)
• Structural templates- search against the PDB
• Sequence homologues- search against SwissProt or
Uniprot (recommended!)- usually using BLAST
• How many?
• As many as possible, as long as the MSA looks good
(next week…)
3. Aligning query and
template sequences
Tips for MSA building
• How long? (length of homologues)
• Fragments- short homologues (less than 50,60% the
query’s length) = bad alignment
• Ensure your sequences exhibit the wanted domain(s)
• N/C terminal tend to vary in length between homologues
• How close? (distance from query sequence)
• All too close- no information
• Too many too far- bad alignment
• Ensure that you have a balanced collection!
3. Aligning query and
template sequences
Tips for MSA building
• From who? (which species the sequence belongs to)
• Don’t care, all homologues are welcome
• Orthologues/paralogues may be helpful
• Sequences from distant/close species provide different
types of information
• Which alignment method?
• The best today are MUSCLE, T-Coffee and MAFFT. All
available at
3. Aligning query and
template sequences
Tips for MSA building
• Most importantly, make sure that both the query
and the selected template are included in the MSA.
• Sequences which are more distant than the template
are not needed to be included in the alignment.
3. Aligning query and
template sequences
Query-template alignment
via a profile-to-profile approach:
1. Construct an MSA for the query, serving as profiles depicting
the protein family properties.
2. Align the profile to profiles of all proteins of the PDB, using,
e.g., FFAS03 or HHpred.
3. Compare pairwise alignments constructed via the different
methods – hope to get a consensus prediction…
3. Aligning query and
template sequences
Different levels of similarity between the template & query
initiate various computational approaches:
4. Building a model
Once you have an improved pairwise
alignment between your query & template
Use NEST to build your model!
Petrey, D., Xiang, X., Tang, C. L., Xie, L., Gimpelev, M., Mitors, T., Soto, C. S.,
Goldsmith-Fischman, S., Kernytsky, A., Schlessinger, A., Koh, I. Y. Y., Alexov, E. and
Honig, B. (2003) Using Multiple Structure Alignments, Fast Model Building, and
Energetic Analysis in Fold Recognition and Homology Modeling.
Proteins: Struc., Func. and Genet 53:430-435 .
NEST
Incorporates a variety of programs to
facilitate the model building
• Input:
1. Sequence alignment of a query to one (or more) template PDBs
2. The template PDB file(s) in the same directory
• Output: a 3D model in PDB format
• Capabilities:
1. Model building with artificial evolution
2. Sequence alignment tuning
3. Composite structure building\multiple templates
4. Structure refinement
NEST
Based on “artificial evolution”:
• Changes to the template structure, such as residue mutation,
insertions or deletions are made one at a time.
• After each change, a slight energy minimization is preformed
to avoid atom clashes.
• This process is repeated until the target sequence is
completely modeled.
• The resulting structure is subjected to minimization energy is calculated based on a simplified potential function
that includes: van der Waals, hydrophobic, electrostatic, torsion
angle and hydrogen- bond terms.
5. Model Evaluation
• The accuracy of the model depends on its
sequence identity with the template:
5. Model Evaluation
The model can be assessed in two levels:
•
Global- reliability of the model as a whole.
*Useful when several models are generated and
one should be chosen as the best one.
*When different models were based on various
templates, may help choose the best one.
•
Local- assessing the reliability of the different
regions, even specific residues, of the model.
*Useful to detect local mistakes, that may
originate in many time from alignment errors.
5. Model Evaluation
Examples of assessment approaches:
1. Assessment of the model’s stereochemistry
2. Prediction of unreliable regions of the model “pseudo energy” profile: peaks  errors
3. Consistence with experimental observations
4. Consistence with evolutionary conservation rates
Summary:
5 Basic Steps
Hands ON
The Query Protein
Name: Dihydrodipicolinate reductase
Enzyme reaction:
Molecular process: Lysine biosynthesis (early stages)
Organism: E. coli
Sequence length: 273 aa
1. Searching For Structures
1. Searching For Structures
Get your sequence
<DAPB_ECOLI
MHDANIRVAIAGAGGRMGRQLIQAALALEGVQLGAALEREGSSLLGSDAGELAGAG
KTGVTVQSSLDAVKDDFDVFIDFTRPEGTLNHLAFCRQHGKGMVIGTTGFDEAGKQ
AIRDAAADIAIVFAANFSVGVNVMLKLLEKAAKVMGDYTDIEIIEAHHRHKVDAPSGTA
LAMGEAIAHALDKDLKDCAVYSREGHTGERVPGTIGFATVRAGDIVGEHTAMFADIGE
RLEITHKASSRMTFANGAVRSALWLSGKESGLFDMRDVLDLNNL
http://www.uniprot.org/
1. Searching For Structures
Find templates with significant homology:
• BLAST against the sequences in the PDB
Find also more distant templates, using profile-toprofile approach:
• FFAS03 server
• HHPRED server
1. Searching For Structures
Blast against the PDB
http://www.ncbi.nlm.nih.gov/BLAST/
1. Searching For Structures
Blast against the PDB
1. Paste
sequence
2. Select the PDB
database
3.
http://www.ncbi.nlm.nih.gov/BLAST/
1. Searching For Structures
Blast against the PDB
http://www.ncbi.nlm.nih.gov/BLAST/
1. Searching For Structures
Use fold recognition - FFAS03
1. Paste
sequence
Select the PDB
database
Run
1. Searching For Structures
Use fold recognition - HHPRED
http://toolkit.tuebingen.mpg.de/hhpred
Select the PDB
database
1. Paste
sequence
Run
2. Selecting templates
2. Selecting templates
Blast against the PDB
The real structure
of our protein
Closest homologous
structure
2. Selecting templates
Blast against the PDB
The selected
template:
1VM6, chain A
http://www.ncbi.nlm.nih.gov/BLAST/
2. Selecting templates
Use fold recognition - FFAS03
http://ffas.ljcrf.edu/ffas-cgi/cgi/get_mu.pl?ses=&qdb=public&tdb
=PDB0408&type=re&key=221830166.3750.0000000
2. Selecting templates
Use fold recognition - FFAS03
Scores below -9.5  significant
2. Selecting templates
Use fold recognition - HHPRED
http://toolkit.tuebingen.mpg.de/hhpred/histograms/8455009
2. Selecting templates
Use fold recognition - HHPRED
2. Selecting templates
Who is our template?
PDB ID 1VM6 is
UniProt entry
‘DAPB_THEMA’
www.ebi.ac.uk/thornton-srv/databases/pdbsum
3. Alignment
3. Alignment
Find query’s homologous sequences
1. Paste query
sequence
2.
http://conseq.bioinfo.tau.ac.il/
3. Alignment
Find query’s homologous sequences
Download the
query’s
alignment
3. Alignment
Extract query-template pairwise alignment
1. Open: Start  Phylogeny  BioEdit
2. Open the alignment: file  open  ‘query.aln’
2. Select the template:
Edit  Search  Find in Titles  “DAPB_THEMA”
3. Alignment
Extract query-template pairwise alignment
“DAPB_THEMA”
3. Alignment
Extract query-template pairwise alignment
4. Add the query to the template selection: ctrl + ‘query’
5. Invert selection: Edit  invert title selection
6. Delete other sequences: Edit  Cut Sequences(s)
7. Minimize gaps: Alignment  Minimize Alignment
8. Save the pairwise alignment:
File  Save as  “DAPB_ECOLI_1VM6.fas”
3. Alignment
Extract query-template pairwise alignment
query
DAPB_THEMA
File name
Save as “fasta” format
3. Alignment
Use fold recognition - FFAS03
Scores below -9.5  significant
3. Alignment
Use fold recognition - FFAS03
http://ffas.ljcrf.edu/ffas-cgi/cgi/get_mu.pl?ses=&qdb=public&tdb
=PDB0408&type=re&key=221830166.3750.0000000
3. Alignment
Use fold recognition - HHPRED
http://toolkit.tuebingen.mpg.de/hhpred/histograms/8455009
3. Alignment
Use fold recognition - HHPRED
3. Alignment
Edit query-template pairwise alignment
• NEST requires a specific file format - unfortunately we will
have to edit the pairwise alignment.
3. Alignment
Edit query-template pairwise alignment
The PDB file of the template
(rename DAPB_THEMA)
>P1;SEQ
sequence:DAPB_ECOLI
MHDANIRVAIAGAGGRMGRQLIQAALALEGVQLGAALEREGSSLLGSDAGELAGAGKTGV
TVQSSLDAVKDDFDVFIDFTRPEGTLNHLAFCRQHGKGMVIGTTGFDEAGKQAIRDAAAD
IAIVFAANFSVGVNVMLKLLEKAAKVMGDYTDIEIIEAHHRHKVDAPSGTALAMGEAIAH
ALDKDLKDCAVYSREGHTGERVPGTIGFATVRAGDIVGEHTAMFADIGERLEITHKASSR
MTFANGAVRSALWLSGKESGLFDMRDVLDLNNL
The name of the query protein (this will
>P1;1VM6
be the name of the modeled PDB file)
structure:1VM6:A
-----MKYGIVGYSGRMGQEIQKVFSE-KGHELVLKVDV-----------------------NGVEEL-DSPDVVIDFSSPEALPKTVDLCKKYRAGLVLGTTALKEEHLQMLRELSKE
VPVVQAYNFSIGINVLKRFLSELVKVLE-DWDVEIVETHHRFKKDAPSGTAILLESAL-------------------GK----SVPIHSLRVGGVPGDHVVVFGNIGETIEIKHRAISR
TVFAIGALKAAEFLVGKDPGMYSFEEVI-----
Save as “DAPB_ECOLI_1VM6.pir”
4. Model Building
4. Model Building
1. Paste the
template’s
PDB ID “1VM6”
Get the template structure
2.
http://www.rcsb.org/pdb/home/home.do
4. Model Building
Get the template structure: 1vm6 chain A
Save as:
“1VM6.pdb”
Notice:
case
sensitive!
4. Model Building
File Transfer: pir file + template PDB
•
Open explorer
Your user name
•
Enter address: ftp://[email protected]
•
Enter:
user: nest
password: uniprot1
4.
Create directory: “[your name]”
5.
Copy the pairwise pir file & the pdb file into your directory
4. Model Building
Open:
Start  Programs Tera Term
1. Enter server
2.
4. Model Building
Run NEST:
•
Enter: cd [your name]
•
Type: nest DAPB_ECOLI_1VM6.pir
•
Take a coffee break…..
4. Model Building
Run NEST
•
Well, the model (“DABP_ECOLI_final.pdb”) is
probably ready by now:
Copy it back to the computer using the FTP window…
5. Evaluation
5. Evaluation
Model Visualization
1.
Open: Start  Bioinformatics  RasTop
2.
Get the model: file open  DABP_ECOLI_final.pdb
5. Evaluation
Active Site Residues
5. Evaluation
Stereochemistry -ProCheck
5. Evaluation
Model Conservation
http://consurf.tau.ac.il
5. Evaluation
Model Conservation
http://consurf.tau.ac.il
5. Evaluation
Model Conservation
http://consurf.tau.ac.il
Real Vs. Model
Superimposition
Useful Links
1. Searching
–
–
–
–
–
for structures
PDB-Blast at NCBI- http://blast.ncbi.nlm.nih.gov/Blast.cgi
Meta server- 3D judry http://bioinfo.pl/meta/
FFAS03- http://ffas.ljcrf.edu/ffas-cgi/cgi/ffas.pl
HHPRED- http://toolkit.tuebingen.mpg.de/hhpred
FUDGE- pipeline- http://wiki.c2b2.columbia.edu/honiglab_public/index.php/Software:PUDGE
2. Selecting templates
3. Aligning query sequence with template structures
–
MSA - MUSCLE, T-coffee and MAFFT at http://toolkit.tuebingen.mpg.de/sections/alignment
–
Alignment editor – Bioedit - http://www.mbio.ncsu.edu/BioEdit/bioedit.html
4. Building a model
–
Nest - http://wiki.c2b2.columbia.edu/honiglab_public/index.php/Software:nest
–
Modeller - http://salilab.org/modeller/modeller.html
5. Evaluating the model
–
ConSurf http://consurf.tau.ac.il
–
PROCHECK http://www.biochem.ucl.ac.uk/~roman/procheck/procheck.html
–
WHATCHECK www.cmbi.kun.nl/swift/whatcheck/
–
ProSA https://prosa.services.came.sbg.ac.at/prosa.php
–
ProQ http://www.sbc.su.se/~bjornw/ProQ/ProQ.cgi
–
AT the Honig lab http://luna.bioc.columbia.edu/Model_Quality_Assessment/cgibin/Model_Quality_Assessment.cgi
Any future questions:
Maya Schushan
[email protected]