The “comparative modeling”

Download Report

Transcript The “comparative modeling”

Structural Analysis
"Big Science at Small Colleges"
Curriculum Development Workshop
Ranyee Chiang
University of California, San Francisco
Outline – Structural Analysis
• Key themes
• Structural visualization tools (hands-on
tutorial)
• Structure prediction
• Structure prediction tools (hands-on activity)
• Genetic variation and structures
• Hemoglobin exploration (hands-on activity)
Proteins have many functions
• Enzymes– catalyze chemical reactions
• Transport proteins– transport other molecules
across cell membranes and throughout the
body
• Structural proteins– provide support
• Receptors– regulate and coordinate bodily
activities
Proteins have characteristic shapes
• A protein is made of a chain of amino acids
RGAEEVWWPILG…
• After the protein is produced, it folds up
Arg
His
Lys
Asp
Glu
Ser
Thr
Asn
Gln
Cys
Gly
Pro
Ala
Ile
Leu
Met
Phe
Trp
Tyr
Val
sequence  structure  function
KNGTIVTADGI
…
D-hydantoinase
cleavage of a 5membered cyclic
diamide
same sequences  same structures
• similar sequences  similar structures
RGAEEVWWPILG…
RGRGEVWWPILG…
 Way to deduce function from structure or sequence
 Way to deduce structure from sequence
How to measure protein similarity?
Comparing sequences…
RGAEEVWWPILG
Comparing structures…
RGRGEVWWPILK
How similar are these two sequences?
RGAEEVWWPILGRRKHGPKRLGRRKHGPKR
RGATEVRWPILGRRKHGPKRLGRRKHGPKR
These sequences have 30 amino acids,
27 are identical  Sequence identity = 90%
How similar are two structures?
GRK
GKK
K
3.0
R
G
5.0
K
K
4.0
G
• RMSD = √(d12+d22+d32) = √(9+16+25) ≈ 7.0
• More similar pairs of structures  lower RMSD
Tutorials
• Introduction to structural visualization
• http://www.cgl.ucsf.edu/chimera/current/docs/
UsersGuide/tutorials/menutut.html OR
• http://www.cgl.ucsf.edu/chimera/current/docs/
UsersGuide/tutorials/getting_started.html
• Structure comparisons
• http://www.cgl.ucsf.edu/chimera/current/docs/
UsersGuide/tutorials/squalene.html
• Sequence and structural alignments
• http://www.cgl.ucsf.edu/chimera/current/docs/
UsersGuide/tutorials/super.html
Other resources
• Fold It - http://fold.it/portal/adobe_main
• Solve protein folding puzzles
• Enables researchers to collect data about human
pattern recognition and problem-solving
• Learn elements of protein folding and stability
• List of sequence alignments (pre-generated and
online servers) http://www.cgl.ucsf.edu/home/meng/sources.html
• List of structure alignments (pre-generated and
online servers) –
http://www.cgl.ucsf.edu/home/meng/grpmt/structalig
n.html
Protein Structure Prediction with Emphasis
on Comparative or Homology Modeling
1. Introduction and motivation
2. Types of protein structure prediction methods
3. Comparative modeling
4. Errors in comparative models
5. Modeling of loops in protein structures
6. Prediction of errors in comparative models
7. Structural genomics
8. Tools
Protein structure provides important
information
• Knowledge of a protein’s structure helps us
• design drugs that target that protein
• engineer new functions for that protein
• determine its evolutionary relationship to other
proteins
Experimental determination of structures
is costly
How much does it cost to determine the
crystal structure of a protein?
NIH estimate: $250,000
1
1. R. Service. Structural Genomics, Round 2. Science 307, 1554-1558, 2005.
Why Protein Structure Prediction?
Y 2008
Sequences
5,000,000+
Structures
49,000
We have an experimentally determined atomic structure for only
~1% of the known protein sequences.
Principles of protein structure
D. Baker & A. Sali. Science 294, 93, 2001.
GFCHIKAYTRLIMVG…
Evolution
(“statistical” rules)
Threading
Comparative Modeling
Anacystis nidulans
Ab initio prediction
Anabaena 7120
(physics)
Condrus crispus
Desulfovibrio vulgaris
Folding
The “physics” principle
The native structure of a protein is determined by its
amino acid sequence, under native conditions
(uniqueness, stability, kinetic accessibility).
C.B. Anfinsen
The “comparative modeling”
principle
Evolution of protein families
C RMSD Å (% EQV)
2 (50)
1 (80)
0 (100)
Anacystis nidulans
Anabaena 7120
Families (very similar sequences)
30,000
Superfamilies (similar sequences)
10,000
Condrus crispus
Desulfovibrio vulgaris
Folds (similar 3D structure)
3,000
~30% are known
Clostridium mp.
20
50
100
% SEQUENCE IDENTITY
10/2/02
Comparative Protein Structure Modeling
Ca RMSD Å (% EQV)
2 (50)
1 (80)
0 (100)
Flavodoxin family
Anacystis nidulans
Anabaena 7120
COMPARATIVE
MODELING
KIGIFFSTSTGNTTEVA…
Condrus crispus
Desulfovibrio vulgaris
Clostridium mp.
20
50
% SEQUENCE IDENTITY
100
Protein structure modeling
Ab initio prediction
Comparative Modeling
Applicable to any sequence.
Applicable to those sequences only that share
recognizable similarity to a template structure.
Not very accurate (>4 A RMSD).
Fairly accurate ( <3 A RMSD), typically
comparable to a low resolution X-ray experiment.
Attempted for proteins of <100 residues.
Not limited by size.
Accuracy and applicability are limited by our
understanding of the protein folding problem.
Accuracy and applicability are limited by the
number of known folds.
Steps in Comparative Protein Structure
Modeling
START
Template Search
Target – Template
Alignment
TARGET
TEMPLATE
ASILPKRLFGNCEQTSDEGLKI
ERTPLVPHISAQNVCLKIDDVP
ERLIPERASFQWMNDK
ASILPKRLFGNCEQTSDEGLKIERTPLVPHISAQNVCLKIDDVPERLIPE
MSVIPKRLYGNCEQTSEEAIRIEDSPIV---TADLVCLKIDEIPERLVGE
Model Building
Model Evaluation
No
OK?
Yes
END
M. Marti-Renom et al. Ann. Rev. Biophys. Biomolec. Struct. 29, 291, 2000.
N. Eswar et al. Curr. Protocols Bioinformatics 5.6, 2006.
http://salilab.org/
Comparative modeling by satisfaction of spatial
restraints MODELLER
3D
GKITFYERGFQGHCYESDC-NLQP…
SEQ GKITFYERG---RCYESDCPNLQP…
1. Extract spatial restraints
2. Satisfy spatial restraints
F(R) =  pi (fi /I)
i
A. Šali & T. Blundell. J. Mol. Biol. 234, 779, 1993.
J.P. Overington & A. Šali. Prot. Sci. 3, 1582, 1994.
A. Fiser, R. Do & A. Šali, Prot. Sci., 9, 1753, 2000.
http://salilab.org/
Extracting spatial restraints from
template
template GKTIFYERKRD…
target
spatial restraint: limit on
structural feature of model
GKITFY– RGRF…
?
12 Å
Comparative modeling by satisfaction of spatial
restraints MODELLER
3D
GKITFYERGFQGHCYESDC-NLQP…
SEQ GKITFYERG---RCYESDCPNLQP…
1. Extract spatial restraints
2. Satisfy spatial restraints
F(R) =  pi (fi /I)
i
A. Šali & T. Blundell. J. Mol. Biol. 234, 779, 1993.
J.P. Overington & A. Šali. Prot. Sci. 3, 1582, 1994.
A. Fiser, R. Do & A. Šali, Prot. Sci., 9, 1753, 2000.
http://salilab.org/
Key components of modeling
• Representation
• Sampling
• Scoring
2.35
Some restraints in MODELLER that are
useful in comparative modeling
Homology-based
(from related structures):
MM Force-Field
(structure-independent):
p(distance / d’,a,g,s,i)
CHARMM-19, 22, 
p(SDCH / R,S’,R’,t,s)
Generalized Born / Surface Area
solvation
p(MNCH / R,M’,R,s)
Statistical potentials
(from all known structures):
p(distance / atom types)
p(MNCH / residue type)
p(SDCH / residue type)
Šali & Blundell. J. Mol. Biol. 234, 779, 1993.
Overington & Sali. Prot. Sci. 3, 1582, 1994.
Fiser, Go, Sali. Prot. Sci. 9, 1753, 2000.
Melo, Sanchez, Sali, Prot. Sci. 11, 430, 2002.
M.-Y. Shen, B. Webb
M. Karplus et al.
Protein Structure Prediction with Emphasis
on Comparative or Homology Modeling
1. Introduction and motivation
2. Types of protein structure prediction methods
3. Comparative modeling
4. Errors in comparative models
5. Modeling of loops in protein structures
6. Prediction of errors in comparative models
7. Structural genomics
8. Tools
Model accuracy as a function of
target-template sequence identity
Fraction of C atoms
within 3.5Å of their
correct positions.
R. Sánchez & A. Šali, Proc. Natl. Acad. Sci. USA 95, 13597, 1998.
Typical errors in comparative models
Incorrect template
Misalignment
MODEL
X-RAY
TEMPLATE
Region without a
template
Distortion/shifts in
aligned regions
Marti-Renom et al. Annu.Rev.Biophys.Biomol.Struct. 29, 291-325, 2000.
Sidechain packing
Protein structure models can be useful, despite errors
D. Baker & A. Sali. Science 294, 93, 2001.
Protein Structure Prediction with Emphasis
on Comparative or Homology Modeling
1. Introduction and motivation
2. Types of protein structure prediction methods
3. Comparative modeling
4. Errors in comparative models
5. Modeling of loops in protein structures
6. Prediction of errors in comparative models
7. Structural genomics
8. Tools
Loop Modeling in Protein Structures
 barrel: flavodoxin
IG fold: immunoglobulin
antiparallel -barrel
A. Fiser, R. Do & A. Šali, Prot. Sci. 9, 1753, 2000.
Loop modeling strategies
Database search
Conformational search
• even in DB search, the different conformations must be ranked
• loops longer than 4 residues need extensive optimization
• DB method is efficient for specific families (eg, canonical loops in Ig’s,  hairpins)
Protein Structure Prediction with Emphasis
on Comparative or Homology Modeling
1. Introduction and motivation
2. Types of protein structure prediction methods
3. Comparative modeling
4. Errors in comparative models
5. Modeling of loops in protein structures
6. Prediction of errors in comparative models
7. Structural genomics
8. Tools
Model Evaluation Methods
Is the fold correct?
How correct is the overall structure?
What regions are modeled incorrectly?
What is the best model in the set of alternative models?
Does the model satisfy the restraints used to calculate it?
Stereochemistry test (PROCHECK)
Residue environment test (Profiles3D)
Statistical potential tests (PROSAII)
Other statistical tests, including tests with multiple criteria
(GA341).
Molecular mechanics force field tests.
Structural Genomics
Sali. Nat. Struct. Biol. 5, 1029, 1998.
Sali et al. Nat. Struct. Biol., 7, 986, 2000.
Sali. Nat. Struct. Biol. 7, 484, 2001.
Baker & Sali. Science 294, 93, 2001.
Goal: Characterize most protein sequences based
on related known structures.
The number of “families” is
much smaller than the
number of proteins.
Any one of the members of
a family is fine.
Eswar et al. Nucl. Acids Res. 31, 3375–3380, 2003.
MODPIPE: Automated
Large-Scale Comparative
Modeling
Get profile for sequence
(SP/TrEMBL)
Align sequence profile with multiple
structure profile using local
dynamic programming
Build models for target segment by
satisfaction of spatial restraints
Evaluate models
END
For each template profile
MODELLER
Select templates using permissive
E-value cutoff
For each target sequence
MODELLER
START
R. Sánchez & A. Šali, Proc. Natl. Acad. Sci. USA
95, 13597, 1998.
Eswar et al. Nucl. Acids Res. 31, 3375–3380,
2003.
Pieper et al., Nucl. Acids Res. 32, 2004.
N. Eswar, M. Marti-Renom, M.S. Madhusudhan,
B. John, A. Fiser, R. Sánchez, F. Melo, N.
Mirkovic, B. Webb, M.-Y. Shen, A. Šali.
MODBASE: models for domains in ~1.6 million sequences
http://salilab.org/modbase
Search Page
Model Details
Sequence
Overview
Model Overview
Pieper et al. MODBASE, a database of annotated comparative protein structure models,
and associated resources. Nucleic Acids Research, 2006.
Seminal papers
• Sippl, M. J. Calculation of conformational ensembles from
potentials of mean force. An approach to the knowledge-based
prediction of local structures in globular proteins. J Mol Biol 213,
859-83 (1990).
• Ponder JW, Richards FM. Tertiary templates for proteins. Use of
packing criteria in the enumeration of allowed sequences for
different structural classes.J Mol Biol. 1987 Feb 20;193(4):77591.
• Sali, A. and Blundell, T.L., Comparative protein modelling by
satisfaction of spatial restraints. J. Mol. Biol., 1993. 234: p. 779815.
• A method to identify protein sequences that fold into a known
threedimensional structure. J. U. Bowie, R. Luthy, D. Eisenberg.
Science 253(5016): 164-70
ModBase Activity
• Uploaded document
Other resources
• An Interactive NCBI Mini-Course
• http://www.ncbi.nlm.nih.gov/Class/minicourse
s/quickstructure.html
• Identify conserved domains
• Search for other proteins with conserved
domains
• Explore modeling template
• Find distant homologs
Understanding the impact of
human genetic variation is a
key challenge
Disease predispositions
Response to medications
R144C
Q6V
Cytochrome P450
Hemoglobin
Sickle cell anemia
Warfarin-induced bleeding
Most disease-associated human genetic
variants are missense mutants
AGTGAC
AGTGUC
ALFLDVSDQTPINSIIFSHED
ALFLDVSVQTPINSIIFSHED
There are many ways that missense
mutants can impact protein function
• Protein aggregates and does not
fold
• Protein is destabilized
• Binding interfaces are disrupted
• Active sites are disrupted
There are many ways that the functional
impact of a missense mutant can be
assessed
• Biochemical experiments
• Physics
• Epidemiology
• Bioinformatics
Predicting functional impact of nsSNP
A knowledge-based bioinformatics prediction approach
begins with sets of missense mutants that have been
characterized biochemically or clinically
• 35,000+ clinically or biochemically characterized nsSNPs
can be found in web databases and literature
Clinically characterized
Biochemically characterized
p53 is a transcription factor that
regulates the cell cycle
.0092 CYSTIC FIBROSIS [CFTR, ARG352GLN]
In a systematic study of 133 CF individuals in northern
Italy, Gasparini et al. (1993) identified an arg352-to-glu
mutation.
Kato et al. PNAS 100:14, 2003
A pipeline for large-scale non-synonymous SNP
annotation
D. Haussler, University of California, Santa Cruz
UNIPROT
Proteins
RefSeq
mRNA
Predicted Genes
dbSNP
Hsu et al. 2006
Genomic DNA
SNPs
MODPIPE
Protein structure
modeling
by homology
Eswar et al. 2003
A pipeline for large-scale nsSNP annotation
SVM Annotations
SVM
Predicted functional
impact
Homology Transfer Annotations
LS-SNP
http://salilab.org/LS-SNP
• SVM annotations for 24,000
nsSNPs
• 20.3% classified as
deleterious
• Previous estimates 20-30%
• Homology transfer
annotations for 19,000
nsSNPs
• Queryable by multiple data
types
Karchin, Diekhans, Kelly, Thomas, Pieper, Eswar, Haussler and Sali, Bioinformatics. 21(12):2814-20, 2005
Results of an example LS-SNP query
http://salilab.org/LS-SNP
Hemoglobin Exploration
• Uploaded activity
Resources
• LS-SNP - http://alto.compbio.ucsf.edu/LSSNP/