Align sequence to structure - Computational Bioscience Program

Download Report

Transcript Align sequence to structure - Computational Bioscience Program

Protein Structure and Prediction
Michael Strong, Ph.D.
Integrated Center for Genes, Environment, and Health
National Jewish Health
From Sequence to Structure
HIV Protease
With Inhibitor
HIV Protease
PQITLWKRPLVTIRIGGQL
KEALLDTGADDTVLEEM
NLPGKWKPKMIGGIGGF
IKVRQYDQIPIEICGHKAI
GTVLVGPT
PVNIIGRNLLTQIGCTLNF
Experimental
Approach
From Sequence to Structure
H1N1 NA
MNPNQKIITIGSVCMTIGMANLILQIG
NIISIWISHSIQLGNQN
QIETCNQSVITYENNTWVNQTYVNISN
TNFAAGQSVVSVKLAGNSSLCPVSGW
AIYSK
DNSVRIGSKGDVFVIREPFISCSPLECRT
FFLTQGALLNDKHSNGTIKDRSPYRTL
MS
CPIGEVPSPYNSRFESVAWSASACHDGI
NWLTIGISGPDNGAVAVLKYNGIITDTI
KS
WRNNILRTQESECACVNGSCFTVMTD
GPSNGQASYKIFRIEKGKIVKSVEMNAP
NYHY
EECSCYPDSSEITCVCRDNWHGSNRP
WVSFNQNLEYQIGYICSGIFGDNPRPN
DKTGS
CGPVSSNGANGVKGFSFKYGNGVWIG
RTKSISSRNGFEMIWDPNGWTGTDN
NFSIKQD
IVGINEWSGYSGSFVQHPELTGLDCIRP
CFWVELIRGRPKENTIWTSGSSISFCGV
NS DTVGWSWPDGAELPFTIDK"
Computational
Approach
Protein Building Blocks
Typical Protein Sequence MNPNQKIITIGSVCMTIGMANLILQIGNIISIWISHSIQLGNQN
Protein Building Blocks
Amino Acid Side Chain (R groups)
Amino Acid Side Chain (R groups)
Disulfide
Bonds
Amino Acid Side Chain (R groups)
acidic
+
basic
Most Proteins Spontaneously Fold
DNA
Transcribed by RNA polymerase
RNA
Translated by Ribosome
Folded Protein
Some proteins need chaperones for correct folding
Most Proteins Spontaneously Fold
Folded protein
Denaturing conditions
Christian
Anfinsen’s Experiment
Unfolded protein
1950s
Native conditions
native state, Folded protein
spontaneous self-organisation
(~1 second)
Most Proteins Spontaneously Fold
Important to Computational Biologists, because this suggests that all information
relating to the correct folding of a protein is contained in it’s primary amino acid
sequence, but …
Most Proteins Spontaneously Fold
But Proteins lack easy rules for folding as compared to DNA
Protein
DNA
Many Factors Influence Protein Folding
Proteins Assume the Lowest Energy
Structure
Protein
Factors that influence folding include:
1. Hydrophobic Interactions / collapse
(particularly within the core)
2. Hydrogen bonds – lead to secondary
structures
3. Disulfide Bonds (Cysteine residues)
4. Salt Bridges / Ionic Interactions
(among charged residues)
5. Multimeric interactions with same
type or other proteins
Common Secondary Structures
Alpha helix
Common Secondary Structures
Beta Sheet
Common Secondary Structures
Loop Regions
Loop
Example - Hemoglobin
Diversity of Protein Structures
A
B
C
F
E
Isoniazid Activating
Enzyme, KatG
Crystal Structure
Streptomycin resistance
gidB
Homology model
Pyrazinamide
Activating enzyme
pncA
Crystal Structure
Rifampin target
rpoB
Homology Model
Isoniazid Target
inhA
Crystal Structure
D
Fluoroquinolone Target
gyrA
Crystal Structure
G
Ethionamide
Target, inhA
Crystal Structure
H
Streptomycin
Resistance
rpsL
Homology model
http://www.proteopedia.org/wiki/index.php/User:Michael_Strong/TB
Experimental Methods of Structure Determination
X-ray crystallography
High resolution structure determination
Grow a protein Crystal
Experimental Methods of Structure Determination
X-ray crystallography
High resolution structure determination
Experimental Methods of Structure Determination
X-ray crystallography
High resolution structure determination
•Intensities and phases of all reflections are
combined in a Fourier transform to provide
maps of electron density
Phases determined by using heavy metals or
selenomethionine (MAD)
Experimental Methods of Structure Determination
NMR – Nuclear Magnetic Resonance
High resolution structure determination
• Smaller Proteins than X-ray
• Distances between pairs of hydrogen
atoms
• Lots of information about dynamics
• Requires soluble, non-aggregating
material
• Assignment sometimes
difficult
NOE cross-peak if they are
within 5.0 Å
Experimental Methods of Structure Determination
Cryo Electron Microscopy
Low to medium resolution structure determination
• Low to medium resolution
~10-15Å
• Limited information about
dynamics
• Can be used for very large
molecules and complexes
Database of Protein Structures
PDB – Protein Data Bank
Database of Protein Structures
PDB – Protein Data Bank
95,113 protein structures as of 10/31/2013
Database of Protein Structures
PDB – Protein Data Bank
Even so, the number of solved structures greatly lags behind the rate of new genes being
sequenced … Solution: Computational Structural Methods
GenBank Sequences
Database of Protein Structures
PDB – Protein Data Bank Files
• Atoms in pdb files are defined by their Cartesian coordinates:
Visualization of PDB files
Pymol, Jmol, Chimera, etc
Visualization of PDB files
Pymol, Jmol, Chimera, etc
DALI Structural Alignments
Align Protein Structures, Structure Superposition
Generates a comparison matrix (transform protein into a 2D array of
distances between C-alpha atoms. Z score reflects reliability, lowest
RMSD identified
From Sequence to Structure
H1N1 NA
MNPNQKIITIGSVCMTIGMANLILQIG
NIISIWISHSIQLGNQN
QIETCNQSVITYENNTWVNQTYVNISN
TNFAAGQSVVSVKLAGNSSLCPVSGW
AIYSK
DNSVRIGSKGDVFVIREPFISCSPLECRT
FFLTQGALLNDKHSNGTIKDRSPYRTL
MS
CPIGEVPSPYNSRFESVAWSASACHDGI
NWLTIGISGPDNGAVAVLKYNGIITDTI
KS
WRNNILRTQESECACVNGSCFTVMTD
GPSNGQASYKIFRIEKGKIVKSVEMNAP
NYHY
EECSCYPDSSEITCVCRDNWHGSNRP
WVSFNQNLEYQIGYICSGIFGDNPRPN
DKTGS
CGPVSSNGANGVKGFSFKYGNGVWIG
RTKSISSRNGFEMIWDPNGWTGTDN
NFSIKQD
IVGINEWSGYSGSFVQHPELTGLDCIRP
CFWVELIRGRPKENTIWTSGSSISFCGV
NS DTVGWSWPDGAELPFTIDK"
Secondary Structure Prediction
Alpha Helix, Beta Strand, or Other
Computational
Approach
Tertiary Predictions:
1. Homology Modeling
2. Fold Recognition
3. De Novo Protein Structure
Prediction
Secondary Structure Prediction
1st and 2nd generation – looked at probability of amino acid to be
in a helix, strand, or other (coil/loop) based on known structures.
Chou-Fasman (short runs of amino acids), GOR (Bayesian, takes
neighbors into account)
- helices – no prolines, periodicity 3.6 residues/turn
- strands – alternating hydropathy, or ends hydrophillic and
center hydrophobic
-other – small, polar, flexible residues, and prolines
But, stalled at 55- 60% accuracy
3rd generation – also used position specific profiles based on
multiple sequence alignments (evolutionary information) (ie
insertion/deletion more likely to be in coil/turn), PSI BLAST and
HMM, NN and SVM (improved to about 75-80%)
Secondary Structure Prediction
But we really want to know how the protein
folds in three dimensions
But we really want to know how the protein
folds in three dimensions
CASP - Critical Assessment of Techniques for
Protein Structure Prediction
• Started in 1994, Helped push the field of structure prediction
•“Contest-like” setup
•Catagories include:
•Homology Modeling / Comparative Modeling
•Fold Recognition / Threading
•Ab Initio, De novo
•Partially vs. Automated Methods (now quite similar results)
Goal: Predict structures of solved but
unpublished/unreleased structures (used to evaluate
predictions. Every year, predictions / algorithms get better
Comparative Modeling “Homology Modeling”
• Proteins that have similar sequences (i.e., related by evolution) are likely to have
similar three-dimensional structures
1. BLAST sequence of Interest against PDB to identify a template
•Multiple templates can be used if desired
•Templates with Ligands bound can be used to identify binding sites and
interacting residues in the homology model
Sequence identity required depends on protein length. A good rule of thumb is to have
at least 40% sequence identity. Higher sequence identity is best. Lower than 25% is not
reliable (zone of uncertainty)
Above 75% sequence identity, usually quite reliable homology model
Accurate sequence alignments very important
Programs include Modeller and Swiss Model
Comparative Modeling “Homology Modeling”
Steps include:
1. Template recognition and initial alignment
2. Alignment Correction (Multiple Sequence Alignment can
Help)
3. Backbone Generation (transfer coordinates from
template)
4. Loop Modeling (loops hard to predict with insertions)
5. Side Chain Modeling (usually similar tortion angles at high
sequenc ID)
6. Model Optimization (minor energy minimization steps or
restrain some atom positions)
7. Model Validation (Higher ID more accurate usually,
Calculate energy, or normality index (bond length, tortion
angles))
8. Iteration (to refine)
Protein Threading, Fold Recognition
Often, seemingly unrelated proteins adopt similar folds.
-Divergent evolution, convergent evolution. For sequences with low
or no sequence homology
Protein Threading
§ Generalization of homology modeling method
• Homology Modeling: Align sequence to sequence
• Threading: Align sequence to structure (templates)
For each alignment, the probability that that each amino acid
residue would occur in such an environment is calculated based
on observed preferences in determined structures.
§ Rationale:
• Limited number of basic folds found in nature
• Amino acid preferences for different structural environments
provides sufficient information to choose the best-fitting protein
fold (structure)
Fold recognition
• The number of possible protein structures/folds is limited (large number of
sequences
but relatively few folds (some estimate ~1000)) (most apparent when 50% of
structures with no seq homology were solved and had folds similar to known structures)
90% of new structures deposited in PDB have similar folds to those already known
• Proteins that do not have similar sequences sometimes have similar threedimensional structures (such as B-barrel TIM fold)
3.6 Å
5% ID
NK-lysin (1nkl)
Bacteriocin T102/as48 (1e68)
• A sequence whose structure is not known is fitted directly (or “threaded”) onto a
known
structure and the “goodness of fit” is evaluated using a discriminatory function
• Need ways to move model closer to the native structure
Ab initio prediction of protein structure – concept
Difficult because search space is huge. Much larger conformational
space
Goal: Predict Structure only given its amino acid sequence
In theory: Lowest Energy Conformation
• Go from sequence to structure by sampling the conformational space in a reasonable
manner and select a native-like conformation using a good discrimination function
Difficult for sequences larger that 150aa
Rosetta (David Baker lab) one of best (CASP evaluation)
Rosetta structure prediction
2 phases
1. Low-resolution phase – statistical scoring function and
fragment assembly
A. local structure conformations using info from PDB (3
and 9mer stretches)
B. multiple fragment substitution simulated annealing –
to find best arrangement of the fragments (Monte Carlo
Search)
C. low resolution ensemble of decoy conformations
2. Atomic refinement phase using rotamers and small backbone
angle moves (in populated regions of Ramachandran plot)
A. Refinement
B. Then structures clustered based on RMSD
C. Center of the Largest Clusters chosen as
representative folds (likely to be correct fold)
Quality Assessment
Ramachandran Plot – Phi Psi angles
To identify residues that may be in wrong conformation
Procheck, What_check