Predicting Protein Structure and Beyond

Download Report

Transcript Predicting Protein Structure and Beyond

PREDICTING PROTEIN STRUCTURE
AND BEYOND ….
P. V. Balaji
Biotechnology Center
I.I.T., Bombay
Organization of the talk
1. Why predict the structure?
2. Methods for structure prediction
3. What next?
Complexity
Genome Size is not Proportional to
the Complexity of the Organism
Size of the Genome
Molecular Logic of Life is Same
English
Genome
 26-Letter alphabet
 4-Letter alphabet
 Only one grammar
 Only one grammar
 Extremely diverse literature
 Extremely diverse organisms
Biochemically, all things living –
animals, plants, bacteria, viruses, etc. –
are remarkably similar
Genome Sequencing and Analysis:
One of the Key Steps in Deciphering
the Logic of Life
Even minute details have to be analyzed
Hang him, not let him go
Humans: NeuNAc
–CH3
Hang him not, let him go
Chimpanzees: NeuNGc
–CH2OH
Innovations in Technology Have Made
Genome Sequencing a Routine Affair
Genome sequencing
Completed: ~70 organisms
In the pipeline: Several more
“ … it is unlikely that the base sequence of more
than a few percent of such a complex DNA will
ever be determined …”
C W Schmid & W R Jelinek, Science, June 1982
One Aspect of Genome Sequence Analysis
is to Assign Functions to Proteins
(Reverse Genetics)
Proteins are workhorses of the cell
Are involved in every aspect of living systems
Function of a Protein can be Defined
at Different Levels
Example: Lysozyme
Biochemical level: Hydrolyzes C—O bond
Physiological level: Breaks down the cell wall
Cellular level: Defense against infection
Different Analysis Tools Provide
Functions at Different Levels
Hallmark of Proteins: Specificity
Know exactly which small molecule (ligand)
they should bind to or interact with
Also know which part of a macromolecule
they should bind to
Origin of Specificity
Function is
critically
dependent
on
structure
1ruv.pdb
Structure – Key to Dissect Function
Location of Mutants
Conserved Residues
SNPs
Dynamics
(breathing)
Crystal
Packing
Functional
Oligomerization
Interaction
Interfaces
Clefts
(active sites)
Surface Shape
& Charge
Antigenic Sites,
surface patches
Structure
Relative
Juxtaposition
Catalytic Clusters
Motifs
Catalytic Mechanism
Fold
Evolutionary
Relationships
Sequence Determines Structure
1KETAAAKFERQHMDSSTSAASSSNYCNQMMKSRNLTKDRCKPVNTFVHES
LADVQAVCSQKNVACKNGQTNCYQSYSTMSITDCRETGSSKYPNCAYKTT
QANKHIIVACEGNPYVPVHFDASV124
1ruv.pdb
Christian B. Anfinsen: Nobel Prize in Chemistry (1972)
How Does Sequence Specify Structure?
?
Sequence
Structure
Functional
Genomics
Function
The Protein Folding Problem
(second half of the genetic code)
Structure has to be determined experimentally
Experimental Methods of
Structure Determination
X-ray crystallography
Provides a static picture
Solubilization of the over-expressed protein
Obtaining crystals that diffract
Nuclear Magnetic Resonance spectroscopy
Provides a Dynamic picture
Solubilization of the over-expressed protein
Size-limit is a major factor
Limitations of Experimental Methods:
Consequences
Annotated proteins in the databank: ~ 100,000
Total number including ORFs: ~ 700,000
Proteins with known structure: ~5,000 !
Dataset for
analysis
ORF, or Open Reading Frame, is a region of genome that
codes for a protein
Have been identified by whole genome sequencing efforts
ORFs with no known function are termed orphan
Structural Biology Consortia:
Brute Force Approach Towards
Structure Elucidation
*
–
+
Aim to solve about 400 structures a year
Employ battalions of Ph.Ds & Post-doctorals
Large-scale expression & crystallization attempts
Basic strategies remain the same
No (known) new tricks
“Unrelenting” ones will be ignored
Enhances the statistical base for inferring
sequence – structure relationships
Predicting Protein Structure:
1. Comparative Modeling
(formerly, homology modeling)
KQFTKCELSQNLYDIDGYGRIALPELICTMF
HTSGYDTQAIVENDESTEYGLFQISNALWCK
SSQSPQSRNICDITCDKFLDDDITDDIMCAK
KILDIKGIDYWIAHKALCTEKLEQWLCEKE
?
1alc
Homologous
Share
Similar
Sequence
Use as template
& model
KVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAK
FESNFNTQATNRNTDGSTDYGILQINSRWWCND
GRTPGSRNLCNIPCSALLSSDITASVNCAKKIV
SDGNGMNAWVAWRNRCKGTDVQAWIRGCRL
8lyz
Comparative Modeling
Basis
*
*
Structure is much more conserved than
sequence during evolution
Higher the similarity, higher is the
confidence in the modeled structure
Limited applicability
*
A large number of proteins and ORFs have no
similarity to proteins with known structure
Predicting Protein Structure:
Alternative Methods
Threading or Fold Recognition
Ab initio
*
In addition, establishing sequence 
* structure relationship is also important
Both these methods depend heavily on
the analysis of known protein structures
*
Input from people trained in statistics,
pattern recognition and related areas of
computer science is very critical
Statistical Analysis of Protein Structures:
Microenvironment Characterization
Describe structures at multiple levels of detail using
a comprehensive set of properties
Atom based properties
Type, Hydrophobicity, Charge
Residue based properties
Type, Hydrophobicity
Chemical group
Hydroxyl, Amide, Carbonyl, etc.
Secondary structure
a-Helix, b-Strand, Turn, Loop
Other properties
VDW volume, B-factor,
Mobility, Solvent accessibility
Predicting Protein Structure:
2. Threading or Fold Recognition
Basis
*
Irrespective of the amino acid sequence, a
* protein has to adopt one of these folds
Fold recognition is essentially finding the best
* fit of a sequence to a set of candidate folds
Select the best sequence-fold alignment using a
* fitness scoring function
NP-complete problem
*
It is estimated there are only around 1000 to
10 000 stable folds in nature
Fold of a Protein
Refers to the spatial arrangement of its secondary
structural elements (a-helices and b-strands)
1l45.pdb
a/b-barrel
4bcl.pdb
b-barrel
1mbl.pdb
a/b-sandwich
Threading: Basic Strategy
Query
dhgakdflsdfjaslfkjsdlfjsdfjasd
Library
of folds
Scoring & selection
Spatial
Interactions
Template
Sequence
Predicting Protein Structure:
3. Ab Initio Methods
Sequence
Prediction
Secondary
structure
Tertiary
structure Energy
Low energy Validation
structures
Mean field
Minimization
potentials
Predicted
structure
Small molecules and/or metal ions
are an integral part of certain proteins
1a6g.pdb
Predicting the structure of such proteins
is an entirely different challenge
Proof of the Pudding: CASP Meetings
Community Wide Experiment on the Critical Assessment
of Techniques for Protein Structure Prediction – 4
Predictions; not Post-dictions
Easy and medium targets: ~100% success
Hard targets: ~50% success
Significant increase from CASP3
OK, I can predict the structure correctly! is that it?
Well, no!!
Detailed biochemical characterization is required
Strict structure – function correlation exists only for a
subset of proteins
Some folds (ferredoxin, TIM barrel, …) are very
popular – several protein families, with diverse
functions, adopt these folds
Despite high similarity in sequence and structure, may
act on different substrates (hence different functions) –
due to subtle changes in active site (b13-GalT and
b13-GlcNAcT)
Inferring Function from Structure: Caveats
Similar structure, mutually exclusive function: Lysozyme
& a-lactalbumin
Same function, completely different structures: Carbonic
anhydrases from M. thermophila and mouse
“Moonlighting” proteins – one structure(?), multiple functions
Glyceraldehyde 3-phosphate dehydrogenase
8lyz.pdb, 1alc.pdb
Gal1p
– Kinase as well as regulator of Gal-gene expression
Glycolysis
Binding
protein
for plasmin,
fibronectin
and lysozyme
Gal3p
– 70%
similar;
does not
have kinase
activity
Transcriptional control of gene expression, DNA replication
1thj.pdb
1dmx.pdb and repair
Flocculation
Same fold, different oligomerization
Dimerization
Tetramerization
ConA
ConA
PNA
PNA,
GSIV
Ligand Induced Conformational
Changes are Quite Common
Binding of first substrate redefines the active site and creates
the binding pocket for the second substrate and the metal ion
Flexible loop
After
Before
Take Home Message
Predicting Protein Structure is a key
component of genome sequence analysis
Structure is a very important link in
deciphering the function
New tools are required? Or larger training
dataset is required?
Acknowledgement
Organizers for giving me this opportunity
Sujatha and Jayadeva Bhat for helping me put
together this talk
Few Useful Links
http://guitar.rockefeller.edu/modeller/modeller.html
http://www.biochem.ucl.ac.uk/bsm/cath-new/index.html
http://predictioncenter.llnl.gov/
http://insulin.brunel.ac.uk