Powerpoint slides

Download Report

Transcript Powerpoint slides

CS 177
Proteins part 1: Structure-function relationships
Review of protein structures
Need for analyses of protein structures
Sources of protein structure information
Review of protein
structures
Need for analyses
of protein
structures
Sources of protein
structure
information
Computational
Modeling
Computational Modeling
Need for analyses of protein structures
A protein performs metabolic, structural, or regulatory functions in a cell.
Cellular biochemistry works based on interactions between 3-D molecular
structures
The 3-D structure of a protein determines its function
Therefore, the relationship of sequence to function is primarily concerned with
understanding the 3-D folding of proteins and inferring protein functions from these
3-D structures
(e.g. binding sites, catalytic activities, interactions with other molecules)
Review of protein
structures
The study of protein structure is not only of fundamental scientific interest in
terms of understanding biochemical processes, but also produces very
valuable practical benefits
Medicine
Need for analyses
of protein
structures
Sources of protein
structure
information
Computational
Modeling
The understanding of enzyme function allows the design of new and improved drugs
Agriculture
Therapeutic proteins and drugs for veterinary purposes and for treatment of plant diseases
Industry
Protein engineering has potential for the synthesis of enzymes to carry out various industrial
processes on a mass scale
Need for analyses of protein structures
Protein 3-D structure has direct medical implications:
a incorrectly folded protein will not function properly
Examples:
- Adult-onset diabetes
Protein misfolding may be responsible for blood-vessel damage, blindness and
other debilitating effects of the disease
- Cystic Fibrosis
Review of protein
structures
Need for analyses
of protein
structures
Sources of protein
structure
information
Computational
Modeling
Most common mutation underlying cystic fibrosis hinders the dissociation of the
transport-regulator protein from one of its chaperones. Thus, the final steps in normal
folding cannot occur, and normal amounts of active protein are not produced.
Need for analyses of protein structures
Examples for diseases associates with protein misfolding (cont.):
- Alzheimer's disease
New studies indicate that Alzheimer's disease may be caused by small clumps of wrongly
folded proteins. Scientists have found that misfolded amyloid beta protein molecules
hinder memory processes in rat brains by blocking synapses
References
1. Walsh, D. M. et al. Naturally secreted oligomers of amyloid (protein
potently inhibit hippocampal long-term potentiation in vivo. Nature,
416, 535 - 539, (2002).
Review of protein
structures
Need for analyses
of protein
structures
Sources of protein
structure
information
Computational
Modeling
2. Bucciantini, M. et al. Inherent toxicity of aggregates implies a
common mechanism for protein misfolding diseases. Nature, 416, 507
- 511, (2002).
CT scan of the brain of
an Alzheimer's patient
showing widespread
destruction (pink) of
brain tissue (green)
Need for analyses of protein structures
Examples for diseases associates with protein misfolding (cont.):
- Transmissible Spongiform Encephalopathies (TSEs)
(such as mad cow disease or the human version, Creutzfeldt-Jakob disease)
Infectious agent is probably a small misfolded protein called prion. Prions naturally
occur in the brain with unknown function. Infectious prions can cause correctly folded
proteins to misfold. Domino-effect: large numbers of misfolded prions cause neural
degeneration
- Other non-infectious brain diseases such as Parkinson’s, Huntington’s, and
Lou Gehrig’s.
Review of protein
structures
Need for analyses
of protein
structures
Sources of protein
structure
information
Computational
Modeling
Sources of protein structure information
3-D macromolecular structures stored in databases
The most important database: the Protein Data Bank (PDB)
The PDB is maintained by the Research Collaboratory for Structural Bioinformatics
(RCSB) and can be accessed at three different sites (plus a number of mirror sites
outside the USA):
- http://rcsb.rutgers.edu/pdb (Rutgers University)
- http://www.rcsb.org/pdb/ (San Diego Supercomputer Center)
- http://tcsb.nist.gov/pdb/ (National Institute for Standards and Technology)
It is the very first “bioinformatics” database ever build
Review of protein
structures
Need for analyses
of protein
structures
Sources of protein
structure
information
Computational
Modeling
The Protein Data Bank (PDB)
PDB: 20,254 structures (4 March 2003)
Review of protein
structures
Need for analyses
of protein
structures
Sources of protein
structure
information
Computational
Modeling
SwissProt: 122,564 entries (5 March 2003)
Ratio: 1:6 (structure of more than 83% of proteins still unknown)
Sources of protein structure information
Experimental structure determination
In practice, most biomolecular structures (>99% of structures in PDB) are
determined using three techniques:
- X-ray crystallography (low to very high resolution)
Problem: requires crystals; difficult to crystallize proteins by maintaining their
native conformation; not all protein can be crystallized;
- Nuclear magnetic resonance (NMR) spectroscopy of proteins in solution
(medium to high resolution)
Problem: Works only with small and medium size proteins (~50% of proteins
cannot be studied with this method); requires high solubility
Review of protein
structures
Need for analyses
of protein
structures
Sources of protein
structure
information
Computational
Modeling
- Electron microscopy and crystallography (low to medium resolution)
Problem: (still) relatively low resolution
Experimental methods are still very time consuming and expensive;
in most cases the experimental data will contain errors and/or are
incomplete. Thus the initial model needs to be refined and rebuild
Sources of protein structure information
Computational Modeling
Researches have been working for decades to develop procedures for
predicting protein structure that are not so time consuming and not hindered
by size and solubility constrains.
As protein sequences are encoded in DNA, in principle, it should therefore be
possible to translate a gene sequence into an amino acid sequence, and to
predict the three-dimensional structure of the resulting chain from this amino
acid sequence
Review of protein
structures
Need for analyses
of protein
structures
Sources of protein
structure
information
Computational
Modeling
Some common terminology used in homology modeling
Motif (sequence context): conserved pattern of amino acids
that is found in two or more proteins
Motif (structural context): combination of several secondary structure elements
(also referred to as super-secondary structures and folds)
Fold: (also referred to folding motif) larger combination of secondary structure units in
the same configuration. Thus, proteins sharing the same fold have the same
combination of secondary structures that are connected by similar loops
Domain (sequence context): (also referred to as homologous domain) extended
sequence patterns, generally found by sequence alignment methods, that
indicate a common evolutionary origin. It is generally longer than motifs
(may include all of a given protein sequence)
Review of protein
structures
Need for analyses
of protein
structures
Sources of protein
structure
information
Computational
Modeling
Domain (structural context): segment of the protein that can fold into a 3-D structure;
they are considered elementary units of molecular function
Family (sequence context): group of proteins of similar biochemical function that are
more than 50% identical when aligned
Family (structural context): structures that have a significant level of structural similarity
but not necessarily significant sequence similarity
Superfamily: group of protein families that are related by distant yet detectable
sequence similarities
Computational modeling
Gene finding
Identification of protein coding regions within DNA sequences (ORFs)
This is one of the single biggest challenges facing the bioinformatics specialists
working on Genome Projects
Existing software is only about 90% accurate in predicting genes in large
stretches of genomic DNA
Review of protein
structures
Need for analyses
of protein
structures
Sources of protein
structure
information
Computational
Modeling
The problem gets worse in eukaryotic genomes by the common occurrence of
pseudogenes that are highly similar to real sequences, but are not transcribed
Computational modeling
How to find genes?
Similarity search against the expressed sequence tag (EST) database (e.g. dbEST)
Translation and similarity search against the protein databanks (e.g.
SWISS-PROT and GenPept)
- automatic translate and search functions implemented in BLASTX and TFASTA
- if a protein (or EST sequence) matches, it can be aligned with the unknown genomic
sequence; start and stop codons should line up nicely and the introns should be obvious
- small error rate remains
Review of protein
structures
If there are no handy template sequences in the databanks, one must rely on
knowledge of DNA code
- the transcription initiation site is generally a ATG codon; it is usually about 30bp
downstream from a TAATAA sequence (or some close approximation)
Need for analyses
of protein
structures
- graphic map of all 6 reading frames can be produced to search for a long one
Sources of protein
structure
information
- problem: none of those programs is perfect; errors will occur
Computational
Modeling
- several software packages are available that map ORF’s (e.g. FRAMES, GeneWorks,
MacVector, DNA Strider, GRAIL, ORF finder, DNA translation, BCM GeneFinder)
- confirming evidence can be collected by looking for regulatory sequences (promoters,
enhancers, transcription factors; also known as signal sequences) that generally occur
near ORF’s. Several databases for signal sequences are available (e.g. TransFac) and
several software tool make use of these databases (e.g. Signal Scan, FindPatterns)
Computational modeling
How to predict the protein structure?
Ab initio prediction of protein structure from sequence: not yet.
Problem: the information contained in protein structures lies essentially in the
conformational torsion angles. Even if we only assume that every amino-acid residue
has three such torsion angles, and that each of these three can only assume one
of three "ideal" values (e.g., 60, 180 and -60 degrees), this still leaves us with 27
possible conformations per residue.
Review of protein
structures
Need for analyses
of protein
structures
Sources of protein
structure
information
Computational
Modeling
For a typical 200-amino acid protein, this
would give 27200 (roughly 1.87 x 10286)
possible conformations!
Q: Can’t we just generate all these
conformations, calculate their energy
and see which conformation has the
lowest energy?
If we were able to evaluate 109 conformations per second, this would still keep us
busy 4 x 10259 times the current age of the universe
There are optimized ab initio prediction algorithms available as well as fold recognition
algorithms that use threading (compares protein folds with know fold structures from
databases), but the results are still very poor
Computational modeling
Solution: homology modeling
Homology (comparative) modeling attempts to predict structure on the strength
of a protein’s sequence similarity to another protein of known structure
Basic idea: a significant alignment of the query sequence with a target sequence from
PDB is evidence that the query sequence has a similar 3-D structure (current threshold
~ 40% sequence identity). Then multiple sequence alignment and pattern analysis can
be used to predict the structure of the protein
Review of protein
structures
Need for analyses
of protein
structures
Sources of protein
structure
information
Computational
Modeling
Computational modeling
Flow chart for protein structure prediction (from Mount, 2001)
Review of protein
structures
Need for analyses
of protein
structures
Sources of protein
structure
information
Computational
Modeling
Computational modeling
Review of protein
structures
Need for analyses
of protein
structures
Sources of protein
structure
information
Computational
Modeling
Protein sequence
- partial or full sequences; predicted through gene finding
Computational modeling
Review of protein
structures
Need for analyses
of protein
structures
Sources of protein
structure
information
Computational
Modeling
Database similarity search
- sequence is used as a query in a database similarity search against proteins in PDB
Computational modeling
Review of protein
structures
Need for analyses
of protein
structures
Sources of protein
structure
information
Computational
Modeling
Does the sequence align with a protein of known structure?
- Yes: if the database similarity search reveals a significant alignment between the query
sequence and a PDB target sequence, the alignment can be used to position the
amino acids of the query sequence in the same approximate 3-D structure
-
No: proceed to protein family analysis
Computational modeling
Review of protein
structures
Need for analyses
of protein
structures
Sources of protein
structure
information
Computational
Modeling
Protein family analysis/relationship to known structure
- Family (structural context): structures that have a significant level of structural similarity
but not necessarily significant sequence similarity
- the goal is to exploit these structure sequence relationships; two questions: 1) is the new
protein a member of a family, 2) does the family have a predicted structural fold?
- analyze sequence for family specific profiles and patterns. Available databases: 3D-Ali,
3D-PSSM, BLOCKS, eMOTIF, INTERPRO, Pfam …)
- if the family analysis reveals that the query protein is a member of a family with a
predicted structural fold, multiple alignment can be used for structural modeling
Computational modeling
Review of protein
structures
Need for analyses
of protein
structures
Sources of protein
structure
information
Computational
Modeling
Protein family analysis/relationship to known structure
- if the family analysis is unsuccessful, proceed to structural analyses
Computational modeling
Review of protein
structures
Need for analyses
of protein
structures
Sources of protein
structure
information
Computational
Modeling
Structural analysis
- several different types of analyses to infer structural information
- presence of small amino acid motifs in a protein can be indicator of a biochemical
function associated with a particular structure. Motifs are available from the Prosite catalog
- spacing and arrangement of amino acids (e.g. hydrophobic amino acids) provide
important structural clues that can be used for modeling
- certain amino acid combinations can occur in certain types of secondary structure
- These structural analyses can provide clues as to the presence of active sites and regions
of secondary structure. These information can help to identify a new protein as a member
of a known structural class
Computational modeling
Review of protein
structures
Need for analyses
of protein
structures
Sources of protein
structure
information
Computational
Modeling
3-D structural analysis in lab
- proteins that fail to show any relationship to proteins of known structure are candidates for
structural analyses (X-ray crystallography, NMR). There are about 600 known fold families
and new structures are frequently found to have already known structural fold.
Accordingly, protein families with no relatives of known structure may represent a novel fold
Computational modeling: summary
Partial or full sequences
predicted through gene
finding
Similarity search
against proteins
in PDB
Find structures that have a significant
level of structural similarity (but not
necessarily significant sequence similarity)
Alignment can be used to position the
amino acids of the query sequence in
the same approximate 3-D structure
If member of a family with a
predicted structural fold,
multiple alignment can be used
for structural modeling
Review of protein
structures
Need for analyses
of protein
structures
Sources of protein
structure
information
Computational
Modeling
Structural analyses in the lab
(X-ray crystallography, NMR)
Infer structural information (e.g. presence of small
amino acid motifs; spacing and arrangement of
amino acids; certain typical amino acid combinations
associated with certain types of secondary structure)
can provide clues as to the presence of active sites and
regions of secondary structure
Computational modeling: summary
Review of protein
structures
Need for analyses
of protein
structures
Sources of protein
structure
information
Computational
Modeling
How to predict the protein structure?
Ab initio prediction of protein structure from sequence
Homology (comparative) modeling attempts to predict structure on the strength
of a protein’s sequence similarity to another protein of known structure
Experimental structure determination
Computational modeling: summary
Review of protein
structures
Need for analyses
of protein
structures
Ab initio prediction
Sources of protein
structure
information
Computational
Modeling
Homology modeling
Experimental structure determination
Computational modeling
Viewing protein structures
A number of molecular viewers are freely available and run on most computer platforms
and operating systems
Examples:
Cn3D 4.0 (stand-alone)
Rasmol (stand-alone)
Chime (Web browser based on Rasmol)
Review of protein
structures
Need for analyses
of protein
structures
Sources of protein
structure
information
Computational
Modeling
Swiss 3D viewer Spdbv (stand-alone)
All these viewers can use the PDB identification code or the structural file from PDB