Transcript Lecture 6
Protein Structure
Obtaining 3-D structure
Obtaining 3-D structure (NMR)
Obtaining 3-D structure (Bioinformatics)
3-D structure (dynamics / computation)
Subdomain Rearrangement in HIV-1 Reverse Transcriptase
Protein Databases
UniProt is the universal protein database, a central repository of protein
data created by combining Swiss-Prot, TrEMBL and PIR. This makes it the
world's most comprehensive resource on protein information.
The Protein Information Resource (PIR), located at Georgetown
University Medical Center (GUMC), is an integrated public bioinformatics
resource to support genomic and proteomic research, and scientific
studies.
Swiss-Prot is a curated biological database of protein sequences from
different species created in 1986 by Amos Bairoch during his PhD and
developed by the Swiss Institute of Bioinformatics and the European
Bioinformatics Institute.
Pfam is a large collection of multiple sequence alignments and hidden
Markov models covering many common protein domains and families.
PDB
NCBI
http://proteome.nih.gov/links.html
PubMed – Protein Databases
The Protein database contains sequence data from the translated
coding regions from DNA sequences in GenBank, EMBL, and DDBJ
as well as protein sequences submitted to Protein Information
Resource (PIR), SWISS-PROT, Protein Research Foundation (PRF),
and Protein Data Bank (PDB) (sequences from solved structures).
The Structure database or Molecular Modeling Database (MMDB)
contains experimental data from crystallographic and NMR structure
determinations. The data for MMDB are obtained from the Protein
Data Bank (PDB). The NCBI has cross-linked structural data to
bibliographic information, to the sequence databases, and to the
NCBI taxonomy. Use Cn3D, the NCBI 3D structure viewer, for easy
interactive visualization of molecular structures from Entrez.
Tutorial: http://www.pdb.org/pdbstatic/tutorials/tutorial.html
Example – PDB
http://www.pdb.org
Only proteins with known structures are included.
Protein Visualization Softwares
•
•
•
•
Cn3d
RasMol
TOPS
Chime
•
•
•
•
•
•
DSSP
Molscript
Ribbons
MSMS
Surfnet
…
PubMed Structure Database
PubMed Structure Database
Protein Structure Classification - SCOP
• Structure Classification Of Proteins database
• http://scop.mrc-lmb.cam.ac.uk/scop/
• Hierarchical Clustering
• Family – clear evolutionarily relationship
• Superfamily – probable common evolutionary origin
• Fold – major structural similarity
• Boundaries between levels are more or less
subjective
• Conservative evolutionary classification leads to
many new divisions at the family and superfamily
levels, therefore it is recommended to first focus
on higher levels in the classification tree.
Protein Structure Classification - SCOP
• a/a
• a+b
• b/b
• Misc
• a/b
Protein Structure Classification - SCOP
Scop Classification Statistics
SCOP: Structural Classification of Proteins. 1.69 release
25973 PDB Entries (1 Oct 2004). 70859 Domains. 1 Literature Reference
(excluding nucleic acids and theoretical models)
Number of folds
Number of
superfamilies
Number of
families
All alpha proteins
218
376
608
All beta proteins
144
290
560
Alpha and beta proteins
(a/b)
136
222
629
Alpha and beta proteins
(a+b)
279
409
717
Multi-domain proteins
46
46
61
Membrane and cell
surface proteins
47
88
99
Small proteins
75
108
171
945
1539
2845
Class
Total
Protein Structure Classification - SCOP
Protein Structure Classification - SCOP
Protein Structure Classification - SCOP
Protein Structure Classification - SCOP
Protein Structure Classification - SCOP
Protein Structure Classification - CATH
• CATH Protein Structure Classification
• http://www.cathdb.info/latest/index.html
• CATH is a hierarchical classification of protein domain structures, which
clusters proteins at four major levels, Class(C), Architecture(A), Topology(T)
and Homologous superfamily (H).
• Class, derived from secondary structure content, is assigned for
more than 90% of protein structures automatically.
• Architecture, which describes the gross orientation of secondary
structures, independent of connectivities, is currently assigned
manually.
• The topology level clusters structures into fold groups according
to their topological connections and numbers of secondary
structures.
• The homologous superfamilies cluster proteins with highly
similar structures and functions. The assignments of structures
to fold groups and homologous superfamilies are made by
sequence and structure comparisons.
Protein Structure Classification - CATH
CATH vs. SCOP
Secondary Structure Prediction
AGADIR - An algorithm to predict the helical content of peptides
APSSP - Advanced Protein Secondary Structure Prediction Server
GOR - Garnier et al, 1996
HNN - Hierarchical Neural Network method (Guermeur, 1997)
Jpred - A consensus method for protein secondary structure prediction
at University of Dundee
JUFO - Protein secondary structure prediction from sequence (neural
network)
nnPredict - University of California at San Francisco (UCSF)
Porter - University College Dublin
PredictProtein - PHDsec, PHDacc, PHDhtm, PHDtopology, PHDthreader,
MaxHom, EvalSec from Columbia University
Prof - Cascaded Multiple Classifiers for Secondary Structure Prediction
PSA - BioMolecular Engineering Research Center (BMERC) / Boston
PSIpred - Various protein structure prediction methods at Brunel
University
SOPMA - Geourjon and Deléage, 1995
SSpro - Secondary structure prediction using bidirectional recurrent
neural networks at University of California
DLP - Domain linker prediction at RIKEN
http://us.expasy.org/tools/#secondary
Secondary Structure Prediction - HNN
• http://npsa-pbil.ibcp.fr/cgi-bin/secpred_hnn.pl
• >gi|78099986|sp|P0ABK2|CYDB_ECOLI Cytochrome d ubiquinol oxidase subunit 2
(Cytochrome d ubiquinol oxidase subunit II) (Cytochrome bd-I oxidase
subunit II)
MIDYEVLRFIWWLLVGVLLIGFAVTDGFDMGVGMLTRFLGRNDTERRIMINSIAPHWDGNQVWLITAGGA
LFAAWPMVYAAAFSGFYVAMILVLASLFFRPVGFDYRSKIEETRWRNMWDWGIFIGSFVPPLVIGVAFGN
LLQGVPFNVDEYLRLYYTGNFFQLLNPFGLLAGVVSVGMIITQGATYLQMRTVGELHLRTRATAQVAALV
TLVCFALAGVWVMYGIDGYVVKSTMDHYAASNPLNKEVVREAGAWLVNFNNTPILWAIPALGVVLPLLTI
LTARMDKAAWAFVFSSLTLACIILTAGIAMFPFVMPSSTMMNASLTMWDATSSQLTLNVMTWVAVVLVPIILLY
TAWCYWKMFGRITKEDIERNTHSLY
Secondary Structure Prediction - HNN
Sequence length : 379
HNN :
Alpha helix (Hh) : 209 is 55.15%
310 helix (Gg) : 0 is 0.00%
Pi helix (Ii) : 0 is 0.00%
Beta bridge (Bb) : 0 is 0.00%
Extended strand (Ee) : 55 is 14.51%
Beta turn (Tt) : 0 is 0.00%
Bend region (Ss) : 0 is 0.00%
Random coil (Cc) : 115 is 30.34%
Ambigous states (?) : 0 is 0.00%
Other states : 0 is 0.00%
10
20
30
40
50
60
70
|
|
|
|
|
|
|
MIDYEVLRFIWWLLVGVLLIGFAVTDGFDMGVGMLTRFLGRNDTERRIMINSIAPHWDGNQVWLITAGGA
ccchhhhhhhhhhhhhhheeeeehccchhcchhhhhheecccccceeeeeeccccccccceeeeeeccch
LFAAWPMVYAAAFSGFYVAMILVLASLFFRPVGFDYRSKIEETRWRNMWDWGIFIGSFVPPLVIGVAFGN
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhcccccccccchhhhhhhhhhcceeehccchccheehhhhhc
LLQGVPFNVDEYLRLYYTGNFFQLLNPFGLLAGVVSVGMIITQGATYLQMRTVGELHLRTRATAQVAALV
hhcccccchhhhheeeeccchhhhhcchceccceeeeeeeeeccchhhhhhhchhhhhhchhhhhhhhhh
TLVCFALAGVWVMYGIDGYVVKSTMDHYAASNPLNKEVVREAGAWLVNFNNTPILWAIPALGVVLPLLTI
hhhhhhccceeeeeeccceeeeeccccccccccchhhhhhhhhhhheeccccceeeeccchhhhhhhhhh
LTARMDKAAWAFVFSSLTLACIILTAGIAMFPFVMPSSTMMNASLTMWDATSSQLTLNVMTWVAVVLVPI
hhhhhhhhhhhhhhhhhhhhhhhhhcchhhcccccccchhhccccchhcccchhhhhhhhhhhhhhhhhh
ILLYTAWCYWKMFGRITKEDIERNTHSLY
hhhhhhhhhhhhhhhcchhhhhhhccccc
Secondary Structure Prediction - HNN
Motifs Readily Identified from Sequence
• Zinc Finger - order and spacing of a pattern for cysteine and
histidine.
• Leucine zippers – two antiparallel alpha helices held together by
interactions between hybrophobic leucine residues at every
seventh position in each helix.
• Coiled coils – 2-3 helices coiled around each other in a lefthanded supercoil (3.5 residue/turn instead of 3.6 – 7/two
turns); first and fourth are always hydrophobic, others
hydrophilic; 5-10 heptads.
• Transmembrane-spanning proteins – alpha helices comprising
amino acids with hydrophobic side chains, typically 20-30
residues.
Topology Prediction
PSORT - Prediction of protein subcellular localization
TargetP - Prediction of subcellular location
DAS - Prediction of transmembrane regions in prokaryotes using the Dense
Alignment Surface method (Stockholm University)
HMMTOP - Prediction of transmembrane helices and topology of proteins
(Hungarian Academy of Sciences)
PredictProtein - Prediction of transmembrane helix location and topology
(Columbia University)
SOSUI - Prediction of transmembrane regions (Nagoya University, Japan)
TMAP - Transmembrane detection based on multiple sequence alignment
(Karolinska Institut; Sweden)
TMHMM - Prediction of transmembrane helices in proteins (CBS; Denmark)
TMpred - Prediction of transmembrane regions and protein orientation (EMBnetCH)
TopPred - Topology prediction of membrane proteins (France)
http://us.expasy.org/tools
Tertiary Structure Prediction
Comparative modeling
SWISS-MODEL - An automated knowledge-based protein modelling server
3Djigsaw - Three-dimensional models for proteins based on homologues of
known structure
CPHmodels - Automated neural-network based protein modelling server
ESyPred3D - Automated homology modeling program using neural networks
Geno3d - Automatic modeling of protein three-dimensional structure
SDSC1 - Protein Structure Homology Modeling Server
Threading
3D-PSSM - Protein fold recognition using 1D and 3D sequence profiles
coupled with secondary structure information (Foldfit)
Fugue - Sequence-structure homology recognition
HHpred - Protein homology detection and structure prediction by HMM-HMM
comparison
Libellula - Neural network approach to evaluate fold recognition results
LOOPP - Sequence to sequence, sequence to structure, and structure to
structure alignment
SAM-T02 - HMM-based Protein Structure Prediction
Threader - Protein fold recognition
ProSup - Protein structure superimposition
SWEET - Constructing 3D models of saccharides from their sequences
Ab initio
HMMSTR/Rosetta - Prediction of protein structure from sequence
http://us.expasy.org/tools
Tertiary Structure Prediction
Comparative modeling
3Djigsaw - Three-dimensional models for proteins based on homologues of
known structure
Contreras-Moreira,B., Bates,P.A. (2002)
Domain Fishing: a first step in protein
comparative modelling. Bioinformatics
18: 1141-1142.
Tertiary Structure Prediction
Threading
3D-PSSM - Protein fold recognition using 1D and 3D sequence profiles
coupled with secondary structure information (Foldfit)
Fugue - Sequence-structure homology recognition
HHpred - Protein homology detection and structure prediction by HMM-HMM
comparison
Libellula - Neural network approach to evaluate fold recognition results
LOOPP - Sequence to sequence, sequence to structure, and structure to
structure alignment
SAM-T02 - HMM-based Protein Structure Prediction
Threader - Protein fold recognition
ProSup - Protein structure superimposition
SWEET - Constructing 3D models of saccharides from their sequences
Tertiary Structure Prediction
Threading
The term threading was first coined by Jones, Taylor and Thornton
in 1992, and originally referred specifically to the use of a full 3-D
structure atomic representation of the protein template in fold
recognition. Today, the terms threading and fold recognition are
frequently (though somewhat incorrectly) used interchangeably.
The basic idea is that the target sequence (the protein sequence for
which the structure is being predicted) is threaded through the
backbone structures of a collection of template proteins (known as
the fold library) and a “goodness of fit” score calculated for each
sequence-structure alignment. This goodness of fit is often derived
in terms of an empirical energy function, based on statistics derived
from known protein structures, but many other scoring functions
have been proposed and tried over the years.
Threading methods share some of the characteristics of both
comparative modelling methods (the sequence alignment aspect)
and ab initio prediction methods (predicting structure based on
identifying low-energy conformations of the target protein).
http://en.wikipedia.org/wiki/Threading_%28protein_sequence%29
Tertiary Structure Prediction
Ab initio (de novo)
• From scratch – using physical property instead of known
structures
• Mimic folding process – minimize certain energy function,
stochastic modeling (e.g., simulated annealing)
• Computationally expensive – requires large clusters, large
machines (e.g., IBM BlueGene) or distributed computing,
currently only work for small peptides
• Big potential in the future – understand the dynamics,
accuracy, and applications in drug development
Tertiary Structure Prediction
Ab initio (de novo)
Prediction Scoring with Rosetta
Rosetta uses a scoring function to judge different
conformations. The process consists of making
'moves' (changing the bond angles of a particular
group of amino acids) and then scoring the new
conformation.
The Rosetta score is a weighted sum of component
scores, where each component score is judging a
different aspect of protein structure.
Environment score: Here, hydrophobic residues as
represented as orange stars, so the left
conformation is good (all the hydrophobics
together) while the rightmost conformation is
bad (with the hydrophobic amino acids not
touching).
Pair-score: Two conformations of a polypeptide are
shown, one (top) where the chain is folded back
on itself bringing two cysteins together
(yellow+yellow = possible disulphide bond) and
forming a salt-bridge (blue+red = opposites
attract). The conformation at bottom does not
make these pairings and the pair-score would,
thus, favor the top conformation.
http://www.grid.org/projects/hpf/howitworks_scoring.htm
Evaluation - CASP
CASP - Critical Assessment of Techniques for Protein Structure Prediction, is a
community-wide experiment (though it is commonly referred to as a
competition) for protein structure prediction taking place every two years
since 1994. (http://predictioncenter.org/)
The main goal of CASP is to obtain an in-depth and objective assessment of
our current abilities and inabilities in the area of protein structure
prediction. To this end, participants will predict as much as possible about
a set of soon to be known structures. These will be true predictions, not
‘post-dictions’ made on already known structures. CASP7 will particularly
address the following questions:
1. Are the models produced similar to the corresponding experimental
structure?
2. Is the mapping of the target sequence onto the proposed structure (i.e. the
alignment) correct?
3. Have similar structures that a model can be based on been identified?
4. Are comparative models more accurate than can be obtained by simply
copying the best template?
5. Has there been progress from the earlier CASPs?
6. What methods are most effective?
7. Where can future effort be most productively focused?
Evaluation - CASP
Evaluation of the results is carried out in the following prediction categories:
• tertiary structure prediction (all CASPs)
• secondary structure prediction (dropped after CASP5)
• prediction of structure complexes (CASP2 only; a separate experiment CAPRI - carries on this subject)
• residue-residue contact prediction (starting CASP4)
• disordered regions prediction (starting CASP5)
• domain boundary prediction (starting CASP6)
• function prediction (starting CASP6)
• model quality assessment (starting CASP7)
• model refinement (starting CASP7)
Tertiary structure prediction category was further subdivided into
• homology modelling
• fold recognition (also called protein threading; Note, this is incorrect as
threading is a method)
• de novo structure prediction Now referred to as 'New Fold' as many
methods apply evaluation, or scoring, functions that are biased by
knowledge of native protein structures, such an example would be an
artificial neural network.
Evaluation - CASP
Number of human expert groups registered
207
Number of targets released
104
Number of prediction servers registered
98
Targets canceled
4
Valid targets
100
Refinement targets
9
Number of groups
contributing
Number of models
designated as 1
Total number of
models
180
12393
48339
Alignments to PDB
structures
15
966
3896
Residue-residue
contacts
17
1473
1561
Structural domains
assignments
27
2258
2515
Disordered regions
19
1801
1801
Function prediction
22
1317
1930
Quality assessment
29
2326
3228
Model refinement
26
136
447
255 (unique)
22670
63717
Prediction format
3D coordinates
All
Proteomics
The term proteome was coined by Mark Wilkins in 1995 and is used to
describe the entire complement of proteins in a given biological
organism or system at a given time, i.e. the protein products of the
genome. The term has been applied to several different types of
biological systems. A cellular proteome is the collection of proteins
found in a particular cell type under a particular set of environmental
conditions such as exposure to hormone stimulation.
Proteomics vs. Genomics
The proteome is larger than the genome, especially in eukaryotes, in the
sense that there are more proteins than genes. This is due to alternative
Splicing_(genetics) splicing of genes and post-translational modifications
like glycosylation or phosphorylation.
The proteome has at least two levels of complexity lacking in the genome.
When the genome is defined by the sequence of nucleotides, the proteome
cannot be limited to the sum of the sequences of the proteins present.
Knowledge of the proteome requires knowledge of (1) the structure of the
proteins in the proteome and (2) the functional interaction between the
proteins.
Proteomics Techniques – 2D Gel
Proteomics, the study of the proteome, has largely been practiced through
the separation of proteins by two dimensional gel electrophoresis. In the
first dimension, the proteins are separated by isoelectric focusing, which
resolves proteins on the basis of charge. In the second dimension, proteins
are separated by molecular weight using SDS-PAGE. The gel is dyed with
Coomassie Blue or silver to visualize the proteins. Spots on the gel are
proteins that have migrated to specific locations.
Matching is a big issue
Proteomics Techniques – Mass Spec
Peptide mass fingerprinting identifies a protein by cleaving it into short
peptides and then deduces the protein's identity by matching the observed
peptide masses against a sequence database. Tandem mass
spectrometry, on the other hand, can get sequence information from
individual peptides by isolating them, colliding them with a nonreactive gas,
and then cataloging the fragment ions produced.
Proteomics Techniques – Mass Spec
Proteomics Techniques – Mass Spec
Proteomics Techniques – Microarray
Measures mRNA level, no change in mRNA does not necessarily mean no
change in protein expression and function due to effects of posttranslational
modulation.