Transcript Document

Protein structure prediction:
The holy grail of bioinformatics
Proteins: Four levels of structural
organization:
Primary structure
Secondary structure
Tertiary structure
Quaternary structure
Primary structure = the linear amino acid sequence
Secondary structure = spatial arrangement of amino-acid
residues that are adjacent in the primary structure
a helix = A helical structure, whose chain coils tightly as
a right-handed screw with all the side chains sticking
outward in a helical array. The tight structure of the a
helix is stabilized by same-strand hydrogen bonds
between -NH groups and -CO groups spaced at four
amino-acid residue intervals.
The b-pleated sheet is made of loosely
coiled b strands are stabilized by
hydrogen bonds between -NH and -CO
groups from adjacent strands.
An antiparallel β sheet. Adjacent β strands run in opposite
directions. Hydrogen bonds between NH and CO groups
connect each amino acid to a single amino acid on an adjacent
strand, stabilizing the structure.
A parallel β sheet. Adjacent β strands run in the same
direction. Hydrogen bonds connect each amino acid on one
strand with two different amino acids on the adjacent strand.
Silk fibroin
a helix
b sheet (parallel and antiparallel)
tight turns
flexible loops
irregular elements (random coil)
Tertiary structure = three-dimensional structure of protein
The tertiary structure is formed by
the folding of secondary structures
by covalent and non-covalent forces,
such as hydrogen bonds,
hydrophobic interactions, salt
bridges between positively and
negatively charged residues, as well
as disulfide bonds between pairs of
cysteines.
Quaternary structure = spatial arrangement of subunits
and their contacts.
Holoproteins & Apoproteins
Holoprotein
Prosthetic group
Apoprotein
Holoprotein
Prosthetic group
Apohemoglobin = 2a + 2b
Prosthetic group
Heme
Hemoglobin = Apohemoglobin + 4Heme
Christian B. Anfinsen
1916-1995
Sela M, White FH, & Anfinsen CB. 1959. The reductive cleavage
of disulfide bonds and its application to problems of protein
structure. Biochim. Biophys. Acta. 31:417-426.
Not all proteins fold independently.
Chaperones.
Reducing agents:
Ammonium thioglycolate (alkaline) pH 9.0-10
Glycerylmonothioglycolate (acid) pH 6.5-8.2
Oxidant
What do we need to know in order to
state that the tertiary structure of a
protein has been solved?
Ideally: We need to determine the position of all
atoms and their connectivity.
Less Ideally: We need to determine the position
of all Cabackbone structure).
Protein structure: Limitations and caveats
• Not all proteins or parts of proteins assume a welldefined 3D structure in solution.
• Protein structure is not static, there are various
degrees of thermal motion for different parts of the
structure.
• There may be a number of slightly different
conformations in solution.
• Some proteins undergo conformational changes
when interacting with STUFF.
Experimental Protein Structure
Determination
• X-ray crystallography
–
–
–
–
most accurate
in vitro
needs crystals
~$100-200K per structure
• NMR
–
–
–
–
fairly accurate
in vivo
no need for crystals
limited to very small proteins
• Cryo-electron-microscopy
– imaging technology
– low resolution
Why predict protein
structure?
• Structural knowledge = some understanding of function
and mechanism of action
• Predicted structures can be used in structure-based drug
design
• It can help us understand the effects of mutations on
structure and function
• It is a very interesting scientific problem (still unsolved
in its most general form after more than 50 years of
effort)
Secondary structure
prediction
Secondary structure prediction
• Historically first structure prediction methods
predicted secondary structure
• Can be used to improve alignment accuracy
• Can be used to detect domain boundaries within
proteins with remote sequence homology
• Often the first step towards 3D structure prediction
• Informative for mutagenesis studies
Protein Secondary Structures
(Simplifications)
a-HELIX
b-STRAND
COIL (everything else)
Assumptions
• The entire information for forming secondary structure is
contained in the primary sequence
• side groups of residues will determine structure
• examining windows of 13-17 residues is sufficient to predict
secondary structure
 a-helices 5–40 residues long
 b-strands 5–10 residues long
Predicting Secondary Structure
From Primary Structure
• accuracy 64-75%
• higher accuracy for a-helices than for bsheets
• accuracy is dependent on protein family
• predictions of engineered (artificial) proteins
are less accurate
A surprising result!
Chameleon
sequences
The “Chameleon” sequence
sequence 1
sequence 2
TEAVDAATAEKVFKQYANDNGVDGEWTYDDATKTFTVTEK
Replace both sequences with
an engineered peptide (“chameleon”)
TEAVDAWTVEKAFKTFANDNGVDGAWTVEKAFKTFTVTEK
a -helix
b-strand
Source: Minor and Kim. 1996. Nature 380:730-734
Measures of prediction accuracy
• Qindex and Q3
• Correlation coefficient
Qindex
Qindex: (Qhelix, Qstrand, Qcoil, Q3)
- percentage of residues correctly predicted as a-helix, b-strand,
coil, or for all 3 conformations.
Npredicted
Q3 
 100
Nobserved
Drawbacks:
- even a random assignment of structure can achieve a high score
(Holley & Karpus 1991)
Correlation coefficient
pa na -ua oa

a
([na  ua ][na  oa ][ pa  ua ][ pa  oa ])
C
Ca = 1 (=100%)
True positive
False positive
(overpredicted)
pa
oa
True negative
False negative
(underpredicted)
na
ua
Methods of secondary structure
prediction
First generation methods:
single residue statistics
Chou & Fasman (1974 & 1978) :
Some residues have particular secondary-structure
preferences. Based on empirical frequencies of residues
in a-helices, b-sheets, and coils.
Examples: Glu
Val
α-helix
β-strand
Chou-Fasman method
Name
P (H)
P (E )
P (t urn)
f (i)
f (i+ 1)
f (i+ 2)
f (i+ 3)
Alanine
142
83
66
0.06
0.076
0.035
0.058
Arginine
98
93
95
0.07
0.106
0.099
0.085
101
54
146
0.147
0.11
0.179
0.081
Asparagine
67
89
156
0.161
0.083
0.191
0.091
Cysteine
70
119
119
0.149
0.05
0.117
0.128
Glutamic Acid
151
37
74
0.056
0.06
0.077
0.064
Glutamine
111
110
98
0.074
0.098
0.037
0.098
Glycine
57
75
156
0.102
0.085
0.19
0.152
Histidine
100
87
95
0.14
0.047
0.093
0.054
Isoleucine
108
160
47
0.043
0.034
0.013
0.056
Leucine
121
130
59
0.061
0.025
0.036
0.07
Lysine
114
74
101
0.055
0.115
0.072
0.095
Methionine
145
105
60
0.068
0.082
0.014
0.055
Phenylalanine
113
138
60
0.059
0.041
0.065
0.065
Proline
57
55
152
0.102
0.301
0.034
0.068
Serine
77
75
143
0.12
0.139
0.125
0.106
Threonine
83
119
96
0.086
0.108
0.065
0.079
108
137
96
0.077
0.013
0.064
0.167
69
147
114
0.082
0.065
0.114
0.125
106
170
50
0.062
0.048
0.028
0.053
Aspartic Acid
Tryptophan
Tyrosine
Valine
Chou-Fasman Method
• Accuracy: Q3 = 50-60%
Second generation methods:
segment statistics
• Similar to single-residue methods, but
incorporating additional information (adjacent
residues, segmental statistics).
• Problems:
– Low accuracy - Q3 below 66% (results).
– Q3 of b-strands (E) : 28% - 48%.
– Predicted structures were too short.
The GOR method
• developed by Garnier, Osguthorpe & Robson
• build on Chou-Fasman Pij values
• evaluate each residue PLUS adjacent 8 Nterminal and 8 carboxyl-terminal residues
• sliding window of 17 residues
• underpredicts b-strand regions
• GOR method accuracy Q3 = ~64%
Third generation methods
• Third generation methods reached 77% accuracy.
• They consist of two new ideas:
1. A biological idea –
Using evolutionary information based on
conservation analysis of multiple sequence
alignments.
2. A technological idea –
Using neural networks.
Artificial Neural Networks
An attempt to imitate the human brain (assuming
that this is the way it works).
Neural network models
- machine learning approach
- provide training sets of structures (e.g. a-helices, non
a -helices)
- computers are trained to recognize patterns in known
secondary structures
- provide test set (proteins with known structures)
- accuracy ~ 70 –75%
Reasons for improved accuracy
• Align sequence with other related proteins
of the same protein family
• Find members that has a known structure
• If significant matches between structure and
sequence assign secondary structures to
corresponding residues
New and Improved ThirdGeneration Methods
Exploit evolutionary information. Based on conservation
analysis of multiple sequence alignments.
• PHD (Q3 ~ 70%)
Rost B, Sander, C. (1993) J. Mol. Biol. 232, 584-599.
• PSIPRED (Q3 ~ 77%)
Jones, D. T. (1999) J. Mol. Biol. 292, 195-202.
Arguably remains the top secondary structure prediction method
(won all CASP competitions since 1998).
Secondary Structure Prediction
Summary
1st Generation - 1970s
• Q3 = 50-55%
• Chou & Fausman, GOR
2nd Generation -1980s
• Q3 = 60-65%
• Qian & Sejnowski, GORIII
3rd Generation - 1990s
• Q3 = 70-80%
• PhD, PSIPRED
Many 3rd+ generation methods exist:
PSI-PRED - http://bioinf.cs.ucl.ac.uk/psipred/
JPRED - http://www.compbio.dundee.ac.uk/~www-jpred/
PHD - http://www.embl-heidelberg.de/predictprotein/predictprotein.html
NNPRED - http://www.cmpharm.ucsf.edu/~nomi/nnpredict.html
The sequence-structure gap
More than 13,137,813 known protein sequences,
76,495 experimentally determined structures.
The sequence-structure gap
The gap is getting bigger.
Sequences
200000
180000
160000
140000
120000
100000
80000
60000
40000
20000
0
1985
1990
1995
2000
2005
Structures
2000000
1800000
1600000
1400000
1200000
1000000
800000
600000
400000
200000
0
1980
Protein Secondary Structures
(Simplifications)
a-HELIX
b-STRAND
COIL (everything else)
Beyond Secondary Structure
Before Tertiary Structure
Supersecondary structures (motifs): small, discrete,
commonly observed aggregates of secondary structures
helix-loop-helix
 b-a-b
Domains: independent units of structure
 b barrel
 four-helix bundle
The terms “domain” and “motif” are
sometimes used interchangeably.
Helix-loop-helix
Beyond Secondary Structure
Before Tertiary Structure
Folds: Compact folding arrangements of a polypeptide
chain (a protein or part of a protein).
The terms “domain” and “fold” are
sometimes used interchangeably.
EF Fold
Found in Calcium binding proteins such as Calmodulin
Leucine Zipper
Rossman Fold
•The beta-alpha-beta-alpha-beta subunit
•Often present in nucleotide-binding proteins
b sandwich
b barrel
a/b horseshoe
Four helix bundle
•24 amino acid peptide with a hydrophobic surface
•Assembles into 4 helix bundle through hydrophobic regions
•Maintains solubility of membrane proteins
TIM Barrel
PDB New Fold Growth
Old fold
New fold
• The number of unique folds in nature is fairly small
(possibly a few thousands)
• 90% of new structures submitted to PDB in the past three
years have similar structural folds in PDB
Protein data bank
• http//:www.rcsb.org/pdb/
Protein 3D structure data:
The structure of a protein consists of the 3D (X,Y,Z) coordinates
of each non-hydrogen atom of the protein.
Some protein structure also include coordinates of covalently
linked prosthetic groups, non-covalently linked ligand
molecules, or metal ions.
For some purposes (e.g. structural alignment) only the Cα
coordinates are needed.
Example of PDB format:
ATOM
ATOM
ATOM
ATOM
18
19
20
21
N
CA
C
O
GLY
GLY
GLY
GLY
27
27
27
27
X
Y
Z
occupancy / temp. factor
40.315
39.049
38.729
39.507
161.004
160.737
159.239
158.484
11.211
10.462
10.784
11.404
1.00
1.00
1.00
1.00
10.11
14.18
20.75
21.88
Note: the PDB format provides no information about connectivity between atoms. The
last two numbers (occupancy, temperature factor) relate to disorders of atomic
positions in crystals.
Protein structure: Some computational tasks
• Building a protein structure model from X-ray data
• Building a protein structure model from NMR data
• Computing the energy for a given protein structure (conformation)
• Energy minimization: Finding the structure with the minimal energy according to
some empirical “force fields”.
• Simulating the protein folding process (molecular dynamics)
• Structure visualization
• Computing secondary structure from atomic coordinates
• Protein superposition, structural alignment
• Protein fold classification
• Threading: finding a fold (prototype structure) that fits to a sequence
• Docking: fitting ligands onto a protein surface by molecular dynamics or energy
minimization
• Protein 3D structure prediction from sequence
Viewing protein structures
When looking at a protein structure, we may ask the following types of
questions:
• Is a particular residue on the inside or outside of a protein?
• Which amino acids interact with each other?
• Which amino acids are in contact with a ligand (DNA, peptide
hormone, small molecule, etc.)?
• Is an observed mutation likely to disturb the protein structure?
Standard capabilities of protein structure software:
• Display of protein structures in different ways (wireframe, backbone,
sticks, spacefill, ribbon.
• Highlighting of individual atoms, residues or groups of residues
• Calculation of interatomic distances
• Advanced feature: Superposition of related structures
Example: c-abl oncoprotein SH2 domain, display wireframe
Example: c-abl oncoprotein SH2 domain, display sticks
Example: c-abl oncoprotein SH2 domain, display backbone
Example: c-abl oncoprotein SH2 domain, display spacefill
Example: c-abl oncoprotein SH2 domain, display ribbons
Predicting protein 3d
structure
Goal: 3d structure from 1d sequence
An existing fold
Fold recognition
Homology modeling
A new fold
ab-initio
Homology modeling
Based on the two major observations
(and some simplifications):
1. The structure of a protein is
uniquely defined by its amino acid
sequence.
2. Similar sequences adopt similar
structures. (Distantly related
sequences may still fold into similar
structures.)
Homology modeling needs three
items of input:
• The sequence of a protein with unknown 3D
structure, the "target sequence."
• A 3D “template” – a structure having the
highest sequence identity with the target
sequence ( >30% sequence identity)
• An sequence alignment between the target
sequence and the template sequence
Homology Modeling: How it works
o Find template
o Align target sequence
with template
o Generate model:
- add loops
- add sidechains
o Refine model
Two zones of homology modeling
[Rost, Protein Eng. 1999]
Automated Web-Based Homology
Modelling
 SWISS Model : http://www.expasy.org/swissmod/SWISS-MODEL.html
 WHAT IF : http://www.cmbi.kun.nl/swift/servers/
 The CPHModels Server : http://www.cbs.dtu.dk/services/CPHmodels/
 3D Jigsaw : http://www.bmm.icnet.uk/~3djigsaw/
 SDSC1 : http://cl.sdsc.edu/hm.html
 EsyPred3D : http://www.fundp.ac.be/urbm/bioinfo/esypred/
Fold recognition = Protein Threading
Which of the known folds is likely to be
similar to the (unknown) fold of a new
protein when only its amino-acid
sequence is known?
Protein Threading
• The goal: find the “correct” sequence-structure
alignment between a target sequence and its native-like
fold in PDB
MTYKLILN …. NGVDGEWTYTE
• Energy function – knowledge (or statistics) based rather
than physics based
– Should be able to distinguish correct structural folds from
incorrect structural folds
– Should be able to distinguish correct sequence-fold alignment
from incorrect sequence-fold alignments
Protein Threading
• Basic premise
The number of unique structural (domain) folds in
nature is fairly small (possibly a few thousand)
• Statistics from Protein Data Bank (~2,000 structures)
90% of new structures submitted to PDB in the past
three years have similar structural folds in PDB
• Chances for a protein to have a structural fold that
already exists in PDB are quite good.
Protein Threading
Basic components:
–
–
–
–
Structure database
Energy function
Sequence-structure alignment algorithm
Prediction reliability assessment
Protein Threading – structure database
• Build a template database
Process
• Threading - A protein fold recognition
technique that involves incrementally replacing
the sequence of a known protein structure with a
query sequence of unknown structure. The new
“model” structure is evaluated using a simple
heuristic measure of protein fold quality. The
process is repeated against all known 3D
structures until an optimal fit is found.
Fold recognition methods
• 3D-PSSM
http//:www.sbg.bio.ic.ac.uk/~3dpssm/
• Fugue
http://www-cryst.bioc.cam.ac.uk/~fugue/
• HHpred
http://protevo.eb.tuebingen.mpg.de/toolkit/index.php?view=hh
pred
ab-initio folding
Goal: Predict structure from “first
principles”
Requires:
– A free energy function, sufficiently close to
the “true potential”
– A method for searching the conformational
space
Advantages:
– Works for novel folds
– Shows that we understand the process
Disadvantages:
– Applicable to short sequences only
Rosetta [Simons et al. 1997]
http//:www.bioinfo.rpi.edu/~bystrc/hmmstr/server.php
Qian et al. (Nature: 2007) used
distributed computing* to predict the
3D structure of a protein from its
amino-acid sequence. Here, their
predicted structure (grey) of a protein
is overlaid with the experimentally
determined crystal structure (color) of
that protein. The agreement between
the two is excellent.
*70,000 home computers for about two
years.
Overall Approach
Multiple Sequence
Alignment
Database Searching
No
Homologue
in PDB
Protein Sequence
Secondary
Structure
Prediction
Fold
Recognition
Yes
Homology
Modelling
3-D Protein Model
Sequence-Structure
Alignment
Ab-initio
Structure
Prediction
Yes
Predicted
Fold
No
ExPASy Proteomics Server:
Expert Protein Analysis System
links to lots of protein prediction resources
http://expasy.org/
RMSDmin
The root mean square deviation (RMSD) is the measure of the
average distance between the backbones of superimposed proteins.
In the study of globular protein conformations, one customarily
measures the similarity in three-dimensional structure by the RMSD
of the Cα atomic coordinates after optimal rigid body superposition.
A widely used way to compare the structures of biomolecules or
solid bodies is to “translate” or rotate one structure with respect to
the other to minimize the RMSD. This RMSDmin can be used as a
distance measure between two proteins.