Predicting and Classifying Protein Structures
Download
Report
Transcript Predicting and Classifying Protein Structures
Predicting and Classifying
Protein Structures
Michel Dumontier, Ph.D.
Carleton University
[email protected]
Lecture 3.2
1
Outline
• 3D Structure Determination
– Validation
• Structure Classification
• Structure Prediction
– Secondary Structure
Lecture 3.2
2
Structure Validation
• A structure can (and often does)
have mistakes
• A poor structure will lead to poor
models of mechanism or relationship
• Unusual parts of a structure may
indicate something important (or an
error)
Lecture 3.2
3
Famous “bad” structures
• Azobacter ferredoxin (wrong space group)
• Zn-metallothionein (mistraced chain)
• Alpha bungarotoxin (poor stereochemistry)
• Yeast enolase (mistraced chain)
• Ras P21 oncogene (mistraced chain)
• Gene V protein (poor stereochemistry)
Lecture 3.2
4
Structure Validation
• Assess experimental fit
– look at Resolution, R-Factor or RMSD
• Assess correctness of overall fold
– look at disposition of hydrophobic
residues
• Assess structure quality
– packing
– stereochemistry
– contacts...
Lecture 3.2
5
X-Ray Resolution
Resolution
>4.0
3.0 - 4.0
2.5 - 3.0
2.0 - 2.5
1.5 - 2.0
0.5 - 1.5
Lecture 3.2
Meaning
Coordinates meaningless.
Fold possibly correct, but errors are very likely. Many
sidechains placed with wrong rotamer.
Fold likely correct except that some surface loops
might be mis-modelled. Several long, thin sidechains
(lys, glu, gln, etc) and small sidechains (ser, val, thr,
etc) likely to have wrong rotamers.
As 2.5 - 3.0, but number of sidechains in wrong rotamer
is considerably less. Many small errors can normally be
detected. Fold normally correct and number of errors in
surface loops is small.
Few residues have wrong rotamer. Many small errors
can normally be detected. Fold always correct, also in
surface loops.
Threonines may have wrong chirality on the C-beta.
6
A Good Protein Structure..
X-ray structure
NMR structure
• R = 0.59 random chain
• RMSD = 4 Å random
• R = 0.45 initial structure
• RMSD = 2 Å initial fit
• R = 0.35 getting there
• RMSD = 1.5 Å OK
• R = 0.25 typical protein
• RMSD = 0.8 Å typical
• R = 0.15 best case
• RMSD = 0.4 Å best case
• R = 0.05 small molecule
• RMSD = 0.2 Å dream on
Lecture 3.2
7
A Good Protein Structure..
• Minimizes disallowed
torsion angles
• Maximizes number of
hydrogen bonds
• Maximizes buried
hydrophobic ASA
• Maximizes exposed
hydrophilic ASA
• Minimizes interstitial
cavities or spaces
Lecture 3.2
8
A Good Protein Structure..
• Minimizes number of
“bad” contacts
• Minimizes number of
buried charges
• Minimizes radius of
gyration
• Minimizes covalent
and noncovalent (van
der Waals and
coulombic) energies
Lecture 3.2
9
Structure Validation Servers
• WHAT IF
– http://swift.cmbi.kun.nl/WIWWWI/
• Verify3D
– http://www.doe-mbi.ucla.edu/Services/Verify_3D/
• VADAR
– http://redpoll.pharmacy.ualberta.ca
Lecture 3.2
10
Lecture 3.2
11
Lecture 3.2
12
Structure Validation Programs
• PROCHECK
– http://www.biochem.ucl.ac.uk/~roman/procheck/procheck.html
• VADAR
– http://www.pence.ca/software/vadar/latest/vadar.html
• DSSP
– http://www.cmbi.kun.nl/gv/dssp/
Lecture 3.2
13
Procheck
Lecture 3.2
14
Outline
• 3D Structure Determination
– Validation
• Structure Classification
• Structure Prediction
– Secondary Structure
Lecture 3.2
15
Domains are ubiquitous in
proteins
Large proteins are composed of compact,
semi-independent units - domains.
Reason:
Modularity
Folding efficiency
2MCP.PDB
Lecture 3.2
16
Protein Domains – an alphabet of
functional modules
14-3-3
ANK3
Death
DED
PH
Lecture 3.2
PTB
ARM
EFH
SAM
BH1
C1
EH
EVH
SH2
C2
SH3
FYVE
WD40
CARD
PDZ
WW
17
SCOP
• The SCOP database aims to provide a detailed and
comprehensive description of the structural and
evolutionary relationships between all proteins
whose structure is known.
• Created by manual inspection and aided by
automated methods
• Consists of four hierarchical categories:
– Class, Fold, Superfamily and Family.
• http://scop.mrc-lmb.cam.ac.uk/scop
Lecture 3.2
18
structural classification
The eight
most
frequent
SCOP
superfolds
Lecture 3.2
19
Semi-automated consensus domain
definition: - Structure (CATH)
Dehydrolipoamide dehydrogenase 1LPFA:
http://www.biochem.ucl.ac.uk/bsm/cath/
Jones S et al. (1998) Domain assignment for protein structures using a consensus
approach: Chracterization and analysis. Protein Science 7:233-242
Lecture 3.2
20
CATH - Class
Class 1: Mainly Alpha Class 2: Mainly Beta
Class 3: Mixed
Alpha/Beta
Class 4: Few
Secondary
Structures
Secondary structure content (automatic)
Lecture 3.2
21
CATH - Architecture
Roll
Super Roll
Barrel
2-Layer
Sandwich
Orientation of secondary structures (manual)
Lecture 3.2
22
CATH - Topology
L-fucose Isomerase Serine Protease
Aconitase,
domain 4
TIM Barrel
Topological connection and number of secondary structures
Lecture 3.2
23
CATH - Homology
Alanine racemase
Dihydropteroat
e (DHP)
synthetase
FMN dependent
fluorescent
proteins
7-stranded
glycosidases
Superfamily clusters of similar structures & functions
Lecture 3.2
24
Conserved Domain Database
Automated (objective) domain definition using sequence.
CDD
from Smart
and Pfam
CDART
from CDD
and Genbank
http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml
Lecture 3.2
25
Homologous domains have
similar structures
1PLS/2DYN:
23% ID
Lecture 3.2
1PLS - PH domain
(Human pleckstrin)
2DYN - PH domain
(Human dynamin)
26
Homology and Structural
Similarity
Proteins
that diverge
in evolution
maintain
their global
fold !
Russell et al. (1997) J Mol Biol 269: 423-439
Lecture 3.2
27
Superposition
• Important as a
means to identify
protein motifs and
fold families
• Non-evolutionary
structural
relationships
Structural similarity between
Calmodulin and Acetylcholinesterase
Lecture 3.2
28
RMSD metric
ai
RMSD coord A, B =
d a i, b i
n
1 da,b
i
i
n i
=1
bi
2
To calculate the RMSD, a pairwise correspondence of points has to
be defined first.
Lecture 3.2
29
RMSDopt
RMSDopt = min(RMSDcoord)
ai
bi
RMSDopt = RMSDcoord(A, Rs x (B-Ts))
The translation vector Ts and the rotation
matrix Ms define a superposition of the
vector set B on A.
An analytic solution of the superposition problem is available, but
not straightforward (involves an eigenvalue problem).
Lecture 3.2
30
Superposition in practice
Pre-aligned structures
– VAST www.ncbi.nlm.nih.gov/Structure/VAST/vast.shtml
– FSSP www.bioinfo.biocenter.helsinki.fi:8080/dali/index.html
– Homstrad www-cryst.bioc.cam.ac.uk/~homstrad/
– PDBsum www.biochem.ucl.ac.uk/bsm/pdbsum/
– DALI www.ebi.ac.uk/dali/
On the fly
– CE cl.sdsc.edu/ce.html
– FAST biowulf.bu.edu/FAST/
Lecture 3.2
31
Outline
• 3D Structure Determination
– Validation
• Structure Classification
• Structure Prediction
– Secondary Structure
Lecture 3.2
32
Secondary
o
(2 )
Structure
Table 10
Phi & Psi angles for Regular Secondary
Structure Conformations
Structure
Antiparallel b-sheet
Parallel b-Sheet
Right-handed a-helix
310 helix
p helix
Polyproline I
Polyproline II
Polyglycine II
Lecture 3.2
Phi (F)
-139
-119
-+64
-49
-57
-83
-78
-80
Psi(Y)
+135
+113
+40
-26
-70
+158
+149
+150
33
Secondary Structure Prediction
• One of the first fields to emerge in
bioinformatics (~1967)
• Grew from a simple observation that
certain amino acids or combinations
of amino acids seemed to prefer to
be in certain secondary structures
• Subject of hundreds of papers and
dozens of books, many methods…
Lecture 3.2
34
2o Structure Prediction
•
•
•
•
•
•
•
Statistical (Chou-Fasman, GOR)
Homology or Nearest Neighbor (Levin)
Physico-Chemical (Lim, Eisenberg)
Pattern Matching (Cohen, Rooman)
Neural Nets (Qian & Sejnowski, Karplus)
Evolutionary Methods (Barton, Niemann)
Combined Approaches (Rost, Levin, Argos)
Lecture 3.2
35
Secondary Structure Prediction
Lecture 3.2
36
Chou-Fasman Statistics
Chou & Fasman Secondary Structure Propensity of the Amino Acids
A
C
D
E
F
G
H
I
K
L
Lecture 3.2
Pa
1.42
0.7
1.01
1.51
1.13
0.57
1
1.08
1.16
1.21
Pb
0.83
1.19
0.54
0.37
1.38
0.75
0.87
1.6
0.74
1.3
Pc
0.75
1.11
1.45
1.12
0.49
1.68
1.13
0.32
1.1
0.49
M
N
P
Q
R
S
T
V
W
Y
Pa
1.45
0.67
0.57
1.11
0.98
0.77
0.83
1.06
1.08
0.69
Pb
1.05
0.89
0.55
1.1
0.93
0.75
1.19
1.7
1.37
1.47
Pc
0.5
1.44
1.88
0.79
1.09
1.48
0.98
0.24
0.45
0.84
37
Simplified C-F Algorithm
• Select a window of 7 residues
• Calculate average Pa over this window and
assign that value to the central residue
• Repeat the calculation for Pb and Pc
• Slide the window down one residue and
repeat until sequence is complete
• Analyze resulting “plot” and assign
secondary structure (H, B, C) for each
residue to highest value.
Lecture 3.2
38
Simplified C-F Algorithm
helix
10
Lecture 3.2
beta
20
30
coil
40
50
60
39
Limitations of Chou-Fasman
• Does not take into account
– long range information (>3 residues away)
– structure class
• Does not include
– related sequences or alignments in prediction
process
• Only about 55% accurate
Lecture 3.2
40
The PhD Algorithm
• Search the SWISS-PROT
database and select high scoring
homologues
• Create a sequence “profile” from
the resulting multiple alignment
• Include global sequence info in
the profile
• Input the profile into a trained
two-layer neural network to
predict the structure and to
“clean-up” the prediction
Lecture 3.2
41
Lecture 3.2
PHD
ZHANG
GOR III
JASEP7
PTIT
LEVIN
LIM
GOR I
CF
Scores (%)
Prediction Performance
75
70
65
60
55
50
45
42
Best of the Best
• PredictProtein-PHD (72%)
– http://cubic.bioc.columbia.edu/predictprotein/
• Jpred (73-75%)
– http://www.compbio.dundee.ac.uk/~wwwjpred/submit.html
• SAM-T02 (75%)
– http://www.cse.ucsc.edu/research/compbio/HMMapps/T02-query.html
• PSIpred (77%)
– http://bioinf.cs.ucl.ac.uk/psipred/psiform.html
Lecture 3.2
43
Lecture 3.2
44
Evaluating Secondary
Structure Predictions
• Historically problematic due to tester
bias (developer trains and tests their
own predictions)
• Some predictions were up to 10% off
• Move to make testing independent
and test sets as large as possible
• EVA – evaluation of protein
secondary structure prediction
Lecture 3.2
45
EVA
• >10 different
methods evaluated
as new structures are
deposited in the PDB
• Results posted on
the web and updated
weekly
•
Lecture 3.2
http://cubic.bioc.columbia.edu/eva
46
EVA
Lecture 3.2
47
Secondary Structure
Evaluation
• Q3 score
– standard method in evaluating
performance, 3 states (H,C,B) evaluated
like a multiple choice exam with 3
choices. Same as % correct
• SOV (segment overlap score)
– more useful measure of how segments
overlap and how much overlap exists
Lecture 3.2
48
Homology Modeling
• Similar sequences usually share the
same fold.
• Structure models can be constructed
from alignments with proteins having
a 3D structure.
• When no suitable template structure
can be found, possible templates are
found using “threading”
• More with Boris in 3.3 and 3.5
Lecture 3.2
49
ab initio
Protein Structure Prediction
• Predicting the 3D structure without any
“prior knowledge”
• Used when homology modeling or
threading have failed (no homologues are
evident)
• Equivalent to solving the “Protein Folding
Problem”
• Still an active research problem
• Howard’s Lecture 5.2
Lecture 3.2
50
Conclusions
• Protein structures are now sufficiently
abundant and well defined that they can
be classified using well-developed rules
of taxonomy
• Distant relationships and common rules
of folding can be uncovered through
fold classification & comparison
Lecture 3.2
51
Conclusions
• Structure prediction is still one of the key
areas of active research in bioinformatics
and computational biology
• Significant strides have been made over
the past decade through the use of larger
databases, machine learning methods
and faster computers
Lecture 3.2
52