From Motifs to Structures

Download Report

Transcript From Motifs to Structures

The Structure
Lectures
Boris Steipe
[email protected]
http://biochemistry.utoronto.ca/steipe
Departments of Biochemistry and Molecular and Medical Genetics
Program in Proteomics and Bioinformatics
University of Toronto
9.0
1
Lecture 9.0:
Use of Protein Structure
Boris Steipe
[email protected]
http://biochemistry.utoronto.ca/steipe
Departments of Biochemistry and Molecular and Medical Genetics
Program in Proteomics and Bioinformatics
University of Toronto
( Some slides have been adapted from material by Chris Hogue, Toronto, prepared for CBW in 2002)
9.0
2
Concepts
1.
"Sequence" and "structure" are abstractions of biopolymers.
2.
Structure can be determined experimentally.
3.
Structure abstractions can be stored, retrieved and visualized.
4.
Knowledge of structure allows mechanistic explanations.
5.
Structure is not arbitrary, but comes in units - motifs, helices,
strands, domains and complexes.
6.
Domains are folding units, functional units and units of
inheritance.
9.0
3
Concept 1:
"Sequence" and
"structure" are
abstractions of
biopolymers.
9.0
4
Physical Amino Acids and
Amino Acid Abstractions
Formula: C9H9NO2
Smiles String†:
N
[CH]([NH][R])([C](=[O])[R])
[CH2][c]1([cH][cH][c]([cH][cH]1)[
OH])
O
OH
Name:
Tyrosine
3-Letter: Tyr
1-Letter:
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
†
9.0
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
N
CA
C
O
CB
CG
CD1
CD2
CE1
CE2
CZ
OH
TYR
TYR
TYR
TYR
TYR
TYR
TYR
TYR
TYR
TYR
TYR
TYR
145
145
145
145
145
145
145
145
145
145
145
145
Y
-35.676
-36.931
-37.676
-37.061
-36.660
-37.845
-38.144
-38.691
-39.248
-39.804
-40.076
-41.170
-13.136
-13.763
-12.879
-12.316
-15.140
-15.737
-15.357
-16.652
-15.856
-17.165
-16.757
-17.231
50.622
51.019
52.016
52.926
51.638
52.361
53.663
51.727
54.311
52.376
53.670
54.345
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
10.36
10.63
11.16
13.91
9.52
6.36
3.30
6.14
5.57
4.89
4.35
4.44
http://www.daylight.com/dayhtml/doc/theory/theory.smiles.html
5
The Concept of Abstract Amino Acids
Allows Highly Compressed Information
H-bond Donor
Nucleophile
Bulky
Phospho-Acceptor
Hydrophobic
H-Bond Acceptor
Y
9.0
Aromatic
2° side chain
rotational freedom
6
The Concept of Abstract Amino
Acid Similarity is Lossy
Bulky
(FILQRYW)
H-bond Donor
(CHKNQRSTWY)
Nucleophile
(CDESTY)
Phospho-Acceptor
(STY)
Hydrophobic
(FAMILYVW)
Y
9.0
H-Bond Acceptor
(DEHNQSTY)
Aromatic
(FWH)
2° side chain
rotational freedom
(CDFHSW)
7
Structure Contextualizes Sequence
…
V
9.0
V
I
Y
T
T
G …
(Tyr262 in 1ERQ.pdb)
8
Structural Abstraction
e
To store structures we
need:
y
d
Sulphur
x
Oxygen
z
- coordinate
g
- topology, and
- chemical type
information.
9.0
Carbon
Nitrogen
b
a
Met
9
Concept 2:
Structure can
be determined
experimentally.
9.0
10
Experimental sources of
structure
X-ray
NMR
9.0
• Crystallization required
• Diffraction  data collection
• The phase problem: MAD, heavy
metal isomorphic derivatives ...
• ... or "Molecular replacement" give
phase approximations
• Model building in electron density
maps
• Refinement
11
Experimental sources of
structure
Crystallization is limiting.
X-ray
Diffraction is not imaging!
Refinement is required.
NMR
Model
Data
http://www-structure.llnl.gov/Xray/101index.html
9.0
12
Experimental sources of
structure
X-ray
NMR
9.0
• High concentration required
( ~ 1mM)
• Assignment of peaks ...
• ... determination of crosspeaks 
distance constraints
• Calculation of models from
distance constraints
• Refinement
13
Experimental sources of
structure
X-ray
1DRO.PDB
Ensemble of structures that are compatible
with experimental distance constraints
Consensus model
Concentration/Solubility
NMR
Assignment and NOEs
Refinement
9.0
14
Assessing structure quality
Metrics:
• Resolution, R-factor and R-free
• Bond length and angle deviations
• Coordinate error can be estimated
from diffraction data
http://www.sci.sdsu.edu/TFrey/Bio750/Bio750X-Ray.html
Programs Whatcheck and Procheck calculate quality metrics:
http://swift.cmbi.kun.nl/WIWWWI//fullcheck.html
http://www.biochem.ucl.ac.uk/~roman/procheck/procheck.html (also NMR)
Rules of thumb for "good structures":
Resolution 2Å, R-factor 20%, mean coordinate error 0.2 Å, RMSD bond-lengts: 0.02Å
9.0
15
Concept 3:
Structure
abstractions can
be stored,
retrieved and
visualized.
9.0
16
The
PDB
The PDB
is the
primary
repository
of protein
structure
data.
QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.
http://www.rcsb.org/pdb
9.0
17
What’s in a Structure File?
• Population experiments
• X-ray, 1 structure
• NMR - sometimes many structures
• Incomplete - not all “atoms” are there
• Hydrogens, parts of the protein in motion
• Crystallographic “space”
• correct, but not always relevant
9.0
18
The PDB format
•Flat file, column oriented
•Human readable
•Human editable
•Huge legacy problems
Flat File: A datafile without indexing structure or hierarchy. In contrast, to
relational database, or data grammar.
9.0
19
Header
HEADER
COMPND
COMPND
SOURCE
AUTHOR
REVDAT
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
IMMUNOGLOBULIN
01-MAR-93
2IMM
IMMUNOGLOBULIN VL DOMAIN (VARIABLE DOMAIN OF KAPPA LIGHT
2 CHAIN) OF MCPC603
HUMAN (HOMO $SAPIENS) RECOMBINANT SYNTHETIC M603 GENE
B.STEIPE,R.HUBER
1
15-JUL-93 2IMM
0
1
1 REFERENCE 1
1 AUTH
B.STEIPE,A.PLUCKTHUN,R.HUBER
1 TITL
REFINED CRYSTAL STRUCTURE OF A RECOMBINANT
1 TITL 2 IMMUNOGLOBULIN DOMAIN AND A
1 TITL 3 COMPLEMENTARITY-DETERMINING REGION 1-GRAFTED MUTANT
1 REF
J.MOL.BIOL.
V. 225
739 1992
1 REFN
ASTM JMOBAK UK ISSN 0022-2836
070
2IMM
2IMM
2IMM
2IMM
2IMM
2IMM
2IMM
2IMM
2IMM
2IMM
2IMM
2IMM
2IMM
2IMM
2
3
4
5
6
7
8
9
10
11
12
13
14
15
2IMM
2IMM
2IMM
23
24
25
[...]
REMARK
REMARK
REMARK
2
2 RESOLUTION. 2.00
3
ANGSTROMS.
[...]
9.0
20
Seqres
[...]
SEQRES
SEQRES
SEQRES
SEQRES
SEQRES
SEQRES
SEQRES
SEQRES
SEQRES
[...]
1
2
3
4
5
6
7
8
9
114
114
114
114
114
114
114
114
114
ASP
SER
GLN
ALA
LEU
ASP
LEU
TYR
GLY
ILE
ALA
SER
TRP
ILE
ARG
THR
TYR
ALA
VAL
GLY
LEU
TYR
TYR
PHE
ILE
CYS
GLY
MET
GLU
LEU
GLN
GLY
THR
SER
GLN
THR
THR
ARG
ASN
GLN
ALA
GLY
SER
ASN
LYS
GLN
VAL
SER
LYS
SER
SER
VAL
ASP
LEU
SER
THR
GLY
PRO
THR
GLY
GLN
HIS
GLU
PRO
MET
ASN
GLY
ARG
SER
ALA
SER
LEU
SER
SER
GLN
GLN
GLU
GLY
GLU
TYR
LYS
SER
CYS
LYS
PRO
SER
THR
ASP
PRO
ARG
LEU
LYS
ASN
PRO
GLY
ASP
LEU
LEU
SER
SER
PHE
LYS
VAL
PHE
ALA
THR
VAL
SER
LEU
LEU
PRO
THR
VAL
PHE
2IMM
2IMM
2IMM
2IMM
2IMM
2IMM
2IMM
2IMM
2IMM
35
36
37
38
39
40
41
42
43
Explicit (above) and implicit sequence may differ !
9.0
21
Pitfalls:
Atom
Atomname is a mix of Chemical element
and bond topology. "CA.." ≠ ".CA."
Sequence number is actually a string Chain and insertion code are required to
make it unique (e.g B 123A).
Atom
number
Amino acid
type
X
Y
ATOM
119
CA
ARG
18
8.386
51.105
35.847
Z
1.00
Occ
7.30
2IMM 179
B
Sequence
number
(Temperature factors)
Atom
name
Record
type
9.0
PDB format is strictly column oriented !
22
Hetero Atoms
[...]
HETATM
[...]
877
O
HOH
1
-4.169
60.050
40.145
1.00
3.00
2IMM 937
http://xray.bmc.uu.se/hicup/
9.0
23
The crystallographic asymmetric units does not
necessarily contain a functional molecule
1qpi.pdb Tet-repressor/operator complex
9.0
The contents of a crystal
lattice unit cell can be
generated from the
asymmetric unit by
applying the required
symmetry operations for
the crystallographic
space-group. But neither
is this trivial for the
non-crystallographer,
nor is it obvious which
of the symmetry
replicates might make
physiological contacts.
24
... Biological Unit
QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.
PQS reasons
automatically
about how a
monomer
might be
correctly
completed to a
functional biomolecular
complex (and
is often
correct).
http://pqs.ebi.ac.uk/
9.0
25
NCBI
structure
group
MMDB - very
well integrated
but somewhat
impenetrable.
9.0
26
NDB
http://ndbserver.rutgers.edu/NDB/
urx035.pdb
(Hammerhead Ribozyme)
9.0
27
PDBsum - and "secondary"
structure databases
QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.
http://www.biochem.ucl.ac.uk/bsm/pdbsum/
9.0
28
PDBsum - Information
QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.
9.0
29
Others
Macromolecular Structure Database at EBI (Relibase, PQS ...)
http://www.ebi.ac.uk/msd/
Macromolecular structure related resources at the PDB
http://www.rcsb.org/pdb/links.html
Structure links at the Southwestern Biotechnology and Informatics Center
http://www.swbic.org/links/1.19.2.5.php
Molecular Models from Chemistry
http://people.ouc.bc.ca/woodcock/molecule/molecule.html
Molecular Library
http://www.nyu.edu/pages/mathmol/library/
.... many, many more.
9.0
30
Concept 4:
Knowledge of
structure allows
mechanistic
explanations.
9.0
31
Structure as an integrated map
- Example questions
• Which part of my structure appears to be conserved ?
• Are two functionally important residues possibly in contact ?
• Where is Asn220 relative to the active site ?
• May the mutation E123A possibly have something to do with
protein stability ?
• Is Leu234 on the surface, or in the core ?
• I want to clone my protein into a yeast two-hybrid system: should I
fuse the DNA binding domain to the N- or the C- terminus ?
9.0
32
Geometric relationships
• Bonds
• Angles, plain and dihedral
• Surfaces
• Chemical potential, amino acid functions
• Static and dynamic disorder
• Structural similarity
• Electrostatics
• Conservation patterns (structural and functional)
• Quarternary structure
• Posttranslational modification sites
• Unexpected homology
• [...]
9.0
33
Distances from
coordinates
XYZ coordinates are vectors in an
orthogonal coordinate system, in Å.
All the rules of analytical geometry apply.
[...]
ATOM
[...]
ATOM
[...]
d =
=
=
=
=
9.0
687
OH
TYR
86
7.415
62.584
32.900
1.00
3.37
651
O
ASP
82
9.996
62.571
32.488
1.00
5.18
[(9.996-7.415)2 + (62.571-62.584)2 + (32.488-32.900)2]0.5
[(2.581)2 + (-0.013)2 + (-0.412)2]0.5
[6.661561 + 0.0000169 + 0.169744]0.5
[6.831474]0.5
2.614 Å = 0.2614 nm = 2.614 . 10-10 m
34
Dihedral angles
i+3
i
i+2
i+1
9.0
+f
Single bonds:
Freely rotable, but constrained
by steric overlap. Small
energetic barrier, preference for
staggered conformations.
Double bonds:
Constrained to planar
geometry. Large energetic
barrier to isomerization.
35
Backbone dihedral angles:
Ramachandran plots

f

Rotatable
bonds in the
backbone are
named f,
and .
9.0
Due to steric
overlap, not all
combinations of
(f, are
allowed.
Allowed and
forbidden regions of
(f, space are
shown on the
Ramachandran plot.
Observed (f,
values reflect
the theoretical
boundaries well.
36
Sidechain rotamers
3
2

100 randomly chosen
Phe-residues superimposed.
Ponder & Richards (1987) J. Mol. Biol. 193, 775-791
http://dunbrack.fccc.edu/bbdep/
9.0
37
H-bond patterns
Example: TYR - Side Chain Donor
OH can donate a single hydrogen
(The OH-H bond is 1.00Å long and lies in the plane
of CE1, CE2, CZ and OH forming an angle of 110
degrees with the CZ-OH bond.)
QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.
Distribution of H-bond counts in all and buried residues, D-A distances, H-A
distances and D-H-A angles inTyr sidechains.
Tyr-Thr sidechain H-bond:
despite canonical geometry,
correct topology may be
ambiguous!
McDonald & Thornton (1994) J. Mol. Biol. 238, 777-793
http://www.biochem.ucl.ac.uk/bsm/atlas/
9.0
38
Molecular surface
Chain "A" of
1AON.PDB GroEL/ES complex
Surface rendering
of GroEL/ES
complex
(D. Goodsell)
9.0
39
Molecular surface
Surface provides a visual metaphore,
and a useful tool to map properties.
But how can a molecular surface be
defined ? Obviously, the hard-sphere
surface is chemically not very relevant.
Van der Waals surface
9.0
40
Molecular surface
Probe !
Van der Waals surface
9.0
41
Molecular surface
Contact surface
Accessible surface
"Accessible"
Van der Waals surface
"Buried"
9.0
Reentrant surface
42
Calculating solvent accessible
surfaces
1.
2.
3.
Draw a sphere around each atom, with a radius of (VdW + solvent
probe ).
Erase all overlapping sphere surfaces.
The remaining area is the accessible surface.
C: 1.75 Å
N: 1.55
O: 1.4Å
H: 1.17Å
9.0
43
Parameters and assumptions
Problem:
Solution:
Problem:
Solution:
Problem:
Solution:
Problem:
Solution:
Problem:
Solution:
Problem:
Solution:
Problem:
Solution:
Problem:
Solution:
[...]
Analytical solution inefficient.
Numerical solution with probe points
Regular placement of n probe points
Stochastic placement
Stochastic placement quite irregular
Enforce minimum separation
Efficiency
Place points only once, translate as needed
What is a good value for n ?
Try different n, evaluate standard deviation
Should n be constant per atom, or per area ?
dots/area - need to scale dots with r VdW
Hydrogens - where to get united atom radii ?
Literature search.
Reference areas for relative SAA needed
Model explicitely, as tripeptides
u,v  [0,1]
 = 2p u
f = cos-1 (2v–1)
http://mathworld.wolfram.com/
SpherePointPicking.html
Even a straightforward algorithm has it's hidden parameters and assumptions.
Results are meaningful only in this context. Any comparison is problematic.
9.0
44
Mapping properties on
surfaces
•Properties of atoms (B-factors)
•Ensemble properties of residues
(hydrophobicity, conservation)
•Geometry (local curvature)
•Fields and potentials
(isosurfaces, binding potential)
AChE (1ACL.PDB) color coded by
electrostatic potential with
GRASP.
(http://trantor.bioc.
columbia.edu/grasp/)
9.0
45
Concept 5:
Structure is not
arbitrary, but
contains
recurring units.
9.0
46
Basic building blocks of
structure:
Eg. PROMOTIF - as used in PDBSUM
But: classical descriptions of structural building blocks are as much
based on idealized concepts of geometry as on observations of
nature. An unbiased analysis may arrive at significantly different
classifications !
9.0
47
Unbiased structure motifs:
alignment with added value
Motif alignments ... Why are particular
amino acids conserved? What is
essential in a sequence ?
A structure motif consensus sequence, compiled
from unrelated segments, averages out features of
conservation that are only due to incomplete
divergence (homology).
A consensus sequence, taken from different
structural contexts, averages out features of
sequence that are due to specific functional
(binding, catalysis) or non-local structural
requirements (packing, interaction).
What remains is information about sequence
propensities of local structural elements.
9.0
48
A schematikon motif example:
complex loop
Motif:
Length:
Support:
Unique:
Rank:
9.0
1icf 215
7
7
7
399
49
A schematikon motif example:
strand N-cap
Motif:
Length:
Support:
Unique:
Rank:
9.0
1whi 35
4
7
7
444
50
Concept 6:
Domains are
folding units,
functional units, and
units of inheritance.
9.0
51
Domains are ubiquitous in
proteins
Large proteins are composed of compact,
semi-independent units - domains.
Reason:
Modularity
Folding efficiency
2MCP.PDB
9.0
52
Domains in proteins:
Number of
domains in 787
representative
proteins used as
the basis for the
CATH database
Jones S et al. (1998)
Protein Science 7:233
9.0
53
Domains in proteins:
Non-random
relationship
between domain
number and
chain length in
the 787
representative
proteins used as
the basis for the
CATH database
Jones S et al. (1998)
Protein Science 7:233
9.0
54
Domains in proteins:
Domain size in
the 787
representative
proteins used as
the basis for the
CATH database
Jones S et al. (1998)
Protein Science 7:233
9.0
55
There is no universal
definition of "domains"
Possible definitions are based on independently inherited (sub)sequences
(sequence domain), modular protein functions (functional domain), folding
unit or atomic contacts (structural domain).
Domain: A part of structure that can fold
irrespective of the presence of other
parts of structure
But: what is measured is commonly sequence, function, or structure - NOT FOLDING!
9.0
56
Further complications:
Analogous
structure,
Domain
insertions,
Circular
permutations,
Domain
swapping.
Domain insertion
1A2J.PDB
Protein disulfide isomerase
9.0
2TRX.PDB
Thioredoxin
57
Further complications:
Analogous
structure,
Domain
insertions,
Circular
permutations,
Domain
swapping.
253
1ERQ.PDB
beta lactamase
9.0
Circular permutation
1ALQ.PDB
beta lactamase
58
Further complications:
Analogous
structure,
Domain
insertions,
Circular
permutations,
Domain
swapping.
Domain swapping
11BG.PDB
Bull seminal ribonuclease
9.0
59
Domains can be elusive:
The separation of a structure into
domains requires the arbitrary
definition of thresholds in a
continuum of possibilities.
informed
9.0
60
Why care ?
Function:
evolution works on sequence, but selects function.
Definition of domains in structure can uncover functional units
that may evolve independently. Sequence searches, alignments
etc. with domains are much more specific.
Once structural domains have been defined, sequence profiles,
HMMs or other computational procedures can be used to pick
out more members of the domain family from the database.
Domains can be defined from sequence patterns, or from the
analyis of structure.
9.0
61
Automated (objective) domain
definition: - Sequence (CDD)
http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml
CDD
from Smart
and Pfam
CDART
from CDD
and Genbank
9.0
62
SemiAutomated consensus domain
definition: - Structure (CATH)
Dehydrolipoamide
dehydrogenase 1LPFA:
Jones S et al. (1998) Domain assignment for protein structures using a consensus
approach: Chracterization and analysis. Protein Science 7:233-242
9.0
63
SCOP & CATH: structural classification
The eight
most
frequent
SCOP
Superfolds
http://scop.mrc-lmb.cam.ac.uk/scop/
http://www.biochem.ucl.ac.uk/bsm/cath/
9.0
64
CATH - Class
Class1: Mainly Alpha Class 2: Mainly Beta
9.0
Class 3: Mixed
Alpha/Beta
Class4: Few
Secondary
Structures
65
CATH - Architecture
Roll
9.0
Super Roll
Barrel
2-Layer Sandwich
66
CATH - Topology
L-fucose Isomerase
9.0
Serine Protease
Aconitase, domain
4
TIM Barrel
67
CATH - Homology
Alanine racemase
9.0
Dihydropteroate
(DHP)
synthetase
FMN dependent
fluorescent
proteins
7-stranded
glycosidases
68
CATH Entry
(Example)
QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.
9.0
69
IV: Open Issues
I: Integration into processes, scriptable APIs
II: Sequence based identification of domains
III: Analysing domains in context
IV: Defining modular domain functions
9.0
70
Bioinformaticians apparently
do not like structure !
Sequence:
Structure:
• Discrete alphabet
• Continuous space
• Easy to manipulate
• Linear algebra, complicated
energy functions
• Well developed
datastructures
• Well developed libraries
• Databases and
datastructures are difficult
• Paucity of libraries
Meet the challenge !
9.0
71