Lecture 1 – Classificationx - LCQB

Download Report

Transcript Lecture 1 – Classificationx - LCQB

Structural
Bioinformatics
Elodie Laine
Master BIM-BMC Semestre 3, 2016-2017
Laboratoire de Biologie Computationnelle et Quantitative (LCQB)
e-documents: http://www.lcqb.upmc.fr/laine/STRUCT
e-mail: [email protected]
Lecture 1 – Classification
Elodie Laine – 20.09.2016
General principles
Comparing structures
structural
alignment
secondary
structure
elements
structural
similarity
segment
shapes
Classifying structures
functional
motifs
evolutionary
relationships
Elodie Laine – 20.09.2016
A brief history of protein structures
Linus Pauling (1901-1994)
Pauling & Corey (1951) PNAS
Nobel prize in 1954,
CALTECH
Elodie Laine – 20.09.2016
A brief history of protein structures
John Kendrew (1917-1997)
Kendrew et al. (1958) Nature
Nobel prize in 1962 with Max Perutz,
Cavendish Laboratory
Elodie Laine – 20.09.2016
Protein structures are puzzling
Kendrew et al. (1958) Nature
3-Dimensional structure of myoglobin (1958, 2 Å resolution)
Perhaps the most remarkable features of the molecule are its complexity and its lack of symmetry.
The arrangement seems to be almost totally lacking in the kind of regularities which one instinctively
anticipates, and it is more complicated than has been predicted by any theory of protein structure.
Elodie Laine – 20.09.2016
BACK TO BASICS
Elodie Laine – 20.09.2016
General principle driving protein folding
Hydrophobic core / hydrophilic surface
Observation: The main driving force for folding water-soluble globular
proteins is to pack the hydrophobic side chains into the interior of the molecule.
Problem: To pack side chains inside the protein core, the main chain must also
fold into the interior, but it is highly polar and thus hydrophilic.
Solution: formation of secondary structures charaterized by hydrogenbonding between the main-chain NH and C=O’ groups

Elodie Laine – 20.09.2016
Secondary structure: alpha (α) helix
A few facts about α-helices:
3.6 residues per turn
C’=O (n) ---- NH (n+4)
l in [4-5;>40], μ = 10 res.
1.5 Å by res x 10 = 15 Å
Space-filling model of the α-helix
Pauling & Corey (1951) PNAS
Elodie Laine – 20.09.2016
Secondary structure: beta (β) sheet
anti-parallel
parallel
A few facts about β-strands:
C’=O (n) ---- NH (nadj)
l in [5; 10]
form pleated β-sheets
Elodie Laine – 20.09.2016
Secondary structure: loop regions
Type I
Hairpin loops
A few facts about loop regions:
C’=O ---- wat ; NH ---- wat
various lengths & irregular shape
highly flexible
involved in binding sites
Type II
…
Elodie Laine – 20.09.2016
Secondary structure assignment
Given the 3D coordinates of the atoms of a protein, it is possible to assign
secondary structure to each amino acid residue.
The DSSP (Define Secondary Structure of Proteins) algorithm
Kabsch & Sander (1983) Biopolymers
- Identifies intra-backbone H-bonds using an electrostatic definition:
- Defines eight types of secondary structures:
• 310 helix (G), α helix (H),  helix (I) are recognized by having a
repetitive sequence of H-bonds with 3, 4 or 5 residues apart
• β-sheets can be β-bridges (B) with a few H-bonds or β-bulges (E)
defined by longer sets of H-bonds
• turns (T) feature H-bonds typical of helices
• regions with high curvature (S), Ciα Ci+2α – Ci-2α Ciα angle < 70°
• loops (a blank or space) where no other rule applies
Elodie Laine – 20.09.2016
Topology Diagrams
nucleoplasmin (1K5J)
auxin binding protein 1 (1LRH)
Elodie Laine – 20.09.2016
Structural motifs
Structural motifs are simple combinations of a
few secondary structure elements, with a specific
geometrical arrangement. Some motifs may be
associated to a particular function.
EF-hand
Calmodulin
Elodie Laine – 20.09.2016
Protein domains
The fundamental unit of tertiary structure is the domain. A domain can fold independently.
Receptor
tyrosine
kinases
Elodie Laine – 20.09.2016
What is a domain ?!
Protein domains are stable units of protein structure that can fold
autonomously.
In the past, protein domains have been described in terms of structure
compactness, function and evolution, or folding.
Small proteins and most medium sized ones have just one domain.
Often the different domains of a protein are associated with different
functions.
Domains are formed by different combinations of secondary structure
elements and motifs.
Elodie Laine – 20.09.2016
A JUNGLE OF SHAPES
Elodie Laine – 20.09.2016
All α domains
The first globular protein to be solved, myoglobin, belongs to the class
of α-domains structures.
In α-domains structures, α-helices are packed against each other to
produce a stable globular structure, which hydrophobic core is protected
from the solvent.
Alpha-helices are sufficiently versatile to produce many very different
classes of structures. In membrane-bound proteins, the regions inside the
membranes are frequently α-helices.
Elodie Laine – 20.09.2016
All α domains
Alpha helices are sufficiently versatile to produce many very different class of structures.
1FXK
molecular
chaperone
prefoldin
Siegert et al.
(2000) Cell
An isolated α-helix in solution is marginally stable. F. Crick showed (1953) that the side-chain
interactions are maximized for helices wound around each other in a coiled-coil arrangement.
Elodie Laine – 20.09.2016
All α domains
The four-helix bundle is a common domain structure in α-proteins.
1HWH
hydrophobic
core
Growth
Hormone
Sundstrom et
al. (1996)
J. Biol. Chem.
Four-helix bundles are formed by four α-helices packed against each other with their helical axes
almost parallel to each other. Hydrophobic side chains are buried between the helices.
Elodie Laine – 20.09.2016
All α domains
The globin fold is present in myoglobin and hemoglobin.
Myoglobin
153 aa residues
+ heme group
compact structure
with 8 21 α-helical
parts
used for oxygen
storage in the
muscles
has to bind oxygen
reversibly at low
pressures
Hemoglobin
1MBN
4 subunits:
2 α-chains (141 aa)
2 β-chains (146 aa)
each subunit has one
heme group
used for oxygen
transport
can bind 4 oxygen
molecules reversibly
allosteric protein
Elodie Laine – 20.09.2016
α/β domains
The most frequent of the domain structures are the alpha/beta (α/β)
domains.
Alpha/beta domains consist of a central parallel or mixed β- sheet
surrounded by α-helices.
Parallel β-strands are arranged in barrels or sheets, according to three
main classes: the TIM barrel, the Rossman fold and the horseshoe fold.
Elodie Laine – 20.09.2016
α/β domains
The TIM barrel has a central cylinder or barrel of β-sheet formed from 8 parallel β-strands.
This very common fold
is found in many
proteins with diverse
functions and no
detectable sequence
identity (convergent
evolution). The active
sites are all formed by
loop regions at the
carboxyl ends of the βstrands that connect to
the α-helices.
Triose
phosphate
isomerase
5CSS
Elodie Laine – 20.09.2016
α/β domains
The Rossman fold is composed of six parallel β-strands linked to two pairs of α-helices.
3QVO
This fold is common in
proteins that bind
nucleotides. It was
names after Michael
Rossman, Purdue
University, who first
discovered it in the
enzyme lactate
dehydrogenase in 1970.
Elodie Laine – 20.09.2016
α/β domains
The horseshoe fold is characteristic of leucine-rich repeats.
This fold is composed of
repeating 20–30 aa
stretches that are unusually
rich in leucine. The βstrands form a curved
parallel β-sheet with all the
helices on the outside. One
face of the beta sheet and
one side of the helix array
are exposed to solvent.
1DFJ
Elodie Laine – 20.09.2016
All β domains
Antiparallel beta (β) structures represent the most functionally diverse
group of protein structures ; it includes enzymes, transport proteins,
antibodies, cell surface proteins, and virus coat proteins.
The cores of theses structures are built up by β-strands, from 4-5 to
over 10. The β-strands are arranged in a predominantly antiparallel
fashion.
They usually form two β-sheets (twisted by definition) joined together
and packed against each other, resulting in a barrel-like structure.
Elodie Laine – 20.09.2016
All β domains
The plasma-borne retinol-binding protein, RBP, is an up-and-down β-barrel.
The structure can be
viewed as two β-sheets
(green and blue) packed
against each other. Red βstrands participate in both
β-sheets. A retinol
molecule, vitamin A
(yellow), binds inside the
barrel and is transported
from the liver to various to
tissues before RBP is
degraded.
5HBS
Elodie Laine – 20.09.2016
All β domains
The influenza virus neuraminidase soluble head is a homotetramer (4*400 aas).
In this up-and-down βsheet structure, the βsheets do not form a
simple barrel but
instead 6 small sheets,
each with 4 strands,
arranged like blades of
a 6-bladed propeller.
The active site is in the
middle of one side of
the propeller.
7NN9
P. Colman (1991)
2.9 Å resolution
Elodie Laine – 20.09.2016
All β domains
The γ-crystallin molecule has two domains, each domain built from 2 greek key motifs.
The crystallins are lens-specific
proteins responsible for the
transparency and reflective power
of the lenses in our eyes. The 4
greek key motifs are evolutionary
related (2 events of duplication and
fusion).
1A45
T. Blundell (1981)
1.9 Å resolution
Elodie Laine – 20.09.2016
All β domains
2 greek key motifs can be found in jelly roll barrel, very common in subunits of spherical viruses.
This complex nonlocal
structure contains 4 pairs of
antiparallel β-sheets, only
one of which is adjacent in
sequence, "wrapped" in 3D
to form a barrel shape.
1QW9
Elodie Laine – 20.09.2016
All β domains
Up-and-down
γ-crystallin-like
jelly-roll
Elodie Laine – 20.09.2016
LET'S PUT SOME
ORDER
Elodie Laine – 20.09.2016
Hierarchical taxonomy
Increasing similarity
class
Fold/topology
superfamily
secondary structure content
global shape
homology & similar function
The structural classification of proteins is centered on the notion of domains.
Elodie Laine – 20.09.2016
Protein structure classification resources
CATH/Gene3D
16 millions protein domains classified into 2,626 superfamilies
http://www.cathdb.info/
SCOP/SCOPe
59,514 PDB entries representing 167,547 domains.
http://scop.berkeley.edu/
Superfamily level annotations on a
collection of hidden Markov models for
2,478 completely sequence genomes
Elodie Laine – 20.09.2016
SCOP & CATH – the standard of truth
The SCOP database is mainly based on expert knowledge.
class
’all α’, ’all β’,
’α/β’, ’α+β’
fold
sec. struct. +
connectivity
superfamily
low seq. id, high
struct. similarity
family
high seq. similarity or
functional evidence
The building process of CATH contains more automatic steps and less
human intervention.
class
architecture
topology
homologous
superfamily
’all α’, ’all β’,
’α/β’
general shape
sec. struct. +
connectivity
low seq. id, high
struct. similarity
Elodie Laine – 20.09.2016
CATH example
Elodie Laine – 20.09.2016
Classes
β
class
α/β
secondary structure content
Levitt and Chothia (MRC lab):
All alpha (α)
All beta (β)
Alpha and beta – mixed (α/β)
Alpha and beta proteins – segregated (α+β)
…
α/β
Elodie Laine – 20.09.2016
Folds and superfamilies
Domains belonging to the same fold have the same major secondary structures in
the same arrangement with the same topological connections.
Ex: Globin-like, Long alpha-hairpin, Type I dockerin domain…
The domains within a fold are further classified into superfamilies. Domains
belonging to the same superfamily have structural evidence to support a common
evolutionary ancestor but may not have detectable sequence homology.
Ex: Globin-like and Alpha-helical ferredoxin are the two superfamilies of the Globin-like fold.
Elodie Laine – 20.09.2016
Superfamilies
PA
superfamily
Elodie Laine – 20.09.2016
SCOP and CATH comparison

A very large number of domain pairs are not classified consistently in the two herarchies.
consistent
inconsistent
folds
superfamilies
family
7.970.415
133.335
70
102
superfamily
8.208.965
713.181
121
159
fold
10.879.564
2.389.191
84
500
class
268.747.988
62.849.692
745
1258
other class
962.011.672
249.897.353
745
1258
Csaba G, Birzele F, Zimmer R. (2009)
BMC Struct Biol.
Elodie Laine – 20.09.2016
Protein structure space
Choi and Kim.
(2006)
PNAS.
The protein structure space is sparsely
populated, and all of the proteins of
known structures cluster mostly into four
elongated regions, which correspond
approximately to four SCOP classes (allα, all-β, α+β, and α/β)
Elodie Laine – 20.09.2016
NOW, LET'S GO BACK
THROUGH TIME
Elodie Laine – 20.09.2016
Protein evolution
The evolution of proteins is different from the evolution of
organisms. It does not need to follow the evolutionary path of
organismic reproduction. Rather, the evolution of proteins is directly
related to improved , unaltered or diversified molecular functions, and
the protein function is directly related to protein structure.
Structures tend to diverge less than sequences. Proteins displaying
a certain degree of sequence similarity adopt similar shapes. Generally
above 40% sequence identity, the structures are very much alike.
There exist no remain of primitive proteins. All information about
protein structures is derived from the proteins of present-day
organisms, and the current protein universe represents a time-sliced
view of all proteins at their various stages of evolution.
Elodie Laine – 20.09.2016
Evolutionary processes
Do all proteins displaying identical folds share a common ancestor ?
Divergent evolution
Homology
Convergent evolution
Analogy
 Above
a certain level of
structural similarity
 Conservation of rare structural
characteristics, e.g. βαβ left
 Low sequence identity, yet
significant
 Key residues in the active site
 Transitivity: if A & B are
homologous, B & C also, then
A & C are homologous
Elodie Laine – 20.09.2016
Can protein structures evolve?
α-amylase (1bpl)
G4-amylase
(2amg)
GPK (1phk)
HIN recomb.
(1hcr)
L11 (1fow)
biotin repressor
(1bia)
sonic hedgehog
(1vhh)
CAP (1cgp)
Elodie Laine – 20.09.2016
Protein structure evolutionary ages
Choi and Kim.
(2006)
PNAS.
One can estimate the
evolutionary ages of protein
structures:
1) from a representative protein
structure, retrieve all
homologous sequences
2) map these sequences on the
tree of life
3) find the most recent
common ancestor of the
organisms that contain these
homolgous sequences
Elodie Laine – 20.09.2016
When homology is difficult to assess
Decarboxylases: convergent or divergent evolution?
Benzoylformate decarboxylase (BFD) and pyruvate decarboxylase (PDC) share a common fold
and overall biochemical function, but they recognize different substrates and have low (21%)
sequence identity.
Elodie Laine – 20.09.2016
When homology modeling fails
The K homology (KH) module is a widespread RNA-binding motif.
the type I and II KH
domains belong to
different protein folds.
Thus KH motif proteins
provide a rare example
of protein domains that
share significant
sequence similarity in
the motif regions but
possess globally distinct
structures.
Elodie Laine – 20.09.2016
HOW DO WE DO IN
PRACTICE?
Elodie Laine – 20.09.2016
How can we compare 2 structures?
Root Mean Square Deviation (RMSD) is a measure of structural similarity.
( xia  xib ) 2  ( yia  yib ) 2  ( zia  zib ) 2
RMSD  
n
i 1
n
It expresses the minimal global mean distance between the n corresponding atoms
of the superimposed structures a and b, where (x,y,z) are the atomic cartesian
coordinates.
The RMSD can be computed on a selection of atoms (backbone, heavy atoms…).
The RMSD computation requires that exactly n atoms from structure a to
correspond to n atoms from structure b.
The RMSD is generally computed after superimposition of structures a and b.
Elodie Laine – 20.09.2016
How can we establish the correspondence?
Structural alignment algorithms are used to establish the correspondence
between 2 structures and superimpose them.
1/ Identification of segments with similar shapes
2/ Combination of
segment pairs
3/ Alignment
extension in 3D
Elodie Laine – 20.09.2016
Structural alignment: an example
1/ Identification of segments with similar shapes
Each protein is subdivided in:
• segments of n residus that can overlap
• secondary structure elements
• parts of secondary structure elements
Similarity between segments is estimated by using:
• a filter on the end-to-end distance
• a filter on the distance from the N-terminus
Elodie Laine – 20.09.2016
Structural alignment: an example
2/ Combination of segment pairs
Final optimal alignment
Elodie Laine – 20.09.2016
Structural alignment: an example
3/ Extension of the alignment in 3D
Extend the current alignment by
one residue in the forward or
backward direction
The extension
with the smallest
RMSD is retained
Compute RMSDs
after superposition
Elodie Laine – 20.09.2016
Distance-matrix ALIgnment (DALI)
Protein A
DALI is a stuctural alignment method based on intra-molecular distances.
Background idea:
• Represent each protein as a 2D matrix storing all Cα-Cα distances.
Protein B
• Slide one matrix onto the other to find the common sub-matrix with the best match
Implementation (Greedy):
• Break each matrix into elementary contact patterns
• Pair-up similar contact patterns (one from each protein)
• Assemble pairs in the correct order to yield the overall alignment
Protein A
Distance matrix for Protein A
1
2
3
4
1
0
d12
d13
d14
2
d12
0
d23
d24
3
d13
d23
0
d34
4
d14
d24
d34
0
Distance matrix pair
Elodie Laine – 20.09.2016
Distance-matrix ALIgnment (DALI)
Similarity measure:
S

A
B
A
B
(



(
d
,
d
))

(
d
,
d

ij
ij
ij
ij )
icore jcore
: similarity threshold
: deviation
from arithmetric mean
d2
  exp(  2 ) envelope function
r
Assembly of alignment:
• Non-trivial combinatory problem
• The new alignment has one overlapping segment with
the previous one:
Myoglobin distance matrix
(S1S2) – (S1’S2’), (S2S3) – (S2’S3’),…
• Available Alignment Methods:
Monte Carlo optimization
Branch-and-bound
Neighbor walk
Elodie Laine – 20.09.2016
Distance-matrix ALIgnment (DALI)
3D (Spatial)
2D (Distance Matrix)
1D (Sequence)
Holm & Sander (1993) J. Mol. Biol.
Elodie Laine – 20.09.2016
Structural alignment tools
Elodie Laine – 20.09.2016
What is an optimal solution?
There is always a tradeoff between:
• maximizing the number n of corresponding atoms
• minimizing the overall root mean square deviation RMSD
( xia  xib ) 2  ( yia  yib ) 2  ( zia  zib ) 2
RMSD  
n
i 1
n
Structural alignment algorithms can produce incorrect solutions due to:
• structural divergence / flexibility despite high sequence homology
Elodie Laine – 20.09.2016
Vorolign
The Vorolign method is based on the assumption that the environments of two
structurally equivalent residues are similar due to positive selection in order to ensure the
structural integrity of the protein.
Birzele et al. (2007) Bioinformatics
Representation of the protein
Each residue cell contains by definition all points that are closer to the corresponding Cβ atom
(Cα for glycines) than to all other input nodes. The common face shared by two polyhedra in
the Voronoi decomposition correspond to a contact of the two input points (residues).
Elodie Laine – 20.09.2016
Vorolign
Worflow of the algorithm
The similarity of 2 neighbour sets is calculated using dynamic programming (low-level matrix).
The score of the low level matrix is used to fill the high-level dynamic programming matrix.
Elodie Laine – 20.09.2016
Vorolign
Similarity of 2 nearest-neighbours
The similarity between 2 residues k and l belonging to the nearest-neighbours sets of
residues i and j is expressed as:
Sim ( xik , y jl )  w1 * AA( xik , y jl )  w2 * SSE ( xik , y jl )
amino acid
exchange score
secondary
structure score
Similarity of nearest-neighbours sets
The similarity of two nearest-neighbour sets can be computed using dynamic
programming with respect to the similarity function of two nearest-neighbour
residues and an additional penalty for unmatche residues pu.
S (k  1, k  1)  Sim ( xik , y jl )
S (k , l )  max
S (k  1, l )  pu
S (k , j  l )  pu
Elodie Laine – 20.09.2016
Conclusion
• Structure classification can help to understand the function of
proteins, and detect divergent or convergent evolutionary processes
• Root Mean Square Deviation (RMSD) is a very popular measure of
the global drift between two protein conformations
• Structural alignment methods can rely on secondary structure, CαCα distances, fragments… to find the best match between two protein
structures
• Structure comparison and structural alignment are very complex
problems yet unresolved. Concepts regarding methods and measures
are disputed
Elodie Laine – 20.09.2016
SCOP & CATH – the standard of truth
The SCOP database is mainly based on expert knowledge : (1) classes are
’all α’, ’all β’, ’α/β’ as well as ’α+β’ , (2) two proteins with a common fold have the
same major secondary structures in the same arrangement with the same
topological connections, (3) low sequence identities but structures suggesting a
common evolutionary origin define a superfamily, (4) domains in the same
family are likely to have a common evolutionary origin based on sequence
similarity or functional evidence.
The building process of CATH contains more automatic steps and less
human intervention: (1) classes are ’all α’, ’all β’ and ’ α/ β’, (2) two proteins
have the same architecture if they share common general features with respect to
the overall protein fold shape without considering connectivity, (3) the topology
level is analogous to the fold level of SCOP, (4) in a homologous superfamily,
proteins have a high structural similarity and similar functions, suggesting they
may have evolved from a common ancestor.
Elodie Laine – 20.09.2016