Protein structure hierarchical levels

Download Report

Transcript Protein structure hierarchical levels

C
E
N
T
R
E
F
O
R
I
N
T
E
G
R
A
T
I
V
E
B
I
O
I
N
F
O
R
M
A
T
I
C
S
V
U
Lecture 16:
Domains, their prediction and
domain databases
Introduction to Bioinformatics
Sequence-Structure-Function
Sequence
Ab initio
prediction
and folding
impossible but for
the smallest
structures
Threading Structure
Homology
searching
(BLAST)
Function
Function
prediction
from
structure
very difficult
Functional Genomics – Systems
Biology
Genome
Expressome
Proteome
TERTIARY STRUCTURE (fold)
Metabolomics
fluxomics
TERTIARY STRUCTURE (fold)
Metabolome
Systems Biology
is the study of the interactions between the components
of a biological system, and how these interactions give
rise to the function and behaviour of that system (for
example, the enzymes and metabolites in a metabolic
pathway). The aim is to quantitatively understand the
system and to be able to predict the system’s time
processes
• the interactions are nonlinear
• the interactions give rise to emergent properties, i.e. properties that
cannot be explained by the components in the system
• Biological processes include many time-scales, many
compartments and many interconnected network levels (e.g.
regulation, signalling, expression,..)
Systems Biology
understanding is often achieved through
modeling and simulation of the system’s
components and interactions.
Many times, the ‘four Ms’ cycle is adopted:
Measuring
Mining
Modeling
Manipulating
‘The
silicon
cell’
(some people think
‘silly-con’ cell)
A system response
Apoptosis: programmed cell death
Necrosis: accidental cell death
Human
Yeast
‘Comparative
metabolomics’
We need to be able to
do automatic pathway
comparison (pathway
alignment)
Important difference
with human pathway
This pathway diagram shows a comparison of pathways in (left) Homo sapiens
(human) and (right) Saccharomyces cerevisiae (baker’s yeast). Changes in
controlling enzymes (square boxes in red) and the pathway itself have occurred
(yeast has one altered (‘overtaking’) path in the graph)
Experimental
• Structural genomics
• Functional genomics
• Protein-protein interaction
• Metabolic pathways
• Expression data
Issue when elucidating function
experimentally
• Partial information (indirect interactions) and
subsequent filling of the missing steps
• Negative results (elements that have been
shown not to interact, enzymes missing in an
organism)
• Putative interactions resulting from
computational analyses
Protein function categories
• Catalysis (enzymes)
• Binding – transport (active/passive)
– Protein-DNA/RNA binding (e.g. histones, transcription factors)
– Protein-protein interactions (e.g. antibody-lysozyme) (experimentally
determined by yeast two-hybrid (Y2H) or bacterial two-hybrid (B2H)
screening )
– Protein-fatty acid binding (e.g. apolipoproteins)
– Protein – small molecules (drug interaction, structure decoding)
• Structural component (e.g. -crystallin)
• Regulation
• Signalling
• Transcription regulation
• Immune system
• Motor proteins (actin/myosin)
Catalytic properties of enzymes
Michaelis-Menten equation:
Vmax × [S]
V = ------------------Km + [S]
Km
•
•
•
•
•
•
•
kcat
Moles/s
Vmax
Vmax/2
E+S
ES
E+P
E = enzyme
K
S = substrate
ES = enzyme-substrate complex (transition state)
P = product
Km = Michaelis constant
Kcat = catalytic rate constant (turnover number)
Kcat/Km = specificity constant (useful for comparison)
m
[S]
Protein interaction domains
http://pawsonlab.mshri.on.ca/html/domains.html
Energy difference upon binding
Examples of protein interactions (and of functional
importance) include:
• Protein – protein
(pathway analysis);
• Protein – small molecules
(drug interaction, structure decoding);
• Protein – peptides, DNA/RNA
The change in Gibb’s Free Energy of the protein-ligand
binding interaction can be monitored and expressed by
the following equation:
G=H–TS
(H=Enthalpy, S=Entropy and T=Temperature)
Protein-protein interaction networks
Protein function
• Many proteins combine functions
• Some immunoglobulin structures are thought to
have more than 100 different functions (and
active/binding sites)
• Alternative splicing can generate (partially)
alternative structures
Protein function & Interaction
Active site /
binding cleft
Shape complementarity
Protein function evolution
Chymotrypsin
How to infer function
• Experiment
• Deduction from sequence
– Multiple sequence alignment – conservation
patterns
– Homology searching
• Deduction from structure
– Threading
– Structure-structure comparison
– Homology modelling
Cholesterol Biosynthesis:
Cholesterol biosynthesis primarily occurs in
eukaryotic cells. It is necessary for membrane
synthesis, and is a precursor for steroid hormone
production as well as for vitamin D. While the
pathway had previously been assumed to be
localized in the cytosol and ER, more recent
evidence suggests that a good deal of the
enzymes in the pathway exist largely, if not
exclusively, in the peroxisome (the enzymes
listed in blue in the pathway to the left are
thought to be at least partly peroxisomal).
Patients with peroxisome biogenesis disorders
(PBDs) have a variable deficiency in cholesterol
biosynthesis
Cholesterol Biosynthesis:
from acetyl-Coa to mevalonate
Mevalonate plays a role in epithelial cancers:
it can inhibit EGFR
Epidermal Growth Factor as a
Clinical Target in Cancer
A malignant tumour is the product of uncontrolled cell proliferation.
Cell growth is controlled by a delicate balance between growthpromoting and growth-inhibiting factors. In normal tissue the
production and activity of these factors results in differentiated cells
growing in a controlled and regulated manner that maintains the
normal integrity and functioning of the organ. The malignant cell has
evaded this control; the natural balance is disturbed (via a variety of
mechanisms) and unregulated, aberrant cell growth occurs. A key
driver for growth is the epidermal growth factor (EGF) and the
receptor for EGF (the EGFR) has been implicated in the
development and progression of a number of human solid tumours
including those of the lung, breast, prostate, colon, ovary, head and
neck.
Energy housekeeping:
Adenosine diphosphate (ADP) – Adenosine triphosphate (ATP)
Chemical Reaction
Add Enzymatic Catalysis
Add Gene Expression
Add Inhibition
Metabolic Pathway: Proline
Biosynthesis
Proline as end product effects a negative feedback loop
Transcriptional Regulation
Methionine Biosynthesis in E. coli
Shortcut Representation
High-level Interaction representation
Levels of Resolution
SREBP Pathway
Signal Transduction
Important signalling pathways:
Map-kinase (MapK) signalling
pathway, or TGF- pathway
Transport
Phosphate Utilization in Yeast
Multiple Levels of Regulation
• Gene expression
• Protein posttranslational modification
•
•
•
•
Protein activity
Protein intracellular location
Protein degradation
Substrate transport
Graphical Representation –
Gene Expression
Protein interaction domains
http://pawsonlab.mshri.on.ca/index.php?option=com_content&task=view&id=30&Itemid=63
Domain function
Active site / binding cleft
Protein-protein (domaindomain) interaction
Shape complementarity
A domain is a:
• Compact, semi-independent unit
(Richardson, 1981).
• Stable unit of a protein structure that can
fold autonomously (Wetlaufer, 1973).
• Recurring functional and evolutionary
module (Bork, 1992).
“Nature is a tinkerer and not an inventor” (Jacob, 1977).
• Smallest unit of function
Delineating domains is essential for:
• Obtaining high resolution structures (x-ray but
particularly NMR – size of proteins)
• Sequence analysis
• Multiple sequence alignment methods
• Prediction algorithms (SS, Class, secondary/tertiary
structure)
• Fold recognition and threading
• Elucidating the evolution, structure and function of
a protein family (e.g. ‘Rosetta Stone’ method)
• Structural/functional genomics
• Cross genome comparative analysis
Domain connectivity
linker
Structural domain organisation can be nasty…
Pyruvate kinase
Phosphotransferase
 barrel regulatory domain
a/ barrel catalytic substrate binding
domain
a/ nucleotide binding domain
1 continuous + 2 discontinuous domains
Domain size
•The size of individual structural domains varies
widely
– from 36 residues in E-selectin to 692 residues in
lipoxygenase-1 (Jones et al., 1998)
– the majority (90%) having less than 200 residues
(Siddiqui and Barton, 1995)
– with an average of about 100 residues (Islam et al.,
1995).
•Small domains (less than 40 residues) are often
stabilised by metal ions or disulphide bonds.
•Large domains (greater than 300 residues) are
likely to consist of multiple hydrophobic cores (Garel,
1992).
Analysis of chain hydrophobicity in
multidomain proteins
Analysis of chain hydrophobicity in
multidomain proteins
Domain characteristics
Domains are genetically mobile units, and
multidomain families are found in all three
kingdoms (Archaea, Bacteria and Eukarya)
underlining the finding that ‘Nature is a tinkerer and
not an inventor’ (Jacob, 1977).
The majority of genomic proteins, 75% in unicellular
organisms and more than 80% in metazoa, are
multidomain proteins created as a result of gene
duplication events (Apic et al., 2001).
Domains in multidomain structures are likely to
have once existed as independent proteins, and
many domains in eukaryotic multidomain proteins
can be found as independent proteins in
prokaryotes (Davidson et al., 1993).
Protein function evolution
- Gene (domain) duplication Active site
Chymotrypsin
Pyruvate phosphate dikinase
• 3-domain protein
• Two domains catalyse 2-step reaction
A B  C
• Third so-called ‘swivelling domain’ actively
brings intermediate enzymatic product (B)
over 45Å from one active site to the other
/
Pyruvate phosphate dikinase
• 3-domain protein
• Two domains catalyse 2-step reaction
A B  C
• Third so-called ‘swivelling domain’ actively
brings intermediate enzymatic product (B)
over 45Å from one active site to the other
/
The DEATH Domain
http://www.mshri.on.ca/pawson
• Present in a variety of Eukaryotic
proteins involved with cell death.
• Six helices enclose a tightly
packed hydrophobic core.
• Some DEATH domains form
homotypic and heterotypic dimers.
Detecting Structural Domains
• A structural domain may be detected as a
compact, globular substructure with more
interactions within itself than with the rest of the
structure (Janin and Wodak, 1983).
• Therefore, a structural domain can be determined
by two shape characteristics: compactness and
its extent of isolation (Tsai and Nussinov, 1997).
• Measures of local compactness in proteins have
been used in many of the early methods of
domain assignment (Rossmann et al., 1974;
Crippen, 1978; Rose, 1979; Go, 1978) and in
several of the more recent methods (Holm and
Sander, 1994; Islam et al., 1995; Siddiqui and
Barton, 1995; Zehfus, 1997; Taylor, 1999).
Detecting Structural Domains
•However, approaches encounter problems
when faced with discontinuous or highly
associated domains and many definitions
will require manual interpretation.
•Consequently there are discrepancies
between assignments made by domain
databases (Hadley and Jones, 1999).
Detecting Domains using
Sequence only
• Even more difficult than prediction from structure!
Integrating protein multiple sequence
alignment, secondary and tertiary structure
prediction in order to predict structural domain
boundaries in sequence data
SnapDRAGON
Richard A. George
George R.A. and Heringa, J. (2002) J. Mol. Biol., 316, 839-851.
Protein structure hierarchical levels
PRIMARY STRUCTURE (amino acid sequence)
SECONDARY STRUCTURE (helices, strands)
VHLTPEEKSAVTALWGKVNVDE
VGGEALGRLLVVYPWTQRFFE
SFGDLSTPDAVMGNPKVKAHG
KKVLGAFSDGLAHLDNLKGTFA
TLSELHCDKLHVDPENFRLLGN
VLVCVLAHHFGKEFTPPVQAAY
QKVVAGVANALAHKYH
QUATERNARY STRUCTURE
TERTIARY STRUCTURE (fold)
Protein structure hierarchical levels
PRIMARY STRUCTURE (amino acid sequence)
SECONDARY STRUCTURE (helices, strands)
VHLTPEEKSAVTALWGKVNVDE
VGGEALGRLLVVYPWTQRFFE
SFGDLSTPDAVMGNPKVKAHG
KKVLGAFSDGLAHLDNLKGTFA
TLSELHCDKLHVDPENFRLLGN
VLVCVLAHHFGKEFTPPVQAAY
QKVVAGVANALAHKYH
QUATERNARY STRUCTURE
TERTIARY STRUCTURE (fold)
Protein structure hierarchical levels
PRIMARY STRUCTURE (amino acid sequence)
SECONDARY STRUCTURE (helices, strands)
VHLTPEEKSAVTALWGKVNVDE
VGGEALGRLLVVYPWTQRFFE
SFGDLSTPDAVMGNPKVKAHG
KKVLGAFSDGLAHLDNLKGTFA
TLSELHCDKLHVDPENFRLLGN
VLVCVLAHHFGKEFTPPVQAAY
QKVVAGVANALAHKYH
QUATERNARY STRUCTURE
TERTIARY STRUCTURE (fold)
Protein structure hierarchical levels
PRIMARY STRUCTURE (amino acid sequence)
SECONDARY STRUCTURE (helices, strands)
VHLTPEEKSAVTALWGKVNVDE
VGGEALGRLLVVYPWTQRFFE
SFGDLSTPDAVMGNPKVKAHG
KKVLGAFSDGLAHLDNLKGTFA
TLSELHCDKLHVDPENFRLLGN
VLVCVLAHHFGKEFTPPVQAAY
QKVVAGVANALAHKYH
QUATERNARY STRUCTURE
TERTIARY STRUCTURE (fold)
SNAPDRAGON
Domain boundary prediction protocol using sequence information
alone (Richard George)
1. Input: Multiple sequence alignment (MSA)
and predicted secondary structure
2. Generate 100 DRAGON 3D models for the
protein structure associated with the MSA
3. Assign domain boundaries to each of the 3D
models (Taylor, 1999)
4. Sum proposed boundary positions within 100
models along the length of the sequence,
and smooth boundaries using a weighted
window
George R.A. and Heringa J.(2002) SnapDRAGON - a method to delineate protein structural domains from
sequence data, J. Mol. Biol. 316, 839-851.
SnapDragon
Folds
generated by
Dragon
Multiple alignment
Boundary
recognition
(Taylor, 1999)
Predicted
secondary structure
CCHHHCCEEE
Summed and
Smoothed
Boundaries
SNAPDRAGON
Domain boundary prediction protocol using sequence information
alone (Richard George)
1. Input: Multiple sequence alignment
(MSA)
1. Sequence searches using PSI-BLAST (Altschul et
al., 1997)
2. followed by sequence redundancy filtering using
OBSTRUCT (Heringa et al.,1992)
3. and alignment by PRALINE (Heringa, 1999)
• and predicted secondary structure
4. PREDATOR secondary structure prediction
program
George R.A. and Heringa J.(2002) SnapDRAGON - a method to delineate protein structural domains from
sequence data, J. Mol. Biol. 316, 839-851.
Domain prediction using DRAGON
Distance Regularisation Algorithm for
Geometry OptimisatioN
(Aszodi & Taylor, 1994)
•Fold proteins based on the requirement that
(conserved) hydrophobic residues cluster together.
•First construct a random high dimensional Ca
distance matrix.
•Distance geometry is used to find the 3D
conformation corresponding to a prescribed target
matrix of desired distances between residues.
SNAPDRAGON
Domain boundary prediction protocol using sequence information
alone (Richard George)
2. Generate 100 DRAGON (Aszodi & Taylor, 1994)
models for the protein structure associated
with the MSA
–
–
–
–
DRAGON folds proteins based on the requirement that
(conserved) hydrophobic residues cluster together
(Predicted) secondary structures are used to further
estimate distances between residues (e.g. between the first
and last residue in a -strand).
It first constructs a random high dimensional Ca (and pseudo
C) distance matrix
Distance geometry is used to find the 3D conformation
corresponding to a prescribed matrix of desired distances
between residues (by gradual inertia projection and based
on input MSA and predicted secondary structure)
DRAGON = Distance Regularisation Algorithm for Geometry OptimisatioN
Multiple alignment
Ca distance
matrix
N
Target
matrix
3
N
100 randomised
initial matrices
100 predictions
N
N
Predicted secondary
structure
CCHHHCCEEE
N
Input data
•The Ca distance matrix is divided into smaller clusters.
•Separately, each cluster is embedded into a local centroid.
•The final predicted structure is generated from full
embedding of the multiple centroids and their
corresponding local structures.
Lysozyme 4lzm
PDB
DRAGON
Methyltransferase 1sfe
PDB
DRAGON
Phosphatase 2hhm-A
PDB
DRAGON
Taylor method (1999)
DOMAIN-3D
3. Assign domain boundaries to each of
the 3D models (Taylor, 1999)
•
•
•
Easy and clever method
Uses a notion of spin glass theory (disordered magnetic
systems) to delineate domains in a protein 3D structure
Steps:
1.
2.
3.
4.
Take sequence with residue numbers (1..N)
Look at neighbourhood of each residue (first shell)
If (“average nghhood residue number” > res no) resno = resno+1
else resno = resno-1
If (convergence) then take regions with identical “residue
number” as domains and terminate
Taylor,WR. (1999) Protein structural domain identification. Protein Engineering 12 :203-216
Taylor method (1999)
repeat until convergence
5
if 41 < (5+6+56+78+89)/5
78
56
6
41
then Res 41 42 (up 1)
else Res 41 40 (down 1)
89
Taylor method (1999)
continuous
discontinuous
SNAPDRAGON
Domain boundary prediction protocol using sequence information
alone (Richard George)
4. Sum proposed boundary positions within 100
models along the length of the sequence,
and smooth boundaries using a weighted
window (assign central position)
Window score = 1≤ i ≤ l Si × Wi
Wi
i
Where Wi = (p - |p-i|)/p2 and p = ½(n+1).
It follows that l Wi = 1
George R.A. and Heringa J.(2002) SnapDRAGON - a method to delineate protein structural domains from
sequence data, J. Mol. Biol. 316, 839-851.
SNAPDRAGON
Statistical significance:
• Convert peak scores to Z-scores using
z = (x-mean)/stdev
• If z > 2 then assign domain boundary
Statistical significance using random models:
• Test hydrophibic collapse given distribution of
hydrophobicity over sequence
• Make 5 scrambled multiple alignments (MSAs) and
predict their secondary structure
• Make 100 models for each MSA
• Compile mean and stdev from the boundary
distribution over the 500 random models
• If observed peak z > 2.0 stdev (from random models)
then assign domain boundary
SnapDRAGON prediction
assessment
• Test set of 414 multiple alignments;183 single and
231 multiple domain proteins.
• Boundary predictions are compared to the region
of the protein connecting two domains (maximally
10 residues from true boundary)
SnapDRAGON prediction assessment
• Baseline method I:
• Divide sequence in equal parts based on number of
domains predicted by SnapDRAGON
• Baseline method II:
• Similar to Wheelan et al., based on domain length
partition density function (PDF)
• PDF derived from 2750 non-redundant structures
(deposited at NCBI)
• Given sequence, calculate probability of onedomain, two-domain, .., protein
• Highest probability taken and sequence split equally
as in baseline method I
Average prediction results per protein
Continuous set
Discontinuous set
Full set
Coverage
63.9 (± 43.0)
35.4 (± 25.0)
51.8 (± 39.1)
Success
46.8 (± 36.4)
44.4 (± 33.9)
45.8 (± 35.4)
Coverage
43.6 (± 45.3)
20.5 (± 27.1)
34.7 (± 40.8)
Success
34.3 (± 39.6)
22.2 (± 29.5)
29.6 (± 36.6)
Coverage
45.3 (± 46.9)
22.7 (± 27.3)
35.7 (± 41.3)
Success
37.1 (± 42.0)
23.1 (± 29.6)
31.2 (± 37.9)
SnapDRAGON
Baseline 1
Baseline 2
Coverage is the % linkers predicted (TP/TP+FN)
Success is the % of correct predictions made (TP/TP+FP)
Average prediction results per protein