Lect 6 - BIDD - National University of Singapore
Download
Report
Transcript Lect 6 - BIDD - National University of Singapore
BL5203: Molecular Recognition & Interaction
Lecture 6: Modeling Protein Structure and
Protein-Protein Interaction
Y.Z. Chen
Department of Pharmacy
National University of Singapore
Tel: 65-6616-6877; Email: [email protected] ; Web: http://bidd.nus.edu.sg
Content
•
Protein fold and structure
•
Homology modeling
•
Protein-protein docking
Sizes of protein databases
500M
1.6M
26K
1K
10,000,000,000
100,000,000
1,000,000
10,000
100
1
Protein
Protein
Protein
Protein
residues sequences structures complexes
Swiss-Prot database
Protein structure classification
Protein world
Protein fold
Protein superfamily
Protein family
New Fold
PDB New Fold Growth
New PDB structures
Old folds
New folds
•
•
The number of unique folds in nature is fairly small (possibly a few
thousands)
90% of new structures submitted to PDB in the past three years
have similar structural folds in PDB
Protein classification
• Number of protein sequences grow exponentially
• Number of solved structures grow exponentially
• Number of new folds identified very small (and close
to constant)
• Protein classification can
– Generate overview of structure types
– Detect similarities (evolutionary relationships) between
protein sequences
Problems in Protein
Bioinformatics
• 20,000 entries of proteins in the PDB
• 1000 - 2000 distinct protein folds in nature
• Thought to be only several thousand
unique folds in all
• Prediction of structure from sequence
– Fold recognition
– Fragment construction
• Proteome annotation
• Protein-protein docking
Protein folding code
Protein
folding
code
Protein
sequence
Protein
structure
Prediction of correct fold
Query
sequence
Matched
fold
Fold
recognition
Match sequence against library of known folds
Eisenberg et al.
Jones, Taylor, Thornton
Computational Requirements
• 1 sequence search takes 12 mins (3Ghz)
• Benchmarking on 100 proteins with 100
runs for a simplex search of parameter
space = 80 days
• 30 approaches explored = 7 years (on 1
cpu)
Types of Structure Prediction
• De novo protein
– methods seek to build three-dimensional
protein models "from scratch"
– Example: Rosetta
• Comparative protein
– modeling uses previously solved structures as
starting points, or templates.
– Example: protein threading
Factors that Make Protein Structure
Prediction a Difficult Task
• The number of possible structures that proteins
may possess is extremely large, as highlighted
by the Levinthal paradox
• The physical basis of protein structural stability
is not fully understood.
• The primary sequence may not fully specify the
tertiary structure.
– chaperones
• Direct simulation of protein folding is not
generally tractable for both practical and
theoretical reasons.
Homology Modeling
• Homolog a protein related to it by
divergent evolution from a
common ancestor
• 40 % amino-acid identity with its
homolog
– NO large insertions or deletions
– Produces a predicted structure
equivalent to that of a medium
resolution experimentally solved
structure
• 25 % of known protein sequences
fall in a safe area implying they
can be modeled reliably
Homology Modeling Defined
• Homology modeling
– Based on the reasonable assumption that two
homologous proteins will share very similar
structures.
– Given the amino acid sequence of an unknown
structure and the solved structure of a homologous
protein, each amino acid in the solved structure is
mutated computationally, into the corresponding
amino acid from the unknown structure.
Homology Modeling Limitations
• Cannot study conformational changes
• Cannot find new catalytic/binding sites
• Brainstorm lack of activity vs activity
– Chymotrypsionogen, trypsinogen and plasminogen
– 40% homologous
– 2 active, 1 no activity, cannot explain why
• Large Bias towards structure of template
• Models cannot be docked together
Why Homology Modeling?
• Value in structure based drug design
• Find common catalytic sites/molecular
recognition sites
• Use as a guide to planning and interpreting
experiments
• 70-80 % chance a protein has a similar fold to
the target protein due to X-ray crystallography or
NMR spectroscopy
• Sometimes it’s the only option or best guess
Protein Threading
• A target sequence is threaded through the
backbone structure of a collection of template
proteins (fold library)
• Quantitative measure of how well the sequence
fits the fold
• Based on assumptions
– 3-D structures of proteins have characteristics that
are semi-quantitatively predictable
– reflect the physical-chemical properties of amino
acids
– Limited types of interactions allowed within folding
Fold Recognition Methods
• Bowie, Lüthy and Eisenberg (1991)
• 2 approaches to recognition methods
• Derive a 1-D profile for each structure in the fold
library and align the target sequence to these
profiles
– Identify amino acids based on core or external
positions
– Part of secondary structure
• Consider the full 3-D structure of the protein
template
– Modeled as a set of inter-atomic distances
– NP-Hard (if include interactions of multiple residues)
Protein Threading
• The word threading implies that one drags the
sequence (ACDEFG...) step by step through
each location on each template
Protein Threading
Generalized Threading Score
• Want to correctly recognize arrangements of residues
• Building a score function
– potentials of mean force
– from an optimization calculation.
• G(rAB) = kTln (ρAB/ ρAB°)
–
–
–
–
G, free energy
k and T Boltzmanns constant and temperature respectively
ρ is the observed frequency of AB pairs at distance r.
ρ° the frequency of AB pairs at distance r you would expect to
see by chance.
• Z-score = (ENat - <Ealt>)/σ Ealt
– Natural energies and mean energies of all the wrong structures/
standard deviation
Scoring Different Folds
• Goodness of fit score
– Based on empirical energy
function
– Modify to take into account
pairwise interactions and
solvation terms
– High score means good fit
– Low score means nothing
learned
Some Threading Programs
•
•
•
•
•
•
•
•
•
3D-pssm (ICNET). Based on sequence profiles, solvatation potentials and
secondary structure.
TOPITS (PredictProtein server) (EMBL). Based on coincidence of
secondary structure and accesibility.
UCLA-DOE Structure Prediction Server (UCLA). Executes various threading
programs and report a consensus.
123D+ Combines substitution matrix, secondary structure prediction, and
contact capacity potentials.
SAM/HMM (UCSC). Basen on Markov models of alignments of crystalized
proteins.
FAS (Burnham Institute). Based on profile-profile matching algorithms of the
query sequence with sequences from clustered PDB database.
PSIPRED-GenThreader (Brunel)
THREADER2 (Warwick). Based on solvatation potentials and contacts
obtained from crystalized proteins.
ProFIT CAME (Salzburg)
Process of 3D Structure
Prediction by Threading
• Has this protein sequence similarity to other with a
known structure?
• Structure related information in the databases
• Results from threading programs
• Predicted folding comparison
• Threading on the structure and mapping of the
known data
• A comparison between the threading predicted
structure and the actual one
Protein Threading Based on Multiple Protein
Structure Alignment
Tatsuya Akutsu and Kim Lan Sim
Human Genome Center, Institute of Medical Science,
University of Tokyo
• NP-Hard if include interactions between 2 or
more AA
• Determine multiple structural alignments based
on pair wise structure alignments
– Center Star Method
Center Star Method
•
Let I0 be the maximum number of gap symbols placed before the
first residue of S0 in any of the alignments A(S0; S1); : : : ;A(S0; SN).
Let IS0j be the maximum number of gaps placed after the last
character of S0 in any of the alignments, and let Ii be the maximum
number of gaps placed between character S0;i and S0;i+1, where Sj:i
denotes the i-th letter of string Si
• Create a string S0 by inserting I0 gaps before S0, IjSo gaps after S0,
and Ij gaps between S0;I and S0;i+1.
• For each Sj (j > 0), create a pairwise alignment A(S0; Sj) between S0
and Sj by inserting gaps into Sj so that deletion of the columns
consisting of gaps from A(S0; Sj) results in the same alignment as
A(S0; Sj).
• Simply arrange A(S0; Sj )'s into a single matrix A (note that all A(S0;
Sj )'s have the same length).
Simple Threading Algorithm
• Apply simple score function based on structure alignment algorithm
– Let X = x1……xN (input amino acid sequence)
– Ci ( i-th column in A)
• Test and analyze results and/or apply constraints
Protein Threading with Constraints
• Assume part of the input sequence xi…xi+k must
correspond to part of the structure alignment
cj…cj+k
• Apply constraints
Prediction Power
•
•
•
•
Entered in CASP3 competition
17 predictions made
3 targets evaluated as similar to correct folds
Only team to create a nearly correct model for
structure T0043
• Best in competition
– 8 evaluated as similar to correct
Next time….
• In depth detail of
– Multiple structural alignment program
• Multiprospector
– Global Optimum Protein Threading with
Gapped Alignment
• Quality measures for protein threading
models
• Improvements on threading-based models
Gapped Alignment
Fragment based method
1 -Predict structure
of segment
Trial structures for a local
sequence taken from database of
segments of known 3D structure
.
Fragment based method
2 - Construct trial model from segments
Fragment based method
3 - Identify good trial structures
1 Low resolution energy function used in initial search through
conformational space
2 - Side chains represented by single “centroid” pseudoatom
3 - Major contributions from
Hydrophobic burial
Beta strand pairing
Steric overlap
Specific residue pair interactions
4 - Models then refined using explicit rotamer based side chain
representation and potential from design method
Fragment-based protein folding
Cro repressor
(1orc)
observed
Computational Requirements
• Methodology performs numerous simulations
and looks for clusters
• One simulation takes 3 mins (3Ghz)
• Require 1,000 simulations per protein = 2 days
• Benchmark on 50 proteins = 100 days
3D-GENOMICS - proteome
annotation
Proteome
sequences
Annotation procedure
Database
sequences
MySQL database
Database
structures
New research
Functional
data
WWW
Types of annotation
No similar
sequence
- orphan
E. coli
Protein325
-homology
but no
function
structure
Enzyme ABC
EC 1.2.3.4
- function
suggested
membrane
protein
3D-Genomics database
-structural and functional annotation
size
Computational requirements
• Today 800,000 protein sequences.
• Each sequence 15 mins to annotate on 2.5GHz cpu.
• Time today = 8,000 cpu days = 2.5 months with 100 processor farm.
• Need to update every 6 months.
• No of sequences will double in 2-3 years and so will keep pace with
increase in compute power.
Modelling protein-protein
docking
Modelling protein-protein
docking
Protein-protein docking
Coordinates
of mol 1
Coordinates
of mol 2
Rigid body search
List of possible complexes
Evaluate association energy
Flexibility to refine
List of complexes
Experimental
information
Step 1 - Generating Complexes
Shape complementarity
+1
-15
overlap
+1 x -15
+1
match
+1 x +1
A(i,j,k)
C=
B(l,m,n)
SSS A(i,j,k)
x B(l,m,n)
Electrostatic complementarity
-1
+1
Charge in 1 = Q(i,j,k)
E=
+1
-1
Potential outside 2 V(l,m,n)
SSS Q(i,j,k)
x V(l,m,n)
Step 2 - Modelling residueresidue interactions
E
V
I
Step 2 - Modelling residueresidue interactions
E
V
I
Empirical residue pair potentials
< distance cut off (4.5A)
a
b
Analyse residues packing across
90 hetero-protein interfaces
A pair of residues pack if one atom-atom contact
Score(a,b) = log10 (Observed no a/b pairs)
(Expected no a/b pairs)
Step 3 - Including information
about functional residues
From literature
E
Step 3 - Including information
about functional residues
From literature
E
Step 4 - Refinement by
multicopy
Search for optimal
combination of
side-chain rotamers
by energy calculation
+ Limited
rigid-body shifts
CAPRI - blind test of docking
bound Ab - X-ray
bound Ab - predicted
unbound
amylase
Prediction / Actual:
Difference =0.6A
Computational Requirements
• 1 run of procedure takes 2 day on one 3Ghz
processor
• Development tested on 30 protein complexes
takes 60 days for one parameter set
• Applications
– extension to predict which protein interacts
with another requires 1000s of docking
simulations
Application area
• Protein structure prediction
– fold recognition
– simulation
• Proteome annotation
• Protein-protein docking
Computing cost
• Modelling algorithm on one protein 10
mins - 2 days on one 3GHz cpu
• But algorithm development requires
consideration of several structures (50 100) with different parameter sets.
• Hence years of cpu required
Structure prediction & sequence space
ASDJFHLKASDLFH
ASDFLHUHOUIQWE
QWEONBLQWEROKJ
ASDFPOIQWERUHO
QWEORSADFLKJIJ
ASDJFHLKASDLFHTJYH
ASDFLHUHOUIQWEDFGH
QWEONBLQWEROKJDGHJ
ASDFPOIQWERUHODHGR
QWEORSADFLKJIJGHFG
QWOIEGTXKNBVALHERT
ASDLFHIUWERHSDDFGH
KBJDDURMWOFBMFERTJ
FGJDKEGORTMVIRGHRT
ASDJFHLKASD
ASDFLHUHOUI
QWEONBLQWER
ASDFPOIQWER
QWEORSADFLK
ASDJFHLKASDLFHTJYH
ASDFLHUHOUIQWEDFGH
QWEONBLQWEROKJDGHJ
ASDFPOIQWERUHODHGR
QWEORSADFLKJIJGHFG
Multiple sequence alignments aid
comparative protein modeling
• 1 in 3 sequences are recognizably related to at least one
protein structure.
• A significant fraction of the remaining 2/3 have solved
structural homologues, but they are not recognized through
sequence similarity searching techniques.
•
Marti-Renom et al. (2000)
• Multiple sequence alignments greatly improve the efficacy and
accuracy of almost all phase of comparative modeling.
•
Venclovas (2001)
Computational protein design
New sequence
Iterative refinement
Native structure
Large scale sequence
generation
“Reverse BLAST” study:
Total structures
264
Total backbone
variants
26,400
Total time of data
collection
80 days
Processors
available
4,000
Total sequences
generated
200,000
“Reverse BLAST”:
finding templates for
comparative modeling
Larson SM, Garg A, Desjarlais JR, Pande VS. (2003) Proteins: Structure,
Function, and Genetics
Experiment: Sequence quality
Design
ASDFASDFASDFAS
FDSAFASDFASDFA
FASDFASDFASDFA
FHFDIDIFERIDKD
ADHFYWTEFHHASD
ASDFYEFHGASDFV
ADHFYWTEFHHASD
ASDFYEFHGASDFV
DGSAHDYERCNDFK
AKSLKALSDFPLAK
BLAST
E<0.01
Results: Sequence quality
10
25
50
75
100
125
150
175
200
E-value of best PDB hit
0.01
0.001
225
25
0.0001
1E-05
20
1E-06
1E-07
1E-08
1E-09
15
1E-10
1E-11
10
1E-12
1E-13
1E-14
1E-15
5
1E-16
1E-17
0
Designed sequence profile (ranked by E-value)
Average identity to native sequence
(%)
1
0.1 0
30
Method: “Reverse BLAST”
Designed Sequences
THEHYPOTHETICA
LPROTEINSEQUEN
CEASDFASDFASDF
AASDFASDFASDFA
SDFASDFASDFASD
FASDFHWERHWIEN
CVASDFNWEFUWEF
THEHYPOTHETICA
LPROTEINSEQUEN
CEASDFASDFASDF
AASDFASDFASDFA
SDFASDFASDFASD
FASDFHWERHWIEN
CVASDFNWEFUWEF
THEHYPOTHETICA
LPROTEINSEQUEN
CEASDFASDFASDF
AASDFASDFASDFA
SDFASDFASDFASD
FASDFHWERHWIEN
CVASDFNWEFUWEF
THEHYPOTHETICA
LPROTEINSEQUEN
CEASDFASDFASDF
AASDFASDFASDFA
SDFASDFASDFASD
FASDFHWERHWIEN
CVASDFNWEFUWEF
THEHYPOTHETICA
LPROTEINSEQUEN
CEASDFASDFASDF
AASDFASDFASDFA
SDFASDFASDFASD
FASDFHWERHWIEN
CVASDFNWEFUWEF
THEHYPOTHETICA
LPROTEINSEQUEN
CEASDFASDFASDF
AASDFASDFASDFA
SDFASDFASDFASD
FASDFHWERHWIEN
CVASDFNWEFUWEF
THEHYPOTHETICA
LPROTEINSEQUEN
CEASDFASDFASDF
AASDFASDFASDFA
SDFASDFASDFASD
FASDFHWERHWIEN
CVASDFNWEFUWEF
THEHYPOTHETICA
LPROTEINSEQUEN
CEASDFASDFASDF
AASDFASDFASDFA
SDFASDFASDFASD
FASDFHWERHWIEN
CVASDFNWEFUWEF
Hypothetical Proteins
BLAST
Structural Templates
E<0.01
Do the designed sequences help?
Correctly identified structural templates
fold-increase in # of templates
fold-increase in # of genes
total hits
Remote homology detection
Optimizing structural diversity
80
6
sequence entropy
70
5
4
(%)
50
40
3
30
2
Sequence entropy
60
prediction accuracy
prediction coverage
mean pairwise %ID
20
1
10
0
0
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
RMSD of structural ensem ble (Angstrom s)
1.8
2
mean native %ID