Proteiinianalyysi 3

Download Report

Transcript Proteiinianalyysi 3

Proteiinianalyysi 7
Kolmiulotteisen rakenteen ennustaminen
http://www.bioinfo.biocenter.helsinki.fi/downlo
ads/teaching/spring2006/proteiinianalyysi
Sekvenssistä rakenteeseen
• komparatiivinen mallitus
• 1-ulotteinen tilan (luokan) ennustaminen
sekvenssistä
• 3-ulotteisen rakenteen tunnistaminen
annetusta kirjastosta (fold recognition)
• 3-ulotteisen rakenteen ennustaminen ab
initio
Motivation
• Protein structure determines protein
function
• For the majority of proteins the structure is
not known
Structural coverage
sequences
structures
0
250000
500000
750000
1000000
1250000
1500000
3
rmsd of main chain atoms [A]
2.5
2
Curve fitted to data
for homologous
families
1.5
Divergence of
common cores
• fraction in core
decreases with
increasing sequence
divergence
1
0.5
0
0
0.2
0.4
0.6
0.8
1
fraction of mutated residues
Chothia & Lesk (1986)
Steps in comparative modelling
• Find suitable template(s)
• Build alignment between target and
template(s)
• Build model(s)
– Replace sidechains
– Resolve conflicts in the structure
– Model loops (regions without an alignment)
• Evaluate and select model(s)
State of the art in homology
modelling
• Template search
– (iterative) sequence database searches
(PSIBLAST)
• Alignment step
– multiple alignment of close to fairly distant
homologues
• Modelling step
– rigid body assembly
– segment matching
– satisfaction of spatial constraints
An alignment defines structurally
equivalent positions!
Template structure
Template sequence
Alignment
Target sequence
Model
The crucial importance of the
alignment
Template sequence
Template structure
Alignment
Target sequence
Model
Modelling by spatial restraints
• Generate many constraints:
– Homology derived constraints
• Distances and angles between aligned positions
should be similar
– Stereochemical constraints
• Bond lengths, bond angles, dihedral angles,
nonbonded atom-atom contacts
• Model derived by minimizing restraints
Modeller: Sali & Blundell (1993)
Loop modelling
• Exposed loop regions usually more
variable than protein core
• Often very important for protein function
• Loops longer than 5 residues difficult to
built
• Mini-protein folding problem
Model evaluation
• Check of stereochemistry
– bond lengths & angles, peptide bond planarity,
side-chain ring planarity, chirality, torsion
angles, clashes
• Check of spatial features
– hydrophobic core, solvent accessibility,
distribution of charged groups, atom-atomdistances, atomic volumes, main-chain
hydrogen bonding
• 3D profiles/mean force potentials
– residue environment
Knowledge-based mean force potentials
• Compute typical atomic/residue
environments based on known protein
structures
Melo & Feytmanns (1997)
Modelling a transcription factor
Ligand
DNA
• Sequence
from different
species
• Is binding to
ligand
conserved?
Ligand binding domain
hydrogen bonds to ligand
homo-serine lactone moiety binding
acyl moiety binding
DNA binding domain
Linker
DNA binding domain
New Loop
Template
Target
Variable loops
MODELLER output
Ligand binding pocket
Errors in comparative modelling
a)Side chain packing
b)Distortions and shifts
c)Loops
d)Misalignments
e)Incorrect template
True structure
Template
Model
Marti-Renom et al. (2000)
Modelling accuracy
Marti-Renom et al. (2000)
Applications of
homology
modelling
Marti-Renom et al. (2000)
Structural genomics
• Post-genomics:
– many new sequences, no function
• Aim: a structure for every protein
• High-throughput structure determination
– robotics
– standard protocols for
cloning/expression/crystallization
Structural coverage
high quality models
Complete models
Total = 43 %
Vitkup et al. (2001)
Target selection
Fold recognition - Assumption
• Native structure is the global minimum
energy conformation
• So, need
– Discriminating energy function
– Conformation generator
• Backbone from homologous template (comparative
modelling)
• Backbone from analogous template (fold
recognition)
• Comprehensive sampling (ab initio)
Fold recognition steps
• Template library
– Known structures from Protein Data Bank
– Fold classification suggests a limited number of fold
types
• Score = sequence-structure fitness
– Environmental preferences of amino acids
– Boltzmann engine
• Search problem = alignment
– Complicated with pair potentials
• Significance of best score in database search
– Reference state
Potentials of mean force
• “Boltzmann engine”
• In thermodynamic equilibrium, particles
are partitioned between states
proportionally to exp(-DG)
• Effective energy = negative logarithm of
the equilibrium constant
– Count occurrences per state
– Radial distribution of aa pairs (Sippl)
Structural environment
• Single-residue preferences 20 x 3 x 3 x 3
– Helix, strand, coil
– Accessibility
– Contact area (indirectly codes for aa type)
• Contact pair potentials
– Atomic contacts within 4 A
– C-beta atoms within 7 A
– Secondary structure of residues i and j
• 3 x 20 x 3 x 20 = 3600 preferences
Information content
Arg-Asp helix-helix
(dashed)
Arg-Asp strand-strand
(solid)
Arg-Asp (dotted)
Threading algorithms
• Dynamic programming
– Simple
– “frozen approximation”
• Read sequence-dependent environment from template (1st
round), then from aligned target sequence
• Stochastic optimization (Monte Carlo)
– Pair potentials
• Exhaustive search
– Simplify search space (e.g., ignore loops)
Prospect model (Xu & Xu)
Etotal = vmutateEmutate x vsingleEsingle x vpairEpair x vgapEgap
Weights v optimized on training set
Prospect - segmentation
- Finds optimal threading fairly efficiently
- Topological complexity
- No gaps in secondary structure elements
- Pair energy term only evaluated between
secondary structure elements
Prospect- observations
• Mutation energy is the most important
• Single-residue terms with profile
information generate reasonably good
alignments for ~2/3 of test cases
• The pairwise energy term can thus be
ignored during the search for optimal
alignment, but is used in evaluating the
fold recognition
Performance comparison
Method
Family only Superfamily Fold only
Top 1 Top 5 Top 1 Top 5 Top 1 Top 5
Using pair potential
PROSPECT
84.1
88.2
52.6
27.7
50.3
Using dynamic programming, structural environment
FUGUE
82.2 85.8 41.9 53.2 12.5
THREADER
49.2 58.9 10.8 24.7 14.6
26.8
37.7
Using sequence similarity only
PSI-BLAST
71.2 72.3
HMMER
67.7 73.5
SAMT98
70.1 75.4
BLASTLINK
74.6 78.9
SSEARCH
68.6 75.7
4.7
14.6
18.7
16.5
15.6
27.4
20.7
28.3
29.3
20.7
64.8
27.9
31.3
38.9
40.6
32.5
4.0
4.4
3.4
6.9
5.6
Threading score - significance
• Target sequence – fold library
– Each threading aligns a different subsequence
• Compute Z-score for each by ungapped threading
on large decoy (Sippl)
• “Reverse threading”
– Design optimal sequence for a given fold
Incorrect
self-threading
Fold recognized
Fold recognized
Poor alignment of residues
Ab initio prediction
• HMMSTR/I-sites/Rosetta
HMMSTR is a Hidden Markov Model based on protein
STRucture. Each Markov state in this model represents
a position in one of the I-sites motifs. HMMSTR can
predict local structure (as backbone angles), secondary
structure, and supersecondary structure (edge versus
middle strand, hairpin versus diverging turn).
• I-sites Library
I-sites is a library of folding initiation site motif, which are
sequence motifs that correlate with particular local
structures such as beta hairpins and helix caps. I-sites
can be used to predict local structure, or to predict which
parts of a protein are likely to fold early, initiating folding.
Intermediates are not observed,
but
Folding is 2-state
Unfolded
Folded
Nucleation sites
something
happens
first...
Early folding events might be
recorded in the database
Non-homologous proteins
Short, recurrent sequence patterns could be folding Initiation sites
recurrent
part
HDFPIEGGDSPMQTIFFWSNANAKLSHGY
CPYDNIWMQTIFFNQSAAVYSVLHLIFLT
IDMNPQGSIEMQTIFFGYAESA
ELSPVVNFLEEMQTIFFISGFTQTANSD
INWGSMQTIFFEEWQLMNVMDKIPS
IFNESKKKGIAMQTIFFILSGR
PPPMQTIFFVIVNYNESKHALWCSVD
PWMWNLMQTIFFISQQVIEIPS
MQTIFFVFSHDEQMKLKGLKGA
Nature has selected for these patterns because they
speed folding.
How to read an I-sites motif
profile
Backbone angles and sequence
pattern for Amphipathic alpha-helix
Superposition of the top scoring 30
true-positives
Conserved polar (green) and nonpolar (purple) sidechains
Serine alpha-N-cap
HMMSTR
A Markov state. A hidden Markov model
consists of Markov states connected by
directed transitions. Each state emits an
output symbol, representing sequence or
structure. There are four categories of
emission symbols in our model: b, d, r, and
c, corresponding to amino acid residues,
three-state secondary structure, backbone
angles (discretized into regions of phi-psi
space) and structural context (e.g. hairpin
versus diverging turn, middle versus endstrand), respectively.
Bystroff C, Thorsson V & Baker D. (2000). HMMSTR: A
hidden markov model for local sequence-structure
correlations in proteins. Journal of Molecular Biology
301, 173-90.
Merging of two I-sites motifs to form an HMM.
Sequence alignment
•••
VIVAANRSA
VIVSAARTA
VIASAVRTA
VIVDAGRSA
VIASGVRTA
VIVAAKRTA
VIVSAVRTP
VIVSAARTA
VIVSAVRTP
VIVDAGRTA
VIVDAGRTA
VIVSGARTP
VIVDFGRTP
VIVSATRTP
VIVSATRTP
VIVGALRTP
VIVSATRTP
VIVSATRTP
VIASAARTA
VIVDAIRTP
VIVAAYRTA
VIVSAARTP
VIVDAIRTP
VIVSAVRTA
VIVAAHRTA
Sequence Profiles
Sequence profile
•••
 wkd (s kj = aai )
Pij =
aa
k = seqs
 wk
k = seqs
Red = high prob ratio (LLR>1)
Green = background prob ratio (LLR≈0)
Blue = low prob ratio (LLR<-1)
I-sites motifs
diverging type-2
turn
Serine
hairpin
Proline helix C-cap
Backbone
angles:
y=green,
f=red
Amino
acids
arranged
from nonpolar to
polar
alpha-alpha corner
Type-I
hairpin
Frayed
helix
glycine helix N-cap
Why do I-sites exist?
1. They are ancient conserved
regions?
2. They fold independently?
Patterns of conservation suggest
independent folding
2. sidechain
contacts
1. backbone
angle
constraints
3. negative
design
NMR structures confirm independent folding
diverging turn
motif
NMR structure of a 7-residue I-sites motif in isolation (Yi et al, J. Mol. Biol, 1998)
Fold prediction – Rosetta
method
• Knowledge based scoring function
Bayes' law:
P(structure) * P(sequence|structure)
P(structure|sequence) =
P(sequence)
P(sequence|structure) = f(residue contacts in native structures)
sequence consistent
local structure
protein-like
structures
near-native structures
P(structure) = probability of a protein-like structure
(no clashes, globular shape)
Simons et al. (1997)
Rosetta
(1) A stone with three ancient languages on it.
(2) A program (David Baker) that simulates the folding of a
protein, using statistical energies and moves.
The “Folding Problem”
Two parts:
(1) The “Search Problem”
Is the true structure one of my 2 million guesses?
(2) The “Discrimination Problem”
If it’s one of these 2 million, which one is it?
Rosetta
Fragment insertion Monte
Carlo
backbone torsion angles
moveset
accept or
reject
Choose fragment
from moveset
change backbone
angles
Convert angles to 3D
coordinates
Energy
function
Rosetta
Backbone angles are restrained in I-sites regions
regions of highconfidence I-sites
prediction
moveset
backbone torsion angles
Fragments that deviate from the paradigm (>90° in
f or y) are removed from the moveset.
Generally, about one-third of the
sequence has an I-sites prediction with
confidence > 0.75, and is restrained.
Rosetta
Sequence dependent features
Rosetta
Sequence-independent features
Current structure
vector representation
Probabilities from the database
The energy score for a contact between secondary structures
is summed using database statistics.
Rosetta
CASP4 predictions
31 target sequences. Ab initio prediction
i.e. Sequence homolog data was ignored if present.
61% “topologically correct”
60% “locally correct”
73% secondary structure correct
Rosetta
T0116 262-322 (61 residues)
prediction
true structure
Topologically correct (rmsd=5.9Å) but helix is mispredicted as loop.
Rosetta
T0121 126-199 (66 residues)
prediction
true structure
Topologically correct (rmsd=5.9Å) but loop is mispredicted as helix.
Rosetta
T0122 57-153 (97 residues)
prediction
true structure
...contains a 53 residue stretch with max deviation = 96°
Rosetta
T0112
153-213
prediction
true structure
Low rmsd (5.6Å) and all angles correct ( mda = 84°),
but topologically wrong!
(this is rare)
Rosetta
What needs to be fixed?
Turns
8% of the residues in the targets have f > 0.
44% of these are at Glycine residues.
7% of the residues in the predictions have f > 0.
but only 16% of these are at Glycines.
Contact order
N
1
CO =
DSij

LN
True structure: 0.252
Predictions: 0.119
Prediction of protein structure
• ROSETTA program most famous
• different models to treat the local and nonlocal interactions.
• sequence-dependent local interactions bias segments of the chain
to sample distinct sets of local structures
– turn to in known three-dimensional structures as an approximation to
the distribution of structures sampled by isolated peptides with the
corresponding sequences.
• nonlocal interactions select the lowest free-energy tertiary structures
from the many conformations compatible with these local biases.
– The primary nonlocal interactions considered are hydrophobic burial,
electrostatics, main-chain hydrogen bonding and excluded volume.
• minimizing the nonlocal interaction energy in the space defined by
the local structure distributions using Monte Carlo simulated
annealing.
Using NMR to guide Rosetta
• We have extended the ROSETTA ab initio structure
prediction strategy to the problem of generating models
of proteins using limited experimental data. By
incorporating chemical shift and NOE information and
more recently dipolar coupling information into the
Rosetta structure generation procedure, it has been
possible to generate much more accurate models than
with ab initio structure prediction alone or using the same
limited data sets with conventional NMR structure
generation methodology. An exciting recent development
is that the Rosetta procedure can also take advantage of
unassigned NMR data and hence circumvent the difficult
and tedious step of assigning NMR spectra.
Rosetta in comparative modelling
• We have also developed a method for comparative
modeling that was one of the top performing methods in
the CASP4 experiment. The method utilizes a new
protein sequence structure alignment method and
structurally variable regions such as long loops not
present in the structure of a homologue are built using a
modification of the rosetta ab initio structure prediction
methodology. Both the ab initio and the comparative
modeling methods have been implemented in a server
called ROBETTA which was one of the best all around
fully automated structure prediction servers in the
CASP5 test.
Prediction algorithms have
Underlying principles
Darwin = protein evolution.
Principle: Proteins that evolved from common
ancestor have the same fold.
Boltzmann = protein folding
Principle: Proteins search conformational space,
minimizing the free energy.
Summary
•Most prediction methods depend on sequence
homology.(Darwin)
•Folding predictions combine statistics and simulations.
•Putative folding initiation sites can be found using database
statistics.
•Knowledge-based energy functions are derived from database
statistics.
•The folding problem is really two problems: the search problem
and the discrimination problem.
•If we knew how proteins fold, we could predict their structures.
•We don’t know how proteins fold.
CASP6 – current status
• Comparative modelling extended to distant
homologues
– Easy: PSI-Blast neighbours
– Hard: indirect PSI-Blast neighbours
• Fold recognition merged with comparative
modelling
• Ab initio methods based on fragment assembly
generate models (among top N predictions) that
have some resemblance to the real structure