ppt - Ronald M. Levy

Download Report

Transcript ppt - Ronald M. Levy

Structural Bioinformatics II
Protein Structure Modelling
R.S.K. Vijayan
[email protected] , [email protected]
Overview of todays lecture
Levels of Protein structure
Protein Structure Prediction
Secondary Structure Prediction
Chou-Fasman Method
GOR Method
NN based methods
Tertiary Structure Prediction
ab inito based methods
Challenges
Limitations
Overview of Rostetta Method
Overview of CASP and CAMEO
2
Levels of Protein Structure
There are four levels of protein structure.
Primary structure (10)
Secondary structure (20)
Super secondary structure, folds and domains
Tertiary structure (30)
Quaternary structure (40)
The primary structure of protein refers to the amino acid sequence of the polypeptide
chain.
3
Secondary structure in
Proteins
Secondary structure is the general three-dimensional form of local segments of
proteins
The Dictionary of Protein Secondary Structure (DSSP) is commonly used to describe
the protein secondary structure with single letter codes.
There are eight different types of secondary structure
G = 3-turn helix (310 helix). Min length 3 residues.
H = 4-turn helix (α helix). Min length 4 residues.
I = 5-turn helix (π helix). Min length 5 residues (Extremely rare)
T = hydrogen bonded turn (3, 4 or 5 turn)
E = extended β strand (parallel and/or anti-parallel). Min length 2 residues.
B = residue in isolated β-bridge (single pair β-sheet hydrogen bond formation)
S = bend (the only non-hydrogen-bond based assignment).
C = coil (residues which are not in any of the above conformations).
4
Protein Tertiary Structure
Tertiary structure refers to the three-dimensional structure of the entire polypeptide
chain
The tertiary structure is defined by its atomic coordinates and is determined using
techniques such as X-ray crystallography, NMR spectroscopy, and Cyro-EM.
The function of a protein depends on its tertiary structure.
Function
Sequence
Structure
5
Quaternary Structure
Many proteins are made up of a single, continuous polypeptide chain (monomeric).
Some proteins contain two or more polypeptide chains called subunits/chains
(multimeric).
Quaternary structure describes the arrangement of two or more subunits/chains, to
form one integral structure in a multiunit protein
The arrangement of the subunits gives rise to a stable structure
It includes organizations from simple dimers to large homooligomers and complexes
Subunits may be identical (Homo) or different (Hetero)
GABAA Ion Channel- Hetero pentamer
HIV Protease - Homo dimer
6
Levels of Protein Structure
7
Deciphering the Protein Folding Code
Protein folding problem the "holy grail" of modern biological Research
Given an amino acid sequence, predict its 3D structure (Forward folding problem)
How proteins fold so quickly ? Leventhial paradox
what happens when this process goes awry (when proteins misfold)?
Has been studied for more than 4 decades.
Still very much an open problem
"Inverse Folding" Problem
Given a particular 3D structure fold, identify amino acid sequence that can adopt this
fold.
There will be a number of sequences compatible for a particular target because
homologous proteins are known to adopt the same fold.
Protein design: rational design of new protein molecules, with the ultimate goal of
designing novel function and/or behavior.
Bioengineering and biomedical applications.
8
Protein Secondary Structure Prediction
Predicting protein secondary structure from amino acid sequence has been attempted
since the late 1950s.
Secondary structure prediction methods aim to predict the local secondary
structures of proteins based only on knowledge of their primary sequence.
Assigning regions of the amino acid sequence as likely alpha helices, beta strands,
or turns.
The principle behind most secondary structure predictions is to look for patterns of
residue conservation that are indicative of secondary structures like those shown
above.
The early methods suffered from a lack of data.
To date, over 20 different secondary structure prediction methods have been
developed.
Current methods can achieve up 80% overall accuracy for globular proteins.
9
The accuracy of current protein secondary structure prediction methods is assessed in
Amino-acids Propensity Values
•The main criterion for alpha helix preference is
that the amino acid side chain should cover and
protect the backbone H-bonds in the core of the
helix.
•Ala,Leu,Met,Phe,Glu,Gln,His,Lys,Arg
Helix breakers
Gly : Side chain H too small to protect H bond
Pro: Ridig structure (phi = -60), Side chain linked
to alpha N.
Asp, Asn, Ser: H-bonding side chains compete
directly with backbone H-bonds
Large aromatic residues (Tyr, Phe and Trp) and βbranched amino acids (Thr, Val, Ile) are favored to
be found in β strands in the middle of β sheets.
Because every other side chain in a sheet is
pointing in the opposite direction, leaving room for
beta-branched
chainsoftoamino
pack. acid
Guzzo
AV: The side
influence
sequence on protein structure. Biophys
J 1965, 5:809–822.
10
PSSP Applications
Prediction of protein secondary structure provide information that is useful for
a) ab initio structure prediction
b) as additional constraint for fold-recognition algorithms.
c) help the design of site-directed or deletion mutants that will preserve the native
protein
structure (where and how to subclone protein fragments for expression).
d) For refinement of sequence alignments
e) a step toward the goal of understanding protein folding (A hierarchical approach
to solve
Secondary structure elements start to form in specific nucleation point during folding
the protein folding problem).
The
quality of secondary
structure prediction is measured based on Q3 score.
f) Identifying
protein function
The Q3 score is the average of each Qi (i = helix, sheet, loop), where Qi is defined as the
percentage of correctly predicted residues in state i to the total number of experimentally
observed residues in state i
Npredicted
Q3 
 100
Nobserved
11
PSSP Algorithms
There are three generations in PSSP algorithms:
First Generation:
Based on statistical information of single amino acids and were limited by the small
number of proteins with solved structures.
Chow-Fasman, 1974 (First approach): uses a combination of statistical and heuristic rules.
GOR, 1978 : Information-theoretic framework.
Second Generation:
larger database and use of statistics based on windows (segments) of amino acids.
Typically a window contains 11-21 amino acids.
The second-level approximation, involving pairs of residues, provides a better model
(GOR3) algorithm. (local dependencies).
Third Generation:
Based on the use of evolutionary information
Incorporates multiple sequence alignment to obtain additional information based on the observed
patterns in sequence variability, and the location of insertions and deletion
12
Chou and Fasman Algorithm
•
Start by computing amino acids propensities to belong to a given type of
secondary structure
Amino Acid
-Helix
-Sheet
Ala
P(i / Helix )
P(i / Beta )
P(i / Turn)
Cys
P(i )
P (i )
P (i )
Leu
Met
Glu
Propensities > 1Favors α -Helix Gln
His
Lys
Val
Favors β-strand Ile
Phe
Tyr
Trp
Thr
Gly
Ser
Favors turn
Asp
Asn
Pro
Arg
1.29
1.11
1.30
1.47
1.44
1.27
1.22
1.23
0.91
0.97
1.07
0.72
0.99
0.82
0.56
0.82
1.04
0.90
0.52
0.96
0.90
0.74
1.02
0.97
0.75
0.80
1.08
0.77
1.49
1.45
1.32
1.25
1.14
1.21
0.92
0.95
0.72
0.76
0.64
0.99
Turn
0.78
0.80
0.59
0.39
1.00
0.97
0.69
0.96
0.47
0.51
0.58
1.05
0.75
1.03
1.64
1.33
1.41
1.23
1.91
0.88
Favors
-strand
13
Chou and Fasman Algorithm (cont...)
Predicting helices:
- find nucleation site: 4 out of 6 contiguous residues with P(α) >1.
- extension: extend helix in both directions until a set of 4 contiguous
residues has an average P(α) < 1 (breaker).
- if average P(α) over whole region is >1, it is predicted to be helical.
Predicting strands:
- find nucleation site: 3 out of 5 contiguous residues with P(β) > 1.
- extension: extend strand in both directions until a set of 4 contiguous
residues has an average P(β) < 1 (breaker).
- if average P(β) over whole region is > 1, it is predicted to be a strand.
Any region containing overlapping (α -helical and β-sheet assignments are taken to be
helical if the average P(α-helix) > P(β-sheet) for that region.
It is a beta sheet if the average P(β-sheet > P(α) for that region.
14
Chou and Fasman Algorithm (cont...)
Predicting turns:
- for each tetrapeptide starting at residue i, compute:
- PTurn (average propensity over all 4 residues)
- P(t) = f(i)*f(i+1)*f(i+2)*f(i+3)
Position-specific parameters for
- If the averages for the tetrapeptide obey the inequality
PTurn > P(α) and PTurn > P(β) and PTurn > 1 and F > 0.000075
then, the tetrapeptide is considered a turn.
Each position has distinct amino acid preferences.
Examples:
-At position 2, Pro is highly preferred; Trp is disfavored
15
Beware of Q3 Values
Its’s important to be aware that the Q3 score can give an overoptimistic estimate of
accuracy than might be expected.
Because there are only 3 states, even a random guessing would yield a 3-state accuracy
(Q3 ) of about 33% assuming that all structures are equally likely.
The numbers of residues in helices, strands, and loops in the database are frequently not
evenly distributed, with loops usually comprising the greatest proportion.
ALHEASGPSVILFGSDVTVPPASNAEQAK
hhhhhooooeeeeoooeeeooooohhhhh
Amino acid sequence
Actual Secondary Structure
ohhhooooeeeeoooooeeeooohhhhhh
Q3=22/29=76%
hhhhhoooohhhhooohhhooooohhhhh
Q3=22/29=76%
Secondary structure assignment in real proteins is uncertain to about 10%
(disagreement between DSSP and STRIDE); Therefore, a “perfect” prediction
would have Q3 = 90%.
16
Chou and Fasman Algorithm (cont...)
Advantages of Chou-Fasman:
Propensity for a specific conformation is evaluated in the “context” of the
flanking residues using simple rules.
Disadvantages of Chou-Fasman:
Correlations between different positions in the sequence based completely
on empirical rules.
Ambiguity in the assignment of overlapping regions.
Accuracy below 60% (remember 33.3% is the lower limit).
17
GOR Method
GOR method (Garnier-Osguthorpe-Robson) is an information theory-based
method.
GOR method is also based on probability parameters derived from empirical
studies of known experimental structures.
GOR method takes into account not only the propensities of
individual amino acids to form particular secondary structures, but also
the conditional probability of the amino acid to form a secondary structure
given that its immediate neighbors have already formed that structure.
Evaluate each residue PLUS adjacent 8 N-terminal and 8 carboxyl-terminal
residues sliding window of 17 residue.
Underpredicts β-strand regions.
GOR method accuracy Q3 = ~64%
18
GOR Method
Position-dependent propensities for helix, sheet or turn has been calculated for all
residue types.
For each position j in the sequence, eight residues on both sides of the actual position
are considered.
Statistical information derived from proteins of known structure is stored in three
(17X20).
Three matrices, one each for α, β, coil
A helix propensity table contains info about propensity for certain residues at 17
positions when the conformation of residue j is helical.
The predicted state of aaj is calculated as the sum of the position-dependent
propensities of all residues around aaj.
Suppose aj is the amino acid that we are trying to categorize.
GOR looks at the residues aj−8aj−7 . . . aj . . . aj+7aj+8.
Intuitively, it assigns position-dependent probabilities based on what it has calculated
from protein databases.
19
GOR Method
20
Third Generation Methods
Use evolutionary information based on multiple sequence alignment and expert
methods (Neural Networks ) for perdition.
The most important algorithms of today

PHD

NNPREDICT
PSIPRED
Due to the improvement of protein information in databases i.e. better evolutionary
information, today’s predictive accuracy is ~80%.

It is believed that maximum reachable accuracy is 88%.
An artificial neural network is composed of many
artificial neurons that are linked together
according to a specific network architecture. The
goal of the neural network is to transform the
inputs into meaningful outputs.
21
Tertiary Structure Prediction
Major Techniques
Template Based Modeling
• Homology Modeling
• Threading
Template-Free Modeling
Prediction from sequence using first principles
• ab initio Methods
• Physics-Based
• Knowledge-Based
Synonyms : de novo modelling, physics based.
22
Overview of ab initio method
Typically ab initio modelling conducts a conformational search under the
guidance of a designed energy function.
This procedure usually generates a number of possible conformations
(structure decoys), and final models are selected from them.
Therefore, a successful ab initio modelling depends on three factors:
(1) an accurate energy function with which the native structure of a protein
corresponds to the most thermodynamically stable state, compared to all
possible decoy structures
(2) an efficient search method which can quickly identify the low-energy states
through conformational search;
(3) selection of native-like models from a pool of decoy structures.
23
Overview of ab initio method
Disadvantages:
Ab initio prediction - not practical for large sequences (< 100 aa)
Computationally very expensive.
Currently, the accuracy of ab initio modelling is low and the success is
limited to
small proteins .
Advantages:
Can give insights into folding mechanism.
Understanding protein misfolding
Doesn’t require homologs
Only way to model new folds
Useful for de novo protein design
24
Challenges in Protein folding
Energetics
We don’t know all the forces involved in detail
Too computationally expensive BY FAR! ( Folding takes places
at the order of micro seconds to milliseconds)
Conformational search impossibly large
100 a.a. protein, 2 moving dihedrals, 2 possible positions for each
diheral: 2
200
conformations!
Levinthal’s Paradox
Proteins fold in a couple of seconds??
Multiple-minima problem
25
Understanding protein folding via molecular simulation
Advances in computer hardware, software and algorithms have now made it possible to
simulate protein folding.
Atomistic models has been used for more than decades to address protein folding
problem (M. Levitt, A. Warshel 1975).
The first ever longtime scale study on protein folding using MD simulation (Peter
Kollman 1998)
Time scale for protein folding
Challenges
Accurate force fields
Adequate sufficient sampling
Robust data analysis.
26
Rosetta Approach
The Rosetta Approach (David Baker lab, Univ. of Washington).
Performs Monte Carlo search through space of conformations to find minimal energy
conformation
Overview of the Rosetta Approach
Rosetta searches
byspace
replacing
the torsion
•! Rosettastructure
searchesspace
structure
by replacing
the angles of a fragment
in the currenttorsion
model
with of
torsion
angles
from
known
structure
angles
a fragment
in the
current
model
with fragments
torsion angles from known structure fragments
27
The Rosetta Approach
Given: protein sequence P
for each window of length 9 in P assemble a set of structure fragments (using PSIBLAST)
M = initial structure model of P (fully extended conformation) S = score(M)
while stopping criteria not met
randomly select a fixed width “window” of amino acids from P
randomly select a fragment from the list for this window
M’ = M with torsion angles in window replaced by angles from fragment
S’ = score(M’)
if Metropolis criterion(S, S’) satisfied
M = M’
S = S’
Return: predicted structure M
28
The Rosetta Scoring Approach
Rosetta scoring function takes into account
residue environment (solvation)
residue pair interactions (electrostatics, disulfides)
strand pairing (hydrogen bonding)
strand arrangement into sheets
helix-strand packing
steric repulsion
scoring function search progressively adds terms during search
initially on the steric overlap term is used
then all but “compactness” terms are used
search is initiated from different random seeds
for some applications, an atomic-level scoring function is used
29
Critical Assessment of protein Structure Predicti
(CASP)
A community-wide, worldwide experiment for protein structure prediction that is held every
two years since 1994.
Evaluation of the results is carried out in the following prediction categories:
Tertiary structure prediction (all CASPs) ( Divided in to Template based and template
free method)
Secondary structure prediction (dropped after CASP5)
Prediction of structure complexes (CASP2 only; a separate experiment CAPRI)
residue-residue contact prediction (starting CASP4)
disordered regions prediction (starting CASP5)
domain boundary prediction (CASP6–CASP8)
function prediction (starting CASP6)
model quality assessment (starting CASP7)
model refinement (starting CASP7)
high-accuracy template-based prediction (starting CASP7)
30