Prediction to Protein Structure

Transcript Prediction to Protein Structure

Prediction to Protein
Structure
Fall 2005
CSC 487/687 Computing for
Bioinformatics
Protein structure prediction
 knowing the structure of a protein is a
prerequisite to gain a thorough understanding
of the protein's function
 Experimental methods are highly labor
intensive
 Proteins are capable of folding into their
unique functional 3D structures without any
additional genetic mechanisms
Where can I learn more?
Protein Structure Prediction Center
Biology and Biotechnology Research Program
Lawrence Livermore National Laboratory, Livermore, CA
http://predictioncenter.org/
Why do we want to predict 2nd structures?
 prediction of 2nd structure is a step towards 3D
structure prediction
 can be used in threading methods to identify
distantly related proteins
 may provide insights into function
What is secondary structure?
Three major types:
Alpha Helical Regions
Beta Sheet Regions
Coils, Turns, Extended (anything else)
Some Prediction Methods
 ab initio methods
 Based on physical properties of aa’s and bonding
patterns
 Statistics of amino acid distributions in known
structures

Chou-Fasman
 Position of amino acid and distribution
 Garnier, Osguthorpe-Robeson (GOR)
 Neural networks
ab initio methods
 A mixture of science and engineering
 The challenges
 Devise a scoring function that can distinguish
between correct (native or native-like)
structures from incorrect (non-native) ones.
 A search method to explore the conformation
space
 Problems
 A reliable and general scoring function
 Reliable and general search method that can
sample the conformation space adequately
Chou-Fasman
 First widely used procedure
 developed by Chou & Fasman in 1974 & 1978
 based on frequencies of residues in a-helices, b-
sheets and turns
 Accuracy ~50 - 60%
 Output-helix, strand or turn
Chou-Fasman Pij-values
Name
P (H)
P (E )
P (t urn)
f (i)
f (i+ 1)
f (i+ 2)
f (i+ 3)
Alanine
142
83
66
0.06
0.076
0.035
0.058
Arginine
98
93
95
0.07
0.106
0.099
0.085
101
54
146
0.147
0.11
0.179
0.081
Asparagine
67
89
156
0.161
0.083
0.191
0.091
Cysteine
70
119
119
0.149
0.05
0.117
0.128
Glutamic Acid
151
37
74
0.056
0.06
0.077
0.064
Glutamine
111
110
98
0.074
0.098
0.037
0.098
Glycine
57
75
156
0.102
0.085
0.19
0.152
Histidine
100
87
95
0.14
0.047
0.093
0.054
Isoleucine
108
160
47
0.043
0.034
0.013
0.056
Leucine
121
130
59
0.061
0.025
0.036
0.07
Lysine
114
74
101
0.055
0.115
0.072
0.095
Methionine
145
105
60
0.068
0.082
0.014
0.055
Phenylalanine
113
138
60
0.059
0.041
0.065
0.065
Proline
57
55
152
0.102
0.301
0.034
0.068
Serine
77
75
143
0.12
0.139
0.125
0.106
Threonine
83
119
96
0.086
0.108
0.065
0.079
108
137
96
0.077
0.013
0.064
0.167
69
147
114
0.082
0.065
0.114
0.125
106
170
50
0.062
0.048
0.028
0.053
Aspartic Acid
Tryptophan
Tyrosine
Valine
How it works






Assign all of the residues the appropriate set of parameters.
Scan through the peptide and identify regions where 4 out of 6 contiguous residues
have P(alpha-helix) > 100. That region is declared an alpha-helix. Extend the helix
in both directions until a set of four contiguous residues that have an average
P(alpha-helix) < 100 is reached. That is declared the end of the helix. If the
segment defined by this procedure is longer than 5 residues and the average
P(alpha-helix) > P(beta-sheet) for that segment, the segment can be assigned as a
helix.
Repeat this procedure to locate all of the helical regions in the sequence.
Scan through the peptide and identify a region where 3 out of 5 of the residues
have a value of P(beta-sheet) > 100. That region is declared as a beta-sheet.
Extend the sheet in both directions until a set of four contiguous residues that have
an average P(beta-sheet) < 100 is reached. That is declared the end of the betasheet. Any segment of the region located by this procedure is assigned as a betasheet if the average P(beta-sheet) > 105 and the average P(beta-sheet) > P(alphahelix) for that region.
Any region containing overlapping alpha-helical and beta-sheet assignments are
taken to be helical if the average P(alpha-helix) > P(beta-sheet) for that region. It is
a beta-sheet if the average P(beta-sheet) > P(alpha-helix) for that region.
To identify a bend at residue number j, calculate the following value p(t) =
f(j)f(j+1)f(j+2)f(j+3)
where the f(j+1) value for the j+1 residue is used, the f(j+2) value for the j+2 residue
is used and the f(j+3) value for the j+3 residue is used. If: (1) p(t) > 0.000075; (2)
the average value for P(turn) > 1.00 in the tetrapeptide; and (3) the averages for the
tetrapeptide obey the inequality P(alpha-helix) < P(turn) > P(beta-sheet), then a
beta-turn is predicted at that location.
GOR (Garnier, Osguthorpe-Robeson)
 developed by Garnier, Osguthorpe& Robson
 sliding window of 17
 underpredicts b-strand regions
 GOR III method accuracy ~64%
GOR (Garnier, Osguthorpe-Robeson)
 Position-dependent propensities for helix, sheet or turn has
been calculated for all residue types.
 For each position j in the sequence, eight residues on both sides
of the actual position are considered.
 A helix propensity table contains info. about propensity for
certain residues at 17 positions when the conformation of
residue j is helical. The helix propensity tables have 20 x 17
entries.
 The predicted state of aaj is calculated as the sum of the
position-dependent propensities of all residues around aaj.
Psi-BLAST Predict Secondary
Structure (PSIPRED)
 Three stages:
 1)
Generation of sequence profile
 2) Prediction of initial secondary
structure
 3) Filtering of predicted structure
PSIPRED
 Uses multiple aligned sequences for prediction.
 Uses training set of folds with known structure.
 Uses a two-stage neural network to predict
structure based on position specific scoring
matrices generated by PSI-BLAST (Jones,
1999)


First network converts a window of 15 aa’s into a
raw score of h,e (sheet), c (coil) or terminus
Second network filters the first output. For
example, an output of hhhhehhhh might be
converted to hhhhhhhhh.
 Can obtain a Q3 value of 70-78% (may be the
highest achievable)
Neural networks
• Computer neural networks are based on simulation of adaptive
learning in networks of real neurons.
•Neurons connect to each other via synaptic junctions which are either
stimulatory or inhibitory.
•Adaptive learning involves the formation or suppression of the right
combinations of stimulatory and inhibitory synapses so that a set
of inputs produce an appropriate output.
Neural Networks (cont. 1)
•The computer version of the neural network involves
identification of a set of inputs - amino acids in the
sequence, which transmit through a network of
connections.
•At each layer, inputs are numerically
weighted and the combined result passed to the next
layer.
•Ultimately a final output, a decision, helix, sheet or
coil, is produced.
Neural Networks (cont. 2)
90% of training set was used (known structures)
10% was used to evaluate the performance of the neural
network during the training session.
Neural Networks (cont. 3)
•During the training phase, selected sets of proteins of known
structure are scanned, and if the decisions are incorrect, the
input weightings are adjusted by the software to produce the
desired result.
•Training runs are repeated until the success rate is maximized.
•Careful selection of the training set is an important aspect of
this technique. The set must contain as wide a range of
different fold types as possible without duplications of
structural types that may bias the decisions.
Neural Networks (cont. 4)
•An additional component of the PSIPRED procedures involves
sequence alignment with similar proteins.
•The rationale is that some amino acids positions in a sequence
contribute more to the final structure than others. (This has been
demonstrated by systematic mutation experiments in which each
consecutive position in a sequence is substituted by a spectrum of amino
acids. Some positions are remarkably tolerant of substitution, while
others have unique requirements.)
•To predict secondary structure accurately, one should place less weight
on the tolerant positions, which clearly contribute little to the structure
•One must also put more weight on the intolerant positions.
Row specifies aa position
15 groups of 21 units
(1 unit for each aa plus
one specifying the end)
Provides info
on tolerant or
intolerant positions
Filtering network
three outputs are helix, strand or coil
Example of Output from PSIPRED
Workshop
 http://bioinf.cs.ucl.ac.uk/psipred/psiform.html

Prediction to Protein Structure

Transcript Prediction to Protein Structure

Directory