Protein Structure Prediction_2
Download
Report
Transcript Protein Structure Prediction_2
Prediction to Protein
Structure
Fall 2005
CSC 487/687 Computing for
Bioinformatics
Psi-BLAST Predict Secondary
Structure (PSIPRED)
Three stages:
1) Generation of sequence profile
2) Prediction of initial secondary
structure
3) Filtering of predicted structure
PSIPRED
Uses multiple aligned sequences for prediction.
Uses training set of folds with known structure.
Uses a two-stage neural network to predict
structure based on position specific scoring
matrices generated by PSI-BLAST (Jones,
1999)
First network converts a window of 15 aa’s into a raw
score of h,e (sheet), c (coil) or terminus
Second network filters the first output. For example,
an output of hhhhehhhh might be converted to
hhhhhhhhh.
Can obtain a Q3 value of 70-78% (may be the
highest achievable)
Neural networks
• Computer neural networks are based on simulation of adaptive
learning in networks of real neurons.
•Neurons connect to each other via synaptic junctions which are either
stimulatory or inhibitory.
•Adaptive learning involves the formation or suppression of the right
combinations of stimulatory and inhibitory synapses so that a set
of inputs produce an appropriate output.
Neural Networks (cont. 1)
•The computer version of the neural network involves
identification of a set of inputs - amino acids in the
sequence, which transmit through a network of
connections.
•At each layer, inputs are numerically
weighted and the combined result passed to the next
layer.
•Ultimately a final output, a decision, helix, sheet or
coil, is produced.
Neural Networks (cont. 2)
90% of training set was used (known structures)
10% was used to evaluate the performance of the neural
network during the training session.
Neural Networks (cont. 3)
•During the training phase, selected sets of proteins of known
structure are scanned, and if the decisions are incorrect, the
input weightings are adjusted by the software to produce the
desired result.
•Training runs are repeated until the success rate is maximized.
•Careful selection of the training set is an important aspect of
this technique. The set must contain as wide a range of
different fold types as possible without duplications of
structural types that may bias the decisions.
Neural Networks (cont. 4)
•An additional component of the PSIPRED procedures involves
sequence alignment with similar proteins.
•The rationale is that some amino acids positions in a sequence
contribute more to the final structure than others. (This has been
demonstrated by systematic mutation experiments in which each
consecutive position in a sequence is substituted by a spectrum of amino
acids. Some positions are remarkably tolerant of substitution, while
others have unique requirements.)
•To predict secondary structure accurately, one should place less weight
on the tolerant positions, which clearly contribute little to the structure
•One must also put more weight on the intolerant positions.
Row specifies aa position
15 groups of 21 units
(1 unit for each aa plus
one specifying the end)
Provides info
on tolerant or
intolerant positions
Filtering network
three outputs are helix, strand or coil
Example of Output from
PSIPRED
Workshop
http://bioinf.cs.ucl.ac.uk/psipred/psiform.html
3D structure data
The largest 3D structure database is the
Protein Database
It contains over 33,000 records
Each record contains 3D coordinates for
macromolecules
80% of the records were obtained from X-ray
diffraction studies, 15% from NMR and the rest
from other methods and theoretical calculations
Part of a record from the PDB
ATOM
1
N
ARG A
14
22.451
98.825
31.990
1.00 88.84
N
ATOM
2
CA
ARG A
14
21.713 100.102
31.828
1.00 90.39
C
ATOM
3
C
ARG A
14
22.583 101.018
30.979
1.00 89.86
C
ATOM
4
O
ARG A
14
22.105 101.989
30.391
1.00 89.82
O
ATOM
5
CB
ARG A
14
21.424 100.704
33.208
1.00 93.23
C
ATOM
6
CG
ARG A
14
20.465 101.880
33.215
1.00 95.72
C
ATOM
7
CD
ARG A
14
20.008 102.147
34.637
1.00 98.10
C
ATOM
8
NE
ARG A
14
18.999 103.196
34.718
1.00100.30
N
ATOM
9
CZ
ARG A
14
18.344 103.507
35.833
1.00100.29
C
ATOM
10
NH1 ARG A
14
18.580 102.835
36.952
1.00 99.51
N
ATOM
11
NH2 ARG A
14
17.441 104.479
35.827
1.00100.79
N
Steps to tertiary structure
prediction
Comparative protein modeling
Extrapolates new structure based on related
family members
Steps
1. Identification of modeling templates
2. Alignment
3. Model building
Identification of modeling
templates
One chooses a cutoff value from FastA or BLAST
search (10-5)
Up to ten templates can be used but the one with
the highest sequence similarity to the target
sequence (lowest E-value) is the reference template
Ca atoms of the templates are selected for
superimposition.
This generates a structurally corrected multiple sequence
alignment
Alignment
“Common core” of target sequence is
threaded onto the template structure using
only alpha carbons
Framework construction
Building the model
Framework construction
Average the position of each atom in target,
based on the corresponding atoms in
template.
Portions of the target sequence that do not match the
template are constructed from a “spare part” algorithm.
Each loop is defined by its length and Ca atom
coordinates of the four amino acids preceding
and following the loop.
Building the model
Completing the backbone-a library of PDB entries is
consulted to add carbonyl groups and amino
groups. The 3-D coordinates come from a separate
library of pentapeptide backbone fragments. These
backbone fragments are fitted onto the target C
alpha carbons. The central tri-peptide is averaged
from each backbone atom (N,C,C(O)).
Side chains are added from a table of most
probable rotamers that depend on backbone
conformation.
Model refinement-minimization of energy