Transcript Lecture22

Neural Networks for Protein
Structure Prediction
Brown, JMB 1999
CS 466
Saurabh Sinha
Outline
• Goal is to predict “secondary structure”
of a protein from its sequence
• Artificial Neural Network used for this
task
• Evaluation of prediction accuracy
What is Protein Structure?
http://academic.brooklyn.cuny.edu/biology/bio4fv/page/3d_prot.htm
http://matcmadison.edu/biotech/resources/proteins/labManual/images/220_04_114.png
Protein Structure
• An amino acid sequence “folds” into a
complex 3-D structure
• Finding out this 3-D structure is a crucial
and challenging task
• Experimental methods (e.g., X-ray
crystallography) are very tedious
• Computational predictions are a
possibility, but very difficult
What is “secondary structure”?
“Strand”
“Helix”
http://www.wiley.com/college/pratt/0471393878/student/structure/secondary_structure/secondary_structure.gif
“Helix”
“Strand”
http://www.npaci.edu/features/00/Mar/protein.jpg
Secondary structure prediction
• Well, the whole 3-D “tertiary” protein structure
may be hard to predict from sequence
• But can we at least predict the secondary
structural elements such as “strand”, “helix”
or “coil”?
• This is what this paper does
• .. and so do many other papers (it is a hard
problem !)
A survey of structure prediction
• The most reliable technique is “comparative
modeling”
– Find a protein P whose amino acid sequence is
very similar to your “target” protein T
– Hope that this other protein P does have a known
structure
– Predict a similar structure similar to that of P, after
carefully considering how the sequences of P and
T differ
A survey of structure prediction
• Comparative modeling fails if we don’t have a
suitable homologous “template” protein P for our
protein T
• “Ab initio” tertiary methods attempt to predict the
structure without using a protein structure
– Incorporate basic physical and chemical principles into the
structure calculation
– Gets very hairy, and highly computationally intensive
• The other option is prediction of secondary structure
only (i.e., making the goal more modest)
– These may be used to provide constraints for tertiary
structure prediction
Secondary structure prediction
• Early methods were based on stereochemical
principles
• Later methods realized that we can do better
if we use not only the one sequence T (our
sequence), but also a family of “related
sequences”
• Search for sequences similar to T, build a
multiple alignment of these, and predict
secondary structure from the multiple
alignment of sequence
What’s multiple alignment
doing here ?
• Most conserved regions of a protein
sequence are either functionally important or
buried in the protein “core”
• More variable regions are usually on surface
of the protein,
– there are few constraints on what type of amino
acids have to be here (apart from bias towards
hydrophilic residues)
• Multiple alignment tells us which portions are
conserved and which are not
hydrophobic core
http://bio.nagaokaut.ac.jp/~mbp-lab/img/hpc.png
What’s multiple alignment
doing here ?
• Therefore, by looking at multiple alignment,
we could predict which residues are in the
core of the protein and which are on the
surface (“solvent accessibility”)
• Secondary structure then predicted by
comparing the accessibility patterns
associated with helices, strands etc.
• This approach (Benner & Gerloff) mostly
manual
• Today’s paper suggest an automated method
The PSI-PRED algorithm
•
Given an amino-acid sequence, predict
secondary structure elements in the protein
• Three stages:
1. Generation of a sequence profile (the
“multiple alignment” step)
2. Prediction of an initial secondary structure
(the neural network step)
3. Filtering of the predicted structure (another
neural network step)
Generation of sequence profile
• A BLAST-like program called “PSI-BLAST”
used for this step
• We saw BLAST earlier -- it is a fast way to
find high scoring local alignments
• PSI-BLAST is an iterative approach
– an initial scan of a protein database using the
target sequence T
– align all matching sequences to construct a
“sequence profile”
– scan the database using this new profile
• Can also pick out and align distantly related
protein sequences for our target sequence T
The sequence profile looks like this
• Has 20 x M numbers
• The numbers are log likelihood of each residue at each position
Preparing for the second step
• Feed the sequence profile to an artificial
neural network
• But before feeding, do a simply “scaling”
to bring the numbers to 0-1 scale
1
x
x
1 e
Intro to Neural nets
(the second and third steps of
PSIPRED)
Artificial Neural Network
• Supervised learning algorithm
• Training examples. Each example has a
label
– “class” of the example, e.g., “positive” or
“negative”
– “helix”, “strand”, or “coil”
• Learns how to predict the class of an
example
Artificial Neural Network
•
•
•
•
Directed graph
Nodes or “units” or “neurons”
Edges between units
Each edge has a weight (not known a
priori)
Layered Architecture
http://www.akri.org/cognition/images/annet2.gif
Input here is a four-dimensional vector. Each dimension goes
into one input unit
Layered Architecture
http://www.geocomputation.org/2000/GC016/GC016_01.GIF
(units)
What a unit (neuron) does
• Unit i receives a total input xi from the
units connected to it, and produces an
output yi = fi(xi) where fi() is the “transfer
function” of unit i
xi 
w
ij
y j  wi
j N {i}


y i  f i (x i )  f i 
  w ij y j  w i 

j N {i}

wi is called the “bias” of the unit
Weights, bias and transfer function
Unit takes n inputs
Each input edge has weight wi
Bias b
Output a
Transfer function f()
Linear, Sigmoidal, or other
Weights, bias and transfer function
• Weights wij and bias wi of each unit are
“parameters” of the ANN.
– Parameter values are learned from input data
• Transfer function is usually the same for
every unit in the same layer
• Graphical architecture (connectivity) is
decided by you.
– Could use fully connected architecture: all units in
one layer connect to all units in “next” layer
Where’s the algorithm?
• It’s in the training of parameters !
• Given several examples and their labels: the
training data
• Search for parameter values such that output
units make correct predictions on the training
examples
• “Back-propagation” algorithm
– Read up more on neural nets if you are interested
Back to PSIPRED …
Step 2
• Feed the sequence profile to the input layer of
an ANN
• Not the whole profile, only a window of 15
consecutive positions
• For each position, there are 20 numbers in the
profile (one for each amino acid)
• Therefore ~ 15 x 20 = 300 numbers fed
• Therefore, ~ 300 “input units” in ANN
• 3 output units, for “strand”, “helix”, “coil”
– each number is confidence in that secondary
structure for the central position in the window of 15
e.g.,
15
Input layer Hidden
layer
helix
0.18
strand
0.09
coil
0.67
Step 3
• Feed the output of 1st ANN to the 2nd ANN
• Each window of 15 positions gave 3
numbers from the 1st ANN
• Take 15 successive windows’ outputs and
feed them to 2nd ANN
• Therefore, ~ 15 x 3 = 45 input units in ANN
• 3 output units, for “strand”, “helix”, “coil”
Test of performance
Cross-validation
• Partition the training data into “training set”
(two thirds of the examples) and “test set”
(remaining one third)
• Train PSIPRED on training set, test predictions
and compare with known answers on test set.
• What is an answer?
– For each position of sequence, a prediction of what
secondary structure that position is involved in
– That is, a sequence over “H/S/C” (helix/strand/coil)
• How to compare answer with known answer?
– Number of positions that match