A novel Method of Protein Secondary Structure Prediction

Download Report

Transcript A novel Method of Protein Secondary Structure Prediction

A novel Method of Protein
Secondary Structure Prediction
with High Segment Overlap
Measure: SVM approach
by S. Hua & Z. Sun
Presenter: Sancar Adali
Outline
Problem Statement
SVM primer
Methods and Parameters
Results
Conclusion
Discussion
Problem Statement
Predicting Secondary
Protein Structure from
amino acid Sequences
Secondary Protein Structure: The local
shape of polypeptide chain dependent on
the hydrogen bonds between different
amino acids
In the big picture, the goal is to solve
tertiary/quaternary structure in 3D. By
solving the more easily tackled
secondary structure problem in 2D,
we’re solving an easier problem.
Some important terminology
 residue: When two or more amino acids combine to
form a peptide, the elements of water are removed, and
what remains of each amino acid is called an amino acid
residue.
 segment: a group of consecutive residues that form a
secondary structure such as helix, coil, sheet
 C-terminus: The residue that has a free carboxyl
group…
 N-terminus: The residue in a peptide that has an amino
group that is free…
[International Union of Pure and Applied Chemistry
definitions]
Problem Statement
Our main goal is to find Sequence to
Structure relationship. We usually know
the amino acid sequence of proteins. The
problem is complicated, because of the
many interactions/bonds between amino
acid residues. We start at a lower level of
complexity and try to predict secondary
structure.
Amino acids to Proteins
H
CC
Amino
H
Carboxyl(COO)
CC
Amino(NH3)
R1
Amino
Carboxyl
R2
H
O
H
H
CC
C
N
CC
H
R2
R1
Carboxyl
Peptide bond
N-terminus
C-terminus
Linear Decision Boundary
 # of classes: 2
 y: output {-1,1}; x Є Rn; L =((x1,y1), (x2,y2)…(xn,yn)
 w Є Rn
N
w  x   w jx j
j1
Hyperplane: H = {x : w∙x+b = 0}
SVM(Support Vector Machine)
Linearly Separable Training Samples
SVM(Support Vector Machine)
Original Idea
Choose H by
Maximizing min{||x-xi||: w∙x+b = 0, i=1,…n} with
respect to w and b
Maximizing the margin(the distance from hyperplane
to the closest positive and negative training points)
Solving Linearly Separable Case
Hyperplane: H = {x : w∙x+b = 0}
M=d+ + dChoose w,b to maximize margin. Then
fSVM=sign(w∙x+b)
Since data points are linearly separable,
there exist such w and b
yi(w∙xi +b)>0 for all i =1, …k
Solving Linearly Separable Case
Rescale w, b to make the points closest to
HP satisfy |w·xi +b|=1(standardizing
distance)
yi(w∙xi + b)≥1 for all i = 1,2,…k
Now we have two parallel planes to H
H1 : w  x  b  1
H 2 : w  x  b  1
Linearly Separable Case Soln.
The distance between H1 , H2 is 2/||w||.
Linearly Separable Case Soln.
 We want to maximize distance between
hyperplane H1 and H2 which means
minimizing ||w||
 The problem has become an optimization
problem where the objective function is
minw,b w
2
such that
y i ( x i  w  b)  1, i  1,...n
 The decision boundary depends only on a subset of
training samples. These are called “Support Vectors”.
 Introduce αi≥0 :Lagrange (nonnegative) multipliers to
solve the problem
 Solution and the decision function is
n
w   i yi x i
i 1

n
 y x
i:x i S
i
i
i
n
f SVM  sign (  i y i x i  x  b)
i 1
Error bound on SVM using LOOCV
Not Linearly Separable Case
 Introduce slack variables
ξi≥0 to take classification
errors into account
 If ξi > 1, we make an error
 Add error term ξi to the
objective function
yi (w  x  b)  1  i
n
minw,b w  C( i )
2
i 1
such that
yi ( x i  w  b)  1  i , i  1,...n
Not Linearly Separable Case
Solution is again
n
w   i yi x i
i 1

n
 y x
i:x i S
i
i
i
n
f SVM  sign (  i y i x i  x  b)
i 1
0  i  C
Nonlinear SVM
 Important observation
Training samples only appear in dot products:
xi  x
or
xi  x j
Map the training samples to a higher-dimensional
space where they become linearly separable and
solve the decision problem there
Nonlinear SVM
Φ :R N  H (possibly dim H  )
L  {(Φ(x i ), yi )}ik1
We don’t need to know Φ explicitly, just need
( x i )  ( x j )  ( x i ), ( x j )
It can be represented by a kernel function
K : R R  R
N
N
( x i )  ( x j )  K(x i , x j )
H
Nonlinear SVM
NS
f ( x )  sign (  i yi (s i )( x )  b)
i 1
NS
 sign (  i yi K(s i , x )  b)
i 1
Φ exists neither in the decision function nor in the training step
Typical Kernels
K ( x , x ' )  ( x  x '1)
K ( x , x ' )  exp(
THE ONE USED IN ARTICLE
d
x  x'
2
)
2
2
K ( x , x ' )  arct an(x  x ')
2
Methods
Use of SVMs with Radial Basis Function
Three classes H(α-helix), E(-sheet),
C(coil,and the remaining part)
Construction of tertiary classifiers from
multiple binary classifiers(SVMs are binary
classifiers)
SVM Implementation
Gaussian RBF with r=0.10
The authors claim this parameter is not critical
and they have tried other values
C: the penalty for classification errors.
Must be chosen carefully to find a balance
between overfitting(too high penalty) and
poor generalization(too low penalty). The
optimal value for C was calculated using
7-fold CV in this paper
Encoding Amino Acid Sequence
Information
20 amino acids + 1 for N or C terminus :
21 binary unit vectors (1,0,…,0),(0,1,…,0)
Concatenate these vectors for each sliding
window
For Window Length ℓ: Dimension of
Feature Vector is 21∙ℓ
Encoding Example
 Assume ℓ=5 and there are 3 amino acids(for our
purposes),named Arg(1,0,0,0), Lys(0,1,0,0),
Met(0,0,1,0). The two ends of the chain are
encoded separately as (0,0,0,1)
 The feature vector for Arg* residue in the
sequence Arg,Lys,Arg*,Met,Arg,Met,Met would
be encoded as
xi = (0,0,0,1;0,1,0,0;1,0,0,0;0,1,0,0;1,0,0,0)
 This will be our input to SVM.
Determination of Optimal Window Length
Determines dimension of feature vector
Determined using 7-fold cross-validation
Data Sets
 RS126 Set: 126 nonhomologous proteins
 non-homologous:no two proteins in the set share more than 25
% sequence identity over a length of more than 80 residues
 CB513 Set: 513 nonhomologous proteins. Almost all the
sequences in the RS126 set are included in the CB513
set.
 non-homologous data set: i.e. an SD score of ≥5 is regarded as
homology. SD score(distance of “alignment score” from the
mean score of randomized sequences in terms of std) is a more
stringent measure than the percentage identity. In fact, 11 pairs
of proteins in the RS126 set are sequence similar when using
the SD score instead of percentage identity. The CB513 set
contains 16 chains of 30 residues and short chains usually
decrease accuracy.
Protein Secondary Structure Definition
H(α-helix),
H
G(310-helix)
I(π-helix),
E(β-strand),
E
B(isolated β-bridge),
T(turn),
C
S(bend)
-(rest).
Reduction 1
Reduction 2
Reliability Index(of single classifications)
Using a discretized function of the sample
distance to the separating hyperplane
Compares it with accuracy measure to
prove the point RI increases monotonically
with increasing accuracy
Accuracy Measures
Estimate of
Estimate of
ˆ  I, Y  I)
p(Y
ˆ
p(Y  I | Y  I) 
p(Y  I)
ˆ  I)
p(Y  I, Y
ˆ
p(Y  I | Y  I) 
ˆ  I)
p(Y
Y is the class/state variable and Ŷ is the output of classifier
Segment Overlap Measure
 Segment Overlap Measure: A more realistic way of
measuring accuracy compared to per-residue accuracy
 In terms of relevancy to 3D structure
 Problems with per-residue accuracy measurement
 Proteins with same 3D folding differ by 12% in Secondary Structure
 This means maximum performance of Q3 should =88%
 End of segments might vary for proteins with same 3D structure. (so
their classification is less relevant to determining protein structure)
 Account for the following
 Type and position of secondary structure segments rather than perresidue
 Variation of the residues at the end of the segments
Comparison of SOV and per-residue
Accuracy Measures
Correlation Measure
(p  n  )  (u  o  )
C 
(n  u  )(n  o )(p  u  )(p  o )
where pα :# of true positives
nα: :# of true positives
oα: :# of false positives
uα: :# of false negatives
α stands for H,C or E
Tertiary Classifier Design
H/~H
E/~
E
C/~
C
Choose one with maximum margin:
SVM_MAX_D
Tertiary Classifier Design
H/~H
E/~
E
Vote
SVM_VOTE
C/~
C
H/C
E/H
E/C
Neural Network
SVM_NN
Combine all of tertiary classifiers designed so far to form a jury SVM_JURY
Results
Results(Comparison with PHD)
Results,Details – Support Vectors
In typical pattern recognition problems # SVs/Training Samples is much lower.
Interpretation: Remember the error bound found by LOOCV was
EV(# SVs)/ Training samples. This means the error bound will be very high
compared to typical pattern recognition problems.
Conclusions
 Application of SVM method on prediction with
good results(improvement over PHD).
 Importance of accuracy measure used to
measure performance of algorithms and class
reduction method
 Use of machine learning method in solving
protein structure problem at lower complexity
 The general protein folding problem in three
dimensions requires too much computational
power, so this is a cheap alternative.
Critique of article
Number of SVs too large. This puts the
performance of SVM into question.
The article doesn’t put the machine
learning aspects into clear form, too
focused on results