A novel Method of Protein Secondary Structure Prediction
Download
Report
Transcript A novel Method of Protein Secondary Structure Prediction
A novel Method of Protein
Secondary Structure Prediction
with High Segment Overlap
Measure: SVM approach
by S. Hua & Z. Sun
Presenter: Sancar Adali
Outline
Problem Statement
SVM primer
Methods and Parameters
Results
Conclusion
Discussion
Problem Statement
Predicting Secondary
Protein Structure from
amino acid Sequences
Secondary Protein Structure: The local
shape of polypeptide chain dependent on
the hydrogen bonds between different
amino acids
In the big picture, the goal is to solve
tertiary/quaternary structure in 3D. By
solving the more easily tackled
secondary structure problem in 2D,
we’re solving an easier problem.
Some important terminology
residue: When two or more amino acids combine to
form a peptide, the elements of water are removed, and
what remains of each amino acid is called an amino acid
residue.
segment: a group of consecutive residues that form a
secondary structure such as helix, coil, sheet
C-terminus: The residue that has a free carboxyl
group…
N-terminus: The residue in a peptide that has an amino
group that is free…
[International Union of Pure and Applied Chemistry
definitions]
Problem Statement
Our main goal is to find Sequence to
Structure relationship. We usually know
the amino acid sequence of proteins. The
problem is complicated, because of the
many interactions/bonds between amino
acid residues. We start at a lower level of
complexity and try to predict secondary
structure.
Amino acids to Proteins
H
CC
Amino
H
Carboxyl(COO)
CC
Amino(NH3)
R1
Amino
Carboxyl
R2
H
O
H
H
CC
C
N
CC
H
R2
R1
Carboxyl
Peptide bond
N-terminus
C-terminus
Linear Decision Boundary
# of classes: 2
y: output {-1,1}; x Є Rn; L =((x1,y1), (x2,y2)…(xn,yn)
w Є Rn
N
w x w jx j
j1
Hyperplane: H = {x : w∙x+b = 0}
SVM(Support Vector Machine)
Linearly Separable Training Samples
SVM(Support Vector Machine)
Original Idea
Choose H by
Maximizing min{||x-xi||: w∙x+b = 0, i=1,…n} with
respect to w and b
Maximizing the margin(the distance from hyperplane
to the closest positive and negative training points)
Solving Linearly Separable Case
Hyperplane: H = {x : w∙x+b = 0}
M=d+ + dChoose w,b to maximize margin. Then
fSVM=sign(w∙x+b)
Since data points are linearly separable,
there exist such w and b
yi(w∙xi +b)>0 for all i =1, …k
Solving Linearly Separable Case
Rescale w, b to make the points closest to
HP satisfy |w·xi +b|=1(standardizing
distance)
yi(w∙xi + b)≥1 for all i = 1,2,…k
Now we have two parallel planes to H
H1 : w x b 1
H 2 : w x b 1
Linearly Separable Case Soln.
The distance between H1 , H2 is 2/||w||.
Linearly Separable Case Soln.
We want to maximize distance between
hyperplane H1 and H2 which means
minimizing ||w||
The problem has become an optimization
problem where the objective function is
minw,b w
2
such that
y i ( x i w b) 1, i 1,...n
The decision boundary depends only on a subset of
training samples. These are called “Support Vectors”.
Introduce αi≥0 :Lagrange (nonnegative) multipliers to
solve the problem
Solution and the decision function is
n
w i yi x i
i 1
n
y x
i:x i S
i
i
i
n
f SVM sign ( i y i x i x b)
i 1
Error bound on SVM using LOOCV
Not Linearly Separable Case
Introduce slack variables
ξi≥0 to take classification
errors into account
If ξi > 1, we make an error
Add error term ξi to the
objective function
yi (w x b) 1 i
n
minw,b w C( i )
2
i 1
such that
yi ( x i w b) 1 i , i 1,...n
Not Linearly Separable Case
Solution is again
n
w i yi x i
i 1
n
y x
i:x i S
i
i
i
n
f SVM sign ( i y i x i x b)
i 1
0 i C
Nonlinear SVM
Important observation
Training samples only appear in dot products:
xi x
or
xi x j
Map the training samples to a higher-dimensional
space where they become linearly separable and
solve the decision problem there
Nonlinear SVM
Φ :R N H (possibly dim H )
L {(Φ(x i ), yi )}ik1
We don’t need to know Φ explicitly, just need
( x i ) ( x j ) ( x i ), ( x j )
It can be represented by a kernel function
K : R R R
N
N
( x i ) ( x j ) K(x i , x j )
H
Nonlinear SVM
NS
f ( x ) sign ( i yi (s i )( x ) b)
i 1
NS
sign ( i yi K(s i , x ) b)
i 1
Φ exists neither in the decision function nor in the training step
Typical Kernels
K ( x , x ' ) ( x x '1)
K ( x , x ' ) exp(
THE ONE USED IN ARTICLE
d
x x'
2
)
2
2
K ( x , x ' ) arct an(x x ')
2
Methods
Use of SVMs with Radial Basis Function
Three classes H(α-helix), E(-sheet),
C(coil,and the remaining part)
Construction of tertiary classifiers from
multiple binary classifiers(SVMs are binary
classifiers)
SVM Implementation
Gaussian RBF with r=0.10
The authors claim this parameter is not critical
and they have tried other values
C: the penalty for classification errors.
Must be chosen carefully to find a balance
between overfitting(too high penalty) and
poor generalization(too low penalty). The
optimal value for C was calculated using
7-fold CV in this paper
Encoding Amino Acid Sequence
Information
20 amino acids + 1 for N or C terminus :
21 binary unit vectors (1,0,…,0),(0,1,…,0)
Concatenate these vectors for each sliding
window
For Window Length ℓ: Dimension of
Feature Vector is 21∙ℓ
Encoding Example
Assume ℓ=5 and there are 3 amino acids(for our
purposes),named Arg(1,0,0,0), Lys(0,1,0,0),
Met(0,0,1,0). The two ends of the chain are
encoded separately as (0,0,0,1)
The feature vector for Arg* residue in the
sequence Arg,Lys,Arg*,Met,Arg,Met,Met would
be encoded as
xi = (0,0,0,1;0,1,0,0;1,0,0,0;0,1,0,0;1,0,0,0)
This will be our input to SVM.
Determination of Optimal Window Length
Determines dimension of feature vector
Determined using 7-fold cross-validation
Data Sets
RS126 Set: 126 nonhomologous proteins
non-homologous:no two proteins in the set share more than 25
% sequence identity over a length of more than 80 residues
CB513 Set: 513 nonhomologous proteins. Almost all the
sequences in the RS126 set are included in the CB513
set.
non-homologous data set: i.e. an SD score of ≥5 is regarded as
homology. SD score(distance of “alignment score” from the
mean score of randomized sequences in terms of std) is a more
stringent measure than the percentage identity. In fact, 11 pairs
of proteins in the RS126 set are sequence similar when using
the SD score instead of percentage identity. The CB513 set
contains 16 chains of 30 residues and short chains usually
decrease accuracy.
Protein Secondary Structure Definition
H(α-helix),
H
G(310-helix)
I(π-helix),
E(β-strand),
E
B(isolated β-bridge),
T(turn),
C
S(bend)
-(rest).
Reduction 1
Reduction 2
Reliability Index(of single classifications)
Using a discretized function of the sample
distance to the separating hyperplane
Compares it with accuracy measure to
prove the point RI increases monotonically
with increasing accuracy
Accuracy Measures
Estimate of
Estimate of
ˆ I, Y I)
p(Y
ˆ
p(Y I | Y I)
p(Y I)
ˆ I)
p(Y I, Y
ˆ
p(Y I | Y I)
ˆ I)
p(Y
Y is the class/state variable and Ŷ is the output of classifier
Segment Overlap Measure
Segment Overlap Measure: A more realistic way of
measuring accuracy compared to per-residue accuracy
In terms of relevancy to 3D structure
Problems with per-residue accuracy measurement
Proteins with same 3D folding differ by 12% in Secondary Structure
This means maximum performance of Q3 should =88%
End of segments might vary for proteins with same 3D structure. (so
their classification is less relevant to determining protein structure)
Account for the following
Type and position of secondary structure segments rather than perresidue
Variation of the residues at the end of the segments
Comparison of SOV and per-residue
Accuracy Measures
Correlation Measure
(p n ) (u o )
C
(n u )(n o )(p u )(p o )
where pα :# of true positives
nα: :# of true positives
oα: :# of false positives
uα: :# of false negatives
α stands for H,C or E
Tertiary Classifier Design
H/~H
E/~
E
C/~
C
Choose one with maximum margin:
SVM_MAX_D
Tertiary Classifier Design
H/~H
E/~
E
Vote
SVM_VOTE
C/~
C
H/C
E/H
E/C
Neural Network
SVM_NN
Combine all of tertiary classifiers designed so far to form a jury SVM_JURY
Results
Results(Comparison with PHD)
Results,Details – Support Vectors
In typical pattern recognition problems # SVs/Training Samples is much lower.
Interpretation: Remember the error bound found by LOOCV was
EV(# SVs)/ Training samples. This means the error bound will be very high
compared to typical pattern recognition problems.
Conclusions
Application of SVM method on prediction with
good results(improvement over PHD).
Importance of accuracy measure used to
measure performance of algorithms and class
reduction method
Use of machine learning method in solving
protein structure problem at lower complexity
The general protein folding problem in three
dimensions requires too much computational
power, so this is a cheap alternative.
Critique of article
Number of SVs too large. This puts the
performance of SVM into question.
The article doesn’t put the machine
learning aspects into clear form, too
focused on results