Mismatch String Kernals for SVM Protein Classification
Download
Report
Transcript Mismatch String Kernals for SVM Protein Classification
Mismatch String Kernals for
SVM Protein Classification
Christina Leslie, Eleazar Eskin,
Jason Weston, William Stafford Noble
Presented by
Pradeep Anand Ravindranath
Mei Sze Lam
Introduction
Problem in Computational Biology
Classification of Proteins into functional
and structural classes based on homology
of protein sequence data
Methods for Protein Classification
and Homology Detection
Pairwise sequence alignment
Profiles for protein families
Consensus patterns using motifs
Profiles HMMs
Focus
Remote Homology Detection
How is the problem handled
currently?
Fisher-SVM
One of the successful discriminative
techniques for protein classification and
Best performing method for remote
homology detection
Fisher-SVM
Build a profile HMM for the positive
training sequences, defining loglikelihood
function [log P(x/θ)] for any protein
sequence x.
θ0 - maximum likelihood for model
parameters
Gradient vector d(log P(x/θ)/θ=θ0)/dθ
assigns to each (positive or negative)
training sequence x an explicit vector
feature called fisher scores.
This feature mapping defines a kernel
function, called the fisher kernel.
This Fisher kernel can then be used to
train a SVM classifier.
Strengths
Combines biological information encoded
in a HMM with the discriminative power of
the SVM algorithm.
Negatives
Needs lots of data or sophisticated priors
to train the HMM.
It is expensive to compute the kernel
matrix, as calculating the fisher scores
requires computing forward and backward
probabilities from the Baun-Welch
algorithm.
Mismatch-SVM
The (k,m)-mismatch kernel is based on a
feature map to a vector space indexed by
all possible subsequence of amino acids
of a fixed length k.
Each instance of a fixed k-length
subsequence in an input sequence
attributes to all feature coordinates
differing from it by at most m mismatches.
Thus, the mismatch kernel adds the
biologically important idea of mismatching
to the computationally simpler
spectrum kernel
In this paper, it is described how to
compute the new kernel efficiently using a
mismatch tree data structure
for values of (k,m) useful in this
application. Mismatch kernel
Advantages
By using mismatch tree data structure the
kernel is fast enough to use on real
datasets.
Considerabily less expensive than the
fisher kernel.
Performance equal to Fisher-SVM.
Outperforms other methods.
This kernel does not depend on any
generative model and can be used for
other sequence based classification
problems.
Feature Maps for Strings
(k,m)-mismatch kernel is based on a
feature map from the space of all finite
sequences from an alphabet A of size
|A| = l to the lk –dimensional vector space
indexed by the set of k-length
subsequences (“k-mers”) from A.
where,
A - alphabet representing amino acids.
l - no. of amino acids.
If α is a k-mer
β is all k length sequences
N(k,m) (α) – set of all k length sequences
differing from α by at most m
mismatches.
we define our feature map Φ(k,m) as
Φ(k,m) (α) = (φβ(α))βЄAk
where φβ(α) = 1if β belongs to N(k,m) (α) ,
φβ(α) = 0 otherwise.
For a sequence x of any length, we
extends the map additively by summing
the feature vectors for all the k-mers in x:
Φ(k,m) (x) = Σ(k-mers α in x) Φ(k,m) (α)
The (k,m)-mismatch kernal is given by
K (k,m) (x,y) = ‹Φ(k,m) (x), Φ(k,m) (y)›.
For m = 0, we retrieve the k-spectrum kernal.
Fisher Scores and Spectrum
Kernel
Even though the spectrum and mismatch
feature maps are defined without any
reference to a generative model, there is
some similarity between the k-spectrum
feature map and the fisher scores
associated to an order k-1 markov chain
model.
Efficient computation of the Mismatch
Kernel: Mismatch Tree Data Structure
Mismatch tree data structure is used to
represent the feature space(the set of all
k-mers) and perform a lexical traversal of
all k-mers occurring in the sample dataset
match with up to m of mismatches.
Example: Traversing the
Mismatch Tree
Traversal for input sequence:
AVLALKAVLL, k=8, m=1
Example: Computing the Kernel
for Pair of Sequences
Traversal of trie for k=3 (m=0)
A
S1:
EADLALGKAVF
S2:
ADLALGADQVFNG
D
L
Update kernel value for
K(s1,s2) by adding
contribution for feature ADL
Efficiency Issues for Kernel
Computation
Depth first
search
Recursive
function
efficient use of
memory
no problem for
large data sets
Computational Cost
Theoretical
Computational
Cost, O
Total length of
the sample
data, N
Number of mismatches, m. This
increases, computational cost
increases exponentially
Number of different amino acids in the body
The number of characters in the sequence, k. The classifier breaks up
proteins into lengths of 5~6. (Longer strings are broken down by summing
the feature vectors)
Computational Cost (cont’d)
Worst case scenario for M sequences ?
M 2n
=
MxMxn
=
MXN
2
M
n is just the M number of
Where
sequences to be processed.
Supposing M number of sequences are all
equal with max. no of non zero entities
Training and Test Data
33 Classes of Superfamily Proteins (taken from SCOP database)
Class we are interested in
Any other class
other than Class 1
-
+
-
Class 1
-
+
+
Class 3 ..
Class 2
+
-
-
-
Families of Proteins, each belonging to 1 of 33 Superfamilies
For each class, we want to know whether a given
protein sequence belongs to that class – Y/N?
160 experiments were performed on 33 classes.
-
Implementation and Comparison of
Methods
We test the mismatch kernel with a publicly available
SVM implementation
4 methods
Mismatch Kernel
Fisher Kernel
SAM-T98
PSI-BLAST
Uses SVM implementation
HMM
Alignment Scoring
Peformance Measurement - ROC
ROC
ROC50
100 1 99
1
100
100
10 5 1
10
2
Show on board:
The closer to 1, the
better the score –
more true positives
to false positives
Performance Comparison
Many of the
Mismatch
SVM and
Fisher Kernel
classifications
fall close to 1,
meaning
there is a low
FP error rate;
Both
classifiers
manage to
classify
almost all of
the 33
classes with
ROC score >
0.85
threshold
Comparison of four homology detection methods
Mismatch VS Spectrum
ROC
ROC50
Mismatch kernel outperforms the Spectrum kernel
Mismatch VS Fisher
ROC
ROC50
No Significant Difference!
Discussion & Conclusion
What was it for?
What was achieved?
Constructing kernel for homology detection
A kernel that was equal in performance to the best known
classifier but with a lower computational cost
Future Work
Since does not depend on generative model (unlike Fisher),
can be easily used for other stuff, eg. Splice site prediction
Since it is computationally cheaper (ie. faster), can be used for
practical biological purposes, eg. multiclass prediction