Transcript Slides

Remote Homology detection:
A motif based approach
CS 6890: Bioinformatics - Dr. Yan
Swati Adhau
04/14/06
Outline of the Presentation
• Motivation
• Introduction
• Description (Remote Homology Detection)
• Methods
• Results & Discussion
• Q and A
Motivation
• Remote homology detection is the problem of
•
•
detecting homology in case of low sequence
similarity.
A method based on presence of discrete
sequence motifs for detecting remote homology.
The motif content of a pair of sequences is used
to define a similarity that is used as a kernel for
support vector machine (SVM) classifier.
• Testing of method is done upon two remote
homology detection tasks
1) Prediction of previously unseen SCOP
family (Structural classification of Proteins).
2) Prediction of an Enzyme class given other
enzymes that have a similar function on
other substrates.
Introduction
• Protein Homology detection is one of the most
•
•
•
important problems in computational biology.
Homology is generally established by sequence
similarity.
Two established methods
1) Smith Waterman algorithm
2) Blast
Protein sequence motifs are an alternative
method of detecting sequence similarity
Intro(continued)
• By focussing on limited highly conserved regions of
•
•
•
proteins, motifs can often reveal important clues to
a proteins role.
Motifs often represent functionally important
regions such as catalytic sites, binding sites and
structural motifs.
The Blocks+ database combines various databases
such as pFAM, PRINTs, ProDom, DOMO and
InterPro. eMotif database contains discrete
sequence motifs constructed from blocks of
BLOCKS+.
This paper uses discrete sequence motifs extracted
from the eBLOCKS database using the eMOTIF
method.
Intro(Continued)
• Based upon the motif content of a pair of
•
•
•
•
sequence we introduce sequence similarity
measure.
This paper uses an SVM method.
SVM method is shown to perform better than
methods for Fisher-Kernel method, SAM T-98 and
PSI-BLAST.
When a sequence similarity is shown to be a dot
product in some space it is called the kernel.
In this paper we use protein motifs to construct a
kernel that can be computed efficiently which
performs better than a kernel based on BLAST or
Smith-Waterman scores.
Remote Homology Detection
This method was tested on the following two
tasks:1) Prediction of a SCOP family when trained
on other families in that family’s fold.
2) Prediction of the function of an enzyme
when the training set contains enzyme that
have same general functions but different
substrates.
BackGround of the first dataset
• The first dataset is composed of sequences
of domains from the SCOP database.
Objective:- To detect homology at the SCOP
superfamily level. Recognizing a SCOP
family when the training set contains other
families in the family’s superfamily.
…contd
• This specifies the +ve examples in the test
set and training set.
• The –ve examples are taken from outside
of the family’s fold.
• A random family is chosen to belong to
-ve test set & rest of the families in it’s
superfamily are added to negative training
set.
The second dataset
• We use the classification of Enzymes to simulate
•
•
remote homology.
The function of an enzyme is given by EC
number given it to by Enzyme Commision.
EC number is like n1.n2.n3.n4
For eg 1.1.3.13 for alcohol oxidase.
n1 – 1-6 :- indicates the type of chemical
reaction catalyzed by the enzyme.
n2 – specifies donor molecule.
n3 – specifies the acceptor.
n4 – specifies the substrate.
…contd
• In this paper author concentrates on
oxidoreductase (n1 = 1).
• A classifier is trained to predict oxidoreductases
with a certain function
(n2 & n3).
• The classifier will be tested on oxidoreductases
with adifferent substrate (n4) than those it was
trained on.
For eg.
EC class 1.14.13.8  Positive examples of training
set.
EC class 1.14.13.39  Positive examples of test
set.
• So the similarity between the +ve training &
test may not be very high.
• Negative test & training set are defined
analogusly.
Methods
The Motif kernel
• When the similarity is a dot product it is called a
kernel.
The method is as follows:- Each position in the
motif represents the variability in the column in
a block from multiple sequence alignment.
For eg the motif
[as].dkf[filmv]..[filmv]…l[ast].
[filmv] is a substitution group.
….contd
• If the pattern of amino acids that appear in a
column of a block does not match any
substitution group, then the motif contains the
wild card symbol ‘.’ .
• A sequence will or match above motif if it has
either an a an s in some position, then any
character, then d, k, f & so on, matching until
the end of motif
• A sequence x contains a motif m, if x
contains m at some position.
• A sequence x can be represented in vector
space indexed by a set of motifs M as
follows:F(x) = (fm(x))mЄM
where fm(x) is the number of occurences
of the motif m in x.
We can define motif kernel as
K(x, x’) = F(x) F(x’)
•
As in the most cases a motif appears only once in
sequence, this kernel will count the number of motifs
that are common to both sequence.
Q Why are we using eBlocks database over other motif
databases to define a motif kernel?
Ans:1) Usage of databases like PROSITE & the eMOTIF
presents a problem in the evaluation of performance of
the kernel.
2) The eBLOCKS database are generated in an
unsupervised way
3) Increased coverage of eBLOCKS set of BLOCKS.
Computing the Motif Kernel
Computing the Motif kernel
• To compute the motif content of each sequence; the
•
subsequent computation of the kernel is simply a dot product
between the vectors.
To facilitate the efficient computation of the motif content of a
sequence, the motif database is stored in TRIE which is
defined as follows.
__
 Let m be a motif over the alphabet A U S U {.}
Every prefix of m has a node.
Let m1 and m2 be prefixes of m; there is an edge from
m1 to m2 if lm2l = lm1l +1.
To compute all the motifs that are contained in x at any
position, this search is started at each position of x.
The Blast kernel
• A query sequence by its BLAST scores
against the training set is represented.
• This representation in conjuction with
SVMs was used to address the problem of
remote homology detection
• Results were better than Fisher-kernel
method.
Classification Methods
• We report results using two classification
methods:1) SVMs
2) K-Nearest-Neighbour.
SVM
f(x) = w.x + b
w  weight vector
b  constant bias
Query is classified according to the sign of f.
• As a consequence of optimization process, the
weight vector can be expressed as a weighted
sum of the Support Vectors(SV):-
w=
S bixi
iSV
• The decision function is now written as
f(x) =
S bixi * x + b
iSV
• In terms of kernel function, the decision is
expressed as:f(x) =
S bi K(xi , x) + b
iSV
KNN classifier
• We use a KNN classifier with a continuous
•
valued decision functions.
A score for class j is defined as
fj(x) =
S K(xi , x)
ikNNj(x)
•
kNNj(x) is the set of k nearest neighbors of x
in class j.
Metrics
•
1)
2)
•
•
•
We consider two metrics for asessing the performance
of a classifier
ROC (area under receiver operator characteristic).
RFP (the median rate of false positive)
The ROC curve describes the tradeoff between
sensitivity and specificity.
More specifically we use ROC50 curve, which counts
true positives only up to the first 50 false positives.
The RFP score of a positive test sequence x is the
fraction of negative test sequences that have a value
of the decision function that is at least as high as the
value of the decision function of x.
Results
• Use of astral database to obtain protein domain sequences of the
SCOP database.
• Retained only superfamilies having atleast two families that have
atleast 10 members in each family.
• A dataset with1639 domains in 23 superfamilies & 56 families
was yielded.
• Protein sequences annotated with EC numbers were extracted
from SwissProt database.
• The extracted dataset has 2187 enzymes in 65 classes.
• To generate Blast kernel, authors ran an all vs all BLAST on two
datasets using default parameters & E value cut off 0.1.
• To generate motif kernel, datasets were computed with eBLOCKS
sequence motifs using the TRIE method.
Results …contd
• A family by family comparison of classification
•
•
•
•
performance of the motif-SVM & BLAST-SVM methods is
provided in figure in next slide.
On the SCOP task the motif-SVM method performs
significantly better than BLAST-SVM method with a pvalue of 3.9 * 10-9. in a wilcoxon signed rank test for the
ROC50 score.
In enzyme classification task there is no significant
difference in ROC50 scores.
Similar behavior is observed in the median RFP and
RFP50.
The results were similar when Smith-Waterman
algorithm was used instead of BLAST.
Results …contd
Results …contd
Results …contd
• The motif kernel in figure 4 shows the
similarity between the families in
superfamily whereas none is detected by
the BLAST kernel.
• Increased sensitivity of motif kernel.
Results …contd
Results …contd
• Figure 5 shows the comparison of the
SVM-based method to the one that uses
KNN as a classifier.
• In both the motif and BLAST kernels, SVM
based classifier performs significantly
better than corresponding KNN classifier.
Discussion
• This paper showed that an SVM classifier that uses motif
•
•
•
kernel performs significantly better than SVM that uses a
BLAST/Smith-Waterman kernel on a remote homology
detection problem derived from SCOP database.
Both methods performed equally well on the task of
Enzyme detection.
BLAST kernel & motif kernel worked significantly better
when used in conjunction with an SVM rather than a
Nearest Neighbor classifier.
Despite the relative success of motif method, there
were many SCOP families & EC classes that were not
detected using this method.
Questions?? Comments!!
Thank you !!!