Transcript 吴冬茵

Application of latent semantic analysis
to protein remote homology detection
Wu Dongyin
4/13/2015
ABSTRACT
Related Work on Remote Homology Detection
LSA
LSA-based SVM and Data set
Result and Discussion
CONCLUSION
ABSTRACT
Motivation
 Remote homology detection: A central problem in computational biology, the classification of proteins into
functional and structural classes given their amino acid sequences
 Discriminative method such as SVM is one of the most effective methods
 Explicit feature are usually large and noise data may be introduced, and it leads to peaking phenomenon
Results
 Introduce LSA, which is an effecient feature extraction technique in NLP
 LSA model significantly improves the performance of remote homology detection in comparison with basic
formalisms, and its peformance is comparable with complex kernel methods such as SVM-LA and better than
other sequence-based methods
Related Work of Remote Homology Detection
structure is more conserved than sequence -- detecting very subtle sequence similarities,
or remote homology is important
Most methods can detect homology with a high level of similarity, while remote homology is often
difficult to be separated from pairs of proteins that share similarities owing to chance -- 'twilight
zone'
pairwise
sequence
comparison
algorithm
dynamic programming algorithm:
BLAST,
FASTA,
PSI-BLAST, etc
generative
models
HMM, etc
rotein families
and
discriminative
classifiers
SVM, SVM-fisher,
SVM-k-spectrum,
mismatch-SVM,
SVM-pairwise,
SVM-I-sites,
SVM-LA,
SVM-SW, etc
The success of a SVM
classification method depends on
the choice of the feature set to
describe each protein. Most of
these research efforts focus on
finding useful representations of
protein sequence data for SVM
training by using either explicit
feature vector representations or
kernel functions.
LSA
 Latent semantic analysis (LSA) is a theory and method for extracting and representing the
contextual-usage meaning by statistical computations applied to a large corpus of text.
 LSA analysis the relationships between a set of documents and the terms they contain by
producing a set of concepts related to the documents and terms. LSA assumes that words
that are close in meaning will occur in similar pieces of text.
LSA
bag-of-words model
c1
human
1
interface
1
computer
1
c2
c3
c4
c5
m2
m4
1
1
1
1
system
1
1
response
1
1
time
1
1
EPS
1
1
2
1
1
tree
1
1
graph
N document
M words in total
m3
1
user
survey
m1
minor
Word-Document Matrix
(M×N)
( This repretation does not recognize synonymous
or related words and the dimensions are too large)
1
1
1
1
1
1
1
LSA
U(M×K)
S(K×K)
W  USV T
VT(K×N)
( R  Min( M, N ))
LSA
+
sequences of proteins
c1
human
1
interface
1
computer
1
c2
c4
c5
1
system
1
1
response
1
1
time
1
1
graph
minor
m3
m4
1
1
tree
m2
1
user
survey
m1
1
EPS
documents
c3
1
1
2
1
1
1
1
1
1
1
1
1
1
1
LSA
For a new document (sequence) which is not in the training set, it is required to add the unseen
document (sequence) to the original training set and the LSA model be computed. The new vector
t can be approximated as
t = dU
where d is the raw vector of the new document, which is similar to the columns of the matrix W
LSA-based SVM and Data set
 Structral Classification of Proteins (SCOP) 1.53
 sequences from ASTRAL database
 54 families
 4352 distinct sequences
 remote homology is simulated by holding out all
members of a target 1.53 family from a given
superfamily.
 3 basic building block of proteins
 N-gram
N = 3, 20^3, 8000 words
 Patterns
alphabet ∑U{‘.’}, where ∑ is the set of the 20
amino acids and {‘.’} can be any of the amino acids.
X2 selection, 8000 patterns.
 Motifs
denotes the limited, highly conserved regions of
proteins. 3231 motifs.
Result and Discussion
Two methods are used to evaluate the experimental results:
 the receiver operating characteristic (ROC) scores.
 the median rate of false positives (M-RFP) scores. The fraction of negative test
sequences that score as high or better than the median score of the positive sequences.
Result and Discussion
Result and Discussion
When the families are in the left-upper area, it means that the method labeled by y-axis
outperforms the method labeled by x-axis on this family.
Result and Discussion
1. Family level
fold1
superfamily1.1
family1.1.1
positive
train
20
family1.1.2
positive
test
13
family1.1.3
fold2
superfamily1.2
family1.2.1
family1.2.2
superfamily2.1
family2.1.1
negative train & negative test
3033 & 1137
Result and Discussion
2. Superfamily level
fold1
superfamily1.1
family1.1.1
positive
test
33
family1.1.2
fold2
superfamily1.2
family1.1.3
positive
train
88
family1.2.1
family1.2.2
superfamily2.1
family2.1.1
negative train & negative test
3033 & 1137
Result and Discussion
3. Fold level
fold1
superfamily1.1
family1.1.1
positive
test
33
family1.1.2
family1.1.3
fold2
superfamily1.2
family1.2.1
positiv
e
train
61
family1.2.2
superfamily2.1
family2.1.1
negative train & negative test
3033 & 1137
Result and Discussion
Result and Discussion
computational efficiency
vectorization step
optimization step
SVM-pairwise
O(n2l2)
O(n3)
SVM-LA
O(n2l2)
O(n2p)
SVM-Ngram
O(nml)
O(n2m)
SVM-Pattern
O(nml)
O(n2m)
SVM-Motif
O(nml)
O(n2m)
SVM-Ngram-LSA
O(nmt)
O(n2R)
SVM-Pattern-LSA
O(nmt)
O(n2R)
SVM-Motif-LSA
O(nmt)
O(n2R)
LSA better than SVM-pairwise and SVM-LA
worse than methods without LSA and PSI-BLAST
n: the number of training examples
l: the length of the longest training sequence
m: the total number of words
t: min (m,n)
p: the length of the latent semantic representation vector
p = n, in SVM-pairwise
p = m,in the method with LSA
p = R, in the LSA method
CONCLUSION
successfully used in
protein remote homology detection and improved performances have been acquired in
In this paper, the LSA model from natural language processing is
comparison with the basic formalisms.
Each document is represented as a linear combination of
which arise automatically from the SVD mechanism.
hidden abstract concepts,
LSA defines a transformation between high-dimensional
discrete entities (the
vocabulary) and a low-dimensional continuous vector space S, the R-dimensional
space spanned by the Us, leading to noise removal and efficient representation of
the protein sequence.
As a result, the LSA model achieves better
performance than the methods without LSA.