Transcript 王红刚

BIOINFORMATION
A new taxonomy-based protein fold
recognition approach based on
autocross-covariance transformation
- - 王红刚 14S051054
Introduction
 Introduction
Metrials and Methods
 Method
Results
and Discussions
 Result
and Discussion
Conclusions
 Conclusion
Introduction
 Importance
Most important tasks in computational biology
Fill in this gap
The detection of homologies with low sequence identity
remains a challenging problem
 Some solutions
The general sequence comparison methods
Thread the query sequence onto the template structures
Improve prediction performance by either incorporating new
features or developing novel algorithms.
Introduction
 Problem
Traditional sequence comparison methods fail to identify
reliable homologies with low sequence identity
The taxonomic methods are effective alternatives, but their
prediction accuracies are around 70%, which are still relatively
low for practical usage.
 Autor's solution
Protein sequences have univariate direction from beginning to
end
Is analogous to time sequences of process data
Autocross covariance (ACC) transformation
SVM
Introduction
PSSM:
each squence
PSI-BLAST
PSSM
position-specific score matrices
The element Si,j in the matrix reflects the
probability of amino acid i occurring at the
position j
ACC
fixed-length vector
SVM
classification results
Feature:
Only the evolutionary information represented
in the form of PSSM
It alone can achieve promising results
Materilas and methods
To evaluate the proposed method and compare
it with existing methods,five datasets are used
here:
the D-B dataset
the extended D-B dataset,
the F86 datasets
the F199 datasets
the Lindahl dataset
The D-B dataset :
311 proteins for training
383 proteins for test.
<40% identity
Each fold has at least seven members.
<35% identity in training set.
Classes: all α, all β, α/β, α+β and small proteins.
The extended D-B dataset :
27folds
<40% identity.
contains 3202 sequences.
The fold names and the number of proteins :
F86 and F199:
 The F86 dataset
contains 86 folds and 5671 sequences,
each fold has at least 25 members.
 The F199 dataset
contains 199 folds and 7437 sequences
each fold has at least 10 members
The Lindahl dataset :
 is used as a benchmark to compare the taxonomic
fold recognition methods with the threading methods.
 976 sequences in this dataset
 identity <40%.
ACC transformation
 ACC can transform the PSSMs of different lengths into fixed-length
vectors by measuring the correlation between any two properties
 ACC results in two kinds of variables
auto-covariance(AC):between the same property
cross-covariance(CC):between two different properties.
AC variable: the correlation of the same property between two residues
separated by a distance of lg along thesequence,which can be calculated as:
i :is one of the residues
L :is the length of the protein sequence
Si,j :is the PSSM score of amino acid i at position j
Si:is the average score for amino acid i along the whole sequence
In such a way, the number of AC variables can be calculated as 20∗LG,
where LG is the maximum of lg (lg=1,2,...,LG).
CC variable :measures the correlation of two different properties
between two residues separated by lg along the sequence
i1,i2 :are two different amino acids
The total number of CC variables : 380∗LG.
Each protein sequence is represented as a vector of either
AC variable or ACC variable that is a combination of AC and CC.
Materilas and methods
 Support vector machine
 Performance metrics
The overall accuracy (Q)
sensitivity (Sn) and specificity (Sp):
RESULTS AND DISCUSSIONS
The impact of LG
Performance comparison with existing
taxonomy-based method
Performance comparison with threading
methods
The impact of LG
 The maximum value of LG is the length of the shortest sequence minus one.
 D-B dataset: the optimal values of LG forAC and
Extended dataset: the best values of LG is 10.
Performance comparison with existing taxonomy-based
methods
Results on the D-B dataset
The detailed results are given in the Supplementary Material:
Results on the D-B dataset
To give a more comprehensive comparison, we consider several
other methods in the literature.
The proposed ACCFold method outperforms these methods by 2–14%.
Results on the extended D-B dataset
 Extended D-B dataset:
The same folds
more sequences:3202
 All the methods get improved
 Higher than the other methods
by 9–17%.
Results on the extended D-B dataset
Especially, the performance of the folds in the α/β, α+β and small
proteins classes are significantly improved.
Results on the F86 and F199 datasets
 More folds:
86 folds,199folds
 Time complexity:
SWPSSM:O(n^2 * L^2)
ACC:O(n*L^2+n^2*L).

The results indicate that
the proposed method can be applied to the cases of large number
of folds without significantly affecting its performance, as long
as the number of samples in each fold is not too small.
Performance comparison with threading methods
Threading methods:use the sequence–template alignments to
detect the remote homologies of proteins.
Results on the Lindahl dataset
At the family level, we select the families that contain at least two samples
Performance comparison with threading methods
 Taxonomic methods are not as good as threading methods
 Difficult to be applied to practical fold recognition
 However
the total number of folds are limited
the number of proteins with known structure increases
more space and chance to exploit the taxonomic methods to
develope ffective fold cognition system.
Conclusions
 Combines SVM with ACC is introduced for
taxonomic protein fold recognition.
 ACC transformation is used to convert the PSSMs
into fixed-length vectors
 The results obtained here stand for the state-of-theart performance of taxonomic protein fold recognition