Transcript 赵亚辉
Ensembel classifier for
protein fold pattern recognition
Zhao Yahui
14S051028
Contents
1
Abstract
2
Introduction
3
Materials and Method
4
Result and Discussion
5
Conclusions
Abstract
Prediction of protein folding patterns(27) is
one level deeper than that of protein
structural classes, and hence is much more
complicated and difficult.
An ensemble classifier:a set of basic
classifiers with each trained in different
parameter(6) systems
The operation engine for the constituent
individual classifiers was OET-KNN rule.
Introduction
Current difficulty:
local minimum problem
when doesn't have stuctureknown homologous protein in the
existing database
Here:
resort to the taxonomic approach(based
on the assumption that the number of
protein folds is limited)
Materials and Methods
Dataset:
training set:311 protein
testing set:385 protein
None of proteins in these datasets has >35%
sequence identity to any other, and most of
proteins in the testing dataset have <25%
sequence identity with those in the training
dataset
Materials and Methods
Features:
amino aid composition(replaced)
predicted secondary structure
hydrophobicity
van der Waals volume
polarity
polarizability
improve:
pseudo-amino aid composition(avoid
completely ignoring the sequence-order
effects)
Materials and Methods
PseAA
adopt the alternate correlation function between
hydrophobicity and hydrophilicity( to reflect the
sequence-order effects).
Suppose a protein P with a sequence of L amino acid
residues:
is the hydrophobicity of
,
is the
1
2
hydrophilicity of
. H i , j , H i , j are the hydrophobicity
and hydrophilicity correlation functions
Materials and Methods
i , i 1
are the
correspond ith-tier
correlation factors that
reflect the sequence-order
correlation between all the
ith most contiguous
residues
Materials and Methods
Materials and Methods
Standard conversion of h1 ( Ri ), h2 ( Ri ) :
remain unchanged if going through the same
conversion procedure again
Materials and Methods
By fusing the 2 amphiphilic correlation factors into
the classical amino acid composition, we have the
following augmented discrete form to represent a
protein sample P:
where
OET-KNN Algorithm
optimized evidence-theoretic k-nearest neighbors
ET-KNN:
Based on: each neighbor of a pattern to be classified is
considered as an item of evidence supporting certain
hypotheses concerning the class membership of that pattern.
ET-KNN Algorithm
Let us consider a problem of classifying N entities into
27 classes (fold types), which can be formulated as
The available information is assumed to consist of a
training dataset
ET-KNN Algorithm
Suppose P is a query protein to be classified, and
the set of its k-nearest neighbors
is
for any
, the knowledge that belongs to class
gft
can be considered as a piece of evidence that
increases our belief that P also belongs to
,this
item of evidence can be formulated by
OET-KNN Algorithm
The belief function of P belonging to class
is a
combination of its k-nearest neighbors, and can be
formulated as:
A decision is made by assigning the query
protein P to the class
:
Materials and Methods
Input Feature selection:
(20+2 )D PseAA
21D predicted secondary structure,
[(21*5)+(20+2 )]D
21D hydrophobicity,
21D normalized van der Waals volume,
21D polarity
21D polarizability
Doing so would introduce too many
parameters into the input, thereby
reducing the clustertolerant capacity
and cross-validation success rate
It is instructive to introduce a new concept called the
cluster-tolerant capacity for the dataset studied here.
If such a capacity is high for a dataset, then the removal of
any entry from the dataset will not significantly change the
original clustered picture
(such as the distribution of the standard vectors for each
subset,the spacial gaps between the boundaries of any two
subsets, and the original attribution of the removed entry
to a subset)
conversely, if the clustered tolerant capacity is low for a
dataset, the removal of some entry from it will have a
significant impact on the original clustered picture.
Materials and Methods
Framework of ensemble classifier:
Why choose Frame work:
reduce the variance caused by the peculiarities of a single
training set and hence be able to learn a more expressive
concept in classification than a single classifier.
Materials and Methods
Thus the query protein P is predicted belonging to the fold type with
Suppose the ensemble classifier C is expressed by
which its score is the highest:
Thus, the process of how the ensemble classifier C works by
fusing the nine basic classifiers C(i) (i =1,2……9) can be
formulated as follows. Suppose
Result And Discussion
Conclusions
It is shown thru the present study that the
ensemble classifier formed by fusing
different input types, particularly different
dimensions of pseudo-amino acid
composition, is very promising for enhancing
Add your company slogan