Transcript 赵亚辉

Ensembel classifier for
protein fold pattern recognition
Zhao Yahui
14S051028
Contents
1
Abstract
2
Introduction
3
Materials and Method
4
Result and Discussion
5
Conclusions
Abstract
 Prediction of protein folding patterns(27) is
one level deeper than that of protein
structural classes, and hence is much more
complicated and difficult.
 An ensemble classifier:a set of basic
classifiers with each trained in different
parameter(6) systems
 The operation engine for the constituent
individual classifiers was OET-KNN rule.
Introduction
Current difficulty:
 local minimum problem
 when doesn't have stuctureknown homologous protein in the
existing database
Here:
 resort to the taxonomic approach(based
on the assumption that the number of
protein folds is limited)
Materials and Methods
 Dataset:
 training set:311 protein
 testing set:385 protein
 None of proteins in these datasets has >35%
sequence identity to any other, and most of
proteins in the testing dataset have <25%
sequence identity with those in the training
dataset
Materials and Methods
 Features:
 amino aid composition(replaced)
 predicted secondary structure
 hydrophobicity
 van der Waals volume
 polarity
 polarizability
improve:
 pseudo-amino aid composition(avoid
completely ignoring the sequence-order
effects)
Materials and Methods
 PseAA
 adopt the alternate correlation function between
hydrophobicity and hydrophilicity( to reflect the
sequence-order effects).
 Suppose a protein P with a sequence of L amino acid
residues:

is the hydrophobicity of
,
is the
1
2
hydrophilicity of
. H i , j , H i , j are the hydrophobicity
and hydrophilicity correlation functions
Materials and Methods
 i , i 1
are the
correspond ith-tier
correlation factors that
reflect the sequence-order
correlation between all the
ith most contiguous
residues
Materials and Methods
Materials and Methods
 Standard conversion of h1 ( Ri ), h2 ( Ri ) :
 remain unchanged if going through the same
conversion procedure again
Materials and Methods
 By fusing the 2 amphiphilic correlation factors into
the classical amino acid composition, we have the
following augmented discrete form to represent a
protein sample P:

where
OET-KNN Algorithm
 optimized evidence-theoretic k-nearest neighbors
 ET-KNN:
 Based on: each neighbor of a pattern to be classified is
considered as an item of evidence supporting certain
hypotheses concerning the class membership of that pattern.
ET-KNN Algorithm
 Let us consider a problem of classifying N entities into
27 classes (fold types), which can be formulated as
 The available information is assumed to consist of a
training dataset
ET-KNN Algorithm
 Suppose P is a query protein to be classified, and
the set of its k-nearest neighbors
is
 for any
, the knowledge that belongs to class
gft
can be considered as a piece of evidence that
increases our belief that P also belongs to
,this
item of evidence can be formulated by
OET-KNN Algorithm
 The belief function of P belonging to class
is a
combination of its k-nearest neighbors, and can be
formulated as:
 A decision is made by assigning the query
protein P to the class
:
Materials and Methods
Input Feature selection:

(20+2 )D PseAA
21D predicted secondary structure,
[(21*5)+(20+2 )]D
21D hydrophobicity,
21D normalized van der Waals volume,
21D polarity
21D polarizability

 Doing so would introduce too many
parameters into the input, thereby
reducing the clustertolerant capacity
and cross-validation success rate
 It is instructive to introduce a new concept called the
cluster-tolerant capacity for the dataset studied here.
 If such a capacity is high for a dataset, then the removal of
any entry from the dataset will not significantly change the
original clustered picture
 (such as the distribution of the standard vectors for each
subset,the spacial gaps between the boundaries of any two
subsets, and the original attribution of the removed entry
to a subset)
 conversely, if the clustered tolerant capacity is low for a
dataset, the removal of some entry from it will have a
significant impact on the original clustered picture.
Materials and Methods
 Framework of ensemble classifier:
Why choose Frame work:
reduce the variance caused by the peculiarities of a single
training set and hence be able to learn a more expressive
concept in classification than a single classifier.
Materials and Methods
Thus the query protein P is predicted belonging to the fold type with
 Suppose the ensemble classifier C is expressed by
which its score is the highest:
 Thus, the process of how the ensemble classifier C works by
fusing the nine basic classifiers C(i) (i =1,2……9) can be
formulated as follows. Suppose
Result And Discussion
Conclusions
 It is shown thru the present study that the
ensemble classifier formed by fusing
different input types, particularly different
dimensions of pseudo-amino acid
composition, is very promising for enhancing
Add your company slogan