poster - Computer Science and Engineering

Download Report

Transcript poster - Computer Science and Engineering

Artificial Intelligence Research Laboratory
Bioinformatics and Computational Biology Program
Computational Intelligence, Learning, and Discovery Program
Department of Computer Science
Using Global Sequence Similarity Improves Biological Site-Specific Classifiers
Jivko Sinapov, Cornelia Caragea, Drena Dobbs and Vasant Honavar
Hierarchical Mixture of Naïve Bayes Experts (HME-NB):
Biological Motivation:
Many problems in bioinformatics involve the prediction of class labels for each element in a
protein sequences. Examples include:
 Prediction of RNA and DNA binding protein residues
 Prediction of post-translational modification sites
 Prediction of secondary structure elements in sequences
Typically, classifiers are trained based on local features of each site in the training set of protein
sequences. Thus no global sequence information is used when making predictions on the sites
of the sequence to be annotated. In this work we seek to improve such classifiers by taking into
account the global sequence similarity between the test sequence and the sequences in the
training set.
Example Problems:
Protein-RNA binding site prediction:
Glycosylation site prediction:
H3N+
M
K
L
L
S
I
T
I
R
P
L
L
S
Q
L
E
S
O-Glycosylated?
Datasets:
Dataset
1. Prediction of O-linked glycosylation sites
2. Prediction of RNA-binding protein residues
3. Prediction of protein-protein interface residues
O-GlycBase
Protein-RNA
Protein-Protein
E
S
I
l
2
49
26
28
45
23
T
i
Each non-leaf node combines the predictions from its children:
PV g (C | xtest , Stest )   V child(V g ) PVi (C | xtest , Stest ) P(Stest Vi | Stest  par (Vi ))
COO-
j
Number of
+ Instances
2168
4336
2350
i
j
Results:
Non O-Glycosylated?
Number of
Sequences
216
147
42
94
l
M
j
C
D
122
Let V ,V ,...,V be the leaf nodes in the hierarchical partitioning T
Let 1 ,2 ,..., M be the parameters for the trained Naïve Bayes models at each leaf node in
Let xtest  { f1 , f 2 ,..., f n } be the input features for some residue in sequence S test
l
1
L
P( f1 , f 2 ,..., f n | C ) P(C )
P(C | f1 , f 2 ,..., f n ) 
P( f1 , f 2 ,..., f n )
 Assign class that maximizes:
3. Use the structure of the hierarchical partitioning to learn a
Hierarchical Mixture of Experts model such that:
25
PV l (C | xtest , Stest )  P(C |  j ) P( f i | C , j )
Let xtest = {f1, f2, …,fn} be a n-dimensional test data point
 Independence assumption:
2. Using Spectral Clustering algorithm, recursively partition the set of
training sequences to obtain a Hierarchical Clustering of the
Sequences.
F
Naïve Bayes (NB):
 Apply Bayes rule:
1. Compute an N by N pair-wise similarity matrix using Global Alignment scores with Blosum62
substitution matrix
147
Each leaf node computes the class probability for xtest according to:
T
Q
Let S1, S2, …, SN be a dataset of protein sequences.
1. Performed 10-fold sequence based cross validation
2. Compared Naïve Bayes (NB) and Hierarchical Mixture of Naïve Bayes Experts (HME-NB)
Number of
- Instances
12147
27988
9204
O-Glycosylation
Protein-RNA
interactions
Protein-Protein interface
Naïve Bayes
HME-NB
Naïve
Bayes
HME-NB
Naïve
Bayes
HME-NB
Accuracy
0.89
0.89
0.83
0.84
0.79
0.81
MCC
0.57
0.58
0.32
0.37
0.08
0.25
Sensitivity
0.61
0.65
0.24
0.31
0.06
0.18
Specificity
0.65
0.63
0.65
0.66
0.38
0.60
AUC
0.88
0.91
0.74
0.76
0.62
0.72
a) O-Glycosylation
P( f1 , f 2 ,..., f n | C )   P( f i | C )
P(C ) P( f i | C )
i
i
Feature Representation:
b) Protein-RNA interaction sites
• A window of 21 amino-acids centered on the target residue:
target residue
Sequence:
Label:
DSNPKYLGVKKFGGEVVKAGNILVRQRGTKFKAGQGVGMGRDHTLFALSDGK
1111110011111110011111001011111100000001111101000000
class label
. . .
VKKFGGEVVKAGNIL,-1
KKFGGEVVKAGNILV,-1
KFGGEVVKAGNILVR,+1
FGGEVVKAGNILVRQ,+1
. . .
Data points used for
training and testing a
classifier
A qualitative comparison of Naïve Bayes (NB) and Hierarchical Mixture of Naïve Bayes Experts (HME-NB) on the task of predicting
protein-protein interface sites of Anionic trypsin-2 precursor of Rattus norvegicus (shown in spheres) interfaced with Ecotin precursor of
E.coli (in green). Each residue of the Anionic trypsin-2 precursor is colored based on whether the prediction is a True Positive (red),
True Negative (gray), False Positive (blue), False Negative (yellow). For both methods, the False Positive Rate (FPR) is fixed at 0.1.
HME-NB is able to achive higher TPR (0.88) than that of NB (0.56) for the same FPR.
Conclusion:
• Developed a classifier that improves labeling biological sequence data
Acknowledgements: This work is supported in part by a grant from the National Institutes of Health (GM 066387) to Vasant Honavar & Drena Dobbs
c) Protein-Protein interface sites