Transcript Document
Improving Protein-RNA Interface Prediction by Combining a Sequence
Homology-based Method with a Naïve Bayes Classifier: Preliminary Results
Li Xue1,2, Rasna Walia1,2, Yasser El-Manzalawy2,4, Drena Dobbs1,3, Vasant Honavar1,2
1 Bioinformatics & Computational Biology Program; 2 Dept. of Computer Science; 3 Dept. of Genetics, Development & Cell Biology,
Iowa State University; 4 Dept. of Systems & Computer Engineering, Al-Azhar University, Cairo, Egypt
Method
Protein-RNA interactions play important roles in cellular
processes including protein synthesis, RNA processing,
and gene expression regulation. Reliable identification of
the interfaces involved in protein-RNA interactions is
essential for comprehending their mechanisms and
functional implications and provides a valuable guide for
rational drug discovery and design.
• NR216 – for analyzing protein interface conservation
• RB199 – for testing the prediction performance of
HomPRIP & its combination with a NB classifier
• nr_RNAprot_s2c – for searching for putative
sequence homologs using BLASTP
Experimental determination of interfaces in protein-RNA
complexes is time-consuming and expensive. Thus
computational techniques for predicting RNA-binding
sites on proteins are valuable. Here we propose a novel
family of sequence homology-based methods:
Query protein sequence
Search nr_RNAprot_s2c to find
homologous sequences
• HomPRIP uses interface information from putative
homologs of a query protein to predict interface
residues in the query protein.
• When no sequence homologs for the query protein
can be found, HomPRIP-NB uses a Naïve Bayes
(NB) classifier trained on evolutionary information
derived from protein sequences in the NCBI nr
database to return interface predictions.
Homologous
sequences found?
Yes
Safe
zone
Twilight
zone
Dark
zone
No
HomPRIPNB
returns
predicted
interface
residues
HomPRIP
returns
predicted
interface
residues
http://einstein.cs.iastate.edu/HomPRIP-NB
Protein-RNA Interface Conservation
Homology
Zones
Results & Conclusion
• Support Vector Machine & Naïve Bayes classifiers
were trained using three different features:
• amino acid identity
• PSSM profiles
• smoothed PSSM profiles
and evaluated using five-fold cross-validation.
ICscore Cutoff
Safe Zone
0.70
Twilight Zone 1
0.50
Twilight Zone 2
0.40
Twilight Zone 3
0.20
Dark Zone
0.15
Classifiers
CC
Sensitivity
F1
PPV
Accuracy
Sequence ID
0.24
0.64
0.37
0.26
0.67
Sequence PSSM
0.32
0.70
0.43
0.31
0.72
Smoothed Sequence PSSM
0.30
0.70
0.41
0.29
0.70
Sequence ID
0.24
0.65
0.37
0.26
0.68
Sequence PSSM
0.34
0.72
0.44
0.32
0.73
Smoothed Sequence PSSM
0.32
0.70
0.43
0.31
0.73
Sequence ID
0.28
0.17
0.27
0.62
0.86
Sequence PSSM
0.25
0.60
0.38
0.27
0.71
Smoothed Sequence PSSM
0.25
0.66
0.38
0.27
0.68
HomPRIP
0.69
0.70
0.74
0.78
0.91
HomPRIP-NB
0.51
0.67
0.59
0.53
0.86
SVM with Polynomial
Kernel (default
parameters)
SVM with RBF Kernel
(default parameters)
An interface conservation score (ICscore) is calculated as
a measurement of the similarity of a homolog’s interface
residues to those of the query protein. A regression
model is used to calculate the ICscore, based on BLAST
sequence alignment statistics.
Naïve Bayes
• Performance of HomPRIP is reported only for 71% of
complexes in the RB199 dataset (those for which
homologs could be found); HomPRIP-NB returned
predictions for the entire RB199 dataset.
Future Directions
Safe zone: a high degree of conservation (red data
points)
Twilight zone: moderate conservation of interfaces
(yellow & orange data points)
Dark zone: poor conservation of interfaces (blue data
points)
Acknowledgements
Funding provided by:
NIH GM 066387
Features
Ongoing work is aiming at comparing HomPRIP-NB with
other publically available servers that predict RNAbinding sites on proteins (e.g., BindN, PiRaNha, PRIP,
RNABindR), using an independent test set.
1.
2.
B.A. Lewis, R.R. Walia, M. Terribilini, J. Ferguson, C. Zheng, V. Honavar, and D. Dobbs. PRIDB: a protein–RNA
interface database. Nucleic Acids Research, 39(suppl 1):D277, 2011.
L.C. Xue, D. Dobbs, and V. Honavar. HOMPPI: A class of sequence homology based protein-protein interface
prediction methods. BMC Bioinformatics, 12:244, 2011.