EPPdb: a Database for Proteomic Analysis of Extracytosolic Plant

Download Report

Transcript EPPdb: a Database for Proteomic Analysis of Extracytosolic Plant

Identifying Extracellular Plant
Proteins Based on Frequent
Subsequences of Amino Acids
Y. Wang, O. Zaiane, R. Goebel
Introduction
Protein: linear sequence of amino acids
Protein subcellular localization

Plant: nuclear, cytoplamic, mitochondria,
extracellular, …
Intracellular vs. Extracellular



Sequence information alone
Class imbalance
Transparency
2
Related Word
N-terminal sorting signals
Amino acid composition
Lexical analysis
Integrative approach
Subsequence methods
3
Predicting Extracellular
Proteins
Feature Extraction
Support Vector Machine
Boosting
Frequent Pattern Method
4
Feature Extraction
Frequent subsequences: subsequences
that occur in more than a certain
percentage of extracellular proteins



Strong discriminative power
Perform similar functions via relationed
biochemical mechanism
Capture local similarity
5
Generalized Suffix Tree
6
Support Vector Machine
Input data represented as feature
vectors
Find a linear separator that separate the
data and maximize the margin
Kernel function: nonlinear separator
7
SVM for extracellular protein
prediction
Data Transformation(sequencevector)


Frequent subsequences as features
Transform protein sequence as binary
vectors
Kernel Functions



Linear kernel
Polynomial kernel
RBF kernel
8
Boosting
Iterative algorithms to improve weak
classifier
Different weighted distribution of
examples in each iteration
Increase the weights of incorrectly
classified examples, and decrease the
weights of correctly classified ones
9
AdaBoost
10
Frequent Pattern Method
Frequent pattern: *X1*X2*…*Xn*
extracellular


X1,X2,…Xn are frequent subsequences
“*” can be substituted to zero or up to
MaxGap amino acids when matching a
protein sequence
11
FOIL algorithm
12
Z-number
:support of rule R
:accuracy of rule R
13
14
Experiments
Dataset(PASub project at UofA)

Plant: 3293 proteins, 171 extracellular
Five-cross validation
15
Evaluation Matrix
Overall accuracy is not good enough
F-measure
16
Result(SVM with subsequence)
17
Result(Boosting with subsequence)
18
Result(Frequent Pattern)
MinLen=3
Min_gain=0.1
  0.03
  0.8
MinSup=5%
MinConf=80%
MaxGap=300
19
Result(SVM with composition)
20
Result(Boosting with composition)
21
Cross Comparision
22
SVM with combined features
23
Boosting with combined
features
24
Effects of MinLen on SVM
25
Effects of MinLen on boosting
26
Conclusion
Presented three methods for identifying
extracellular proteins based on frequent
subsequence of amino acids
SVM achieves the best result
FSP method provides easily
interpretable rules
27
Future Work
Use for information about proteins (e.g.,
structure, function, …)
Integrating amino acid composition into
FSP method
Incorporate more biological knowledge
28