IvDimitrov_slides
Download
Report
Transcript IvDimitrov_slides
School of Pharmacy
Medical University of Sofia
Application of machine learning
techniques for allergenicity
prediction
Ivan Dimitrov
2nd Regional Conference
“Supercomputing Applications in Science and Industry”
Rodopi Hotel, Sunny Beach, Bulgaria,
September 20-21, 2011
Allergen processing pathways
C. M. Hawrylowicz & A. O'Garra, Nature Reviews Immunology 2005, 271-283
FAO and WHO Codex alimentarius guidelines for
evaluating potential allergenicity for novel proteins
A query protein is potentially allergenic if it:
has an identity of 6 to 8 contiguous amino acids
or
has > 35% sequence similarity over a window of 80 amino acids
when compared with known allergens.
Codex Principles and Guidelines on Foods Derived from Biotechnology. 2003 Rome, Italy: Codex Alimentarius Commission, Joint FAO/WHO Food
Standards Programme, Food and Agriculture Organization.
Bioinformatics approaches to allergen prediction
1. Sequence-alignment search of query protein
Extensive databases of known allergen proteins and the FAO/WHO
guidelines
- Structural Database of Allergenic Proteins
- Allermatch
Characteristics:
-High sensitivity (true positives/(true positives + false negatives))
- Produce many false positives and low precision
(true positives/(true positives + false positives))
- Discovery of novel antigens is restricted by their lack of
similarity to known allergens.
Ivanciuc et al. Nucleic Acids Res. 2003, 31, 359–362
Fiers et al. BMC Bioinformatics 2004, 5, 133
Bioinformatics approaches to allergen prediction
2. Identification of conserved allergenicity-related linear motifs
- Comparing allergens to non-allergens by MEME motif discovery tool
- Clustering of known allergens, wavelet analysis and hidden Markov
model
- Automated Selection of Allergen-Representative Peptides (DASARP).
- Motif search by Support Vector Machines (SVM), MEME/MAST, IgE
epitopes and Allergen-Representative Peptides (ARP)
- Iterative pairwise sequence similarity encoding scheme with SVM as
the discriminating engine
Both approaches are based on the assumption that the allergenicity is a
linearly coded property.
Stadler and Stadler FASEB J. 2003, 17, 1141-1143
Li et al. Bioinformatics 2004, 20, 2572-2578.
Björklund et al. Bioinformatics. 2005, 21, 39–50
Saha and Raghava Nucleic Acids Research,2006,34, 202-209
Muh et al. PLoS ONE, 2009, 4 (6), art. no. e5861
AIM of the study
To create an alignment-free method for in silico
identification of allergens based on the main
chemical properties of amino acid sequences and
implement it to a web server.
Obstacles:
The choice of an appropriate descriptors to represent
the physicochemical properties of amino acid sequences.
Allergens are proteins with different length.
The z-scales
…Phe – Arg – Trp…
z1
z2
hydrophobicity molecular size
z3
polarity
z1
z2
z3
-4.22 1.94 1.08
Hellberg et al. J. Med. Chem. 1987; 30, 1126-1135
z1
z2
z3
3.62 2.60 -3.60
z1
z2
z3
-4.36 3.94 0.69
ACC transformation
Auto-covariance
ACC jj (lag )
n lag
Z j ,i Z j ,i lag
i
n lag
Cross-covariance
ACC jk j k (lag )
n lag
i
Z j ,i Z k ,i lag
n lag
j, k are the zscales (j=1,2,3);
i is the amino acid positions;
n is the number of amino acids in the sequence;
protein
Phe – Arg – Trp – Phe – Arg – Trp
z1 z2 z3 - z1 z2 z3 - z1 z2 z3 – z1 z2 z3 - z1 z2 z3 – z1 z2 z3
/5
ACC11(1)
z1 z2 z3 - z1 z2 z3 - z1 z2 z3 – z1 z2 z3 - z1 z2 z3 – z1 z2 z3
/5
Wold et al. Anal. Chim. Acta 1993, 277:239-225
ACC13(1)
Preliminary study
595 food allergens from CSL allergen database
595 non-allergens from NCBI database
Training set
475 food allergens
475 non-allergens
ACC transformation
of z descriptors
matrix with 45 variables (32 x 5)
and 950 observations
statistical methods,
machine learning
PLS - discriminant analysis
Logistic regression
Naïve - Bayes algorithm
Decision tree algorithm
k Nearest Neighbours
http://allergen.csl.gov.uk
http://www.ncbi.nlm.nih.gov/
Test set
120 food allergens
120 non-allergens
external validation
Sensitivity
Specificity
Accuracy
Results from preliminary study
sensitivit y
TP
TP FN
specificit y
TN
TN FP
accuracy
TP TN
TP FP TN FN
TP – true positive, FP – false positive
TN – true negative, FN – false negative
100
90
80
70
%
60
Sensitivity,%
50
Specificity,%
Accuracy,%
40
30
20
10
0
PLS-DA
Logistic
regression
Decision tree
Naïve-Bayes
Algorithm
kNN(k=3)
kNN(k=5)
Web servers on the test set
Test set
120 food allergens
120 non-allergens
Algpred
Sensitivity
Specificity
Accuracy
- SVM with single aa composition
- SVM with dipeptide composition
Evaller
APPEL
Allerhunter
100
90
80
70
%
60
Sensitivity,%
50
Specificity,%
40
Accuracy,%
30
20
10
0
ALGPRED
(svm, single aa
composition)
ALGPRED
(svm, dipeptide
composition)
EVALLER
APPELL
ALLERHUNTER
kNN(5)
Server
Saha and Raghava Nucleic Acids Research,2006,34, 202-209.
Barrio et al., Nucleic Acids Research 2007, 35, 694-700
http://jing.cz3.nus.edu.sg/cgi-bin/APPEL
Muh et al. PLoS ONE, 2009, 4 (6), art. no. e5861
Conclusions from the preliminary study
1. The model developed by the k Nearest Neighbors method shows
the best performance on the test set comparing to the other methods.
It has a good balance between specificity and sensitivity, and the
highest accuracy. kNN was used further in the study.
2. The server Allerhunter is the best performing among the
known servers for allergen prediction. kNN needs some more
improvements.
3. A great misbalance exists between sensitivity and specificity
for almost all servers. This indicates that the dataset needs
some improvement too.
The kNN algorithm
Training set
475 allergens, 475 non-allergens
Unknown protein
ACC transformation
of z descriptors
ACC transformation
of z descriptors
vector with 45 variables (32 x 5)
matrix of 45 variables (32 x 5)
and 950 observations
Calculate the Euclidian distance
between the vector and each observation
Sort the distance by
value in ascending order
Determine the k nearest
neighbours
Determine the class of
unknown allergen
according to the majority
of nearest neighbours
Next: Extend the data sets
CSL allergen database, FARRP allergen database
SDAP database, ADFS database
684 food, 1157 inhalant,
553 toxins, venom or salivary allergens
Allergen species
NCBI database
Create local
database
Proteins from allergen species
Blasts search against
all allergens
684 non-allergen from food origin
1157 non-allergens from inhalant origin
553 non-allergens from species with toxins,
venom or salivary allergens
http://allergen.csl.gov.uk
http://www.allergenonline.org/
http://fermi.utmb.edu/SDAP/
http://allergen.nihs.go.jp/ADFS/index.jsp
http://www.ncbi.nlm.nih.gov/
Next: kNN optimization
684 food allergens
684 non-allergens
100
528 allergens
528 non-allergens
machine
learning
k nearest
neighbours
Test set
95
156 allergens
156 non-allergens
90
85
%
Training set
external validation
80
sensitivity
75
specificity
70
accuracy
65
60
55
50
3
Sensitivity
Specificity
Accuracy
5
7
9
11
13
k nearest neigbours
15
17
19
kNN models
684 food allergens
684 non-allergens
Test set
156 allergens
156 non-allergens
1157 inhalant allergens
1157 non-allergens
Training set
528 allergens
528 non-allergens
external
validation
Training set
933 allergens
933 non-allergens
external
validation
external
validation
k NN
k=3
k NN
k=3
Sensitivity
Specificity
Accuracy
Test set
224 allergens
224 non-allergens
kNN models
100
90
80
70
60
sensitivity
50
specificity
accuracy
40
30
20
10
0
kNN, food training and kNN, food training set kNN, inhalant training kNN inhalant training
test set
on inhalant test set
and test set
set on food test set
kNN aggregated
training and test set
AllerTOP
web tool for allergenicity prediction
Training set
1952 food, inhalant and others
allergens and 1952 non-allergens
ACC transformation
of z descriptors
kNN model
external validation
AllerTOP
http://www.pharmfac.net/alletop
Servers performance on united testset
United test set of 441 food and inhalant allergens and 441 non-allergens
100
90
80
70
60
sensitivity
50
specificity
accuracy
40
30
20
10
0
AllerTOP(KNN, K=3)
Allerhunter
AlgPred, svm amino
acid decomposition
AlgPred, svm
dipeptide
decomposition
AlgPred (ARP)
Two of the servers from preliminary studies: Appel and Evaller were not available during recent study.
The results for Allerhunter server are achieved with smaller testset due to its incapability to work
with short sequences (<21 amino acids)
Conclusions
1. An alignment-free method for in silico prediction of allergens based on
the main physicochemical properties of proteins was developed.
2. The method uses z descriptors for representation of amino acids in the
protein sequences and ACC transformation for conversion of proteins into
uniform vectors.
3. The k Nearest Neighbours clustering method showed the best performance
among the other algorithms for classification tested in the study: PLS discriminant analysis, Logistic regression, Naïve - Bayes and Decision Tree
algorithm.
4. The k NN algorithm was optimized and its performance was compared to
the freely available web servers for prediction of allergens.
5. The kNN algorithm was implemented on a web server, freely available on:
http://www.pharmfac.net/allertop
Drug Design Group
School of Pharmacy
Medical University of Sofia
Irini Doytchinova
Ivan Dimitrov
Mariyana Atanasova
Panaiot Garnev
Acknowledgements
Darren R. Flower
Aston University, Birmingham, UK
Funding:
National Research Fund,
Ministry of Education and Science,
Bulgaria, Grant 02-1/2009