Prediction of Lysine Acetylation Sites with the Use of

Download Report

Transcript Prediction of Lysine Acetylation Sites with the Use of

Associating Biomedical Terms:
Case Study for Acetylation
Aaron Buechlein
Indiana University School of Informatics
Advisor: Dr. Predrag Radivojac
Overview
• Background
• Previous Work
• Methods
• Results
Central Dogma
Background
Previous Work
Methods
Results
http://www.accessexcellence.org/RC/VL/GG/images/central.gif
Post-Translational Modifications
(PTMs)
Background
Previous Work
Methods
Results
Acetylation
Background
Previous Work
Methods
Results
• Acetylation involves the substitution of an acetyl group
(-COCH3) for hydrogen
• Typically occurs on N-terminal tails and lysine residues
(Lys or K)
Previous Predictors
Background
Previous Work
Methods
• Several PTM predictors have been created prior to this
work
• There are also acetylation predictors prior
Results
• NetAcet is a predictor for only N-terminal sites
• AutoMotif Server is a predictor for various PTMs and
includes an acetylation portion
• PAIL is a lysine acetylation predictor
Methods
Background
Previous Work
Methods
Results
• Create Dataset
• Download articles relevant to acetylation and extract
sites
• Rank articles in order to elucidate sites quickly
• SwissProt and Human Protein Reference Database
(HPRD)
• Create Predictors
• Leave – one – protein – out validation
• Matlab
Article Retrieval
Background
Previous Work
Methods
• Searched individual journal sites for articles relevant to
acetylation
• Saved resultant html pages for each journal
Results
• These pages were then used as the input for a web
crawler to download articles
• Due to varying journal site construction each journal
required a unique regular expression to extract links
for articles
Rank Articles
Background
• First locate occurrences of first phrase: “phrase 1”
Previous Work
• A = {a1, a2, …, a|A |}
Methods
Results
• Next locate occurrences of second phrase: “phrase 2”
• R = {r1, r2…, r|R|}
•
• c and d are constants
• x is the distance in characters between r and the nearest
word a
An example: acetylation
Background
Previous Work
Methods
Results
1. word “acetylat”
A = {a1, a2, …, am}
2. regular expression
(k  lys  lysine)(space)*(digit)+
R = {r1, r2, …, rn}
An example: acetylation
Background
Previous Work
Methods
Score for article S:
Results
S  i 1 score(ri , A)
n
where
and
An example: acetylation
10
Background
9
Previous Work
Score for article S:
f ( x)  10 e 0.005x
8
7
Results
S  i 1 score(ri , A)
n
6
f(x)
Methods
where:
score(ri , A)  f (| position(ri )  position(ak ) |)
5
4
3
2
and
k  arg min j 1... m | position(ri )  position(a j ) |
1
0
0
100
200
300
400
500
600
700
800
Distance in characters
Papers with S > 100 are rich in sites; if S < 30 “twilight” zone
900
100
Elucidate Sites
Background
Previous Work
Methods
Results
• Sites were manually extracted from articles beginning
with the highest rank
• The original experimental paper for these sites was
verified for traceable evidence
• Sites were extracted from SwissProt
• Sites were extracted from HPRD
Predictors
Background
• Support Vector Machine
Previous Work
Methods
• Artificial Neural Network
Results
• Decision Tree
Predictor Input
Background
Previous Work
Methods
• Positives taken as all lysines found to be acetylated
• Negatives taken as all lysines not found to be
acetylated
Results
• Features created based on characteristics surrounding
lysines
• Amino acid content, hydrophobicity, charge, disorder,
etc.
Predictor Input
Background
Protein
Features
Acetylated
1
8
1
0.48609 0.001767 0.48979 0.51508
1
1
7
1
0.92146 0.03019 0.96423 0.79416
1
Methods
1
0
0
0.50622 0.015251 0.52335 0.51855
0
Results
2
10
2
0.2008 0.038708 0.25441 0.36071
1
2
1
0
0.62016 0.009772 0.62846 0.67525
0
2
0
0
0.27783 0.028957 0.32162 0.34207
0
3
11
1
0.89239 0.018354 0.91884 0.88125
1
3
12
2
0.87354 0.022307 0.90349 0.87446
1
3
8
1
0.81549 0.025339 0.85289 0.85702
1
3
2
0
0.84588 0.024766 0.88219 0.86599
0
Previous Work
Article and Ranking Results
Background
Previous Work
• 4888 articles from 10 sites were searched
• Nature provided 2147 articles
• Science Direct provided1519 articles
Methods
Results
• The highest ranking article was obtained from the
Journal of Biological Chemistry
• Score of 151.87
• Contained 10 acetylation sites
• The highest ranking article was obtained from Nature
when histones are excluded
• Previously ranked at #5
• score of 116.36
• Contained 9 unique acetylation sites
Top 25
Background
Previous Work
Methods
Results
Rank
Score
Sites
Article Source
1)
2)
3)
4)
5)
6)
7)
8)
9)
10)
11)
12)
13)
14)
15)
16)
17)
18)
19)
20)
21)
22)
23)
24)
25)
151.8667
123.2314
121.9031
117.7988
116.3582
111.1745
104.4652
104.0166
102.0683
98.80812
97.64634
96.76536
96.0845
88.12967
86.17157
81.78705
81.30967
81.06128
80.74899
80.16261
79.65658
77.9022
77.88304
77.60087
77.44198
10
12
6
9
9
14
6
7
13
6
6
6
9
9
6
5
6
6
9
9
6
4
5
8
6
Journal of Biological Chemistry
Cell / Science Direct
Nature
Journal of Proteome Research
Nature
Biochemistry
Cell / Science Direct
Nature
Molecular Cell / Science Direct
Journal of Biological Chemistry
Biochemistry
Journal of Biological Chemistry
International Journal of Mass Spectrometry / Science Direct
Biochemistry
Journal of Biological Chemistry
Nucleic Acids Research
Biochemistry
Molecular Cell / Science Direct
Journal of Biological Chemistry
Nature
Molecular Cell / Science Direct
Cell / Science Direct
Nucleic Acids Research
Gene / Science Direct
Journal of the American Society for Mass Spectrometry
Ranking Results
Background
Previous Work
Methods
Results
• Articles with scores greater than 30 had potential for
providing at least one site
• As scores approached 30, articles became less fruitful
Dataset Results
Background
Previous Work
Methods
Results
• Dataset included 1442 total sites and 1085 nonredundant sites
• HPRD contributed 90 total sites
• Swiss-Prot contributed 825
• Our Study contributed 527
Dataset Results
Background
Previous Work
Methods
Results
Sensitivity, Specificity, and Precision
Background
• Sensitivity(sn) -
Previous Work
Methods
Results
• Specificity(sp) -
• Precision(pr) -
Accuracy and AUC
Background
• Accuracy(acc) -
Previous Work
Methods
Results
• Area Under Curve(AUC)
• Refers to the area under the Receiver Operating Curve
(ROC)
• ROC is the graphical plot of sensitivity vs. 1-specificity
SVM Predictor
Background
Polynomial kernel
Degree
sn
sp
pr
acc
AUC
p=1
52.3
71.0
24.6
61.6
65.2
p=2
46.1
69.8
20.3
57.9
62.8
p=3
31.6
80.8
23.5
56.2
60.3
Previous Work
Methods
Results
Gaussian kernel
Degree
sn
sp
pr
acc
AUC
σ = 10-2
43.8
75.8
24.9
59.8
64.3
σ = 10-3
54.1
72.1
25.9
63.1
68.1
σ = 10-6
52.8
70.7
24.6
61.8
65.3
Artificial Neural Network
Background
Previous Work
Methods
Results
Artificial Neural Network
Hidden
Neurons
sn
sp
pr
acc
AUC
1
68.0
47.7
20.7
57.8
61.9
3
65.2
47.7
19.4
56.4
58.9
5
65.0
47.2
19.1
56.1
57.5
Decision Tree
Background
Decision Tree
Previous Work
Methods
Results
Algorithm
Decision
Tree
sn
sp
pr
acc
AUC
61.7
45.9
18.3
53.8
42.1
Algorithm Comparison
Background
Previous Work
Methods
Results
Algorithm
sn
sp
pr
acc
AUC
SVM
54.1
72.1
25.9
63.1
68.1
68.0
47.7
20.7
57.8
61.9
61.7
45.9
18.3
53.8
42.1
Neural
Network
Decision
Tree
I would like to acknowledge those who have helped
me throughout the duration of this project,
Dr. Predrag Radivojac, Dr. Haixu Tang, and Wyatt Clark
I welcome your questions and/or comments
An example: acetylation
Background
Previous Work
Methods
Results
1. word “acetylat”
A = {a1, a2, …, am}
2. regular expression
(k  lys  lysine)(space)*(digit)+
R = {r1, r2, …, rn}
An example: acetylation
Background
Previous Work
Methods
Score for article S:
Results
S  i 1 score(ri , A)
n
where
and