Transcript Slide 1
A Neural Network Predictor for Peptide
Fragmentation in Mass Spectrometry
Arunima Ram
Advisor : Dr. Predrag Radivojac
Co-Advisor : Dr. Haixu Tang
Co-Advisor : Dr. Randy J. Arnold
Indiana University, Bloomington, Indiana
Outline
Introduction to Proteomics
Introduction to Neural Networks
Objective
Data and Process
Results
Future Work
Acknowledgments
Introduction to Proteomics
Proteins are molecules of life, made up of
chain of amino acids. There are 20 known
amino acids and each are represented by a
letter
Proteome is sum of all proteins in an
organism, tissue or sample under study
Amino Acid
Introduction to Proteomics
Proteomics is study of protein
composition of an organelle, or cell
or entire organism to discover the
following goals
Identification
Quantification
Expression changes
Modifications
Interaction
with
other
proteins and molecules
Mass Spectrometers are machines
used for proteomics study.
Introduction to Proteomics
Protein separation
Protein digestion by specific
enzyme trypsin into peptides
Peptides are separated and
charged
Mass Spectrometer selects
peptide based on mass
Mass Spectrum (MS)
peptides is recorded
of
Each peptide is fragmented
and sent through a second
MS to record MS/MS data
Ruedi Aebersold & Matthias Mann, NATURE :VOL 422, 198=207
Introduction to Proteomics
Fragmentation of
follows certain rules
peptides
b ion – N terminal fragment
y ion – C terminal fragment
Most abundant are b and y
ions
Multiply charged peptide can
generate
multiply
charged
fragment ions
Certain residues lose water or
ammonia or both to generate
less abundant ions
http://www.ionsource.com/tutorial/DeNovo/nomenclature.htm
Introduction to Proteomics
b ions
http://www.ionsource.com/tutorial/DeNovo/nomenclature.htm
Introduction to Proteomics
y ions
http://www.ionsource.com/tutorial/DeNovo/nomenclature.htm
Introduction to Artificial Neural Networks
Neural networks are composed of
interconnected neurons working in
unison to solve specific problems
analogous to animal brains
ANN’s learn from examples to extract
patterns and detect trends too
complex to be noticed otherwise
Benefits
Components of a Neuron
cell
Can learn real-valued, discrete-valued
function
Robust to noise in data
Single Neuron
http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/cs11/report.html
Introduction to Artificial Neural Networks
Training examples fed to input layer
Weights associated with each input in each
layer
Weighted input combined at each layer to
give an output
Hidden layer computes output using a
logistic function and feeds to output layer
Determine error between output of network
and desired output
Accordingly adjust weights in each layer
and iterate through
2 – layered Feed-forward
Neural Network
Objective
Matching fragmentation spectrum of peptides through
“Database Matching” uses ad-hoc rules or probabilistic models
and cannot match proteins not present in database
Aim – Use Machine Learning to learn Peptide Fragmentation
rules from set of examples and predict the Fragmentation
spectra and use that to better identify Peptides and Proteins
Dataset
Organism
Charge 1
Total/Unique
Charge 2
Total/Unique
Charge 3
Total/Unique
Search
Engine
Shewanella
7175 / 7175
17647 / 17647
3489 / 3489
Sequest
Rat
150 / 58
3047 / 1782
421 / 305
Mascot
Human
4433 / 472
63012 / 2261
14384 / 775
Sequest
Drosophila
-------
2331 / 1234
28 / 25
Mascot
Mouse
1562 / 419
77030 / 8779
31974 / 3961
Sequest
Process
202 features extracted for 8 ions in charge 1 and 10 ions in charge 2
and charge 3
Amino acids in the peptide
Amino acids at both side
of cleavage
Amino acid at NTerminal
L
N
V
W
G
K
Amino acid at CTerminal
Number of Arginine and
Lysine in peptide
Cleavage point( b-3 ion )
Mass of Peptide and Mass
of Fragment Ion
Basicity, Hydrophobicity, Isoelectric point, helix
propensity for peptide and for the charged ion
and for neighboring amino acids
R. J. Arnold, N. Jayasankar, D. Aggarwal, H. Tang, P. Radivojac. A machine learning approach
to predicting peptide fragmentation spectra. PSB 2006, pp. 219-230
Process
Target values
intensity >= 1% of total intensity = 1
intensity < 1% of total intensity
=0
Number of Positives much smaller than
Negatives, hence create class-balanced dataset
10 fold Cross validation
Input data partitioned into 10 disjoint sets
One set becomes test set and rest 9 become training
set
Feature Set Reduction
Unrelated features removed using T-Test
Correlated features
Component Analysis
removed
using
Dimensionality Reduction
Learning task reduced to
classification problem – ion
Principal exists or not
Process
Train 10 ensemble neural network
with the best performing hidden
neurons for EACH ion in EACH
charge
Report statistics on each cross-fold
and average across each cross fold
Sensitivity - % of correctly identified
positive examples
Specificity - % of correctly identified
negative examples
Accuracy – ( Sn + Sp ) / 2
AUC – Area under the ROC curve
Ensemble of Neural Networks
Acta Chim. Slov. 2005, 52, 440–449
Process - Predictor
Final training done with ALL data
Neural Net architecture saved for future use
Steps –
score
p.o
1 o o(2 p 1)
Input Peptide with charge to predictor
Peptide decomposed into features
Extract saved ANN architecture for each ion in each charge
Predict on 10 ensembles and output the averaged prediction
p.o
score
Score intensities as such –
1 o o(2 p 1)
p = prior probability
o = predicted output
R. J. Arnold, N. Jayasankar, D. Aggarwal, H. Tang, P. Radivojac. A machine learning approach
to predicting peptide fragmentation spectra. PSB 2006, pp. 219-230
Reproducibility Analysis
Among Mouse liver replicates, pick one as actual spectrum and
other as predictions and compute AUC values
Determines maximum accuracy that can be achieved by any
fragmentation predictor
Results - Reproducibility Analysis
Ion Name
Charge 1 AUC
Charge 2 AUC
Charge 3 AUC
b
97.91
90.90
95.53
b-H2O
96.95
89.66
95.45
b-NH3
93.38
87.97
93.65
b-H2O-NH3
96.70
91.35
96.26
b++
---------
93.80
92.13
y
97.27
92.90
95.76
y-H2O
91.89
89.57
93.21
y-NH3
94.41
86.79
91.11
y-H2O-NH3
96.41
93.67
96.94
y++
---------
93.66
93.18
Results – Cross validation Accuracies
Ion
Charge 1
Sn / Sp
Acc / AUC
precursor-H2O 57.6 / 60.2
Charge 2
Sn / Sp
Acc / AUC
Charge 3
Sn / Sp
Acc / AUC
58.9 / 61.1 64.9 / 66.1
65.5 / 71.4 59.6 / 58.5
59 / 62.7
77 / 84.6
78.4 / 85.9
b
83.2 / 78.8
81 / 89
b-H2O
79.7 / 76.4
78.1 / 86.1 76.8 / 75.8
76.3 / 83.9 82.5 / 65.5
74 / 81.6
b-NH3
77.2 / 75.1
76.1 / 83.5 73.1 / 76.9
75 / 82.6
83.2 / 61.2
72.2 / 78.8
b-H2O-NH3
74.3 / 76.1
75.2 / 82
72.3 / 64.2
68.2 / 75.2 84.2 / 59.2
71.7 / 77.8
-----------------------------
77.4 / 72.8
75.1 / 83.1 80.6 / 72.2
76.4 / 84.4
82.8 / 90
b++
78.7 / 75.4
81.1 / 75.8
y
82.6 / 82.3
82.4 / 90.1 84.4 / 79.6
82 / 89.7
y-H2O
79.1 / 77.8
78.4 / 86.1 77.8 / 73.2
75.5 / 82.6 82.5 / 64.7
73.6 / 80.5
y-NH3
76.5 / 68.3
72.4 / 79.5 69.6 / 66.9
68.3 / 75.3 81.7 / 61.9
71.8 / 78.4
y-H2O-NH3
70.4 / 76.4
73.4 / 80.7 74.2 / 64.3
69.3 / 75.7 82.5 / 62.4
72.4 / 77.3
84 / 90.9
80.4 / 87.9
y++
------------------------------
86.6 / 81.5
84.4 / 81.1
85.9 / 74.9
Sensitivity-Specificity and Accuracy-AUC for all charges in all ions on Cross
Validation
Results – Cross Testing Accuracies on Drosophila data for
charge 2
Ion
Sn
Sp
ACC
AUC
Combined AUC
b
69.4
78.5
73.9
82.1
84.6
b-H2O
79.4
72.8
76.1
83.6
83.9
b-NH3
71.1
75.0
73.1
81.8
82.6
b-H2O-NH3
69.9
66.0
67.9
76.8
75.2
b++
65.7
77.0
71.4
77.2
83.1
y
68.0
86.8
77.4
86.8
89.7
y-H2O
63.0
70.4
66.7
72.5
82.6
y-NH3
54.1
75.2
64.6
72.7
75.3
y-H2O-NH3
53.9
70.2
62.1
67.5
75.7
y++
90.3
82.2
86.2
92.6
90.9
MassAnalyzer – Peptide Fragmentation tool
Uses Mathematical
model to predict
fragmentation
Uses one model for
charge 1 and charge 2
and a separate model
for higher charges
Z. Zhang, Anal. Chem. 2004, 76(14),3908-3922
Z. Zhang, Anal. Chem. 2005, 77(19),6634-6373
Results – Prediction Comparison
Charge 1
Charge 2
Ion Name
AUC MA
AUC ANN
AUC MA AUC ANN
b
90.44
91.74
85.61
b-H2O
89.37
91.87
b-NH3
85.82
b-H2O-NH3
Charge 3
AUC MA
AUC ANN
86.24
90
90.24
86.84
85.85
88.99
86.97
89.01
85.40
85.17
83.61
86.51
61.31
90.12
71.20
80.15
77.45
85.58
b++
---------
--------
85.34
87.20
86.01
88.56
y
86.65
88.20
85.98
88.57
91.72
91.82
y-H2O
85.96
81.38
77.58
72.95
82.88
87.89
y-NH3
87.62
78.43
78.30
75.73
85.10
85.03
y-H2O-NH3
64.82
77.50
69.51
77.28
76.55
84.47
y++
--------
--------
90.92
93.34
85.60
87.55
Results – Prediction Comparison
ROC figures – Charge 2
Results – Spectrum Comparison
Future Work
Reproducibility analysis on various other datasets and incorporating
for replicate size( number of replicates for each spectrum )
Use Predicted Spectra to build another Predictor that would learn to
score the given spectrum
Acknowledgements
Dr. Predrag Radivojac
Dr. Haixu Tang
Dr. Randy J. Arnold
Lab mates –
Amrita Mohan
Nils Schimmelmann
Wyatt Clark
Yong Li
Linda Hostetter
Bioinformatics faculty at SOI
School of Informatics