Transcript Slide 1

A Neural Network Predictor for Peptide
Fragmentation in Mass Spectrometry
Arunima Ram
Advisor : Dr. Predrag Radivojac
Co-Advisor : Dr. Haixu Tang
Co-Advisor : Dr. Randy J. Arnold
Indiana University, Bloomington, Indiana
Outline







Introduction to Proteomics
Introduction to Neural Networks
Objective
Data and Process
Results
Future Work
Acknowledgments
Introduction to Proteomics


Proteins are molecules of life, made up of
chain of amino acids. There are 20 known
amino acids and each are represented by a
letter
Proteome is sum of all proteins in an
organism, tissue or sample under study
Amino Acid
Introduction to Proteomics


Proteomics is study of protein
composition of an organelle, or cell
or entire organism to discover the
following goals

Identification

Quantification

Expression changes

Modifications

Interaction
with
other
proteins and molecules
Mass Spectrometers are machines
used for proteomics study.
Introduction to Proteomics
 Protein separation
 Protein digestion by specific
enzyme trypsin into peptides
 Peptides are separated and
charged
 Mass Spectrometer selects
peptide based on mass
 Mass Spectrum (MS)
peptides is recorded
of
 Each peptide is fragmented
and sent through a second
MS to record MS/MS data
Ruedi Aebersold & Matthias Mann, NATURE :VOL 422, 198=207
Introduction to Proteomics
 Fragmentation of
follows certain rules
peptides
 b ion – N terminal fragment
 y ion – C terminal fragment
 Most abundant are b and y
ions
 Multiply charged peptide can
generate
multiply
charged
fragment ions
 Certain residues lose water or
ammonia or both to generate
less abundant ions
http://www.ionsource.com/tutorial/DeNovo/nomenclature.htm
Introduction to Proteomics
b ions
http://www.ionsource.com/tutorial/DeNovo/nomenclature.htm
Introduction to Proteomics
y ions
http://www.ionsource.com/tutorial/DeNovo/nomenclature.htm
Introduction to Artificial Neural Networks



Neural networks are composed of
interconnected neurons working in
unison to solve specific problems
analogous to animal brains
ANN’s learn from examples to extract
patterns and detect trends too
complex to be noticed otherwise
Benefits


Components of a Neuron
cell
Can learn real-valued, discrete-valued
function
Robust to noise in data
Single Neuron
http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/cs11/report.html
Introduction to Artificial Neural Networks
 Training examples fed to input layer
 Weights associated with each input in each
layer
 Weighted input combined at each layer to
give an output
 Hidden layer computes output using a
logistic function and feeds to output layer
 Determine error between output of network
and desired output
 Accordingly adjust weights in each layer
and iterate through
2 – layered Feed-forward
Neural Network
Objective
 Matching fragmentation spectrum of peptides through
“Database Matching” uses ad-hoc rules or probabilistic models
and cannot match proteins not present in database
 Aim – Use Machine Learning to learn Peptide Fragmentation
rules from set of examples and predict the Fragmentation
spectra and use that to better identify Peptides and Proteins
Dataset
Organism
Charge 1
Total/Unique
Charge 2
Total/Unique
Charge 3
Total/Unique
Search
Engine
Shewanella
7175 / 7175
17647 / 17647
3489 / 3489
Sequest
Rat
150 / 58
3047 / 1782
421 / 305
Mascot
Human
4433 / 472
63012 / 2261
14384 / 775
Sequest
Drosophila
-------
2331 / 1234
28 / 25
Mascot
Mouse
1562 / 419
77030 / 8779
31974 / 3961
Sequest
Process
 202 features extracted for 8 ions in charge 1 and 10 ions in charge 2
and charge 3
Amino acids in the peptide
Amino acids at both side
of cleavage
Amino acid at NTerminal
L
N
V
W
G
K
Amino acid at CTerminal
Number of Arginine and
Lysine in peptide
Cleavage point( b-3 ion )
Mass of Peptide and Mass
of Fragment Ion
Basicity, Hydrophobicity, Isoelectric point, helix
propensity for peptide and for the charged ion
and for neighboring amino acids
R. J. Arnold, N. Jayasankar, D. Aggarwal, H. Tang, P. Radivojac. A machine learning approach
to predicting peptide fragmentation spectra. PSB 2006, pp. 219-230
Process
 Target values
intensity >= 1% of total intensity = 1
intensity < 1% of total intensity
=0
Number of Positives much smaller than
Negatives, hence create class-balanced dataset
 10 fold Cross validation
Input data partitioned into 10 disjoint sets
One set becomes test set and rest 9 become training
set
 Feature Set Reduction
Unrelated features removed using T-Test
Correlated features
Component Analysis
removed
using
Dimensionality Reduction
 Learning task reduced to
classification problem – ion
Principal exists or not
Process
 Train 10 ensemble neural network
with the best performing hidden
neurons for EACH ion in EACH
charge
 Report statistics on each cross-fold
and average across each cross fold
 Sensitivity - % of correctly identified
positive examples
 Specificity - % of correctly identified
negative examples
 Accuracy – ( Sn + Sp ) / 2
 AUC – Area under the ROC curve
Ensemble of Neural Networks
Acta Chim. Slov. 2005, 52, 440–449
Process - Predictor



Final training done with ALL data
Neural Net architecture saved for future use
Steps –





score 
p.o
1  o  o(2 p  1)
Input Peptide with charge to predictor
Peptide decomposed into features
Extract saved ANN architecture for each ion in each charge
Predict on 10 ensembles and output the averaged prediction
p.o
score 
Score intensities as such –
1  o  o(2 p 1)
p = prior probability
o = predicted output
R. J. Arnold, N. Jayasankar, D. Aggarwal, H. Tang, P. Radivojac. A machine learning approach
to predicting peptide fragmentation spectra. PSB 2006, pp. 219-230
Reproducibility Analysis


Among Mouse liver replicates, pick one as actual spectrum and
other as predictions and compute AUC values
Determines maximum accuracy that can be achieved by any
fragmentation predictor
Results - Reproducibility Analysis
Ion Name
Charge 1 AUC
Charge 2 AUC
Charge 3 AUC
b
97.91
90.90
95.53
b-H2O
96.95
89.66
95.45
b-NH3
93.38
87.97
93.65
b-H2O-NH3
96.70
91.35
96.26
b++
---------
93.80
92.13
y
97.27
92.90
95.76
y-H2O
91.89
89.57
93.21
y-NH3
94.41
86.79
91.11
y-H2O-NH3
96.41
93.67
96.94
y++
---------
93.66
93.18
Results – Cross validation Accuracies
Ion
Charge 1
Sn / Sp
Acc / AUC
precursor-H2O 57.6 / 60.2
Charge 2
Sn / Sp
Acc / AUC
Charge 3
Sn / Sp
Acc / AUC
58.9 / 61.1 64.9 / 66.1
65.5 / 71.4 59.6 / 58.5
59 / 62.7
77 / 84.6
78.4 / 85.9
b
83.2 / 78.8
81 / 89
b-H2O
79.7 / 76.4
78.1 / 86.1 76.8 / 75.8
76.3 / 83.9 82.5 / 65.5
74 / 81.6
b-NH3
77.2 / 75.1
76.1 / 83.5 73.1 / 76.9
75 / 82.6
83.2 / 61.2
72.2 / 78.8
b-H2O-NH3
74.3 / 76.1
75.2 / 82
72.3 / 64.2
68.2 / 75.2 84.2 / 59.2
71.7 / 77.8
-----------------------------
77.4 / 72.8
75.1 / 83.1 80.6 / 72.2
76.4 / 84.4
82.8 / 90
b++
78.7 / 75.4
81.1 / 75.8
y
82.6 / 82.3
82.4 / 90.1 84.4 / 79.6
82 / 89.7
y-H2O
79.1 / 77.8
78.4 / 86.1 77.8 / 73.2
75.5 / 82.6 82.5 / 64.7
73.6 / 80.5
y-NH3
76.5 / 68.3
72.4 / 79.5 69.6 / 66.9
68.3 / 75.3 81.7 / 61.9
71.8 / 78.4
y-H2O-NH3
70.4 / 76.4
73.4 / 80.7 74.2 / 64.3
69.3 / 75.7 82.5 / 62.4
72.4 / 77.3
84 / 90.9
80.4 / 87.9
y++
------------------------------
86.6 / 81.5
84.4 / 81.1
85.9 / 74.9
Sensitivity-Specificity and Accuracy-AUC for all charges in all ions on Cross
Validation
Results – Cross Testing Accuracies on Drosophila data for
charge 2
Ion
Sn
Sp
ACC
AUC
Combined AUC
b
69.4
78.5
73.9
82.1
84.6
b-H2O
79.4
72.8
76.1
83.6
83.9
b-NH3
71.1
75.0
73.1
81.8
82.6
b-H2O-NH3
69.9
66.0
67.9
76.8
75.2
b++
65.7
77.0
71.4
77.2
83.1
y
68.0
86.8
77.4
86.8
89.7
y-H2O
63.0
70.4
66.7
72.5
82.6
y-NH3
54.1
75.2
64.6
72.7
75.3
y-H2O-NH3
53.9
70.2
62.1
67.5
75.7
y++
90.3
82.2
86.2
92.6
90.9
MassAnalyzer – Peptide Fragmentation tool


Uses Mathematical
model to predict
fragmentation
Uses one model for
charge 1 and charge 2
and a separate model
for higher charges
Z. Zhang, Anal. Chem. 2004, 76(14),3908-3922
Z. Zhang, Anal. Chem. 2005, 77(19),6634-6373
Results – Prediction Comparison
Charge 1
Charge 2
Ion Name
AUC MA
AUC ANN
AUC MA AUC ANN
b
90.44
91.74
85.61
b-H2O
89.37
91.87
b-NH3
85.82
b-H2O-NH3
Charge 3
AUC MA
AUC ANN
86.24
90
90.24
86.84
85.85
88.99
86.97
89.01
85.40
85.17
83.61
86.51
61.31
90.12
71.20
80.15
77.45
85.58
b++
---------
--------
85.34
87.20
86.01
88.56
y
86.65
88.20
85.98
88.57
91.72
91.82
y-H2O
85.96
81.38
77.58
72.95
82.88
87.89
y-NH3
87.62
78.43
78.30
75.73
85.10
85.03
y-H2O-NH3
64.82
77.50
69.51
77.28
76.55
84.47
y++
--------
--------
90.92
93.34
85.60
87.55
Results – Prediction Comparison
ROC figures – Charge 2
Results – Spectrum Comparison
Future Work


Reproducibility analysis on various other datasets and incorporating
for replicate size( number of replicates for each spectrum )
Use Predicted Spectra to build another Predictor that would learn to
score the given spectrum
Acknowledgements




Dr. Predrag Radivojac
Dr. Haixu Tang
Dr. Randy J. Arnold
Lab mates –







Amrita Mohan
Nils Schimmelmann
Wyatt Clark
Yong Li
Linda Hostetter
Bioinformatics faculty at SOI
School of Informatics