Transcript 投影片 1
Algorithmic Problems in Peptide
Sequencing
Outline
Basics of Proteomics
Roles and Anatomy of Proteins
Tandem Mass Spectrometry
Algorithms for Peptide Identifications
De Novo Sequencing
An Algorithm for Perfect Spectra
Peptide Identification in Real World
Discussions
De Novo Sequencing for Peptide Identificaiton
2/54
Briefings
• We mainly focus on the following result:
– Ting Chen, Ming-Yang Kao, Matthew Tepel, John Rush and
George Church, A Dynamic Programming Approach to
De Novo Peptide Sequencing via Tandem Mass
Spectrometry, Journal of Computational Biology, 8(3):
325-337, 2001.
– Its preliminary version also appears in The 11th Annual
SIAM-ACM Symposium on Discrete Algorithms (SODA
2000), page 389-398, 2000.
• One of the most-cited algorithm articles in the
computational proteomics community.
De Novo Sequencing for Peptide Identificaiton
3/54
Outline
Basics of Proteomics
Roles and Anatomy of Proteins
Tandem Mass Spectrometry
Algorithms for Peptide Identifications
De Novo Sequencing
An Algorithm for Perfect Spectra
An Improved Version
Peptide Identification in Real World
Discussions
De Novo Sequencing for Peptide Identificaiton
4/54
Anatomy of Protein Molecules
• Residue (of the peptides)
• Neutral peptide
H
H
O
NH
C
C
OH
NH
Rx
H
O
C
C
Rx
Stable state in nature
Basic building blocks
De Novo Sequencing for Peptide Identificaiton
5/54
Proteins and Peptides
H2 N
H
O
C
C
146.19
128.17
R
174.13
156.11
N
H
R1
K
H
H
H
O
H
H
O
N
C
C
N
C
C
C
C
R2 O
R3
R4
H
H2 N
C
C
R1
C
H
R5
H
H
O
H
H
O
N
C
C
N
C
C
R3
O
N
COOH
trypsin + H2O
arginine (R)
or lysine (K)
H
H
R4
H
N
C
H
R5
COOH
H
N
H
C
C
OH
Rectangles stand for amino acid residues
R2
O
De Novo Sequencing for Peptide Identificaiton
6/54
Amino Acid Molecules
• Please visit http://www.ionsource.com/ for more
information.
De Novo Sequencing for Peptide Identificaiton
7/54
Outline
Basics of Proteomics
Roles and Anatomy of Proteins
Tandem Mass Spectrometry
Algorithms for Peptide Identifications
De Novo Sequencing
An Algorithm for Perfect Spectra
Peptide Identification in Real World
Discussions
De Novo Sequencing for Peptide Identificaiton
8/54
Tandem Mass Spectrometry
• Mass Spectrometers measure the mass of charged ions.
– A mass spectrometer has 3 major components.
Sample
+
_
Ionizer
Mass Analyzer
Detector
Adapted from Nathan Edwards’ slides
De Novo Sequencing for Peptide Identificaiton
9/54
Proteomics via Mass Spectrometers
Enzymatic Digest
and
Fractionation
First stage MS
MS/MS
Precursor selection
and dissociation
Adapted from Nathan Edwards’ slides
De Novo Sequencing for Peptide Identificaiton
10/54
Outline
Basics of Proteomics
Roles and Anatomy of Proteins
Tandem Mass Spectrometry
Algorithms for Peptide Identification
De Novo Sequencing
An Algorithm for Perfect Spectra
Peptide Identification in Real World
Discussions
De Novo Sequencing for Peptide Identificaiton
11/54
Peptide Identification
• Given:
• A MS/MS spectrum (m/z, intensity, possibly along with its
retention time)
• The precursor mass
• Output:
• The amino-acid sequence of the peptide
• Imagine a deck of cards that you can cut many times
and obtains the sums of the upper or lower half
De Novo Sequencing for Peptide Identificaiton
12/54
Peptide Fragmentation Mechanism
N-Terminus
C-Terminus
b-ions
y-ions
y-ions
R
E
G
L
b-ions
L
G
E
De Novo Sequencing for Peptide Identificaiton
R
m/z
13/54
Peaks in a Spectrum
• Peptide: L – G – E – R
Weight
Ion
Amino
Amino
Acids
Acids
114.2
b1
L
171.2
b2
LG
300.3
b3
LGE
Ion
Weight
GER y3
361.3
ER y2
304.3
R y1
175.2
De Novo Sequencing for Peptide Identificaiton
14/54
Manual De Novo Sequencing
667.27-536.24=131.03
Molecular weight of
M
128.09 ≈147.11-19
Molecular weight of
K
De Novo Sequencing for Peptide Identificaiton
15/54
Outline
Basics of Proteomics
Roles and Anatomy of Proteins
Tandem Mass Spectrometry
Algorithms for Peptide Identification
De Novo Sequencing
An Algorithm for Perfect Spectra
Peptide Identification in Real World
Discussions
De Novo Sequencing for Peptide Identificaiton
16/54
De Novo Sequencing
•
De Novo: From the beginning in Latin.
– Database search tools match against known
peptides.
•
Problem Definitions:
Given a spectrum ( a set of real intervals ),
a mass value M,
compute a sequence P, ( a set of real number with specific order)
s.t. m(P)=M, and the matching score is maximized.
m(P) is the sum of residue mass.
M
De Novo Sequencing for Peptide Identificaiton
17/54
De Novo Sequencing: An Ideal Case
• An ideal tandem mass spectrum is noise-free and
contains only b- and y-ions, and every mass peak has
the same height.
The task is to find paths connecting two endpoints on a
directed acyclic graph.
The problem is : how to construct the ion ladder?
M
De Novo Sequencing for Peptide Identificaiton
18/54
Ion Ladders in an Ideal Case
Based on an ideal ion ladder, we can determine the
sequence by concatenating prefixes (or suffixes) in
order.
However, we cannot determine the ion type of a peak
before identifying it.
R
L
y1 E
G
y2 G
E
y3
Given only
L+ , ER+,
LGE+, R+
L
R
De Novo Sequencing for Peptide Identificaiton
m/z
19/54
NC-Spectrum Model
•
We generate a (superset of ) ladder of ions.
– A Trick: Even if we cannot determine the ion types, we know
that an ion is either b-ion or y-ion.
1. Assume that we want to generate b-ion ladder.
2. If a peak is a b-ion, add the peak value to the list.
3. If a peak is a y-ion, add the complementary b-ion value to
the list.
•
This phase doubles the number of peaks.
De Novo Sequencing for Peptide Identificaiton
20/54
NC-Spectrum Model
• For the peptide sequence LGRE, we construct all
possible b-ions with respect to current spectrum.
• {P1, Q3, P4} or {P2, P3, Q1} are both complete ladders.
GER
LG
Q2
Q4 Q3
0
Q1
Pi: observed peaks
Qi: artificial peaks
m
m/2
P1
P2
P3 P4
L
R
ER LGE
De Novo Sequencing for Peptide Identificaiton
21/54
NC-Spectrum Model
•
Given a peak list = {P1,P2,P3, … , Pk}
•
The coordinates of all points along the line:
•
1. Pk – 1
Since the ion loses a Hydrogen
2. Qk = M – Pk+1 (why?)
(M – (Pk – 1 ) ) - 1
We still have to add two endpoints:
1. 0
2. M – 18
De Novo Sequencing for Peptide Identificaiton
22/54
NC Spectrum Model: A Summary
•
We are given k peaks.
– Now we have at most 2k+2 vertices.
•
Two vertices are adjacent if their coordinates differ by
the weight of some amino acid.
– The spectrum graph can be constructed in O(n2). (Why?)
•
The de novo sequencing is to search a path (or paths)
representing a good path from coordinate 0 to M-18.
– Such a path is not necessarily an ion ladder, though.
De Novo Sequencing for Peptide Identificaiton
23/54
Dynamic Programming Strategy
Dynamic Programming can solve this problem
efficiently.
•
–
Uni-directional (forward) DP does not work since it could
produce a solution containing both candidates for each
peak.
Q2
Q4 Q3
0
Q1
m
m/2
P1
P2
P3
De Novo Sequencing for Peptide Identificaiton
P4
24/54
Dynamic Programming Strategy (Cont’d)
Dynamic Programming can solve this problem
efficiently using a different encoding scheme.
•
–
We approach the middle part from both end sides.
Q2
Q4 Q3
0
Q1
m
m/2
P1
P2
P3
De Novo Sequencing for Peptide Identificaiton
P4
25/54
Dynamic Programming Strategy (Cont’d)
• Mass(b-ion) + Mass(y-ion) = PrecursorMass +2
– These b-ion candidates are nested pairs in the spectrum
graph.
0
m/2
De Novo Sequencing for Peptide Identificaiton
m
26/54
Relabeling the Vertices
•
To encode the spectrum graph by the nested pairs, we
need to relabel the vertex number.
1. {0 = x0, x1, x2, …, xk, yk, …, y2, y1, y0 = m}
2. xi and yi are both generated from the same peak.
3. We go one level further in each iteration.
0
m/2
x0
xk
yk
De Novo Sequencing for Peptide Identificaiton
m
y0
27/54
How Dynamic Programming Works
We design the |V|×|V| matrix M for representing
partial path candidates.
•
1.
M(i, j) = 1 iff [xo, xi] and [yj, yo] can occur simultaneouly in a legal
path.
2. For 1≦ s ≦ i, 1 ≦ s ≦ j, s occurs exactly once in the determined
partial path.
?
0
xi
yj
m/2
De Novo Sequencing for Peptide Identificaiton
m
28/54
How Dynamic Programming Works
(Cont’d)
x0
x1
x2
x3
x4
y4
y3
y2
y1
y0
m/2
m
0
M(0,0) = 1
x0
M(0,1) = 1
x0
M(1,0) = 1
x0
y0
y1
x1
De Novo Sequencing for Peptide Identificaiton
y0
y0
29/54
How Dynamic Programming Works
(Cont’d)
x0
x1
x2
x3
x4
y4
y3
y2
y1
y0
m/2
m
0
M(0,1) = 1
x0
M(1,0) = 1
x0
M(2,0) = 0
x0
y1
x1
x1
y0
y0
x2
y0
•M(1,0) =1 , but we cannot reach x2 from x0 nor x1.
M(2,1) = 1
x0
x2
y1
y0
•M(0,1) =1 , and we can reach x2 from x0.
De Novo Sequencing for Peptide Identificaiton
30/54
How Dynamic Programming Works
(Cont’d)
x0
x1
x2
x3
x4
y4
y3
y2
y1
y0
m/2
m
0
M(0,1) = 1
x0
M(1,0) = 1
x0
M(0,2) = 0
y1
x1
x0
y0
y0
y2
y1
y0
•M(0,1) =1 , but we cannot reach y2 from y0 nor y1.
M(1, 2) = 1
x0
x1
y2
y0
•M(1,0) =1 , and we can reach y2 from y0.
De Novo Sequencing for Peptide Identificaiton
31/54
Dynamic Programming: Preview
• In the i-th iteration, we determine and record all
possible (partial) paths in [0, xi] and [ yi, m].
0
m
m/2
x0
x0
…
xi-1
yt
xi or yi?
…
xi-1
xi
t < i-1
yi
De Novo Sequencing for Peptide Identificaiton
yt
…
…
y0
y0
32/54
Dynamic Programming: Preview(Cont’d)
Path extension
• How can we reach yi?
• To calculate M(xj, yi) for all j < i,
• For every j < i, check if yi is adjacent to yt and M(xj, yt)
= 1, for some t < i
x0
…
xj
yi
yt
…
y0
– Then M(xj, yi) = 1. Otherwise, it is 0.
x0
…
xj
yi
De Novo Sequencing for Peptide Identificaiton
yt
…
y0
33/54
Dynamic Programming: Preview(Cont’d)
Path extension
• Similarly, how can we reach xi?
• To calculate M(xi, yj) for all j < i,
• For every j < i, check if xi is adjacent to xt and M(xt, yj)
= 1, for some t < i
x0
…
xt
xi
yj
…
y0
– Then define M(xi, yj) =1.
x0
…
xt
xi
De Novo Sequencing for Peptide Identificaiton
yj
…
y0
34/54
Dynamic Programming
m/2
m
0
x0
x1
x2
x3
x4
y4
M
y3
y0
y2
y1
y1
y2
y3
y0
y4
x0
x1
x2
x3
x4
De Novo Sequencing for Peptide Identificaiton
35/54
Dynamic Programming: Initialization
m/2
m
0
x0
x1
x2
x3
x4
y4
y3
M
y0
x0
1
y2
y1
y0
y1
y2
y3
y4
x1
0
0
0
0
x2
0
0
0
0
x3
0
0
0
0
x4
0
0
0
0
De Novo Sequencing for Peptide Identificaiton
36/54
Dynamic Programming: 1st iteraton
We then compute M(1,0) and M(0,1).
m/2
m
0
x0
x1
x2
x3
x4
Check the arcs (x0, x1) and
(y1, y0)
y4
y3
y2
M
y0
y1
x0
1
1
x1
1
y1
y0
y2
y3
y4
0
0
0
0
x2
0
0
0
0
x3
0
0
0
0
x4
0
0
0
0
De Novo Sequencing for Peptide Identificaiton
37/54
Dynamic Programming: Recursion (a)
For j = 2 to k
For i = 0 to j-2
(a) If M(i, j-1) = 1 and edge(Xi, Xj) = 1, then M(j, j-1) = 1.
m/2
m
0
x0
x1
x2
x3
x4
y4
y3
y1
M
y0
y1
x0
1
1
x1
1
0
y0
y2
y3
y4
0
0
0
x3
0
0
0
x4
0
0
0
x2
Can we adjust the leftmost endpoint to xj?
y2
De Novo Sequencing for Peptide Identificaiton
1
38/54
Dynamic Programming: Recursion (b)
For j = 2 to k
For i = 0 to j-2
(b) If M(i, j-1) = 1 and edge(Yj, Yj-1) = 1, then M(i, j) = 1.
m/2
m
0
x0
x1
x2
x3
x4
y4
y3
y1
M
y0
y1
y2
x0
1
1
0
x1
1
0
y0
y3
y4
0
0
0
x3
0
0
0
x4
0
0
0
x2
Can we adjust the rightmost endpoint to yj?
y2
De Novo Sequencing for Peptide Identificaiton
1
39/54
Dynamic Programming: Recursion (c)
For j = 2 to k
For i = 0 to j-2
(c) If M(j-1,i) = 1 and edge(Xj-1, Xj) = 1, then M(j, i) = 1.
m/2
m
0
x0
x1
x2
x3
x4
y4
Can we adjust the leftmost endpoint to xj?
y3
y2
y1
M
y0
y1
y2
x0
1
1
0
x1
1
0
x2
0
1
y0
y3
y4
0
0
0
x3
0
0
0
x4
0
0
0
De Novo Sequencing for Peptide Identificaiton
40/54
Dynamic Programming: Recursion (d)
For j = 2 to k
For i = 0 to j-2
(d) If M(j-1, i) = 1 and edge(Yi, Yj) = 1, then M(j-1, j) = 1.
m/2
m
0
x0
x1
x2
x3
x4
y4
Can we adjust the rightmost endpoint to yj?
y3
y2
y1
M
y0
y1
y2
x0
1
1
0
x1
1
0
1
x2
0
1
y0
y3
y4
0
0
0
x3
0
0
0
x4
0
0
0
De Novo Sequencing for Peptide Identificaiton
41/54
Dynamic Programming (Cont’d)
Now for j = 3
m/2
m
0
x0
x1
x2
x3
x4
y4
y3
y2
y1
y0
M
y0
y1
y2
y3
x0
1
1
0
0
x1
1
0
1
1
x2
0
1
0
1
0
x3
0
0
1
0
0
0
0
0
x4
De Novo Sequencing for Peptide Identificaiton
y4
42/54
Dynamic Programming (Cont’d)
Now for j = 4
m/2
m
0
x0
x1
x2
x3
x4
y4
y3
y2
y1
y0
M
y0
y1
y2
y3
y4
x0
1
1
0
0
0
x1
1
0
1
1
0
x2
0
1
0
1
0
x3
0
0
1
0
0
x4
0
0
0
1
0
De Novo Sequencing for Peptide Identificaiton
43/54
Dynamic Programming: Constructing the
Answer
• Legal path: Starting our search from the outermost regions ( the last
row/column):
– [x4, y4] -> [x3, y3] -> [x2, y2] ->[x1, y1]
– We backtrack M to search each edge corresponding to the feasible solution
m/2
m
0
x0
x1
x2
x3
x4
y4
y3
y2
y1
y0
M
y0
y1
y2
y3
y4
x0
1
1
0
0
0
x1
1
0
1
1
0
x2
0
1
0
1
0
x3
0
0
1
0
0
x4
0
0
0
1
0
De Novo Sequencing for Peptide Identificaiton
44/54
Dynamic Programming: Review
• Chen et al. create a new NC-specturm graph G=(V, E),
where V=2k+2 and k is the number of mass peaks
(ions).
• Given the NC-spectrum graph, we can solve the ideal
de novo peptide sequencing problem in O(|V|2) time
and O(|V|2) space.
– M construction : O(|V|2) time
– Constructing a feasible solution : O(|V|) time
• Therefore we find a feasible solution in O(|V|2) time
and O(|V|2) space.
De Novo Sequencing for Peptide Identificaiton
45/54
Outline
Basics of Proteomics
Roles and Anatomy of Proteins
Tandem Mass Spectrometry
Algorithms for Peptide Identification
De Novo Sequencing
An Algorithm for Perfect Spectra
Peptide Identification in Real World
Discussions
De Novo Sequencing for Peptide Identificaiton
46/54
Noises in Real Spectra
•
The de novo strategy is too fragile to handle frequent
errors.
1. False negative peaks
•
Missing ions will break the path. The algorithms may find wrong
paths by concatenating two partial paths.
2. False positive peaks
•
The main critique of de novo strategy
3. Peak value is not the ion mass
•
Peak values represent the mass over charge value of ions.
•
It relies on the vendor. (Applied Biosystem)
De Novo Sequencing for Peptide Identificaiton
47/54
False Positives in Real Spectra
•
Different types of ions
– a-x, b-y, c-z
– Internal fragments/immonium ions
•
Neutral losses
– Neutral loss of water (~18Da)
– Neutral loss of ammonia (~17Da)
•
PTM (like adding new letters)
– Phosphorylation, glycopeptides
•
•
Isotopes
Unpurified samples
De Novo Sequencing for Peptide Identificaiton
48/54
Database Search Tools
• MASCOT: http://www.matrixscience.com/
• The de facto identification tool
De Novo Sequencing for Peptide Identificaiton
49/54
Database Search Tools (Cont’d)
• Brian Searle of Proteome Software informs us:
De Novo Sequencing for Peptide Identificaiton
50/54
Peptide and Protein Identification
• A brief comparison of popular tools
Scoring Strategy
Representatives
Correlation, Z-score, posterior
probabilities
SEQUEST, MS-Tag, Scope, CIDentify, Popitam,
ProbID, and PepSearch
Statistical significance: E-values or Pvalues
Mascot, Sonar, InsPecT,
OMSSA, and X!Tandem
De Novo Sequencing
Pseudo-peaks
PEAKS
Spectrum graphs
Lutefisk, PepNovo, AUDENS
Statistical models
NovoHMM
De Novo Sequencing for Peptide Identificaiton
51/54
Outline
Basics of Proteomics
Roles and Anatomy of Proteins
Tandem Mass Spectrometry
Algorithms for Peptide Identification
De Novo Sequencing
An Algorithm for Perfect Spectra
An Improved Version
Peptide Identification in Real World
Discussions
De Novo Sequencing for Peptide Identificaiton
52/54
Wrap Up
• The MS/MS measures the mass of fragment ions.
– A single stage MS measures a collection of peptide.
• We generate ion ladders to reconstruct peptide
sequence.
– Masses are more reliable than intensities.
• De novo sequencing is an elegant strategy, but not
robust.
– We need some signal preprocessing strategies.
• Database search tools cannot handle novel proteins,
and results from different tools are often inconsistent.
– Integration of the two above methods may be a possible way.
De Novo Sequencing for Peptide Identificaiton
53/54
Some Guys You May Wish to Know
Affiliation Principal Investigators
Topics
ETH at Zurich
Ruedi Aebersold
Peptide-atlas, statistical significance
estimation
UCSD
Pavel Pevzner, Vineet Bafna
De novo sequencing: Multi-spectra
alignment
Waterloo
Bin Ma
De novo sequencing: SPIDER,
PEAKS
NIH
Yi-Kuo Yu
Signal calibration, statistical
significance estimation
Xerox
Andrew Goldberg, Marshall Bern
PTM
Georgetown
Nathan Edwards
Peptide identification
USC
Tim Chen
De Novo Sequencing
De Novo Sequencing for Peptide Identificaiton
54/54