Presentation (PowerPoint File)
Download
Report
Transcript Presentation (PowerPoint File)
PEAKS: De Novo Sequencing
using Tandem Mass Spectrometry
Bin Ma
Dept. of Computer Science
University of Western Ontario
Outline
• Background
• Sandwich algorithm for de novo sequencing
• Software implementation – PEAKS
Background
• Diseases are closely related to the abnormal
proteins.
• Given a tissue, the identification of the proteins
(and their posttranslational modifications) in it
is a fundamental problem in proteomics.
• MS/MS is the most common way for protein
identification.
Sample Preparation
tissue
fraction
gel
GTDIMR
PAK
MPSER
……
……
Add trypsin
peptides
HPLC
To
MS/MS
Tandem Mass Spectrometer
QTOF
ions
MPSER +
PAK +
SG… +
…
ESI
parent ions
Quadrupole
mass
analyzer
PAK +
PAK +
PAK +
PAK +
detector
fragment ions
collision
P + AK
P
AK +
PA + K
PA
K+
TOF
mass
analyzer
+
AK
PA
+
+
K
P+
peptide sequence:
LGSSEVEQVQLVVDGVK
tandem mass spectrometry:
MS/MS spectrum
database
de novo sequencing:
LGSSEVEQVQLVVDGVK
How Does a Peptide Fragment?
m(b1)=1+m(A1)
m(b2)=1+m(A1)+m(A2)
m(b3)=1+m(A1)+m(A2)+m(A3)
m(y1)=19+m(A4)
m(y2)=19+m(A4)+m(A3)
m(y3)=19+m(A4)+m(A3)+m(A2)
Matching Sequence with Spectrum
De Novo Sequencing
• De Novo Sequencing (Dancik et al., JCB
6:327-342.)
– Given a spectrum, a mass value M,
compute a sequence P, s.t. m(P)=M, and
the matching score is maximized.
• We consider the matching score of P is the
sum of the scores of the matched peaks.
– We use intensity of a peak as its score to
illustrate PEAKS’ algorithm.
Spectrum Graph Approach
• Convert the peak list to a graph. A peptide
sequence corresponds to a path in the graph.
• Bartels (1990), Biomed. Environ. Mass Spectrom
19:363-368.
• Taylor and Johnson (1997). Rapid Comm. Mass
Spec. 11:1067-1075. (Lutefisk)
• Dancik et al. (1999), JCB 6:327-342.
• Chen et al. (2001), JCB 8:325-337.
• ……
Warm up – Counting Only Y-ions
The Score of a Suffix
19
y1
y2
y3
Let Q be a suffix of the peptide. It can
determine some y-ions.
score(Q) are the sum of scores of those
y-ions of Q.
DP(m) max score(Q)
m(Q ) m
19
Recursive Computation of DP(m)
Q’
a
Suppose Q is such that DP(m)=score(Q).
score (Q) h(m 19) score (Q' )
score(Q’)=DP(m(Q’))
DP (m) h(m 19) DP (m m(a))
Do not know a?
DP (m) h(m 19) max DP (m m(a))
a
Dynamic Programming
1. for m from 0 to M
DP (m) h(m 19) max a DP (m a)
2. backtracking
Counting Both y and b Ions
Good News
y1
y2
y3
bn-3 bn-2 bn-1
Bad News
Ions Determined By a Pair
P=LGEY
Q=LLVR
score(P,Q) is
the sum of
matched peak
intensities. A
peak can only
count once.
Chummy Pairs
• Two strings P and Q are called chummy pairs, iff.
either of the following two is true:
1 m( P ) 19 m(Q) 1 m( P)
(C1)
P LG ,
Q VR,
P LGE
(C2)
19 m(Q ) 1 m( P) 19 m(Q)
Q VR,
P LGE,
Q LVR
Recursive Computation of score(P,Q)
P=LGEY
Q=LLVR
Q LVR
u=m(P), v=m(Q)
score ( P, Q)
f (u , v) score ( P, Q )
Chummy pairs
• Lemma 1 – Suppose P and Q are a
chummy pair. u=m(P), v=m(Q). If (C1) is
true,
score( P,Q) score( P ,Q) g (u, v)
If (C2) is true,
score( P,Q) score( P,Q ) f (u, v)
Chummy Pairs
• Lemma 2 – Let (P,Q) be a chummy pair, a be a
letter.
– (C1) (P,aQ) is a chummy pair but (Pa,Q) is not.
– (C2) (Pa,Q) is a chummy pair but (P,aQ) is not.
• Lemma 3 – Let S be the optimal solution.
Then there is a chummy pair (P,Q) and a letter
a such that S=PaQ. Also, there is a chummy
pair series such that
( , ) ( P1 , Q1 ) ( Pn , Qn ) ( P, Q)
Dynamic Programming
• Combining Lemma 1, 2, 3, we can compute
DP (u, v)
max
m ( P ) u , m ( Q ) v
( P ,Q ) chummy
score ( P, Q)
• Suppose (P,Q) is the pair maximizing DP(u,v)
under the condition m(P)+m(Q)+m(a)=M. Then
PaQ is the optimal peptide.
Algorithm Sandwich
•
•
DP(0,0) = 0; DP(u,v) = -infinity for (u,v)!=(0,0);
for u from 1 to M/2 step d do
for v from u-m(W) to u+m(W) step d do
for a in Σ do
if u<v then
DP(u a, v) max DP(u, v) f (u, v), DP(u a, v)
else
DP(u, v a) max DP(u, v) g (u, v), DP(u, v a)
• find u,v,a, s.t. u+v+m(a)=M and DP(u,v) maximized;
• backtracking;
Time:
O(
M m( W ) | |
)
2
d
PEAKS – The Software
Comparison
• LCQ data (Iontrap instrument):
– Generously provided by Dr. Richard Johnson.
144 spectra.
• Micromass Q-Tof data:
– Measured in UWO’s Protein ID lab. 61 spectra
• Sciex Q-Star data:
– Provided by U. Victoria’s Genome BC
Proteomics Centre. 13 good/okay spectra.
PEAKS v.s. Lutefisk
• completely correct sequences:
– 38/144 v.s. 15/144
• correct amino acids:
– 1067/1702 v.s. 767/1702 v.s.
• partially correct sequences with 5 or
more contiguous correct amino acids:
– 94/144 v.s. 64/144
PEAKS v.s. Micromass PLGS
• completely correct sequences:
– 13/61 v.s. 7/61
• correct amino acids:
– 456/764 v.s. 232/764
• partially correct sequences with 5 or
more contiguous correct amino acids:
– 38/61 v.s. 24/61
PEAKS v.s. Sciex BioAnalyst
• completely correct sequences:
– 7/13 v.s. 1/13
• correct amino acids:
– 115/150 v.s. 86/150
• partially correct sequences with 5 or
more contiguous correct amino acids:
– 12/61 v.s. 7/61
Users
The company logos have been deleted from the original
presentation.
Please visit http://www.bioinformaticssolutions.com for a
list of users.
Other Techniques Used by PEAKS
• Preprocess the MS/MS spectra
– Deconvolution, noise reduction, and signal
enhancement.
– It does a better job than spectrometer vendor’s
software.
• Recalibration
– compress/stretch the spectrum for calibration error
• Positional Confidence
– Estimate the confidence level of individual amino acids.
Sophisticated Ion Matching Score
• Score of one peak matching b ion
log( h) e
error
tolerance
2
supp( b 17, b 18, a, c)
PEAKS 2.x’s Additional Feature
• Identify the proteins by matching the de
novo (partial) sequences.
• Then further match the spectra with the
peptides of the proteins.
Collaborators and References
• Sandwich algorithm:
– B. Ma, K. Zhang, C. Liang, CPM’03. (sandwich
algorithm)
• PEAKS:
– B. Ma, K. Zhang, C. Hendrie, C. Liang, M. Li, A.
Doherty-Kirby, G. Lajoie, Rapid Comm. Mass Spec.
(software feature, score function, experiments)
• Acknowledgement:
– PEAKS development team. (Bioinformatics Solutions
Inc.).