Protein Sequencing and Identification by Mass Spectrometry

Download Report

Transcript Protein Sequencing and Identification by Mass Spectrometry

An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Protein Sequencing and
Identification by Mass
Spectrometry
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Outline
• Tandem Mass Spectrometry
• De Novo Peptide Sequencing
• Spectrum Graph
• Protein Identification via Database Search
• Identifying Post Translationally Modified Peptides
• Spectral Convolution
• Spectral Alignment
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Masses of Amino Acid Residues vary:
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Protein Backbone
H...-HN-CH-CO-NH-CH-CO-NH-CH-CO-…OH
N-terminus
Ri-1
AA residuei-1
Ri
AA residuei
Ri+1
AA residuei+1
C-terminus
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Peptide Fragmentation
Collision Induced Dissociation
H+
H...-HN-CH-CO
Ri-1
Prefix Fragment
. . . NH-CH-CO-NH-CH-CO-…OH
Ri
Ri+1
Suffix Fragment
• Peptides tend to fragment along the backbone.
• Fragments can also lose neutral chemical groups
like NH3 and H2O.
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Peptide Fragmentation
b2-H2O
a2
b3- NH3
b2
a3
b3
HO
NH3+
|
|
R1 O
R2 O
R3 O
R4
|
||
|
||
|
||
|
H -- N --- C --- C --- N --- C --- C --- N --- C --- C --- N --- C -- COOH
|
|
|
|
|
|
|
H
H
H
H
H
H
H
y3
y2
y3 -H2O
y1
y2 - NH3
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Breaking Protein into Peptides and
Peptides into Fragment Ions
• Proteases, e.g. trypsin, break protein into
peptides.
• A Tandem Mass Spectrometer further breaks
the peptides down into fragment ions and
measures the mass of each piece.
• the Mass Spectrometer accelerates the
fragmented ions; heavier ions accelerate slower
than lighter ones.
• Mass Spectrometer measures the mass/charge
ratio of an ion.
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
N- and C-terminal Peptides
given a little protein:
...we can fragment it into...
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Terminal peptides and ion types
Peptide
Mass (D)
Peptide
Mass (D)
57 + 97 + 147 + 114 = 415
without
57 + 97 + 147 + 114 – 18 = 397
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
N- and C-terminal Peptides by mass
486
71
415
301
185
154
332
57
429
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
N- and C-terminal Peptides
486
71
415
301
185
154
332
57
429
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
N- and C-terminal Peptides
486
71
415
oops. nothing left.
301
185
154
332
57
429
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
N- and C-terminal Peptides
486
71
415
301
Our goal is to reconstruct the peptide from the set of masses
of fragment ions
185
(the mass-spectrum)
154
332
57
429
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Mass Spectra
57 Da =K‘G’
D
D
V
99 Da = ‘V’
L
L
H2O
G
D
K
V
G
mass
0
• The peaks in the mass spectrum are any and all of:
• Prefix and Suffix Fragments
• Fragments with neutral losses (-H2O, -NH3)
• Noise and missing peaks
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Protein Identification with MS/MS
G
V
D
K
Peptide
Identification
Intensity
MS/MS
L
mass
00
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Tandem Mass-Spectrometry
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Breaking Proteins into Peptides
MPSERGTDIMRPAKID......
protein
*High
Performance Liquid Chromatography
GTDIMR
PAKID
MPSER
……
……
peptides
HPLC*
To
MS/MS
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Mass Spectrometry
Matrix-Assisted Laser Desorption/Ionization (MALDI)
From lectures by Vineet Bafna (UCSD)
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Tandem Mass Spectrometry
e
c
n
a
d
n
u
b
A
e
v
i
t
a
l
e
R
S#: 1707 RT: 54.44 AV: 1 NL: 2.41E7
F: + c Full ms [ 300.00 - 2000.00]
RT: 0.01 - 80.02
100
90
80
638.0
100
1389
1409
LC
NL:
1.52E8
Base Peak F: +
c Full ms [
300.00 2000.00]
1991
1615
2149
1621
1411
2147
1611
70
1387
60
50
1593
1995
1655
1435
1987
1445
1661
40
20
1095
1307 1313
1105
95
e
v
i
t
a
l
e
R
70
MS
90
85
80
75
65
60
55
801.0
50
2001 2177
45
40
1937
1779
30
2155
e
c
n
a
d
n
u
b
A
2205
2135
2017
35
Scan 1707
638.9
30
25
2207
2329
872.3
1275.3
15
1707
687.6
10
2331
10
1173.8
20
944.7
783.3
1048.3
1212.0
1413.9
1617.7
1400
1600
1742.1
1884.5
5
0
200
0
5
10
15
20
25
30
35
40 45
Time (min)
50
55
60
65
70
75
400
600
800
1000
m/z
1200
1800
2000
80
S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6
T: + c d Full ms2 638.00 [ 165.00 - 1925.00]
850.3
100
collision
MS-2
MS-1
cell
Ion
Source
e
c
n
a
d
n
u
b
A
95
e
v
i
t
a
l
e
R
70
687.3
90
85
588.1
80
75
MS/MS
65
60
55
851.4
425.0
50
45
949.4
40
326.0
35
524.9
30
25
20
589.2
226.9
1048.6
1049.6
397.1
489.1
15
10
629.0
5
0
200
400
600
800
1000
m/z
1200
Scan 1708
1400
1600
1800
2000
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Protein Identification by Tandem
Mass Spectrometry
S
e
q
u
e
n
c
e
MS/MS instrument
S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6
T: + c d Full ms2 638.00 [ 165.00 - 1925.00]
850.3
100
e
c
n
a
d
n
u
b
A
95
e
v
i
t
a
l
e
R
70
687.3
90
85
588.1
80
75
65
60
55
851.4
425.0
50
45
949.4
40
326.0
35
524.9
Database search
•Sequest
de Novo interpretation
•Sherenga
30
25
20
226.9
589.2
1048.6
1049.6
397.1
489.1
15
10
629.0
5
0
200
400
600
800
1000
m/z
1200
1400
1600
1800
2000
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6
T: + c d Full ms2 638.00 [ 165.00 - 1925.00]
850.3
100
e
c
n
a
d
n
u
b
A
95
e
v
i
t
a
l
e
R
70
687.3
90
85
588.1
80
75
65
60
55
851.4
425.0
50
45
949.4
40
326.0
35
524.9
30
25
20
589.2
226.9
1048.6
1049.6
397.1
489.1
15
10
629.0
5
0
200
400
600
800
1000
m/z
1200
1400
1600
1800
2000
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Tandem Mass Spectrum
• Tandem Mass Spectrometry (MS/MS): mainly
generates partial N- and C-terminal peptides
• Spectrum consists of different ion types because
peptides can be broken in several places.
• Chemical noise often complicates the spectrum.
• Represented in 2-D: mass/charge axis vs. intensity
axis (for the remaining computational problem we
will ignore charge z, which is generally 1 or 2)
An Introduction to Bioinformatics Algorithms
Theoretical Spectrum
www.bioalgorithms.info
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Theoretical Spectrum (cont’d)
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Theoretical Spectrum (cont’d)
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Building Spectrum Graph
• How to create vertices (from masses)
• How to create edges (from mass differences)
• How to score paths
• How to find best path
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
b
suppose we know the masses for the peptides
S E
Q U E
N
Mass/Charge (M/Z)
C E
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
and we know the masses for a set of ions of those peptides
SE
Q U
E
N
Mass/Charge (M/Z)
C
a
E
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
a is an ion type shift in b
S E
Q U E
Mass/Charge (M/Z)
N
C E
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
y
oh and we have the suffix masses too
E C
N
E
U Q
Mass/Charge (M/Z)
E S
An Introduction to Bioinformatics Algorithms
Intensity
here they all are!
Mass/Charge (M/Z)
www.bioalgorithms.info
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Intensity
but we don't really know whether a given mass is from the N or C end...
Mass/Charge (M/Z)
An Introduction to Bioinformatics Algorithms
and there's probably some noise...
Mass/Charge (M/Z)
www.bioalgorithms.info
noise
An Introduction to Bioinformatics Algorithms
MS/MS Spectrum
Intensity
yuck.
Mass/Charge (M/z)
www.bioalgorithms.info
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Some Mass Differences between Peaks
Correspond to Amino Acids
u
q
s
e
s
e
e
c
e
u
q
e
n
n
q
u
e
n
c
c
e
e
s
e
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Ion Types
• Some masses correspond to fragment
ions, others are just random noise
• Knowing ion types Δ={δ1, δ2,…, δk} lets us
distinguish fragment ions from noise
• We can learn ion types δi and their
probabilities qi by analyzing a large test
sample of annotated spectra.
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Example of Ion Type
• Ion types Δ={δ1, δ2,…, δk}
• Ion types
{b, b-NH3, b-H2O}
correspond to
Δ={0, 17, 18}
*Note: In reality the δ value of ion type b is -1 but we will “hide” it for the sake of simplicity
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Match between Spectra and the
Shared Peak Count
• The match between two spectra is the number of
masses (peaks) they share (Shared Peak Count
or SPC)
• In practice, mass-spectrometrists use the
weighted SPC that reflects intensities of the peaks
• The match between experimental and theoretical
spectra is defined similarly
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Peptide Sequencing Problem
Goal: Find a peptide with maximal match between an
experimental and theoretical spectrum.
Input:
• S: experimental spectrum
• Δ: set of possible ion types
• m: parent mass
Output:
• P: peptide with mass m, whose theoretical spectrum
matches the experimental S spectrum the best
We will solve this by converting the input into a
spectrum graph.
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Vertices of Spectrum Graph
• We have masses of potential N-terminal peptides
• Vertices are also generated by reverse shifts corresponding to ion
types
Δ={δ1, δ2,…, δk}
• Every N-terminal peptide can generate up to k ions
m-δ1, m-δ2, …, m-δk
• Every mass s in an MS/MS spectrum generates k vertices
V(s) = {s+δ1, s+δ2, …, s+δk}
corresponding to potential N-terminal peptides
• Vertices of the spectrum graph:
{initial vertex}  V(s1)  V(s2)  ...  V(sm)  {terminal vertex}
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Edges of Spectrum Graph
• If we have two vertices with mass difference
corresponding to an amino acid A,
• Connect with an edge labeled by A
• We will insert "gap" edges for di- and tripeptides (c.f. sequence alignment)
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Paths
• Paths in the labeled graph spell out amino
acid sequences
• There are many paths, so how shall we find
the correct one?
• We need some scoring to evaluate paths
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Path Score
• Let p(P, S) be the probability that peptide P
produces spectrum S = {s1,s2,…sq}
• where p(P, s) = the probability that peptide P
generates a peak s
• therefore Scoring = computing probabilities
p(P,S)   pP,s
sS
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Peak Score
• For a position t that represents ion type dj :
qj, if peak is generated at t
p(P,st) =
1-qj , otherwise
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Peak Score (cont’d)
• For a position t that is not associated with an
ion type:
qR, if peak is generated at t
pR(P,st) =
1-qR , otherwise
• qR = the probability of a "noisy peak" that
does not correspond to any ion type

An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Finding Optimal Paths in the Spectrum Graph
• For a given MS/MS spectrum S, find a
peptide P’ maximizing p(P,S) over all
possible peptides P:
p(P,S)  max P p(P,S)
• Peptides = paths in the spectrum graph
• P’ = the optimal path in the spectrum graph
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Ions and Probabilities
• Tandem mass spectrometry is characterized
by a set of ion types {δ1,δ2,..,δk} and their
probabilities {q1,...,qk}
• δi-ions of a partial peptide are produced
independently with probabilities qi
(remember that the qi are estimated
externally)
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Ions and Probabilities
• A peptide has all k peaks with probability
k
q
i1
k
• and no peaks with probability  (1 qi )
i1

• A peptide also produces a "random noise''

with uniform probability qR in any position.
i
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Ratio Test Scoring for Partial Peptides
• Incorporates premiums for observed ions
and penalties for missing ions.
• Example: for k=4, assume that for a partial
peptide P’ we only see ions δ1,δ2,δ4.
The score is calculated as:
q1 q2 (1 q3 ) q4
 

qR qR (1 qR ) qR
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Scoring Peptides
• Given T set of all positions.
• Ti ={tδ1,, tδ2,..., ,tδk,} - the set of positions that
represent ions of partial peptides Pi.
• A peak at position tδj is generated with
probability qj.
• R=T- U Ti - set of positions that are not
associated with any partial peptides (noise).

An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Probabilistic Model
• For a position t δj  Ti let p(t, P,S) be the probability
that peptide P produces a peak at position t.
 q j
p(t,P,S)  
1 q j
if a peak is generated at position t
j
otherwise
• Similarly, for tR, the probability that P produces a
random noise peak at t is:
 qR
pR (t)  
1 qR
if a peak is generated at position t
otherwise
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Probabilistic Score
• For a peptide P with n amino acids, the score
for the whole peptides is expressed by the
following ratio test:
n
k
p(t i j ,P,S)
p(P,S)
 
pR (S)
pR (t i j )
i1 j1

An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
De Novo vs. Database Search
S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6
T: + c d Full m s 2 638.00 [ 165.00 - 1925.00]
850.3
100
Database
Search
e
c
n
a
d
n
u
b
A
e
v
i
t
a
l
e
R
95
687.3
90
85
588.1
80
De Novo
75
70
65
60
55
851.4
425.0
50
45
949.4
40
326.0
35
524.9
30
25
20
589.2
226.9
1048.6
1049.6
397.1
489.1
15
10
629.0
5
0
200
400
600
800
1000
m /z
1200
1400
1600
1800
2000
W
Database of
known peptides
R
V
A
A
MDERHILNM, KLQWVCSDL,
PTYWASDL, ENQIKRSACVM,
TLACHGGEM, NGALPQWRT,
HLLERTKMNVV, GGPASSDA,
GGLITGMQSD, MQPLMNWE,
ALKIIMNVRT,
ALKIIMNVRT,AVGELTK
AVGELTK, ,
HEWAILF, GHNLWAMNAC,
GVFGSVLRA, EKLNKAATYIN..
C
G
G
L
P
L
L
E
K
K
W
D
T
AVGELTK
T
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
De Novo vs. Database Search: A Paradox
• de novo algorithms are much faster, even though their
search space is much larger.
• A database search scans all peptides in the search
space to find best one.
• De novo eliminates the need to scan all peptides by
modeling the problem as a graph search.
Why not sequence de novo?
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Why Not Sequence De Novo?
• De novo sequencing is still not very accurate.
Algorithm
Lutefisk
(Taylor and Johnson, 1997).
SHERENGA (Dancik et. al., 1999).
Peaks
(Ma et al., 2003).
PepNovo (Frank and Pevzner, 2005).
Amino
Acid
Accuracy
Whole Peptide
Accuracy
0.566
0.189
0.690
0.673
0.727
0.289
0.246
0.296
• Less than 30% of the peptides sequenced
were completely correct!
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Pros and Cons of de novo Sequencing
• Advantage:
• Gets the sequences that are not necessarily in the
database.
• An additional similarity search step using these sequences
may identify the related proteins in the database.
• Disadvantage:
• Requires higher quality data.
• Often contains errors.
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Peptide Sequencing Problem revisited:
• Goal: Find a peptide with maximal match between
an experimental and theoretical spectrum.
• Input:
• S: experimental spectrum
• Δ: set of possible ion types
• m: parent mass
• Output:
• A peptide with mass m, whose theoretical spectrum
matches the experimental S spectrum the best
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Peptide Sequencing Problem revisited:
• Goal: Find a peptide from the database with
maximal match between an experimental and
theoretical spectrum.
• Input:
•
•
•
•
S: experimental spectrum
database of peptides
Δ: set of possible ion types
m: parent mass
• Output:
• A peptide of mass m from the database whose theoretical
spectrum matches the experimental S spectrum the best
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
MS/MS Database Search
Database search in mass-spectrometry has been very
successful in identification of already known proteins.
Experimental spectrum can be compared with theoretical
spectra of database peptides to find the best fit.
SEQUEST (Yates et al., 1995)
But reliable algorithms for identification of modified
peptides is a much more difficult problem.
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
The dynamic nature of the proteome
•
•
•
•
•
The proteome of the cell
is changing
Various extra-cellular,
and other signals
activate pathways of
proteins.
A key mechanism of
protein activation is
post-translational
modification (PTM)
These pathways may
lead to other genes
being switched on or off
Mass spectrometry is
key to probing the
proteome and detecting
PTMs
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Post-Translational Modifications
Proteins are involved in cellular signaling and
metabolic regulation.
They are subject to a large number of biological
modifications.
Almost all protein sequences are posttranslationally modified and 200 types of
modifications of amino acid residues are
known.
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Examples of Post-Translational
Modification
Post-translational modifications increase the number of “letters” in
amino acid alphabet and lead to a combinatorial explosion in both
database search and de novo approaches.
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Search for Modified Peptides:
Virtual Database Approach
Yates et al.,1995: an exhaustive search in a
virtual database of all modified peptides.
Exhaustive search leads to a large combinatorial
problem, even for a small set of modifications
types.
Problem (Yates et al.,1995). Extend the virtual
database approach to a large set of
modifications.
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Exhaustive Search for modified peptides.
•
YFDSTDYNMAK
Oxidation?
• For each peptide,
generate all
modifications.
• Score each
modification.
Phosphorylation?
• 25=32 possibilities, with 2 types
of modifications!
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Peptide Identification Problem Revisited
Goal: Find a peptide from the database with
maximal match between an experimental and
theoretical spectrum.
Input:
• S: experimental spectrum
• database of peptides
• Δ: set of possible ion types
• m: parent mass
Output:
• A peptide of mass m from the database whose
theoretical spectrum matches the
experimental S spectrum the best
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Modified Peptide Identification Problem
Goal: Find a modified peptide from the database with maximal
match between an experimental and theoretical spectrum.
Input:
• S: experimental spectrum
• database of peptides
• Δ: set of possible ion types
• m: parent mass
• Parameter k (# of mutations/modifications)
Output:
• A peptide of mass m that is at most k
mutations/modifications apart from a database peptide
and whose theoretical spectrum matches the
experimental S spectrum the best
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Database Search:
Sequence Analysis vs. MS/MS Analysis
Sequence analysis:
similar peptides (that are a few mutations apart) have similar
sequences
MS/MS analysis:
similar peptides (that are a few mutations apart) have dissimilar
spectra
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Peptide Identification Problem: Challenge
Very similar peptides may have very different
spectra!
Goal: Define a notion of spectral similarity that
correlates well with the sequence similarity.
If peptides are a few mutations/modifications
apart, the spectral similarity between their
spectra should be high.
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Deficiency of the Shared Peaks Count
Shared Peaks Count (SPC): intuitive measure of
spectral similarity.
Problem: SPC diminishes very quickly as the number
of mutations increases.
Only a small portion of correlations between the
spectra of mutated peptides is captured by SPC.
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
SPC Diminishes Quickly
no mutations
SPC=10
1 mutation
SPC=5
2 mutations
SPC=2
S(PRTEIN) = {98, 133, 246, 254, 355, 375, 476, 484, 597, 632}
S(PRTEYN) = {98, 133, 254, 296, 355, 425, 484, 526, 647, 682}
S(PGTEYN) = {98, 133, 155, 256, 296, 385, 425, 526, 548, 583}
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Spectral Convolution
• the spectral convolution is defined by
S1  S2  s2  s1 : s1  S1,s2  S2
• the number of pairs (s1, s2) with s1 - s2 = x is
(S1  S2 )(x)
 • the Shared Peaks Count (SPC) is therefore
(S1  S2 )(0)


An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Elements of S2 * S1 represented as elements of a difference matrix. The
elements with multiplicity >2 are colored; the elements with multiplicity 2
are circled. The SPC takes into account only the red entries
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Spectral Convolution: An Example
5
4
3
2
1
0
-150
150
-100
-50
0
x
50
100
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Spectral Comparison: Difficult Case
S = {10, 20, 30, 40, 50, 60, 70, 80, 90, 100}
Which of the spectra
S’ = {10, 20, 30, 40, 50, 55, 65, 75,85, 95}
or
S” = {10, 15, 30, 35, 50, 55, 70, 75, 90, 95}
fits the spectrum S the best?
SPC: both S’ and S” have 5 peaks in common with S.
Spectral Convolution: reveals the peaks at 0 and 5.
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Spectral Comparison: Difficult Case
S * S’
S * S’’
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Limitations of the Spectrum Convolutions
Spectral convolution does not reveal that
spectra S and S’ are similar, while spectra S
and S” are not.
Clumps of shared peaks: the matching
positions in S’ come in clumps while the
matching positions in S” don't.
This important property was not captured by
spectral convolution.
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Shifts
A = {a1 < … < an} : an ordered set of natural
numbers.
A shift (i,) is characterized by two parameters,
the position (i) and the length ().
The shift (i,) transforms
{a1, …., an}
into
{a1, ….,ai-1,ai+,…,an+  }
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Shifts: An Example
The shift (i,) transforms
{a1, …., an}
into
{a1, ….,ai-1,ai+,…,an+  }
e.g.
10 20 30 40 50 60 70 80 90
shift (4, -5)
10 20 30 35 45 55 65 75 85
shift (7,-3)
10 20 30 35 45 55 62 72 82
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Spectral Alignment Problem
• Find a series of k shifts that make the sets
A ={a1, …., an} and B ={b1,…., bn}
as similar as possible.
• This leads to the concept of k-similarity
between sets
• D(k) - the maximum number of elements in
common between sets after k shifts.
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Representing Spectra in 0-1 Alphabet
• Convert spectrum to a 0-1 string with 1s
corresponding to the positions of the peaks.
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Comparing Spectra=Comparing 0-1 Strings
• A modification with positive offset corresponds to
inserting a block of 0s
• A modification with negative offset corresponds to
deleting a block of 0s
• Comparison of theoretical and experimental spectra
(represented as 0-1 strings) corresponds to a
(somewhat unusual) edit distance/alignment
problem where elementary edit operations are
insertions/deletions of blocks of 0s
• Use sequence alignment algorithms!
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Spectral Alignment vs. Sequence Alignment
• Manhattan-like graph with different alphabet
and scoring.
• Movement can be diagonal (matching
masses) or horizontal/vertical
(insertions/deletions corresponding to PTMs).
• At most k horizontal/vertical moves.
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Spectral Product
A ={a1, …., an} and B ={b1,…., bn}
Spectral product AB: two-dimensional matrix with
nm 1s corresponding to all pairs of10 20 30 40 50 55 65 75 85 95
indices (ai,bj) and remaining
elements being 0s.
SPC: the number of 1s at
the main diagonal.
-shifted SPC: the number
of 1s on the diagonal (i,i+ )
1
1
1
1
1 1
1
1
1
1
1
1
1
1
1 1
1
1
1
1
1
1
1
1
1 1
1
1
1
1
1
1
1
1
1 1
1
1
1
1
1
1
1
1
1 1
1
1
1
1
1
1
1
1
1 1
1
1
1
1
1
1
1
1
1 1
1
1
1
1
1
1
1
1
1 1
1
1
1
1
1
1
1
1
1 1
1
1
1
1
1
1
1
1
1 1
1
1
1
1

An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Spectral Alignment: k-similarity
k-similarity between spectra: the maximum number
of 1s on a path through this graph that uses at most
(k+1) diagonals.
k-optimal spectral
alignment = a path.
The spectral alignment
allows one to detect
more and more subtle
similarities between
spectra by increasing k.
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Use of k-Similarity
SPC reveals only
D(0)=3 matching
peaks.
Spectral
Alignment reveals
more hidden
similarities
between spectra:
D(1)=5 and D(2)=8
– and detects
corresponding
mutations.
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Black line represent the path for k=0
Red lines represent the path for k=1
Blue lines (right) represents the path for k=2
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Spectral Convolution’s Limitation
The spectral convolution considers diagonals
separately without combining them into feasible
mutation scenarios.
10 20 30 40 50 55 65 75 85 95
10 15 30 35
10
10
20
20
30
30
40
40
50

50
60
60
70
70
80
80
90
90
100
100
D(1) =10
shift function score = 10
50 55
70 75 90 95

D(1) =6
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Dynamic Programming for
Spectral Alignment
Dij(k): the maximum number of 1s on a path to
(ai,bj) that uses at most k+1 diagonals.
Di ' j ' (k )  1, if (i ' , j ' ) ~ (i, j )
Dij (k )  max {
(i ', j ')(i , j ) Di ' j ' ( k  1)  1, otherwise
D (k )  max Dij ( k )
ij
Running time: O(n4 k)
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Edit Graph for Fast Spectral Alignment
diag(i,j) – the position
of previous 1 on the
same diagonal as (i,j)
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Fast Spectral Alignment Algorithm
M ij (k ) 
max Di ' j ' (k )
(i ', j ')(i , j )
 Ddiag(i , j ) (k )  1
Dij (k )  max 
M i 1, j 1 (k  1)  1
 Dij (k )

M ij (k )  max M i 1, j (k )
M
 i , j 1 (k )
Running time: O(n2 k)
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
Spectral Alignment: Complications
Spectra are combinations of an increasing (Nterminal ions) and a decreasing (C-terminal
ions) number series.
These series form two diagonals in the
spectral product, the main diagonal and the
perpendicular diagonal.
The described algorithm deals with the main
diagonal only.