Transcript Slide 1
Error tolerant search
• Large number of spectra remain without significant
score. Reasonable number of fragment ion peaks might
have not match.
– Underestimated mass measurement error (should be seen
in peptide view graphs,
– Incorrect determination of precursor charge state
– Peptide sequence is not in the database.
– Missed cleavage & unexpected cleavage,
– Unexpected chemical & post-translational modification.
• The biological structure, function and activity of a protein can be
determined by the modification of the given protein.
• An increasing part of the proteins that have been mapped to e.g.
different diseases, not only change in expression levels but also or
exclusively in the level of posttranslational modifications.
1
Post-Translational Modifications (PTMs)
• PTM alters the weight of amino acids and the
peptide that results peak shifts in the spectrum:
y10 y9
y8
y7
y6
y5
y4
y3
y2
y1
H Q S V M V G M V Q K
b1
b1
200
y1
b2
b2 b 3
b3 b4 b5 b6 b7 b8 b9 b10
b1: H
QSVMVGMVQK:y 10
b 2 : HQ
SVMVGMVQK: y 9
b 3 : HQS
VMVGMVQK: y 8
b 4 : HQSV
MVGMVQK: y 7
b 5 : HQSVM
VGMVQK: y 6
…
…
b 10 :HQSVMVGMVQ
K: y 1
b3
y7
…
400
…
b10 y10 b10 y10
1000
m/z
2
PTMs
• Complete modifications (chemical
modifications)
• Variable modifications
3
PTMs
• Obstacles
– Complexity (means longer execution time)
• Can increase the search space 1,10,...10000 fold
– Significance
4
Obstacles - Complexity
• Let the theoretical peptide be:
– HQSVMVGMVQK (11 amino acids)
– Each amino acid can be modified by, let’s say, 5 PTMs
# included PTMs
# modified theoretical spectra
time
0
1
1 sec
1
11*5 = 55
55 seconds (1min)
2
11*25 = 275
4.5 mins
3
11*15*125 = 20625
5.7 hours
10
11 10
5 11* 9765625 107421875
10
29839 hours (3.5 years)
In general:
Peptide length = L
Included PTMs = K
PTMs/aa = M
L K
M
K
...
5
– Inserting many PTMs make the theoretical spectra
too flexible and in the end all theoretical spectra
can be aligned to the experimental spectra.
100%
0%
1
0
6
Significance
• Increases the random matches
Frequency
– Inserting many PTMs make the theoretical spectra
too flexible and in the end all theoretical spectra
can be aligned to the experimental spectra.
probability distribution
of random scores
probability distribution
of correct scores
A
p-value of hit h
B
T
score
h
7
Computational Identification of PTMs
• 3 approaches:
– Targeted,
– Untargeted or also called restricted
– Unrestricted, de novo, blind search
8
Targeted approach
• Almost all search engine supports it.
– Experimenter needs to guess the PTMs in the
sample.
• Two pass strategy
– Two rounds, refinement on a smaller
– Sequest, Mascot
9
Targeted approach – X!Tandem
10
Targeted approach – InsPecT
11
Untargeted approaches
• Uses a big list of databases
– Search space is limited but can be very huge.
– if we allow 5 of the 10 most frequent
modifications to occur in a peptide at the same
type, the search space grows 3 orders of
magnitude.
– The growth is more dramatic if instead of 10 types
of modifications we wish to consider all of roughly
500 known types.
12
Database of PTMs
• Unimod
– http://www.unimod.org
– Contains 906 modifications
• Resid
– http://www.ebi.ac.uk/RESID
– 559 Entries
13
Untargeted
• PILOT_PTM
– Uses a large dataset of modifications.
– Binary Linear programming.
• Objective function is the number of the matched peaks
• Linear constrain functions are guarantee meaningful
modifications of the peptide.
14
Unrestricted
•
•
•
•
No priori information about PTMs.
De novo identification of PTMs
Search space is infinite.
In practice no more than one or two PTMs can
be identified on the same peptide.
15
TwinPeaks approach
• Based on the Sequest idea.
• Shifts the experimental spectra over a range,
and plots the similarity score as a function of
the mass shift.
16
Sum of matched intensity
TwinPeaks approach
17
MS-Alignment
• Based on the alignment of the theoretical
spectra to the experimental spectra
18
Experimental Spectrum
Theoretical Spectrum
19
MS-alignment
20
Comparison of targeted and unrestricted results
X!Tandem
targeted
Scan ID
log(-E)
MS-Alignment
Unrestricted (de novo)
Peptide
Scan ID
P-value Peptide
-13.8
fqyr295 ILTAAALCHF TSIEVVK 311kasg (130)
3
1.00E-05 R.ILTAAALCHFTSIEVVK.K
-6.6
rihr159
6
1.00E-05 R.FVEKPQVFVSNK.I
-3.4
rtcr30
13
1.00E-05 K.FFDDDLLVSTSR.V
-4.0
dvtr473
27
1.00E-05 R.IHQIEYAMEAVK.Q
-10.0
ietk133
-4.2
pskr237
-2.5
ntpr149
-7.4
pqgr19
31.1.1
-2.0
kefk80
34.1.1
-1.6
dyhr131 YLAEFATGND R 141keaa (9406)
35.1.1
-7.0
grar16 QYTSPEEIDA QLQAEK 31qkar (2754)
97
0.004672897 Q.L+128GVSHVFEYIR.S
36.1.1
-2.0
rlar172 QDPQLHPEDP ER 183raai (644)
98
0.004830918 C.T+160EDMTEDELR.E
37.1.1
-8.1
iflh92 ISDVEGEYVP VEGDEVTYK 110mcsi (73)
99
1.00E-05 R.EFFD-18SNGNFLYR.I
38.1.1
-3.9
mrsr328 TASGSSVTSL DGTR 341srsh (2698)
100
1.00E-05 R.LVLESPAPVEVNLK.L
40.1.1
-3.7
lgnk29 YVQLNVGGSL YYTTVR 44altr (71)
105
1.00E-05 K.LQEFAYVTDGAC+14SEEDILR.M
42.1.1
-1.9
dlqk183 EGEFSTCFTE LQR 195dflk (239)
108
1.00E-05 K.SFDENGFDYLLTYSDNPQTVFP+156.R
45.1.1
-2.9
pkek135 QPVAGSEGAQ YR 146kkql (694)
115
1.00E-05 R.GPATVEDLPSAFEEK.A
-10.3
lsar446
119
1.00E-05 Y.ITD+163VLTEEDALEILQK.G
-6.8
evyr175
147
1.00E-05 R.IYSYQMALTPVVVTLWYR.A
-4.7
iygk81
3.1.1
6.1.1
11.1.1
12.1.1
13.1.1
24.1.1
25.1.1
27.1.1
46.1.1
53.1.1
57.1.1
FVEKPQVFVS NK
170inag
(471)
SPEPGPSSSI GSPQASSPPR PN
TMHFGTPTAY EK
FFDDDLLVST SR
484ecft
144vrlf
QTNGCLNGYT PSR
KNGGLGHMNI
51hyll
(306)
(176)
249krqa
IHQIEYAMEA VK
30qgsa
DREDLVPYTG EK
91rgkv
(112)
1.00E-05 K.QFEDELHPDLK.F
58
0.004739336 R.ETFY+18LAQDFFDR.F
(10317)
59
1.00E-05 R.TCLSQLLDIMK.S
(137)
71
1.00E-05 K.EYFSTFGEVLM+16VQVK.K
75
1.00E-05 K.QH-18LENDPGSNEDTDIPK.G
186lrvc
91ftga
0.028806584 A.V+172LTAFANGR.S
57
ASNAWILQQH IATVPSLTHL CR
QFEDELHPDL K
47
ALLSDLTK 166qisr
NSMPASSFQQ QK
(48)
467leir
(7099)
(491)
(1776)
(107)
21
Validate your results
22
Summary
• What you should remember:
– PTM identification is computationally expensive
– 3 approaches (targeted, untargeted, unrestricted)
– Always examine the results, omit weird PTMs,
– Decreases the statistical significance
– The more you are looking for the less you get (due
to significance)
23