Dynamic programming based on multiple tags

Download Report

Transcript Dynamic programming based on multiple tags

2015. 05. 01
이원엽
Contents
I. Introduction
II. Algorithm
III. Experimental procedure
IV. Results
2
Introduction
• Tandem mass spectrometry(MS/MS)
• A powerful tool for rapid identification of post-translational
modifications (PTMs)
• Accurate computational identification of modified peptides
• Remains a difficult problem addressed with restrictive approaches
• Require ‘guessed’ lists of possible PTMs to be provided in advance.
3
Introduction
• Tandem mass spectrometry(MS/MS)
• A powerful tool for rapid identification of PTMs
• Accurate computational identification of modified peptides
• Remains a difficult problem addressed with restrictive approaches
• Require ‘guessed’ lists of possible PTMs to be provided in advance.
• Unrestrictive or blind approaches
• Search MS/MS spectra for all known and even possibly unknown types of PTMs
• Derive the list of modifications directly from MS/MS data
4
Introduction
• Unrestrictive search
• Improve sensitivity
• In the detection of peptides with unexpected modifications
• But, face serious challenges
• Increased run time
• Decreased identification of unmodified peptides
5
Introduction
• Unrestrictive searches’s challenge
• The first challenge is the increase in
• False positives (incorrect identifications)
• False negatives (missed identifications of peptides)
• Maintaining a fixed 1% FDR
• Requires a much higher score threshold for peptide identification
• Conversely results in a large number of false negatives.
6
Introduction
• Unrestrictive searches’s challenge
• The second challenge
• Is the number of modified sites allowed per peptide
• A critical analytical capability
• The modified site can be determined
• In time proportional to the length of P
• For multiple modifications
• The time complexity grows exponentially.
AALF
+42
A+42ALF
AA+42LF
AAL+42F
AALF+42
7
Introduction
• MODa (MODification via alignment)
• Enabling fast ‘multi-blind’ unrestrictive PTM search
• With an order of magnitude speedup
• No limitation on the number of variable modifications per peptide.
8
Introduction
• Multiple sequence tags
• Advantages in identifying modifications
1.
2.
Dramatically reduce the number of database peptides matched to
each spectrum
Effectively localize modified regions within a spectrum
9
Introduction
• Multiple sequence tags
• Several different Δ masses are discovered from multiple
matched tags
 multiple modifications
• Dynamic programming to identify triply modified peptide and its
MS/MS spectrum of ‘A+42ALFC+48LESAW+16K’,
10
Algorithms
• The MODa algorithms
• Sequence tags
• Select candidate peptides
• Dynamic programming algorithm
• Find an optimal spectrum/peptide
11
Algorithms
• MODa score and probability computation
• To compute the score
Experimental
spectrum S
converted
Prefix residue mass
(PRM) spectrum
1. S is separated into windows of 100 m/z units.
2. Top 10 peaks are retained
• according to their intensity
3. Each peak is given a weight
• according to its ranking
4. For every retained peak’s mass m
• nodes of masses (m-1) and (PrecursorMass(S)-(m-1)) are added to the
PRM spectrum
12
Algorithms
• MODa score and probability computation
• To compute the score
• The PRM node’s score
• the sum of weights of expected ion peaks from the PRM
• b, b-H2O, b-NH3, y, y-H2O, and y-NH3 and isotopes
13
Algorithms
• MODa score and probability computation
• The PRM-score
• To rank the candidate identifications
• From a single spectrum during dynamic programming
• Not sufficient to assess the quality of the top-scoring match
14
Algorithms
• MODa score and probability computation
• Probability
• that the top identification is correct
• Various properties
• represent the quality of match between the peptide and the MS/MS
spectrum
15
Algorithms
• MODa score and probability computation
• Probability
• that the top identification is correct
• Various properties
1) PRM-score
2) mass errors of matched fragment ions
3) the fractions of b and y ions found
4) the propensity to a particular ion type – tryptic peptide features
a stronger y-ion ladder than b-ion ladder
• The four components are combined by a logistic regression
• The weights were trained and validated
16
Algorithms
• Dynamic programming based on multiple tags
• Nodes: scores
• Arrows: possible jumps
• Score of path: the sum of node scores on the path
17
Algorithms
• Dynamic programming based on multiple tags
Derived from the spectrum
Special tags of length zero
• To identify triply modified peptide
‘A+42ALFC+48LESAW+16K’
18
Algorithms
• Dynamic programming based on multiple tags
Peptide
• To identify triply modified peptide
‘A+42ALFC+48LESAW+16K’
19
Algorithms
• Dynamic programming based on multiple tags
• M[p][t][s]
• Maximum score path on the diagonal defined by a tag t
• p : position
• t : tag
• A tag t of length n is defined by n+1 masses
• start(t) = 0 and end(t) = 1 mean that only a1 is matched
Database peptide
Tag
Spectrum
20
Algorithms
• Dynamic programming based on multiple tags
• M[p][t][s]
• Maximum score path on the diagonal defined by a tag t
• p : position
• t : tag
• A tag t of length n is defined by n+1 masses
• start(t) = 0 and end(t) = 1 mean that only a1 is matched
a
a2 a3 4 a5 a
6
a1
a2a3a4 = AKM
a1 = 0~1
A K M
0
1
2
3
4
5
6
21
Algorithms
• Dynamic programming based on multiple tags
• M[p][t][s]
• s : whether the position on the diagonal
• before the tag: 0
• after the tag: 1
• The main goal of the third dimension(M[p][t][s])
• Avoid having to decompose each tag into multiple “subtags”.
22
Algorithms
• Dynamic programming based on multiple tags
• M[p][t][s]
• Initialized to zero
• Defined recursively as score(p,t) + the maximum of jump
1. Amino acid jump
• within same row
2.
Modification jump
• to other row
23
Algorithms
• Dynamic programming based on multiple tags
24
Algorithms
• Dynamic programming based on multiple tags
• M[p][t][s]
• defined recursively as score(p,t) + the maximum of jump
1. Amino acid jump
• before the tag : M [p-1][t][0]
• inside the tag : max(M[p-1][t][0], M[p-1][t][1])
• after the tag : M [p-1][t][1]
A+42ALFC+48LESAW+16K
25
Algorithms
• Dynamic programming based on multiple tags
• M[p][t][s]
• defined recursively as score(p,t) + the maximum of jump
2. Modification jump
• M[p-1][q][1] + pf(Δ, ap)
• iff s=0, p<end(t), over all tags q such that start(q)<start(t)
• pf(Δ, ap) ≤ 0 is a penalty function for a modification Δ on the amino
acid aa.
• In this work, we considered pf(Δ, ap) as a constant C for all the
modifications.
26
Algorithms
• Dynamic programming based on multiple tags
A+42ALFC+48LESAW+16K
27
Algorithms
• Dynamic programming based on multiple tags
• Time and space complexity
• O(max(|S|, |T|2 * |P|))
• S : MS/MS spectrum
• P : a peptide selected from the database(P = a1…an)
• T : a set of tags matched to P
• scan the spectrum once at the start to compute every score(p, t)
28
Algorithms
• The MODa algorithms
• The most prominent difference
• Use of parameter k(the number of modification)
• MODa determines k automatically
• by using dynamic programming
• MODa is 40 times faster than MS-Alignment
29
Experimental Procedures
• Three proteomics data sets
Human plasma(A)
HEK 293 cell line(B)
Human lens(C)
Unmodified peptide
/one modified peptide
large scale
complex mixture
multi-modified
peptide
30
Experimental Procedures
• Human plasma
• Consists of 67,648 MS/MS spectra
• To demonstrate the performance of MODa
• Against the other engines SEQUEST, Mascot, InsPecT,
and MS-Alignment
• The HEK293 data set
• Consisits of 363,807 MS/MS spectra
• Represents an analysis of a large scale complex mixture
• From whole cell lysate
31
Experimental Procedures
• Human lens
• Consists of 381,224 MS/MS spectra
• 3-day and 2-, 18-, 35-, and 70-yearold normal lens and 70- and 93year-old cataract lens
• Many lens proteins do not turn over and tend to become
substantially modified over time
• PTM identification for this sample has been extensively studied by
others
• Their results were compared with the results from MODa anlaysis.
32
Experimental Procedures
• MODa Parameters
•
•
•
•
•
peptide ions: ±2.5 Da mass tolerances
fragment ions : ± 0.5 Da mass tolerances
no enzyme specificity
200 Da for modification mass size
‘multi-mod’ mode
(allowing arbitrary number of modifications per peptide)
33
Experimental Procedures
• Database
• IPI human database + the shuffled sequences
• mutDB: mutated sequences + their shuffled sequences
• Peptide identification
• FDR 1% using target-decoy approach
• FDR = (2×D)/(T+D)
• T : the number of target hits above score threshold
• D : the number of decoy hits above score threshold
34
Experimental Procedures
• Established restrictive/unrestrictive searches for
high-throughput complex mixture data
• To test MODa performance in various aspects.
• Data : human plasma data (A)
• Database
• IPI human database + the shuffled sequences
35
Experimental Procedures
• Established restrictive/unrestrictive searches for
high-throughput complex mixture data
• Compare to other tools
• Restrictive search tools
• SEQUEST
• InsPecT
• Unrestrictive search tool
• MS-Alignment
• InsPecT with blind option allowing one modification per
peptide
36
Experimental Procedures
• Unrestrictive searches for PTM-rich data
• Data
• The human ocular lens tissue(C)
• Consists of a small number of crystallin proteins
37
Experimental Procedures
• Simulation test for modified peptides
• To evaluate how sensitive MODa
• modified region in an MS/MS spectrum
• Data
• 2,423 peptide sequences from 14,623 SEQUEST PSMs
• Identified in human plasma data
• about 50% of sequences were mutated
• by changing one residue of sequences to a random residue
38
Experimental Procedures
• Simulation test for modified peptides
• Peptide identification
• Identify the original peptides with one amino acid mutation
• Compare another tool
• MS-Alignment search
• Mutated database(mutDB)
• consisted of mutated sequences and their shuffled sequences
39
Results
• Overall PTM identification
• Summarize the types of modification found in three samples
40
Results
• Overall PTM identification
• Peptides containing alkylated and oxidized cysteine
• C* represents carbamidomethylated cysteine
41
Results
• Mutation analysis
• The mutations discovered by MODa analysis of Plasma
and HEK293 data sets
• MODa found
• 13 previously unreported mutations
42
Results
• Mutation analysis
• MS/MS spectra of multiply mutated peptides identified in
cataract lens sample
43
Results
• Mutation analysis
• The Pro → Ser substitution of this peptide was not
observed alone but only jointly with the Ser → Gly
substitution
44
Results
• Rare and unknown modifications
• MODa
• detect a form of
• Glycosylation
• The enzymatic process that attaches glycans to proteins
• C-linked glycosylation or C-mannosylation
• A rare form of glycosylation
45
Results
• Rare and unknown modifications
• Mannosylated peptides found in human plasma data
• The recognition motif ‘WXXW’
46
Results
• Rare and unknown modifications
47
Results
• Rare and unknown modifications
• The modification was characterized by a mass shift of 166
Da at Trp(W) and was supported by 15 unique peptides
listed in Figure c
48
Results
• Rare and unknown modifications
kynurenine(+4 Da) + C-mannosylation(+162 Da)
• An annotated MS/MS spectrum for the 166 Da modification
49
Results
• Competence in peptide identifications
• Sensitivity of unmodified peptides
50
Results
• Competence in peptide identifications
• Sensitivity of unmodified peptides
• The numbers of identified PSMs
51
Results
• Competence in peptide identifications
• Sensitivity of unmodified peptides
• MODa is very robust and does not lose unmodified peptide
identifications
52
Results
• Competence in peptide identifications
• PTM-rich data(93-year-old cataract lens)
53
Results
• Sensitivity over modified peptides
• Simulation test for modified peptides
• MODa correctly identified 90% of nonmutated PSMs and 83% of
mutated PSMs
• MS-Alignment identified 48% of nonmutated PSMs and 31% of
mutated PSMs
The numbers of PSMs matched with the 14,623 PSMs of SEQUEST
54