Transcript MODa

Modification identification –
Blind search (MODa)
Department of Electronics and
Computer Engineering
2011501682, Kyung Soo Kim
2
MODa
• A novel ‘multi-blind’ spectral alignment algorithm
Fast unrestrictive PTM searches
No limitation on the number of modification per
peptide
 Sensitive on Human shotgun proteomics data


▫ The first unrestrictive search tool with the potential
to fully replace conventional restrictive
identification
3
MODa
• The MODa algorithm is utilized to extract
candidate peptide selection by using sequence
tags.
• This algorithm also is utilized to find an optimal
peptide by using dynamic programming.
4
MODa
• Multiple Sequence Tags
▫ By using multiple sequence tags, we expect some
effects as follows:
 It can dramatically reduce the number of database
peptides matched to each spectrum.
 It also can effectively localize modified regions
within a spectrum.
5
MODa
• Multiple Sequence Tags (Example)
A+42ALFC+48LESAW+16K’,
6
MODa
• Score computation method
▫ This score is used to rank the candidate identifications
from a single spectrum during dynamic programming.
Experimental spectrum S is converted into PRM spectrum
1. S is separated into windows of 100 m/z units
2. Top 10 peaks are retained according to their intensities
3. Each peak is given a weight according to its ranking
4. For every retained peak’s mass m
nodes of masses (m-1) and (PrecursorMass(S)-(m-1)) are added to the
PRM spectrum
The PRM node’s score
the sum of weights of expected ion peaks from the PRM
The score of a candidate peptide
the sum of scores of PRM nodes
7
MODa
• Probability computation method
▫
The probability is evaluated by taking into account various
properties representing the quality of match between the peptide
and the MS/MS spectrum.
▫ Thus, we compute the probability that the top identification is
correct.
Four properties:
1) PRM-score
2) mass errors of matched fragment ions
3) the fractions of b and y ions found
4) the propensity to a particular ion type –tryptic peptide features a
stronger y-ion ladder than b-ion ladder
▫ The four components are combined by a logistic regression
▫ The weights were trained and validated over correct and incorrect
matches from ISB’s standard protein mixture dataset.
8
MODa
• Probability computation method
▫ To construct the training dataset, top-ranked matches to one of
the standard proteins and contaminants were classified as correct.
▫ And, their second-ranked matches were classified as incorrect.
▫ Finally, the weights were obtained separately according to
instrument types and charge states of precursor ions.
9
MODa
• Dynamic Programming
M[p][t][s]
Maximum score path on the diagonal defined by a tag t
p: position
t: tag
s: whether the position(p) on the diagonal
before the tag: 0
in the tag: 1
after the tag: 2
10
MODa
• Dynamic Programming
M[p][t][s] is defined recursively as score(p,t) + the maximum of jump
1. Amino acid jump within same row
2. Modification jump to other row
11
MODa
• Dynamic Programming
1.
Amino acid jump
before the tag: M [p-1][t][0] iff p <= start(t)
inside the tag: max(M[p-1][t][0], M[p-1][t][1])
iff s=1, start(t) < p <= end(t)
after the tag: M [p-1][t][1]
iff p> end(t)
12
MODa
• Dynamic Programming
2. Modification jump
M[p-1][q][1] + pf(Δ, ap)
iff s=0, p<end(t), over all tags q such that start(q)<start(t)
pf(Δ, ap) ≤ 0 is a penalty function for a modification Δ on the amino acid aa.
In this work, we considered pf(Δ, ap)as a constant C for all the modifications.
13
MODa
• Time and Space complexity
O(max(|S|, |T|2 * |P|))
▫ The most prominent difference
 Use of parameter k(the number of modification)
 MODa determines k automatically by using dynamic
programming
▫ MODa is 40 times faster than MS-Alighnment
14
Experiment setting
• Proteomics dataset:
Human plasma(A)
HEK 293 cell line(B)
Human lens(C)
Unmodified peptide
/one modified peptide
large scale
complex mixture
multi-modified
peptide
• Parameter setting
•
•
•
•
•
peptide ions: ±2.5 Da mass tolerances
fragment ions : ± 0.5 Da mass tolerances
no enzyme specificity
200 Da for modification mass size
‘multi-mod’ mode
15
Experiment setting
• Database
▫ IPI human database + the shuffled sequences
▫ mutDB: mutated sequences + their shuffled
sequences
• Peptide identification
▫ FDR 1% using target-decoy approach
▫ FDR = (2xD)/(T+D)
 T: the number of target hits above score threshold
 D: the number of decoy hits above score threshold
16
Experiment setting
• Established restrictive/unrestrictive searches for
high-throughput complex mixture data
▫ Data : human plasma data (A)
▫ Database :
IPI human database + the shuffled sequences
▫ Compared tools
 Restrictive search tools: SEQUEST, InsPecT
 Unrestrictive search tool: MS-Alignment
▫ InsPecT with blind option allowing one modification per peptide
17
Experiment setting
• Unrestrictive searches for PTM-rich data
▫ Data
 The human ocular lens tissue(C)
▫ Unrestrictive search tools
 Mascot error tolerant search (Known modification)
 Protein Prospector (Unknown modification)
18
Experiment setting
• Simulation test for modified peptides
▫ Evaluate how sensitive MODa
 modified region in an MS/MS spectrum
▫ Data
 2,423 peptide sequences from 14,623 SEQUEST PSMs
identified in human plasma data
▫ Database :mutDB
 The database consists of mutated sequences and their
shuffled sequences
▫ Peptide identification
 Identify the original peptides with one amino acid
mutation
▫ Compared tool
 MS-Alignment search
19
Results
• Overall PTM identification
Summary of frequent modification types observed from (a) human plasma, (b) human HEK293,
and (c) human lens data sets.
20
Results
• Overall PTM identification
▫ MODa highlighted modifications on alkylated cysteine
▫ Peptides containing alkylated and oxidized cysteine,
identified in plasma data, and an MS/MS spectrum.
▫ C* represents carbamidomethylated cysteine
21
Results
• Mutation analysis
▫ The mutations discovered by MODa analysis of Plasma
and HEK293 data sets
▫ MODa found
 13 previously unreported mutations
 12 of which could possibly be explained by a SNP(Single
Nucleotide Polymorphism)
22
Results
• Mutation analysis
 Identified peptides with two mutations in cataract
lens sample
23
Results
• Mutation analysis
Listed in dbSNP
Novel
 The Pro → Ser substitution of this peptide was not observed
alone but only jointly with the Ser → Gly substitution
24
Results
• Competence in peptide identifications
Comparison of identifications by (a) MODa/one-mod, (b) MODa/multi-mod, and (c)
MS-Alignment with identifications from SEQUEST and InsPecT searches against
human plasma data. All identifications were obtained at FDR 1%. Numeric figure
below each tool represents the number of its overall identifications.
25
Results
• Competence in peptide identifications
 MODa
 against whole Swiss-Prot human proteins
 Mascot and Protein Prospector
 against a subset of proteins identified from their initial
searches
26
Results
• Competence in peptide identifications
Comparison among identifications from MODa/multi-mod, Protein Prospector, and
Mascot/error-tolerant searches against human 93-year-old cataract lens data. All
identifications were obtained at FDR 1%. Numeric figure below each tool represents
the number of its overall identifications.
27
Results
• Sensitivity over modified peptides
▫ Simulation test for modified peptides
 how many spectra could be identified as the original peptides with one
amino acid mutation
 searched against mutDB.
MODa performance on
mutated database.
Thank you for listening my presentation.