Transcript MODa
Modification identification –
Blind search (MODa)
Department of Electronics and
Computer Engineering
2011501682, Kyung Soo Kim
2
MODa
• A novel ‘multi-blind’ spectral alignment algorithm
Fast unrestrictive PTM searches
No limitation on the number of modification per
peptide
Sensitive on Human shotgun proteomics data
▫ The first unrestrictive search tool with the potential
to fully replace conventional restrictive
identification
3
MODa
• The MODa algorithm is utilized to extract
candidate peptide selection by using sequence
tags.
• This algorithm also is utilized to find an optimal
peptide by using dynamic programming.
4
MODa
• Multiple Sequence Tags
▫ By using multiple sequence tags, we expect some
effects as follows:
It can dramatically reduce the number of database
peptides matched to each spectrum.
It also can effectively localize modified regions
within a spectrum.
5
MODa
• Multiple Sequence Tags (Example)
A+42ALFC+48LESAW+16K’,
6
MODa
• Score computation method
▫ This score is used to rank the candidate identifications
from a single spectrum during dynamic programming.
Experimental spectrum S is converted into PRM spectrum
1. S is separated into windows of 100 m/z units
2. Top 10 peaks are retained according to their intensities
3. Each peak is given a weight according to its ranking
4. For every retained peak’s mass m
nodes of masses (m-1) and (PrecursorMass(S)-(m-1)) are added to the
PRM spectrum
The PRM node’s score
the sum of weights of expected ion peaks from the PRM
The score of a candidate peptide
the sum of scores of PRM nodes
7
MODa
• Probability computation method
▫
The probability is evaluated by taking into account various
properties representing the quality of match between the peptide
and the MS/MS spectrum.
▫ Thus, we compute the probability that the top identification is
correct.
Four properties:
1) PRM-score
2) mass errors of matched fragment ions
3) the fractions of b and y ions found
4) the propensity to a particular ion type –tryptic peptide features a
stronger y-ion ladder than b-ion ladder
▫ The four components are combined by a logistic regression
▫ The weights were trained and validated over correct and incorrect
matches from ISB’s standard protein mixture dataset.
8
MODa
• Probability computation method
▫ To construct the training dataset, top-ranked matches to one of
the standard proteins and contaminants were classified as correct.
▫ And, their second-ranked matches were classified as incorrect.
▫ Finally, the weights were obtained separately according to
instrument types and charge states of precursor ions.
9
MODa
• Dynamic Programming
M[p][t][s]
Maximum score path on the diagonal defined by a tag t
p: position
t: tag
s: whether the position(p) on the diagonal
before the tag: 0
in the tag: 1
after the tag: 2
10
MODa
• Dynamic Programming
M[p][t][s] is defined recursively as score(p,t) + the maximum of jump
1. Amino acid jump within same row
2. Modification jump to other row
11
MODa
• Dynamic Programming
1.
Amino acid jump
before the tag: M [p-1][t][0] iff p <= start(t)
inside the tag: max(M[p-1][t][0], M[p-1][t][1])
iff s=1, start(t) < p <= end(t)
after the tag: M [p-1][t][1]
iff p> end(t)
12
MODa
• Dynamic Programming
2. Modification jump
M[p-1][q][1] + pf(Δ, ap)
iff s=0, p<end(t), over all tags q such that start(q)<start(t)
pf(Δ, ap) ≤ 0 is a penalty function for a modification Δ on the amino acid aa.
In this work, we considered pf(Δ, ap)as a constant C for all the modifications.
13
MODa
• Time and Space complexity
O(max(|S|, |T|2 * |P|))
▫ The most prominent difference
Use of parameter k(the number of modification)
MODa determines k automatically by using dynamic
programming
▫ MODa is 40 times faster than MS-Alighnment
14
Experiment setting
• Proteomics dataset:
Human plasma(A)
HEK 293 cell line(B)
Human lens(C)
Unmodified peptide
/one modified peptide
large scale
complex mixture
multi-modified
peptide
• Parameter setting
•
•
•
•
•
peptide ions: ±2.5 Da mass tolerances
fragment ions : ± 0.5 Da mass tolerances
no enzyme specificity
200 Da for modification mass size
‘multi-mod’ mode
15
Experiment setting
• Database
▫ IPI human database + the shuffled sequences
▫ mutDB: mutated sequences + their shuffled
sequences
• Peptide identification
▫ FDR 1% using target-decoy approach
▫ FDR = (2xD)/(T+D)
T: the number of target hits above score threshold
D: the number of decoy hits above score threshold
16
Experiment setting
• Established restrictive/unrestrictive searches for
high-throughput complex mixture data
▫ Data : human plasma data (A)
▫ Database :
IPI human database + the shuffled sequences
▫ Compared tools
Restrictive search tools: SEQUEST, InsPecT
Unrestrictive search tool: MS-Alignment
▫ InsPecT with blind option allowing one modification per peptide
17
Experiment setting
• Unrestrictive searches for PTM-rich data
▫ Data
The human ocular lens tissue(C)
▫ Unrestrictive search tools
Mascot error tolerant search (Known modification)
Protein Prospector (Unknown modification)
18
Experiment setting
• Simulation test for modified peptides
▫ Evaluate how sensitive MODa
modified region in an MS/MS spectrum
▫ Data
2,423 peptide sequences from 14,623 SEQUEST PSMs
identified in human plasma data
▫ Database :mutDB
The database consists of mutated sequences and their
shuffled sequences
▫ Peptide identification
Identify the original peptides with one amino acid
mutation
▫ Compared tool
MS-Alignment search
19
Results
• Overall PTM identification
Summary of frequent modification types observed from (a) human plasma, (b) human HEK293,
and (c) human lens data sets.
20
Results
• Overall PTM identification
▫ MODa highlighted modifications on alkylated cysteine
▫ Peptides containing alkylated and oxidized cysteine,
identified in plasma data, and an MS/MS spectrum.
▫ C* represents carbamidomethylated cysteine
21
Results
• Mutation analysis
▫ The mutations discovered by MODa analysis of Plasma
and HEK293 data sets
▫ MODa found
13 previously unreported mutations
12 of which could possibly be explained by a SNP(Single
Nucleotide Polymorphism)
22
Results
• Mutation analysis
Identified peptides with two mutations in cataract
lens sample
23
Results
• Mutation analysis
Listed in dbSNP
Novel
The Pro → Ser substitution of this peptide was not observed
alone but only jointly with the Ser → Gly substitution
24
Results
• Competence in peptide identifications
Comparison of identifications by (a) MODa/one-mod, (b) MODa/multi-mod, and (c)
MS-Alignment with identifications from SEQUEST and InsPecT searches against
human plasma data. All identifications were obtained at FDR 1%. Numeric figure
below each tool represents the number of its overall identifications.
25
Results
• Competence in peptide identifications
MODa
against whole Swiss-Prot human proteins
Mascot and Protein Prospector
against a subset of proteins identified from their initial
searches
26
Results
• Competence in peptide identifications
Comparison among identifications from MODa/multi-mod, Protein Prospector, and
Mascot/error-tolerant searches against human 93-year-old cataract lens data. All
identifications were obtained at FDR 1%. Numeric figure below each tool represents
the number of its overall identifications.
27
Results
• Sensitivity over modified peptides
▫ Simulation test for modified peptides
how many spectra could be identified as the original peptides with one
amino acid mutation
searched against mutDB.
MODa performance on
mutated database.
Thank you for listening my presentation.