modifications and identification

Download Report

Transcript modifications and identification

Popitam,
une méthode tolérante aux
mutations/modifications pour
l'identification de protéines à
partir de données de spectrométrie
de masse (MS/MS)
Patricia Hernandez
Swiss Institute of Bioinformatics
Overview
- proteomics
- proteome
- proteome visualization: 2D gels
- protein identification
- classical workflow
- shared peak count
- modifications and identification
- modified peptides
- SPC
- spectral alignment, de novo sequencing, tag extraction
- Popitam
- overview
- tags
- scoring function, genetic programming
- some results
Proteome
--> Proteomics: science that studies proteins
expressed by a genome --> proteome
--> changes with the state of development,
the tissue or the environmental conditions
-->
-->
-->
-->
-->
-->
...
identification and quantification
3D structure prediction
localisation in the cell
biological function
modifications
interactions with other proteins
proteomics
2d gels
proteomics
--> a simple way to "see" a proteome
--> numerous proteins from a biological sample (example: blood)
are separated according to 2 criteria :
molecular weight of the protein
isoelectric point
--> this method allows separating simultaneously thousands of
proteins and displaying them on a
two-dimensional map
--> spot = (generally) one purified protein
--> we can "see" the proteins, but we don't
know to which protein corresponds a
given spot...
Spots identification: classical workflow
--> identify a spot = give a protein name to a spot
MS/MS
identification
--> protein databases (for example SwissProt)
- records all known proteic sequences
- annotated
cut the aa chain into
peptides (every K
and R aa)
select an unknown
purified protein
MGQGWATAGLPSFRPEPYKCYGHPVP
SQEASQQVTVKTHGTSSQATTSSQK…
select a
peptide
MGQGWATAGLPSFR
PEPYK
CYGHPVPSQEASQQVTVK...
fragment
it
protein identification
MG
MGQ
MGQGWA
WAT
WATA
...
measure the mass of the
fragments by ms
measure the mass of the
peptides by ms
MS
identification
(PMF)
protein identification
Shared peak count
MS spectrum: list of the masses of peptides that constitute the
protein of interest
MS/MS spectrum: list of masses of fragments that constitute a
peptide of the protein of interest
MS: virtually cut the
theo. seq. into peptides
and compute masses
MS/MS: virtually cut
the theo. seq. into
peptides, and further
cut the peptides into
fragments, and compute
the masses
protein
database
p
i
compare the list of
experimental and
theoretical masses in
order to find the best
match between
experimental and virtual
spectra
--> detection
--> ions
--> noise
g
hbb_human
Modified peptides (1)
modifications and identification
PTMs
--> most eukaryote proteins
--> addition of a chemical group : - methylation:+14
- phosphroylation:+80
--> participate to:
- proteic structures - glycosylation: >800 ...
- proteic functions
- control of metabolic pathways
The sequence of the database may differ from the experimental peptide:
CONFLICT (different sources report differing sequences)
--> in about 4'600 human entries
VARIANT (authors report that sequence variants exist) = alleles
--> in about 2'200 human entries
MUTATIONS associated with diseases
--> 187 references to mutations and diseases in COMMENTS section
modifications and identification
Modified peptides (2)
EPYK
PEP
MGQGWATAGLPSFRPEPYKCYGHPVP
SQEASQQVTVKTHGTSSQATTSSQK…
PYK
PEPYK
MS, selection of
the peptide
digestion
fragmentation
intensity
intensity
a modified
protein
m/z
m/z
modified
experimental
MS/MS spectrum
intensity
experimental
MS/MS
spectrum
intensity
SPC and modified peptides
modifications and identification
m/z
intensity
intensity
m/z
m/z
m/z
theoretical peptide
"Shared peak count" algorithms have to introduce
modifications into the theoretical peptide databases.
modifications and identification
Database size (1)
AAIEGKLMQRAPALK
AAIEGK
LMQR
APALK
New database, if the two following modifications are taken
into account
- modification occurring on amino acid A: A->a
- modification occurring on amino acids L: L->l and E: E->e
= all the peptide from the initial database, plus all
modified peptides that can be built from the
initial database
AAIEGK
aAIEGK
AaIEGK
aaIEGK
AAIeGK
aAIeGK
AaIeGK
aaIeGK
LMQR
lMQR
APALK
aPALK
APaLK
aPaLK
APAlK
aPAlK
APalK
aPalK
modifications and identification
Database size (2)
B(L,p,k) gives the probability to have k positions of modification in a
sequence of lenght L, if p is the probability that a position may be
modified
(we assume the positions to be independent)
N0 xxxx
Aim: assess the number of peptides that contain zero, one, two...
"positions" for a possible modification
N k  nB ( L, p, k )  nC p 1  p 
k
L
n
k
Lk
max k
N
k 0
k
L = 10, p = 1/20:
800'000 = 478'990 + 252'100 + 59'710 + 8'380 + 771 + c
L= 10, p= 5/20:
800'000 = 45'050 + 150'169 + 225'254 + 200'225 + 116'798 + c
N1 oxxx
xoxx
xxox
xxxo
N2 ooxx
oxox
oxxo
xoox
xoxo
xxoo
modifications and identification
Database size (3)
Expected number s of peptides that may contain exactly
M modifications
sM 
L max
C Nk
k M
M
k
Expected size of database when taking into account 0 to
M modifications
M
s0toM   si
i 0
N0 xxxx
N1 oxxx
xoxx
xxox
xxxo
N2 ooxx
ooxx
oxox
oxox
...
Database size (3)
modifications and identification
SwissProt Human, 10'000 proteins
n = 806'787 peptides [300,3000] (=~from 3 to 30 aa)
L = 11 amino acids
0 to 3 modifications occuring on one specific amino acid: p=1/20
P0to3_mod = 1'375'700 + c
0 to 3 modifications that may occur on several loci:
Phosphorylation: H,D,S,T,Y (eucaryotes): p = 5/20
P0to3_mod = 4'865'100 + c
0 to 3 modifications that may occur on every amino acid: p=1
P0to3_mod = 3,97e12 + c
Mutation scenario:
Each amino acid may mutate into one of the remaining 19 amino acids:
All possible words = 19k-1
P1_mut = 1.16e14
Other strategies
modifications and identification
2 major problems:
- size of the database
- a priori knowledge on the deltaMass due to the modification
Solutions:
Define an identification algorithm that is not based on a SPC
--> spectral convolution/alignment
- PEDENTA (2000)
--> de novo sequencing followed by sequence matching
- extraction of one or several complete sequences
LUTEFISK (1997), SHERENGA (1999)...
- extraction of one or several small tags
(PeptideSearch, 1994), Patchwork sequencing...
--> Popitam (2003): "guided" sequencing
modifications and identification
Spectral convolution/alignment
Key idea: k-similarity D(k)
Given Sexp and Stheo, the goal is to find a serie
of k shifts in Sexp that makes Sexp and Stheo as
similar as possible.
D(k) represents the maximum number of
elements in common between a
theoretical and an experimental spectrum after
k shifts
SPC score:
SA score:
if (i',j') and (i,j)
are co-diagonal
otherwise
D(k=0) = 2
D(k=2) = 6
exp. MS/MS spectrum
 Di ' j ' ( k )  1
Dij (k )  max 
( i ', j ')( i , j ) D
 i ' j ' ( k  1)  1
Pevzner PA, Dancik V, Tang CL: Mutationtolerant protein identification by mass
spectrometry. J.Comput.Biol. 2000, 7:777787
theo. MS/MS spectrum
A
A
B
C
D
E
F
B
C
D
E
F
De novo sequencing
modifications and identification
Taylor JA, Johnson RS: Sequence database searches via de novo peptide sequencing by tandem mass
spectrometry. Rapid Commun.Mass Spectrom. 1997, 11:1067-1075
Longest path problem in a directed acyclic graph --> dynamic programming
--> complete sequences
--> mutations, but no modifications
4/24
modifications and identification
Tag extraction
Mann M, Wilm M: Error-tolerant
identification of peptides in sequence
databases by peptide sequence tags.
Anal.Chem. 1994, 66:4390-4399
Island of sequence ions
The tags (m1-SEQ-m2) are manually extracted
2 steps: tags as filtering, then SPC
Schlosser A, Lehmann WD: Patchwork
peptide sequencing: Extraction of
sequence information from accurate
mass data of peptide tandem mass
spectra recorded at high resolution*.
Proteomics. 2002, 2:524-533
Based on very accurate masses (10 mDa)
Small tags are extracted from low mass regions (2 aa)
Popitam key's idea
Spectrum graph
--> good way to structure the information contained in
the MS/MS spectrum, allows mutations
Tags
--> modified source peptides
--> fragmented spectra
Search space
--> use dtb information during tag extraction
--> take into account only mutations compatible with
the spectrum (graph)
--> make only modification scenarios compatible with
the current theoretical peptide
Scoring function
--> take into account a lot of parameters
--> genetic programming
Popitam
Popitam
Popitam overview
any source of
biological sequences
initial node
I(P1)
I(P2)
...
P1
P2
...
filter
final node
IDENTIFICATION
MS/MS
Peptide
sequence
database
For each Pi
extractTags();
processTags();
score();
7/12
Popitam
Spectrum graph
y++
measured mass
[m/z]
a+-H20
- selection based on intensity
- for each peak, make
all possible hypotheses
b+
b+-NH3
bMass
(ideal fragmentation)
- # nodes > # peaks
- families
“N-term”: bMass = chargeNb * m/z – (chargeNb-1) – offset
“C-term”: bMass = PM – […]
5/12
Popitam
Tag extraction
peLTE
peLet
peLvm
peITE
peIet
peIvm
petlE
LTE
Let
Lvm
ITE
Iet
Ivm
tlE
ck
TE
et
vm
go
E
V
9 nodes,11 edges
--> 21 tags
Popitam
Tag extraction (2)
LVNELTEFAK (125 peaks)
Pentium, 1.6 GHz
AIGGGLSSVGGSSTIK
(1159 peaks)
1
2
3
4
5
16/97
30/338
44/692
58/1121
72/1667
5.6*104
5.4*106
5.7*107
3.4*108
2.3*109
0m02s
0m27s
3m16s
21m09s
2h17m07s
AHFSISNSAEDPFIAIHADSK
(145 peaks)
1 24/121
2 46/308
3 68/831
6.1*104
0m02s
8
1.9*10
16m15s
2.0*1010 22h06m47s
Popitam
Tag extraction (3)
Recursively extract from the graph all tags that are compatible
with the current theoretical peptide
--> a tag = a path (bMass, edge label, ionic hypothesis…)
ACCACMCAK
-
A
C
CACMCAK
k
CACMCAK
MCAK
k
MCAK
C
A
CMCAK
MCAK
k
Popitam
Tag processing
- discard subtags
- discard tags that begin the theo. peptide,
but not the graph (and vice versa)
- discard tags that finish on the last aa, but
not on the last node
- group "family" tags
AVVQDPALKPLALVYGEATSR
PeakNb : 1260
ParentMass
: 2197.15
NodeNb :
86
EdgeNb :
142 / 1098
29 tags --> 13 subSeqs
KplALVYGE
plALVYGE
ALVYGE
LVYGE
VYGE
YGE
30 39 43 45
39 43 45
43 45
45
paLKplALvy
LKplALvy
KplALvy
plALvy
ALvy
0 4 10 16
4 10 16
10 16
16
LKPla
LKPla
KPla
KPla
10 13 19
10 14 19
13 19
14 19
22
22
22
22
22
22
22
22
22
50
50
50
50
50
58
58
58
58
58
58
26
26
26
26
26
31
31
31
31
63
63
63
63
63
63
31
31
31
31
31
64
64
64
64
64
64
42
42
42
42
42
68
68
68
68
68
68
PLAlv
LAlv
DpaL
LKP
LVY
LVY
PAL
QDP
alkpL
avVqd
dpAL
avVQD
VQD
paLK
29 35
35
65 69
11 15
16 19
44 49
19 22
10 16
54 63
0 5 9
37 43
55 60
60
59 66
40
40
78
20
24
57
26
20
71
18
45
65
65
69
42 48
42 48
84
24
29
62
31
24
75
50
70 75
70 75
75
Popitam
Subsequence processing (1)
Aim:
Find all possible arrangements of subsequences, given the theoretical peptide
BUT
do not include in a same arrangement tags that are incompatible with the others.
Compatibility rules:
--> no peak shared
--> beginMasses must respect positions in the sequences
Compatibility graph
0 1 2 3 4 5
0 x
x
1
x
x
2
x x
3
x x x
x
4
x
5 x
x
x
...
A V V Q D P A L K P L A L V Y G E A T S R
0
5
10
15
...
0 KplALVYGE 794.41
1 LKPla
282.17
2 PLAlv
785.34
3 DpaL
1673.89
4 LKP
284.11
5 LVY
410.26
...
0 1 2 6 15 19 21 27 30
2 7 29 33 41
6 8 19 21 28
14 20 31 36
17 22 32 36
14 22 28 29
Each found clique in the graph is a possible arrangement of subsequences
Here, 91 cliques, but most of them are really uninteresting.
Popitam
Scoring function (1)
--> 2 levels scoring:
AVVQDPALKPLALVYGEATSR
KplALVYGE
LKP
LVY
794.4
284.1
1202.7
AVVQDPALKPLALVYGEATSR
KplALVYGE
LKP
LVY
avVqd
794.4
284.1
1202.7
1.0
AVVQDPALKPLALVYGEATSR
KplALVYGE
LKP
avVqd
794.4
284.1
1.0
AVVQDPALKPLALVYGEATSR
KplALVYGE
LVY
...
794.4
1202.7
- scoring linked to the subsequences (local)
subscores:
number of tags that compose the subsequence
length of the subsequence
occurrence probabilities of the ionic type
hypothesized (geometric/arithmetic mean)
- scoring linked to the arrangement (global)
subscores:
global coverage
linear regression
Scoring function (2)
How can we combine the subscores in order to build an efficient
scoring function ?
--> empirical function (expert knowledge)
--> probabilitic function
--> function built using GENETIC PROGRAMMING
GENETIC PROGRAMMING
population of "programs" : trees
nodes : mathematic operators (+, -, *, /, ^, ...)
bolean operators (AND, OR, NOT...)
conditional operators (if-then-else...)
iterative functions (do-until...)
other specific functions...
leaves : subscores, coefficient
Popitam
Genetic operators (1)
Popitam
Initiation:
Programs are initially randomly determined (structure, functions, values)
Iterations:
At each iteration, the programs are evaluated (fitness function). Only the
best are allowed to reproduce, using genetic operators (permutation,
mutation, crossing-over...).
Genetic operators (2)
Popitam
Popitam
Genetic programming
genetic programming allows testing several scoring functions and
making them "cleverly" evolve in order to find an optimal one
tree population
if (correctId() )
si  ]0.5;1[
(according to the discriminative power)
scoring
function1
else {
if (belongToList() )
si  ]0;0.5]
(according to the position in the list)
Popitam
else
si = 0;
fitness
scoring
function3
scoring
function2
Popitam
fitness
Popitam
fitness
Some results
Popitam