Transcript ppt

Mass Spectrometry-based Proteomics
Xuehua Shen
(Adapted from slides with textbook)
1
Outline
• Motivation of proteomics
• Mass spectrometry-based proteomics
• Instrumentation of mass spectrometry
• De novo sequencing algorithm
• Database search
• Algorithms of real software (e.g., sequence
tags)
2
Motivation
• Proteins are working units of the cells
– The number of found genes is much less than the
number of expressed proteins
– Directly related with cell processes and diseases
DNA
SNP
~30,000 human
genes
mRNA
Protein
Alternative
Post-translational
splicing
Modification
>100,000 RNA
messages
>1,000,000 distinct
protein forms
3
Tools for Proteomics
• Edman degradation reaction
• NMR (Nuclear Magnetic Resonance)
• X-ray crystallography
• Protein array
• Mass Spectrometry
4
Mass Spectrometry-based Proteomics
• Primary sequence (sequencing, identification)
• Post-translational modification (PTM)
(characterization)
• Quantitative proteomics (quantification)
• Protein-protein interaction
5
6
Components of Mass Spectrometer
• Ion source (ESI and MALDI)
• Mass analyzer (ion traps, TOF, Quadrupole, FT,
etc.)
– Mass-to-charge ratio (m/z)
• Ion detector
7
Peptide and Intact Protein
• Peptide: a fragment of protein
• Some enzymes, e.g. trypsin, break protein into
peptides.
• Some technology put intact protein into the mass
spectrometer
8
Peptide Fragmentation
Collision Induced Dissociation
H+
H...-HN-CH-CO
Ri-1
N-Terminus
•
•
. . . NH-CH-CO-NH-CH-CO-…OH
Ri
Ri+1
C-Terminus
Peptides tend to fragment along the backbone.
Fragments can also loose neutral chemical groups
like NH3 and H2O.
9
Ideal Mass Spectrum
10
Real Mass Spectrum
11
N- and C-terminal Peptides
12
Terminal peptides and ion types
Peptide
Mass (D)
Peptide
Mass (D)
57 + 97 + 147 + 114 = 415
without
57 + 97 + 147 + 114 – 18 = 397
13
N- and C-terminal Peptides
486
71
415
301
185
154
332
57
429
14
N- and C-terminal Peptides
486
71
415
301
185
154
332
57
429
15
N- and C-terminal Peptides
486
71
415
301
185
154
332
57
429
16
N- and C-terminal Peptides
486
71
415
Problem:
301
154
57
Reconstruct peptide from the set of
masses of fragment
185
332
429
17
Mass Spectra
57 Da =K‘G’
D
D
V
99 Da = ‘V’
L
H2O
G
L
D
K
V
G
mass
0
• The peaks in the mass spectrum:
– Prefix
and Suffix Fragments.
– Fragments with neutral losses (-H2O, -NH3)
– Noise and missing peaks.
18
Protein Identification with MS/MS
G
V
D
K
Peptide
Identification:
Intensity
MS/MS
L
mass
00
19
Protein Identification by Tandem Mass
Spectrometry
MS/MS instrument
S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6
T: + c d Full ms2 638.00 [ 165.00 - 1925.00]
850.3
100
95
687.3
90
85
588.1
80
75
70
65
Relative Abundance
S
e
q
u
e
n
c
e
60
55
851.4
425.0
50
45
949.4
40
326.0
35
De Novo interpretation
•Sherenga
Database search
•Sequest
524.9
30
25
20
589.2
226.9
1048.6
1049.6
397.1
489.1
15
10
629.0
5
0
200
400
600
800
1000
m/z
1200
1400
1600
1800
2000
20
De Novo vs. Database Search
S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6
T: + c d Full m s 2 638.00 [ 165.00 - 1925.00]
850.3
100
95
85
588.1
80
De Novo
75
70
65
Relative Abundance
Database
Search
687.3
90
60
55
851.4
425.0
50
45
949.4
40
326.0
35
524.9
30
25
20
589.2
226.9
1048.6
1049.6
397.1
489.1
15
10
629.0
5
0
200
400
600
800
1000
m /z
1200
1400
1600
1800
2000
Mass, Score
W
Database of
known peptides
R
V
A
A
MDERHILNM, KLQWVCSDL,
PTYWASDL, ENQIKRSACVM,
TLACHGGEM, NGALPQWRT,
HLLERTKMNVV, GGPASSDA,
GGLITGMQSD, MQPLMNWE,
ALKIIMNVRT,
ALKIIMNVRT,AVGELTK
AVGELTK, ,
HEWAILF, GHNLWAMNAC,
GVFGSVLRA, EKLNKAATYIN..
C
G
G
L
P
L
L
T
E
K
K
W
D
T
AVGELTK
21
Pros and Cons of de novo Sequencing
•
Advantage:
– Gets the sequences that are not necessarily in the database.
•
– An additional similarity search step using these sequences
may identify the related proteins in the database.
Disadvantage:
– Requires higher quality data.
– Often contains errors.
22
Current Status
• It is still a open problem of protein sequencing
no matter whether using de novo sequencing
or database search methods
• Following algorithms only deal with simplified
(or ideal) spectrums
• Some algorithms combine de novo
sequencing and database search
23
Outline
• Motivation of proteomics
• Mass spectrometry-based proteomics
• Instrumentation of mass spectrometry
• De novo sequencing
• Database search
• Algorithms of real software (e.g., sequence
tags)
24
De novo Peptide Sequencing
S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6
T: + c d Full ms2 638.00 [ 165.00 - 1925.00]
850.3
100
95
687.3
90
85
588.1
80
75
70
Relative Abundance
65
60
55
851.4
425.0
50
45
949.4
40
326.0
35
524.9
30
25
20
589.2
226.9
1048.6
1049.6
397.1
489.1
15
10
629.0
5
0
200
400
600
800
1000
m/z
1200
1400
1600
1800
2000
Sequence
25
Peptide Sequencing Problem
Goal: Find a peptide with maximal match between
an experimental and theoretical spectrum.
Input:
– S: experimental spectrum
– Δ: set of possible ion types
– m: parent mass
Output:
– P: peptide with mass m, whose theoretical
spectrum matches the experimental S
spectrum the best
26
Procedure of De Novo Sequencing
•
Build spectrum graph
– How to create vertices (from masses)
– How to create edges (from mass differences)
•
Find best path or rank paths of spectrum graph
– How to find candidate paths
– How to score paths
27
From Sequence to Spectrum
b
S E
Q U E
N
Mass/Charge (M/Z)
C E
28
From Sequence to Spectrum
(cont.)
a
SE
Q U
E
N
Mass/Charge (M/Z)
C
E
29
From Sequence to Spectrum
(cont.)
a is an ion type shift in b
S E
Q U E
Mass/Charge (M/Z)
N
C E
30
From Sequence to Spectrum (cont.)
y
E C
N
E
U Q
Mass/Charge (M/Z)
E S
31
Intensity
From Sequence to Spectrum (cont.)
Mass/Charge (M/Z)
32
Intensity
From Sequence to Spectrum (cont.)
Mass/Charge (M/Z)
33
From Sequence to Spectrum (cont.)
noise
Mass/Charge (M/Z)
34
Intensity
MS/MS Spectrum
Mass/Charge (M/z)
35
Some Mass Differences between Peaks
Correspond to Amino Acids
u
q
s
e
s
e
e
c
e
u
q
e
n
n
q
u
e
n
c
c
e
e
s
e
36
Now decoding from spectrum
to sequence…?
Build spectrum graph
37
Vertices of Spectrum Graph
•
Vertices are generated by reverse shifts corresponding to
ion types
•
Δ={δ1, δ2,…, δk}
Every mass s in an MS/MS spectrum generates k vertices
V(s) = {s+δ1, s+δ2, …, s+δk}
corresponding to potential N-terminal peptides
•
Vertices of the spectrum graph:
{initial vertex}V(s1) V(s2) ... V(sm) {terminal
vertex}
38
Reverse Shifts
Shift in H2O
Shift in H2O+NH3
39
Edges of Spectrum Graph
•
Two vertices with mass difference corresponding to
an amino acid A:
– Connect with an edge labeled by A
•
Gap edges for di- and tri-peptides
– Potential sequence tag method (covered later)
40
Best Path of Spectrum Graph
• How to find candidate paths
• There are many paths, how to find the correct
one?
• We need scoring to evaluate paths
41
Find Candidate Paths
• Heuristics: find a path with maximum number
of edges
• Longest path problem in DAG
• DFS (Depth First Search)
42
Path Score
• p(P,S) = probability that peptide P produces
spectrum S= {s1,s2,…sq}
• p(P, s) = the probability that peptide P
generates a peak s
• Scoring = computing probabilities
43
Finding Optimal Paths in the
Spectrum Graph
• For a given MS/MS spectrum S, find a
peptide P’ maximizing p(P,S) over all possible
peptides P:
p(P',S)  max P p(P,S)
• Peptides = paths in the spectrum graph
• P’ = the optimal path in the spectrum graph
• Some software rank paths
44
Ions and Probabilities
• A peptide has all k peaks with probability  q
k
i
i 1
• and
k
no peaks with probability  (1  qi )
i 1
• A peptide also produces a ``random noise''
with uniform probability qR in any position.
45
Ratio Test Scoring for Partial Peptides
• Incorporates premiums for observed ions and
penalties for missing ions.
• Example: for k=4, assume that for a partial
peptide P’ we only see ions δ1,δ2,δ4.
The score is calculated as:
q1 q2 (1  q3 ) q4
 

qR qR (1  qR ) qR
46
Why Not Sequence De Novo?
• De novo sequencing is still not very accurate!
Amino Acid
Accuracy
Whole Peptide
Accuracy
0.566
0.189
SHERENGA (Dancik et. al., 1999).
0.690
0.289
Peaks
0.673
0.727
0.246
0.296
Algorithm
Lutefisk
(Taylor and Johnson, 1997).
(Ma et al., 2003).
PepNovo (Frank and Pevzner, 2005).
• Less than 30% of the peptides sequenced were
completely correct!
47
The End
Thank you !
48
De Novo vs. Database Search
S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6
T: + c d Full m s 2 638.00 [ 165.00 - 1925.00]
850.3
100
95
85
588.1
80
De Novo
75
70
65
Relative Abundance
Database
Search
687.3
90
60
55
851.4
425.0
50
45
949.4
40
326.0
35
524.9
30
25
20
589.2
226.9
1048.6
1049.6
397.1
489.1
15
10
629.0
5
0
200
400
600
800
1000
m /z
1200
1400
1600
1800
2000
W
Database of
known peptides
R
V
A
A
MDERHILNM, KLQWVCSDL,
PTYWASDL, ENQIKRSACVM,
TLACHGGEM, NGALPQWRT,
HLLERTKMNVV, GGPASSDA,
GGLITGMQSD, MQPLMNWE,
ALKIIMNVRT,
ALKIIMNVRT,AVGELTK
AVGELTK, ,
HEWAILF, GHNLWAMNAC,
GVFGSVLRA, EKLNKAATYIN..
C
G
G
L
P
L
L
T
E
K
K
W
D
T
AVGELTK
49
De Novo vs. Database Search: A
Paradox
•
•
•
de novo algorithms are much faster, even though their
search space is much larger!
A database search scans all peptides in the search
space to find best one.
De novo eliminates the need to scan all peptides by
modeling the problem as a graph search.
Why not sequence de novo?
50
Outline
• Motivation of proteomics
• Mass spectrometry-based proteomics
• Instrumentation: Mass Spectrometry
• De novo sequencing algorithm
• Database search
• Algorithms of real software (e.g., sequence
tags)
51
Peptide Identification Problem
Goal: Find a peptide from the database with
maximal match between an experimental and
theoretical spectrum.
Input:
– S: experimental spectrum
– database of peptides
– Δ: set of possible ion types
– m: parent mass
Output:
– A peptide of mass m from the database whose
theoretical spectrum matches the
experimental S spectrum the best
52
MS/MS Database Search
Database search in mass-spectrometry has been very
successful in identification of already known proteins.
Experimental spectrum can be compared with theoretical
spectra of database peptides to find the best fit.
SEQUEST (Yates et al., 1995)
But reliable algorithms for identification of modified
peptides is a much more difficult problem.
53
Post-Translational Modifications
Proteins are involved in cellular signaling and
metabolic regulation.
They are subject to a large number of biological
modifications.
Almost all protein sequences are posttranslationally modified and 200 types of
modifications of amino acid residues are known.
54
Examples of Post-Translational
Modification
Post-translational modifications increase the number of “letters” in
amino acid alphabet and lead to a combinatorial explosion in both
database search and de novo approaches.
55
Search for Modified Peptides: Virtual
Database Approach
Yates et al.,1995: an exhaustive search in a virtual
database of all modified peptides.
Exhaustive search leads to a large combinatorial
problem, even for a small set of modifications
types.
Problem (Yates et al.,1995). Extend the virtual
database approach to a large set of
modifications.
56
Exhaustive Search for Modified Peptides
•
YFDSTDYNMAK
Oxidation?
•
•
For each peptide,
generate all
modifications.
Score each
modification.
Phosphorylation?
•
25=32 possibilities, with 2 types
of modifications!
57
Modified Peptide Identification Problem
Goal: Find a modified peptide from the database with maximal
match between an experimental and theoretical spectrum.
Input:
– S: experimental spectrum
– database of peptides
– Δ: set of possible ion types
– m: parent mass
– Parameter k (# of mutations/modifications)
Output:
– A peptide of mass m that is at most k
mutations/modifications apart from a database peptide
and whose theoretical spectrum matches the
experimental S spectrum the best
58
Peptide Identification Problem: Challenge
Very similar peptides may have very different
spectra!
Goal: Define a notion of spectral similarity that
correlates well with the sequence similarity.
If peptides are a few mutations/modifications
apart, the spectral similarity between their
spectra should be high.
59
Spectrum Alignment
• See 8.14 and 8.15 in the text book for one
algorithm
• Complicated for real spectrums
60
Quality Measure of Mass Spectrometer
• Sensitivity
• Mass accuracy
• Resolution
• Dynamic range
61
Ion Types
• Some masses correspond to fragment
ions, others are just random noise
• Knowing ion types Δ={δ1, δ2,…, δk} lets us
distinguish fragment ions from noise
• We can learn ion types δi and their
probabilities qi by analyzing a large test
sample of annotated spectra.
62
Database Search:
Sequence Analysis vs. MS/MS Analysis
Sequence analysis:
similar peptides (that a few mutations apart) have similar sequences
MS/MS analysis:
similar peptides (that a few mutations apart) have dissimilar spectra
65
Deficiency of the Shared Peaks Count
Shared peaks count (SPC): intuitive measure of
spectral similarity.
Problem: SPC diminishes very quickly as the
number of mutations increases.
Only a small portion of correlations between the
spectra of mutated peptides is captured by
SPC.
66
Ions and Probabilities
• Tandem mass spectrometry is characterized
by a set of ion types {δ1,δ2,..,δk} and their
probabilities {q1,...,qk}
• δi-ions of a partial peptide are produced
independently with probabilities qi
67
De Novo vs. Database Search:
•
•
•
•
•
The database of all peptides is huge ≈ O(20n) .
The database of all known peptides is much smaller ≈
O(108).
However, de novo algorithms can be much faster, even
though their search space is much larger!
A database search scans all peptides in the database of all
known peptides search space to find best one.
De novo eliminates the need to scan database of all peptides
by modeling the problem as a graph search.
68
Probabilistic Model
•
For a position t δj  Ti the probability p(t, P,S) that
peptide P produces a peak at position t.
 qj
P(t , P, S )  
1  q j
•
if a peak is generated at position t
j
otherwise
Similarly, for tR, the probability that P produces a
random noise peak at t is:
 qR
PR (t )  
1  qR
if a peak is generated at position t
otherwise
69
Probabilistic Score
• For a peptide P with n amino acids, the score
for the whole peptides is expressed by the
following ratio test:
n
k p (t
p ( P, S )
i j , P , S )
 
pR ( S )
pR (ti j )
i 1 j 1
70
Peak Score
• For a position t that represents ion type dj :
qj, if peak is generated at t
p(P,st) =
1-qj , otherwise
71
Peak Score (cont.)
• For a position t that is not associated with an
ion type:
qR , if peak is generated at t
pR(P,st) =
1-qR , otherwise
• qR = the probability of a noisy peak that does
not correspond to any ion type
72