Transcript ppt

Mass Spectrometry-based Proteomics
Xuehua Shen
(Adapted from slides with textbook)
1
Outline
• Motivation of proteomics
• Mass spectrometry-based proteomics
• Instrumentation of mass spectrometry
• De novo sequencing algorithm
• Database search
• Algorithms of real software (e.g., sequence
tags)
2
Motivation
• Proteins are working units of the cells
– The number of found genes is much less than the
number of expressed proteins
– Directly related with cell processes and diseases
DNA
SNP
~30,000 human
genes
mRNA
Protein
Alternative
Post-translational
splicing
Modification
>100,000 RNA
messages
>1,000,000 distinct
protein forms
3
Tools for Proteomics
• Edman degradation reaction
• NMR (Nuclear Magnetic Resonance)
• X-ray crystallography
• Protein array
• Mass Spectrometry
4
Mass Spectrometry-based Proteomics
• Primary sequence (sequencing, identification)
• Post-translational modification (PTM)
(characterization)
• Quantitative proteomics (quantification)
• Protein-protein interaction
5
6
Components of Mass Spectrometer
• Ion source (ESI and MALDI)
• Mass analyzer (ion traps, TOF, Quadrupole, FT,
etc.)
– Mass-to-charge ratio (m/z)
• Ion detector
7
Peptide and Intact Protein
• Peptide: a fragment of protein
• Some enzymes, e.g. trypsin, break protein into
peptides.
• Some technology put intact protein into the mass
spectrometer
8
Peptide Fragmentation
Collision Induced Dissociation
H+
H...-HN-CH-CO
Ri-1
N-Terminus
•
•
. . . NH-CH-CO-NH-CH-CO-…OH
Ri
Ri+1
C-Terminus
Peptides tend to fragment along the backbone.
Fragments can also loose neutral chemical groups
like NH3 and H2O.
9
Ideal Mass Spectrum
10
Real Mass Spectrum
11
N- and C-terminal Peptides
12
Terminal peptides and ion types
Peptide
Mass (D)
Peptide
Mass (D)
57 + 97 + 147 + 114 = 415
without
57 + 97 + 147 + 114 – 18 = 397
13
N- and C-terminal Peptides
486
71
415
301
185
154
332
57
429
14
N- and C-terminal Peptides
486
71
415
301
185
154
332
57
429
15
N- and C-terminal Peptides
486
71
415
301
185
154
332
57
429
16
N- and C-terminal Peptides
486
71
415
Problem:
301
154
57
Reconstruct peptide from the set of
masses of fragment
185
332
429
17
Mass Spectra
57 Da =K‘G’
D
D
V
99 Da = ‘V’
L
H2O
G
L
D
K
V
G
mass
0
• The peaks in the mass spectrum:
– Prefix
and Suffix Fragments.
– Fragments with neutral losses (-H2O, -NH3)
– Noise and missing peaks.
18
Protein Identification with MS/MS
G
V
D
K
Peptide
Identification:
Intensity
MS/MS
L
mass
00
19
Protein Identification by Tandem Mass
Spectrometry
MS/MS instrument
S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6
T: + c d Full ms2 638.00 [ 165.00 - 1925.00]
850.3
100
95
687.3
90
85
588.1
80
75
70
65
Relative Abundance
S
e
q
u
e
n
c
e
60
55
851.4
425.0
50
45
949.4
40
326.0
35
De Novo interpretation
•Sherenga
Database search
•Sequest
524.9
30
25
20
589.2
226.9
1048.6
1049.6
397.1
489.1
15
10
629.0
5
0
200
400
600
800
1000
m/z
1200
1400
1600
1800
2000
20
De Novo vs. Database Search
S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6
T: + c d Full m s 2 638.00 [ 165.00 - 1925.00]
850.3
100
95
85
588.1
80
De Novo
75
70
65
Relative Abundance
Database
Search
687.3
90
60
55
851.4
425.0
50
45
949.4
40
326.0
35
524.9
30
25
20
589.2
226.9
1048.6
1049.6
397.1
489.1
15
10
629.0
5
0
200
400
600
800
1000
m /z
1200
1400
1600
1800
2000
Mass, Score
W
Database of
known peptides
R
V
A
A
MDERHILNM, KLQWVCSDL,
PTYWASDL, ENQIKRSACVM,
TLACHGGEM, NGALPQWRT,
HLLERTKMNVV, GGPASSDA,
GGLITGMQSD, MQPLMNWE,
ALKIIMNVRT,
ALKIIMNVRT,AVGELTK
AVGELTK, ,
HEWAILF, GHNLWAMNAC,
GVFGSVLRA, EKLNKAATYIN..
C
G
G
L
P
L
L
T
E
K
K
W
D
T
AVGELTK
21
Current Status
• It is still a open problem of protein sequencing
no matter whether using de novo sequencing
or database search methods
• Following algorithms only deal with simplified
(or ideal) spectrums
• Some algorithms combine de novo
sequencing and database search
22
Pros and Cons of de novo Sequencing
•
Advantage:
– Gets the sequences that are not necessarily in the database.
•
– An additional similarity search step using these sequences
may identify the related proteins in the database.
Disadvantage:
– Requires higher quality data.
– Often contains errors.
23
Outline
• Motivation of proteomics
• Mass spectrometry-based proteomics
• Instrumentation of mass spectrometry
• De novo sequencing
• Database search
• Algorithms of real software (e.g., sequence
tags)
24
De novo Peptide Sequencing
S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6
T: + c d Full ms2 638.00 [ 165.00 - 1925.00]
850.3
100
95
687.3
90
85
588.1
80
75
70
Relative Abundance
65
60
55
851.4
425.0
50
45
949.4
40
326.0
35
524.9
30
25
20
589.2
226.9
1048.6
1049.6
397.1
489.1
15
10
629.0
5
0
200
400
600
800
1000
m/z
1200
1400
1600
1800
2000
Sequence
25
Peptide Sequencing Problem
Goal: Find a peptide with maximal match between
an experimental and theoretical spectrum.
Input:
– S: experimental spectrum
– Δ: set of possible ion types
– m: parent mass
Output:
– P: peptide with mass m, whose theoretical
spectrum matches the experimental S
spectrum the best
26
Procedure of De Novo Sequencing
•
Build spectrum graph
– How to create vertices (from masses)
– How to create edges (from mass differences)
•
Find best path or rank paths of spectrum graph
– How to find candidate paths
– How to score paths
27
From Sequence to Spectrum
b
S E
Q U E
N
Mass/Charge (M/Z)
C E
28
From Sequence to Spectrum
(cont.)
a
SE
Q U
E
N
Mass/Charge (M/Z)
C
E
29
From Sequence to Spectrum
(cont.)
a is an ion type shift in b
S E
Q U E
Mass/Charge (M/Z)
N
C E
30
From Sequence to Spectrum (cont.)
y
E C
N
E
U Q
Mass/Charge (M/Z)
E S
31
Intensity
From Sequence to Spectrum (cont.)
Mass/Charge (M/Z)
32
Intensity
From Sequence to Spectrum (cont.)
Mass/Charge (M/Z)
33
From Sequence to Spectrum (cont.)
noise
Mass/Charge (M/Z)
34
Intensity
MS/MS Spectrum
Mass/Charge (M/z)
35
Some Mass Differences between Peaks
Correspond to Amino Acids
u
q
s
e
s
e
e
c
e
u
q
e
n
n
q
u
e
n
c
c
e
e
s
e
36
Now decoding from spectrum
to sequence…?
Build spectrum graph
37
Peptide Fragmentation
• Different ion types (b, y, b-NH3, b-H2O)
• Fragment at one site (internal ions)
b2-H2O
a2
b3- NH3
b2
a3
b3
HO
NH3+
|
|
R1 O
R2 O
R3 O
R4
|
||
|
||
|
||
|
H -- N --- C --- C --- N --- C --- C --- N --- C --- C --- N --- C -- COOH
|
|
|
|
|
|
|
H
H
H
H
H
H
H
y3
y2
y3 -H2O
y1
y2 - NH3
38
Example of Ion Type
• Δ={δ1, δ2,…, δk}
• Ion types
{b, b-NH3, b-H2O}
correspond to
Δ={0, 17, 18}
*Note: In reality the δ value of ion type b is -1 but we will “hide” it for the sake of simplicity
39
Why Peptide Sequencing hard
• Two ladders of overlapping masses, could not
tell whether it is b ion or y ion
• Incomplete fragmentation
• Chemical noise
• Mass accuracy of the instrument is not good
enough (Q=K, G+V=156.090, R=156.101)
• Q: Is sequencing shorter or longer peptide
harder?
40
Vertices of Spectrum Graph
•
Vertices are generated by reverse shifts corresponding to
ion types
•
Δ={δ1, δ2,…, δk}
Every mass s in an MS/MS spectrum generates k vertices
V(s) = {s+δ1, s+δ2, …, s+δk}
corresponding to potential N-terminal peptides
•
Vertices of the spectrum graph:
{initial vertex}V(s1) V(s2) ... V(sm) {terminal
vertex}
41
Reverse Shifts
Shift in H2O
Shift in H2O+NH3
42
Edges of Spectrum Graph
•
Two vertices with mass difference corresponding to
an amino acid A:
– Connect with an edge labeled by A (Directed Graph)
•
Gap edges for di- and tri-peptides
– Potential sequence tag method (covered later)
43
Best Path of Spectrum Graph
• How to find candidate paths
• There are many paths, how to find the correct
one?
• We need scoring to evaluate paths
44
Find Candidate Paths
• Heuristics: find a path with maximum number
of edges
• Longest path problem in DAG
• DFS (Depth First Search)
45
Path Score
• p(P,S) = probability that peptide P produces
spectrum S= {s1,s2,…sq}
• Scoring = computing probabilities
46
Finding Optimal Paths in the
Spectrum Graph
• For a given MS/MS spectrum S, find a
peptide P’ maximizing p(P,S) over all possible
peptides P:
p(P',S)  max P p(P,S)
• Peptides = paths in the spectrum graph
• P’ = the optimal path in the spectrum graph
• Some software rank paths
47
Ratio Test Scoring for Partial Peptides
• Incorporates premiums for observed ions and
penalties for missing ions.
• Example: for k=4, assume that for a partial
peptide P’ we only see ions δ1,δ2,δ4.
The score is calculated as:
q1 q2 (1  q3 ) q4
 

qR qR (1  qR ) qR
48
Why Not Sequence De Novo?
• De novo sequencing is still not very accurate!
Amino Acid
Accuracy
Whole Peptide
Accuracy
0.566
0.189
SHERENGA (Dancik et. al., 1999).
0.690
0.289
Peaks
0.673
0.727
0.246
0.296
Algorithm
Lutefisk
(Taylor and Johnson, 1997).
(Ma et al., 2003).
PepNovo (Frank and Pevzner, 2005).
• Less than 30% of the peptides sequenced were
completely correct!
49
De Novo vs. Database Search
S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6
T: + c d Full m s 2 638.00 [ 165.00 - 1925.00]
850.3
100
95
85
588.1
80
De Novo
75
70
65
Relative Abundance
Database
Search
687.3
90
60
55
851.4
425.0
50
45
949.4
40
326.0
35
524.9
30
25
20
589.2
226.9
1048.6
1049.6
397.1
489.1
15
10
629.0
5
0
200
400
600
800
1000
m /z
1200
1400
1600
1800
2000
W
Database of
known peptides
R
V
A
A
MDERHILNM, KLQWVCSDL,
PTYWASDL, ENQIKRSACVM,
TLACHGGEM, NGALPQWRT,
HLLERTKMNVV, GGPASSDA,
GGLITGMQSD, MQPLMNWE,
ALKIIMNVRT,
ALKIIMNVRT,AVGELTK
AVGELTK, ,
HEWAILF, GHNLWAMNAC,
GVFGSVLRA, EKLNKAATYIN..
C
G
G
L
P
L
L
T
E
K
K
W
D
T
AVGELTK
50
Outline
• Motivation of proteomics
• Mass spectrometry-based proteomics
• Instrumentation: Mass Spectrometry
• De novo sequencing algorithm
• Database search
• Algorithms of real software (e.g., sequence
tags)
51
Peptide Identification Problem
Goal: Find a peptide from the database with
maximal match between an experimental and
theoretical spectrum.
Input:
– S: experimental spectrum
– database of peptides
– Δ: set of possible ion types
– m: parent mass
Output:
– A peptide of mass m from the database whose
theoretical spectrum matches the
experimental S spectrum the best
52
Match between Spectra and the Shared
Peak Count
• The match between two spectra is the number of masses
(peaks) they share (Shared Peak Count or SPC)
• In practice mass-spectrometrists use the weighted SPC
that reflects intensities of the peaks
• Match between experimental and theoretical spectra is
defined similarly
53
MS/MS Database Search
Database search in mass-spectrometry has been successful in
identification of already known proteins.
Experimental spectrum can be compared with theoretical spectra of
database peptides to find the best fit.
SEQUEST (Yates et al., 1995)
But reliable algorithms for identification of peptides is a much more
difficult problem.
Q: Why can a peptide be not identical to a sequence in the database
54
Deficiency of the Shared Peaks Count
Shared peaks count (SPC): intuitive measure of
spectral similarity.
Problem: SPC diminishes very quickly as the
number of mutations increases.
Only a small portion of correlations between the
spectra of mutated peptides is captured by
SPC.
55
SPC Diminishes Quickly
no mutations
SPC=10
1 mutation
SPC=5
2 mutations
SPC=2
S(PRTEIN) = {98, 133, 246, 254, 355, 375, 476, 484, 597, 632}
S(PRTEYN) = {98, 133, 254, 296, 355, 425, 484, 526, 647, 682}
S(PGTEYN) = {98, 133, 155, 256, 296, 385, 425, 526, 548, 583}
56
Post-Translational Modifications
Proteins are involved in cellular signaling and
metabolic regulation.
They are subject to a large number of biological
modifications.
Almost all protein sequences are posttranslationally modified and 200 types of
modifications of amino acid residues are known.
57
Examples of Post-Translational
Modification
Post-translational modifications increase the number of “letters” in
amino acid alphabet and lead to a combinatorial explosion in both
database search and de novo approaches.
58
Search for Modified Peptides: Virtual
Database Approach
Yates et al.,1995: an exhaustive search in a virtual
database of all modified peptides.
Exhaustive search leads to a large combinatorial
problem, even for a small set of modifications
types.
Problem (Yates et al.,1995). Extend the virtual
database approach to a large set of
modifications.
59
Modified Peptide Identification Problem
Goal: Find a modified peptide from the database with maximal
match between an experimental and theoretical spectrum.
Input:
– S: experimental spectrum
– database of peptides
– Δ: set of possible ion types
– m: parent mass
– Parameter k (# of mutations/modifications)
Output:
– A peptide of mass m that is at most k
mutations/modifications apart from a database peptide
and whose theoretical spectrum matches the
experimental S spectrum the best
60
Spectrum Alignment
• See 8.14 and 8.15 in the text book for one
algorithm
• Complicated for real spectrums
61
Outline
• Motivation of proteomics
• Mass spectrometry-based proteomics
• Instrumentation: Mass Spectrometry
• De novo sequencing algorithm
• Database search
• Algorithms of real software (e.g., sequence
tags)
62
Combining de novo and Database Search in
Mass-Spectrometry
•
•
•
•
So far de novo and database search were presented as
two separate techniques
Database search is rather slow: many labs generate more
than 100,000 spectra per day. SEQUEST takes
approximately 1 minute to compare a single spectrum
against SWISS-PROT (54Mb) on a desktop.
It will take SEQUEST more than 2 months to analyze the
MS/MS data produced in a single day.
Q: Can slow database search be combined with fast de novo
analysis?
63
What Can be Done with De Novo?
• Given an MS/MS spectrum:
– Can de novo predict the entire peptide sequence? - No!
(accuracy is less than 30%).
– Can de novo predict a set of partial sequences, that with
high probability, contains at least one correct tag?
A Covering Set of Tags
- Yes!
64
Peptide Sequence Tags
• A Peptide Sequence Tag is short substring of
a peptide.
Example:
Tags:
GVDLK
GVD
VDL
DLK
65
Filtration with Peptide Sequence Tags
•
•
Peptide sequence tags can be used as filters in
database searches.
The Filtration: Consider only database peptides that
contain the tag (in its correct relative mass location).
•
First suggested by Mann and Wilm (1994).
•
Similar concepts also used by:
– GutenTag - Tabb et. al. 2003.
– MultiTag - Sunayev et. al. 2003.
– OpenSea - Searle et. al. 2004.
66
Why Filter Database Candidates?
•
•
Effective filtration can greatly speed-up the process, enabling
expensive searches involving post-translational
modifications.
Goal: generate a small set of covering tags and use them to
filter the database peptides.
67
Summary
• Protein sequencing
• Mass spectrum
• De novo search and database search
• Difficulty of protein sequencing
68
The End
69
Quality Measure of Mass Spectrometer
• Sensitivity
• Mass accuracy
• Resolution
• Dynamic range
70
Exhaustive Search for Modified Peptides
•
YFDSTDYNMAK
Oxidation?
•
•
For each peptide,
generate all
modifications.
Score each
modification.
Phosphorylation?
•
25=32 possibilities, with 2 types
of modifications!
71
Peptide Identification Problem: Challenge
Very similar peptides may have very different
spectra!
Goal: Define a notion of spectral similarity that
correlates well with the sequence similarity.
If peptides are a few mutations/modifications
apart, the spectral similarity between their
spectra should be high.
72
Why Filtration ?
Sequence Alignment – Smith
BLASTWaterman Algorithm
Protein Query
Sequence matches
Scoring
Filtration
Database
actgcgctagctacggatagctgatcc
agatcgatgccataggtagctgatcc
atgctagcttagacataaagcttgaat
cgatcgggtaacccatagctagctcg
atcgacttagacttcgattcgatcgaat
tcgatctgatctgaatatattaggtccg
atgctagctgtggtagtgatgtaaga
•
BLAST filters out very few correct
matches and is almost as accurate as
Smith – Waterman algorithm.
73
Filtration and MS/MS
Peptide Sequencing – SEQUEST / Mascot
MS/MS spectrum
Sequence matches
Scoring
Filtration
Database
MDERHILNMKLQWVCSDLPT
YWASDLENQIKRSACVMTLA
CHGGEMNGALPQWRTHLLE
RTYKMNVVGGPASSDALITG
MQSDPILLVCATRGHEWAILF
GHNLWACVNMLETAIKLEGVF
GSVLRAEKLNKAAPETYIN..
74
Filtration in MS/MS Sequencing
•
•
•
•
Filtration in MS/MS is more difficult than in BLAST.
Early approaches using Peptide Sequence Tags were not
able to substitute the complete database search.
Current filtration approaches are mostly used to generate
additional identifications rather than replace the database
search.
Can we design a filtration based search that can replace
the database search, and is orders of magnitude faster?
75
Asking the Old Question Again: Why
Not Sequence De Novo?
• De novo sequencing is still not very accurate!
Amino Acid
Accuracy
Whole Peptide
Accuracy
Lutefisk (Taylor and Johnson, 1997).
0.566
0.189
SHERENGA (Dancik et. al., 1999).
0.690
0.289
Peaks (Ma et al., 2003).
0.673
0.246
PepNovo (Frank and Pevzner, 2005).
0.727
0.296
Algorithm
76