An Algorithmic Approach to Peptide Sequencing via Tandem

Download Report

Transcript An Algorithmic Approach to Peptide Sequencing via Tandem

An Algorithmic Approach to
Peptide Sequencing
via Tandem Mass Spectrometry
Ming-Yang Kao
Department of Computer Science
Northwestern University
Evanston, Illinois
U. S. A.
1
Collaborators of This Project
University of Southern California
•
Ting Chen
Harvard Medical School
•
•
•
George M. Church
John Rush
Matthew Tepel
2
Perspectives
A key goal of bioinformatics: To study biological systems
based on global knowledge of genomes, transcriptomes,
and proteomes.
• Genome: entire sets of materials in the chromosomes.
• Transcriptome: entire sets of gene transcripts.
• Proteome: entire sets of proteins.
Genome (DNA)  Transcriptome (RNA)  Proteome (Protein)
3
Perspectives
A key goal of bioinformatics: To study biological systems
based on global knowledge of genomes, transcriptomes,
and proteomes.
• Genome: entire sets of materials in the chromosomes.
• Transcriptome: entire sets of gene transcripts.
• Proteome: entire sets of proteins.
this talk’s focus
Genome (DNA)  Transcriptome (RNA)  Proteome (Protein)
4
Proteomics
• Proteome: all proteins encoded within a genome
– half millions distinct proteins (temporal, spatial, modifications)
– ~30,000 human genes
– mRNA and protein expressions may not correlate
• Proteomics: study of protein expression by biological systems
– relative abundance and stability; post-translational modifications
– fluctuations as a response to environment and altered cellular needs
– correlations between protein expression and disease state
– protein-protein interactions, protein complexes
• Technologies:
– 2D gel electrophoresis
– mass spectrometry
this talk’s focus
– yeast two-hybrid system
– protein chips
5
A Key Step of Proteomics
• How to sequence proteins?
• How to sequence protein peptides? (this talk’s focus)
6
Outline of This Talk
1. Problem Formulation (Biology)
2. Problem Formulation (Computer Science)
3. Basic Computational Techniques
4. Improved Computational Complexity and More Robust
Algorithms
5. Conclusions
7
Outline of This Talk (1)
1. Problem Formulation (Biology)
2. Problem Formulation (Computer Science)
3. Basic Computational Techniques
4. Improved Computational Complexity and More Robust
Algorithms
5. Conclusions
8
Protein Identification: HPLC-MS-MS
Mass
Spectrometer
HPLC
Proteins
Peptides
Mass
Spectrometer
Fragmentation
& Ionization
B-ions / Y-ions
One Peptide
Mass/Charge
De Novo Peptide Sequencing
Protein Database Search
Mass/Charge
Tandem Mass Spectrum
9
Protein Identification: HPLC-MS-MS
Mass
Spectrometer
HPLC
Proteins
Peptides
Mass
Spectrometer
Fragmentation
& Ionization
B-ions / Y-ions
One Peptide
Mass/Charge
De Novo Peptide Sequencing
Protein Database Search
Mass/Charge
Tandem Mass Spectrum
10
Peptide Fragmentation and Ionization
B-ion
Y-ion
Complementary: Mass(B-ion)+Mass(Y-ion) = Mass(peptide)+4H+O
11
B-ions and Y-ions
Fragmentation
Peptide (  R1  R2  R3   )
All y  ions
All b  ions
b1
b2
b3

(  R1 )
(  R1  R2 ) 
(  R1  R2  R3 ) 

y1
( R3   )
y2
( R2  R3   ) 

y3 ( R1  R2  R3   )
  1,   19
12
Tandem Mass Spectrum
100
175.113
50
361.121
88.033
274.112
200
448.225
430.213
400
Mass / Charge
13
Raw Tandem Mass Spectrum
14
Prediction from Raw Tandem Mass Spectrum
15
Protein Database Search
Find the peptide sequences in a protein database that
optimally fit the spectrum.
• It does not work if the target peptide sequence is not in
the database.
• It does not work if there is an unknown modification at
some amino acid.
• It is very slow because it must search the entire database.
• E.g., SEQUEST, Yates, Univ. of Washington.
16
De Novo Peptide Sequencing Problem
• Input:
(1) the mass W of an unknown target peptide, and
(2) a set S of the masses of some or all b-ions and y-ions
of the peptide.
• Output: a peptide P such that
(1) mass(P)=W and
(2) S is a subset of all the ion masses of P.
Peptide Mass 429.212 Daltons
100
361.121
274.112
50
P = SWR,
Mass(P) = 429.212,
Ions(P) =
{88.033, 175.113, 274.112,
361.121, 430.213, 448.225}
Mass / Charge
17
Tandem Mass Spectrum
100
175.113
50
361.121
88.033
274.112
200
448.225
430.213
400
Mass / Charge
Peptide Mass 429.212 Daltons
18
Amino Acid Mass Table
A
C
D
E
F
G
H
I
K
L
71.08
103.14
115.09
129.12
147.18
57.05
137.14
113.16
128.17
113.16
M
N
P
Q
R
S
T
V
W
Y
131.19
114.1
97.12
128.13
156.19
87.08
101.11
99.13
186.21
163.18
19
Feature 1
All B-ions form a forward mass ladder.
100
175.113
361.121
b1
b2
88.033
50
S
1
b3
274.112
W
448.225
430.213
R
200
400
Mass / Charge
Peptide Mass 429.212 Daltons
20
Feature 2
All Y-ions form a reverse mass ladder.
100
y1
y2
y3
175.113
361.121
448.225
R
88.033
50
S
19
S
W
274.112
W
430.213
R
200
400
Mass / Charge
Peptide Mass 429.212 Daltons
21
Basic Difficulty #1
It is unknown whether an ion is a B-ion or an Y-ion.
100
175.113
50
361.121
88.033
274.112
200
448.225
430.213
400
Mass / Charge
Peptide Mass 429.212 Daltons
22
Basic Difficulty #2
There are missing ions.
100
361.121
274.112
50
Ion 2
Ion 1
200
400
Mass / Charge
Peptide Mass 429.212 Daltons
23
Feature 3 (to our Rescue)
Complementary Ion Pairs: b1/y2 and b2/y1
100
y1
y2
y3
175.113
361.121
448.225
R
50
88.033
S
S
W
b1
W
274.112
b2
430.213
b3
R
200
400
Mass / Charge
Peptide Mass 429.212 Daltons
24
Outline of This Talk (2)
1. Problem Formulation (Biology)
2. Problem Formulation (Computer Science)
3. Basic Computational Techniques
4. Improved Computational Complexity and More Robust
Algorithms
5. Conclusions
25
Formulating the Computational Problem
1. T = an alphabet of 20 characters a1,a2,…,a20.
2. two special characters: alpha and beta.
3. the mass of alpha = 1,
the mass of beta = 19,
the mass of ai is mi.
4. A peptide sequence is x1,x2,x3,…,xn-1,xn, where
each xi is from T.
5. A b-ion is x0,x1,x2,…,xi for some 1 <= i <= n,
where x0 = alpha.
6. A y-ion is xi,…,xn-2,xn-1,xn,xn+1 for some 1 <= i <=
n, where xn+1 = beta.
26
De Novo Peptide Sequencing Problem
• Input:
(1) the mass W of an unknown target peptide, and
(2) a set S of the masses of some or all b-ions and y-ions
of the peptide.
• Output: a peptide P such that
(1) mass(P)=W and
(2) S is a subset of all the ion masses of P.
Peptide Mass 429.212 Daltons
100
361.121
274.112
50
P = SWR,
Mass(P) = 429.212,
Ions(P) =
{88.033, 175.113, 274.112,
361.121, 430.213, 448.225}
Mass / Charge
27
Amino Acid Mass Table
A
C
D
E
F
G
H
I
K
L
71.08
103.14
115.09
129.12
147.18
57.05
137.14
113.16
128.17
113.16
M
N
P
Q
R
S
T
V
W
Y
131.19
114.1
97.12
128.13
156.19
87.08
101.11
99.13
186.21
163.18
28
Outline of This Talk (3)
1. Problem Formulation (Biology)
2. Problem Formulation (Computer Science)
3. Basic Computational Techniques
4. Improved Computational Complexity and More Robust
Algorithms
5. Conclusions
29
Basic Computing Scheme
peptide mass W
tandem mass spectrum S
NC-spectrum graph
Find feasible paths to order the masses in
S to identify all the b-ions and y-ions
consistent with S.
Convert feasible paths into legal peptide sequences
30
NC-Spectrum Graph: Nodes (1)
N0
C0
0
429.22
mass of this peptide
31
NC-Spectrum Graph: Nodes (2)
Assumption 1:
If Ion 1 is an y-ion
C1: a b-ion node
N0
Ion # 1 (274.11)
C1
174.11
0
Assumption 2:
If Ion 1 is a b-ion
N1: a b-ion node
N1
C0
273.11
429.22
mass of this peptide
mass(
) + mass(
) = mass(P) + 18
32
NC-Spectrum Graph: Nodes (3)
Ion # 2 (88.10)
N0
N2
C1
0
87.10
mass(
174.11
) + mass(
N1
273.11
C2
C0
360.12
429.22
) = mass(P) + 18
33
NC-Spectrum Graph: Edges (1)
Mass(S) = 87.08.
S
N0
N2
0
87.10
C1
174.11
N1
273.11
C2
C0
360.12
429.22
34
NC-Spectrum Graph: Edges (2)
Mass(S) = 87.08.
Mass(W) = 186.21
W
S
N0
N2
0
87.10
C1
174.11
N1
273.11
C2
C0
360.12
429.22
35
NC-Spectrum Graph: Edges (3)
Mass(S) = 87.08.
Mass(W) = 186.21
W
S
N0
N2
C1
0
87.10
174.11
N1
273.11
C2
C0
360.12
429.22
S+W
Mass(S+W) = 273.29
36
NC-Spectrum Graph: Edges (4)
Mass(S) = 87.08.
Mass(W) = 186.21
Mass(R) = 156.19
W
R
S
N0
N2
C1
0
87.10
174.11
N1
273.11
C2
C0
360.12
429.22
S+W
Mass(S+W) = 273.29
37
NC-Spectrum Graph
N0
N2
0
87.10
C1
174.11
N1
273.11
C2
C0
360.12
429.22
38
NC-Spectrum Graph: Paths = Sequences
W
R
S
N0
N2
0
87.10
C1
N1
174.11
273.11
C2
C0
360.12
429.22
b-ions
39
NC-Spectrum Graph: A Feasible Path (1)
b-ions
a feasible path
S
N0
N2
0
87.10
W
C1
174.11
R
N1
273.11
C2
C0
360.12
429.22
Definition: A feasible path is a path from N0 to C0 that goes
through exactly one node for each pair (either Nj or Cj).
40
NC-Spectrum Graph: A Feasible Path (2)
y-ions
b-ions
a feasible path
S
N0
N2
0
87.10
S
C1
174.11
N1
273.11
C2
C0
360.12
429.22
GVV
Definition: A feasible path is a path from N0 to C0 that goes
through exactly one node for each pair (either Nj or Cj).
41
NC-Spectrum Graph: Not A Feasible Path (1)
not a feasible path:
(1) miss ion #2
N0
N2
0
87.10
C1
174.11
N1
273.11
C2
C0
360.12
429.22
Definition: A feasible path is a path from N0 to C0 that goes
through exactly one node for each pair (either Nj or Cj).
42
NC-Spectrum Graph: Not A Feasible Path (2)
not a feasible path:
(2) repeat ion #1
N0
N2
0
87.10
C1
174.11
N1
273.11
C2
C0
360.12
429.22
Definition: A feasible path is a path from N0 to C0 that goes
through exactly one node for each pair (either Nj or Cj).
43
NC-Spectrum Graph: Not A Feasible Path (3)
not a feasible path:
(1) miss ion #2
(2) repeat ion #1
N0
N2
0
87.10
C1
174.11
N1
273.11
C2
C0
360.12
429.22
Definition: A feasible path is a path from N0 to C0 that goes
through exactly one node for each pair (either Nj or Cj).
44
Reformulating the De Novo Peptide Sequencing Problem
Input: an NC-spectrum graph G.
Output: a feasible path from N0 to C0.
45
Observations
• A longest path does not always go through exactly
one of each pair of nodes.
• It is an NP-hard problem if the spectrum graph is a
general directed graph.
46
Basic Algorithm
Input: a peptide mass W and a tandem mass spectrum S.
Output: a feasible peptide sequence.
Steps:
1. Compute the nodes of the NC-spectrum graph G.
2. Compute the edges of G.
3. Compute a feasible path P in G.
4. Convert P into a feasible sequence.
47
Basic Algorithm (1)
Input: a peptide mass W and a tandem mass spectrum S.
Output: a feasible peptide sequence.
Steps:
1. Compute the nodes of the NC-spectrum graph G.
2. Compute the edges of G.
3. Compute a feasible path P in G.
4. Convert P into a feasible sequence.
48
Compute the Nodes of the NC-Spectrum Graph
Step 1. Compute the nodes and place them in the increasing order of
masses.
N0
N2
0
87.10
C1
174.11
N1
273.11
C2
C0
360.12
429.22
Step 2. Rename the nodes from left to right as X0,…, Xk,Yk,…,Y0
X0
X1
0
87.10
X2
174.11
Y2
273.11
Y1
Y0
360.12
429.22
Observation: Xi and Yi form a complementary pair of nodes Ni and Ci
for ion i.
Running Time: O(k), where k = # of masses in the spectrum.
49
Basic Algorithm (2)
Input: a peptide mass W and a tandem mass spectrum S.
Output: a feasible peptide sequence.
Steps:
1. Compute the nodes of the NC-spectrum graph G.
2. Compute the edges of G.
inverse of each other
3. Compute a feasible path P in G.
4. Convert P into a feasible sequence.
50
Compute the Edges of the NC-Spectrum Graph
X0
X1
0
87.10
X2
174.11
Y2
273.11
Y1
Y0
360.12
429.22
Basic Question: Given a mass u, is there a protein sequence with that
mass?
Solution: dynamic programming via a Boolean array E( ).
1. precision = 0.01.
2. Boolean array length L = peptide mass W / precision.
3. Boolean array E(u/0.01) = 1 if u is the mass of a peptide; otherwise 0.
4. dynamic programming E(j) = 1 if only E(j – mi) =1 for some amino
acid mass mi.
5. Running Time:
(1) Computing E() takes O(L) time; or O(L/log L) via 4-Russian
preprocessing.
(2) Computing the edges takes O(k^2) time.
51
Basic Algorithm (3)
Input: a peptide mass W and a tandem mass spectrum S.
Output: a feasible peptide sequence.
Steps:
1. Compute the nodes of the NC-spectrum graph G.
2. Compute the edges of G.
3. Compute a feasible path P in G.
4. Convert P into a feasible sequence.
52
X0
X1
0
87.10
X0
X1
0
87.10
Compute a Feasible Path (1)
X2
174.11
Y2
273.11
Y1
Y0
360.12
429.22
Y1
Y0
360.12
429.22
Recursion: Use the feasible paths of X0,…, Xi,Yj,…,Y0 to compute
the feasible paths of X0,…, Xi, Xi+1,Yj+1,Yj,…,Y0.
Dynamic Programming: M(i,j) = 1 if there exist a path PL from X0 to
Xi and a path PR from Yj to Y0 such that PL and PR together contain
exactly one of Xq and Yq for each q = 0, …, max{i,j}.
Observation: There is a feasible path if and only if
(1) for some i and k, there is an edge e from Xi to Yk and M(i,k) = 1, or
(2) for some k and j, there is an edge e from Xk to Yj and M(k,j) = 1
53
Compute a Feasible Path (2)
Dynamic Programming: M(i,j) = 1 if there exist a path PL from X0 to
Xi and a path PR from Yj to Y0 such that PL and PR together contain
exactly one of Xq and Yq for each q = 0, …, max{i,j}.
Observation: There is a feasible path if and only if
(1) for some i and k, there is an edge e from Xi to Yk and E(i,k) = 1, or
(2) for some k and j, there is an edge e from Xk to Yj and E(k,j) = 1
X0
Xi
PL
X0
Yk
e
PR
Xk
PL
Y0
Yj
e
Y0
PR
54
Compute a Feasible Path (3)
Dynamic Programming: M(i,j) = 1 if there exist a path PL from X0 to
Xi and a path PR from Yj to Y0 such that PL and PR together contain
exactly one of Xq and Yq for each q = 0, …, max{i,j}.
Base Case: M(0,0), M(0,1), M(1,0).
Recurrence:
(1) If M(i,j-1) = 1 and edge(Xi, Xj) = 1, then M(j,j-1) = 1.
(2) If M(i,j-1) = 1 and edge(Yj, Yj-1) = 1, then M(i,j) = 1.
(3) If M(j-1,i) = 1 and edge(Xj-1, Xj) = 1, then M(j,i) = 1.
(4) If M(j-1,i) = 1 and edge(Yj, Yi) = 1, then M(j-1,j) = 1.
Idea: Extend PL and PR by one edge at a time.
55
Compute a Feasible Path (4)
Dynamic Programming: M(i,j) = 1 if there exist a path PL from X0 to
Xi and a path PR from Yj to Y0 such that PL and PR together contain
exactly one of Xq and Yq for each q = 0, …, max{i,j}.
Recurrence:
(1) If M(i,j-1) = 1 and edge(Xi, Xj) = 1, then M(j,j-1) = 1.
(2) If M(i,j-1) = 1 and edge(Yj, Yj-1) = 1, then M(i,j) = 1.
(3) If M(j-1,i) = 1 and edge(Xj-1, Xj) = 1, then M(j,i) = 1.
(4) If M(j-1,i) = 1 and edge(Yj, Yi) = 1, then M(j-1,j) = 1.
Xj
Yj
Yj-1
Y0
X0
Xi
e
PL
X0
Xi
PL
PR
Xj
Yj
Yj-1
e
Y0
PR
56
Compute a Feasible Path (5)
Dynamic Programming: M(i,j) = 1 if there exist a path PL from X0 to
Xi and a path PR from Yj to Y0 such that PL and PR together contain
exactly one of Xq and Yq for each q = 0, …, max{i,j}.
Recurrence:
(1) If M(i,j-1) = 1 and edge(Xi, Xj) = 1, then M(j,j-1) = 1.
(2) If M(i,j-1) = 1 and edge(Yj, Yj-1) = 1, then M(i,j) = 1.
(3) If M(j-1,i) = 1 and edge(Xj-1, Xj) = 1, then M(j,i) = 1.
(4) If M(j-1,i) = 1 and edge(Yj, Yi) = 1, then M(j-1,j) = 1.
Computational Complexity: O(k^2).
57
Algorithmic Result #1: Finding a Feasible Path
Input: an NC-Spectrum Graph G=(V,E)
Output: a feasible path in G.
Computational Complexity: O(|V|2) time & O(|V|2) space.
58
Outline of This Talk (4)
1. Problem Formulation (Biology)
2. Problem Formulation (Computer Science)
3. Basic Computational Techniques
4. Improved Computational Complexity and More Robust
Algorithms
5. Conclusions
59
Algorithmic Result #2: Finding a Feasible Path (Improved)
Input: an NC-spectrum graph G=(V,E).
Output: A feasible path can be found in O(|V|+|E|) time.
Idea: Speed up via pre-processing.
60
Amino Acid Modifications
A modification is an amino acid with slightly different atoms
(and thus a different mass) from the typical molecule.
Importance of modifications: Amino acid modifications are related to
functions. For example, a protein is active when phosphorylated and
inactive when de-phosphorylated.
61
Modification in the Tandem Mass Spectrum
100
R
50
S
S
W+d
W+d
200
R
400
Mass / Charge
62
Spectrum Graph: As Before
N0
N1
N2
C2
C1
C0
63
Spectrum Graph: A Modified Feasible Path
Idea: One mass change leads to one missing edge.
N0
N1
N2
C2
C1
C0
64
Algorithmic Result #3: Finding One Modification
A modification is an amino acid with slightly different
atoms (and thus a different mass) from the typical
molecule.
Theorem: Finding the position of the modification
takes O(|V|+|E|) space and O(|V| |E|) time.
65
Algorithmic Result #4: Noisy Data
Define a scoring function s():
– s(edge) = function(mass).
– s(node) = function(abundance).
Redefine the problem: Find the maximum score path
that goes through at most one node for each ion.
Solution: dynamic programming in O(|V|+|E|) space and
O(|V| |E|) time.
66
Outline of This Talk (5)
1. Problem Formulation (Biology)
2. Problem Formulation (Computer Science)
3. Basic Computational Techniques
4. Improved Computational Complexity and More Robust
Algorithms
5. Conclusions
67
Further Difficulties for Tandem Mass Spectrum Interpretation
•
•
•
•
•
•
Each ion has a couple of isotopic forms.
Other ions (a or z) may appear.
Some ions may lose a water or an ammonia.
Multiple ion charges.
Noise.
Amino acid modifications.
68
Further Research Directions
• Efficient algorithms to deal with more modifications in
conjunction with data noise.
• Efficient algorithms to combine de novo peptide sequencing with
peptide database search.
• Efficient algorithms to assess statistical significance of feasible
peptide sequences.
• Efficient algorithms to deal with multiple peptides.
• Practical implementation; speed-up via preprocessing.
• More …
69
Further Research Directions
• Looking for top-rate graduate students for this
project (and other projects).
• Immediate and expedited admission for the coming
fall semester.
70