Fast De Novo Peptide Sequencing and Spectral Alignment via Tree

Download Report

Transcript Fast De Novo Peptide Sequencing and Spectral Alignment via Tree

Tag-based Blind Identification of
PTMs with Point Process Model
1Chunmei
1Dept.
Liu, 2Bo Yan, 1Yinglei Song, 2Ying Xu, 1Liming Cai
of Computer Science
of Biochemistry and Molecular Biology,
University of Georgia, USA
2Dept.
Tandem mass spectra of peptides
e.g., MS/MS of GLSDGEWQQVLNVWGK (www.ionsource.com)
2
Tandem mass spectra of peptides
(a tutorial from
www.cmb.usc.edu)
Massb1 + Massy8 = Masstotal
3
Peptide sequencing

De novo sequencing: directly infer the target peptide from its
MS data
[Fernandez et al 1992; Dancik et al 1999; Chen et al 2001;
Searle et al 2001; Ma et al 2003, Liu et al 2006]
sensitive to MS data; noises; missing peaks; and difficult.

DB search based: compare the target MS with theoretical MS
in a peptide database
[Eng et al, 1994; Perkins et al 1999]
slow; target may not be in the database;
or modified after translation
4
Post-translational modification (PTM)
5
Identifying PTMs in peptide sequencing

Assume a limited set of modification types and
model them as pseudo amino acids
[Yates et al 1995; Wilkins et al 1999; Tanner et al 2005]
regular sequencing tools can apply
may erroneously processing PTMs of unknown types

Blind identification (unlimited modification types)

spectral alignment based (difficult) [Pevnzer et al 2000, Tsur
et al 2005, Yan et al 2006]
de novo sequencing dependent [Han et al 2005]
6
This work

DB search based (point process model, Yan et
al 2006) PTM identification

Yan et al 2006 is comparable to Tsur et al 2005
Both are the best blind PTM identification programs
Yan et al 2006 faster, hits of homologs

Peptide tag-based filtering of database

Graph-theoretic approach to generate tags
7
Our approach details


Input: an experimental spectrum
Output: a peptide sequence and possible PTMs
Steps:
 Construct an extended spectrum graph to
find all maximum weighted anti-symmetric paths and
select tags as the paths

Construct a DFA from the tags to
filter the peptide database to obtain candidate peptides

Apply point process model to the candidates to
identify the peptide and potential PTMs by maximizing spectra
alignment score
8
Spectrum graph
Intensity
is the mass of a single amino acids.
2
4
2i
2n
source
sink
source
1 2 3
b y b
2n-1 2n
b y
A tandem mass spectrum
m/z
sink
1
3
i
2n-1
De novo sequencing
corresponds to finding a longest
directed anti-symmetric path
from source to sink
[Dancik et al 1999, etc.]
9
Extended spectrum graph
Assume a MS/MS spectrum S of a peptide P be a set of mass peaks
{x1 , x2 ,...., x2 k } . xi  x j if i  j; parent mass is M. [Liu et al 2006]
……
0
x1
x2
If x j  xi is a mass of a
single amino acid, connect
the corresponding vertices
with directed edges
……
xi
……
x2 k i 1
x2 k 1
x2 k
M
Connect each pair of
complementary vertices
and x2 k i 1 with a nondirected edge.
xi
10
Extended spectrum graph
parent mass=471
Intensity
113
358
202 269
71
400
A
M
L
0
100
200
300
R
R
L
M
A
400
0
71
113
202
269
358
400
471
Mass/Charge
(a)
(b)
Peptide: AMRL/LRMA
11
Tag selection for the target peptide

Tag: a short sequence of amino acids

Previous work: PepNovo [Frank et al 2005]
apply de novo sequence algorithms first, and
identify tags from the sequenced peptide
Advantage: effective
Disadvantages: the present of noises, missing
peaks, and PTMs make it hard to improve the
effectiveness; slow
12
Tag selection for the target peptide

In this work:
construct an extended spectrum graph (mixed graph) from the
target spectrum
tree-decompose the graph
dynamic programming to find all maximum weighted antisymmetric paths
advantages: fast and effective, tolerating noises and
missing peaks
13
Graph
Tree Decomposition
bag
Graph
g
a
b
d
c
f
e
h
Tree decomposition
ac
b
af
c
aa gf
gh
cd
e
abc
de
acf
gh
14
Properties of tree decomposition
g
a
1. Each vertex is
contained in at least
one bag
b
c
d
ac
b
af
c
h
f
e
aa gf
gh
cd
e
15
Properties of tree decomposition
1. Each vertex is
contained in at least
one bag
2. For any edge {g, f}:
there is a bag
containing both g and f
g
a
b
c
d
ac
b
af
c
h
f
e
a gf
gh
cd
e
16
Properties of tree decomposition
1. Each vertex is
contained in at least
one bag
2. For any edge {g, f}:
there is a bag
containing both g and
f
3. For every vertex c:
the bags that contain
c form a connected
subtree
g
a
b
c
d
ac
b
af
c
h
f
e
a gf
gh
cd
e
17
Tree Width

Tree width of a tree
decomposition:
max iI | bag i | 1

Tree width of a graph:
minimum width over
all tree
decompositions of the
graph
a
b
c
g
d
f
e
ac
b
af
c
c de
abc
de
h
Tree width = 2
ag
f
gh
Tree width = 4
acf
gh
18
Tree bags are separators

Internal tree bags in a
tree decomposition
are separators of the
graph
g h
c
f
e
b
a
d
ac
b
af
c
a gf
gh
cd
e
19
Tree bags are separators


Tree bags in a tree
decomposition are
separators of the graph
This allows efficient
dynamic programming
a
b
c
d
ac
b
g
g h
h
b
f
e
af
c
d
e
a gf
gh
cd
e
20
Dynamic programming
6
4
5
3
1
2

A table is maintained
for each bag

Compute tables
bottom up

Each table contains
partial optimal
solutions; the root
table contains the
optimal one
Time complexity: O(6tn2)
21
Dynamic programming (cont’s)
…
…
…
a b c c1 c2 V L
……
Xi
bottom-up
abc
Xj
ad
Xk
Xk
b e c c1 c2 V L
1 0 1 0 1 3
1 0 1 1  0
……
bec
Xj
b c1 c2 V L
1 1 1 00 13
1 0 1 1  14
……
adb
1 1 1 12
……
Xi
22
Score scheme and reliability of
sequence tags

Assign the score scheme [Dancik et all 1999] as weights to the
edges in spectrum graphs

Overall reliability of a tag ti = w1r1(ti) + w2r2(ti)
r1(ti) - reliability computed from ti’s edge normalized weights
r2(ti) - reliability computed autocorrelation score [Liu et al, 2005]
Refer to the paper for details
23
PTM identification with point process
blind search

Find a set of PTMs to maximize the spectral alignment

Can identify all possible PTMs through one round of
cross-correlation calculation

Computation time is independent of the number of PTMs
24
PTM identification with point process
blind search

Treat a spectrum and the theoretical spectrum of a candidate
N
peptide as one point process:
x(t )    (t  ti )
i 1
where {ti } is a set of mass locations with N peaks, and δ is the
Kronecker delta function:

Assume there is K PTMs, the {ti } can be clustered into K+1 groups:
K
K
k 0
k 0
x(t )   xk (t )  
Nk
(k )

(
t

t

i )
i 1
25
PTM identification with point process
blind search

When a PTM happens, a shift occurs to xk(t) to produce yk(t)
Nk
yk (t )  xk (t   k )    (t   k  ti( k ) )
i 1

Use C[.] to denote the total number of non-zero values in a point
process:
K
C xk (t )  N k , C x(t )   N k
k 0
cxy ( )  C x(t ) y (t   )
26
PTM identification with point process
blind search

For K=1, ∆ represents the mass of a possible PTM, we report
the top candidate with a ∆, and
| PWexp  PW   | tolerance
with the maximum
c xy (0)  c xy ( )
27
Evaluations

Datasets
 2657 annotated yeast ion trap tandem mass spectra
from OPD (Prince et al, 2004) having relatively low mass
resolutions
 2620 modified spectra with one artificially added one
PTM to each spectrum (Yan et al, 2006)

Experiments
 Sequence tag generation
 Database search via DFA based model
 Blind PTM identification
28
Performance in tag selection
w/o
PTM
with
PTM
Tag
length
Algorithm
Top 1
Top 3
Top 5
Top 10
Top 25
Time(s)
3
Ours
Pepnovo
75.8
75.8
89.1
90.1
94.6
94.6
96.9
96.8
98.1
98.8
0.33
3.62
4
Ours
Pepnovo
65.3
65.5
80.5
81.0
88.7
86.6
93.6
92.3
96.4
95.3
0.34
3.69
5
Ours
Pepnovo
56.4
58.4
72.8
71.3
78.3
77.6
85.1
84.0
89.8
88.9
0.33
3.83
6
Ours
Pepnovo
50.2
49.7
62.3
61.5
66.9
67.8
76.6
75.0
82.4
81.8
0.34
4.27
3
Ours
Pepnovo
68.1
62.8
84.8
83.7
90.3
89.7
94.8
84.9
97.1
97.8
0.32
3.59
4
Ours
Pepnovo
53.5
51.1
71.2
71.7
78.6
79.3
84.8
85.8
90.0
91.4
0.32
3.64
Columns: percentages of spectra that have at least one correct tag in top 1, 3,
5, 10, 25. Comparisons based on the sequencing results by SEQUEST [Eng
et al 1994]
29
Performance in tag selection (cont’d)
Time complexity of the tag selection depends on the tree width t
of spectrum graphs: O(6tn2)
About 90% of such graphs have tree width not exceeding 6
More than 10 times faster than PepNovo [Frank et al 2005]
30
Database search for PTM identification

Construct a DFA from the selected sequence tags and
use it to filter a peptide database

Only small portion of peptides will remain

Point process model for PTM identification are applied
to identify the peptide and potential PTMs
31
Performance in PTM identification
Tag
length
Top 1
Top 2
Top 3
Top 4
Top 5
Filtration
Ratio
T(s)
3
76.69
86.01
89.29
90.70
91.62
0.0167
263
4
74.98
80.77
81.71
82.17
84.40
0.0014
34
W/O
Filtration
60.38
72.33
76.64
79.16
81.17
-
3843
Columns: cumulative percentages of search results capturing
the target peptides exactly in Top i; T is the total time for all 2620
experimental spectra. Comparisons with Yan et al 2006 that does
not employ filtration.
32
Summary

A new graph-theoretic approach for peptide tag selection
effective and efficient

In combine with point process model to sequence peptide and
identify PTMs
effective and efficient

More tests are needed (e.g. two PTMs)

Tree decomposition based approaches have not been fully
exploited (e.g., improving tag selection effectiveness)
33
Acknowledgement
CS@UGA
Chunmei Liu
Yinglei Song
BMB@UGA
Bo Yan
Ying Xu
NSF
NIH
34