De Novo Peptide Sequencing via Probabilistic Network Modeling

Download Report

Transcript De Novo Peptide Sequencing via Probabilistic Network Modeling

PepNovo
De Novo Peptide Sequencing via
Probabilistic Network Modeling
Peptide Fragmentation
N
N
A
A
C
C
F
F
M
E
E
T
T
P
G
P
G
R
R
PM-M
Collision-Induced Dissociation (CID)
C
C
Peptide Fragmentation

A peptide with mass PM, that fragments into a prefix
of mass m, and a suffix of mass PM-m, can produce
different fragment ions:
Prefix ion
position
Suffix Ion
position
b
m+1
y
PM-m+19
b-H2O
m-17
y-NH3
PM-m+2
b+2
(m+2)/2
y-H2O-H2O
PM-m-17
...

...
The intensities at the expected offsets from mass m
are used to create an intensity vector:

I m  {I b , I y , I b  H 2O ,...}
The Spectrum Graph
Scoring for De Novo Sequencing



All masses in spectrum range can be
considered putative cleavage sites.

Given observed intensities I m , how to
evaluate if mass m is cleavage site.
A common statistical tool used by many
scoring functions is the likelihood ratio
test (Dancik et al. 99’, Havilio et al.
03’,...)
Dancik et al. ’99 – Hypotheses


The main concept:
Give premium for present peaks and penalties for missing
peaks.
Uses a probability table:
Fragmentation
Hypothesis
Fragment
y
0.71 (P1)
b
0.66 (P2)
a
0.26 (P3)
y-H2OH2O

Probability
0.09
PR – Probability of observing random peak (~0.1)
(Random hypothesis).
(Pk)
Scoring a Cleavage Site (Dancik
‘99)


Out of k possible ions for cleavage at m, t are detected
(w.l.o.g fragments 1,..,t) and k-t are missing (t+1,..,k).
Score using a log ratio test:
Probability of cleavage site m according to Fragmentation hypothesis
P1  P2    Pt
(1  Pt 1 )    (1  Pk )
Score(m)  log

PR    PR (1  PR )  (1  PR )    (1  PR )


 
t
k t
Probability of cleavage site m according to Random hypothesis
PepNovo Scoring



PepNovo implements a similar
likelihood ratio test mechanism.
Can be viewed as extending the
scoring model of Dancik et al. 99’.
Includes several factors that are not
sufficiently addressed in current
scoring functions.
Enhancements to Dancik et al. (’99)
1.
2.
3.
4.
5.
Several Intensity values.
Combinations of fragment ions.
Incorporation of additional chemical
knowledge (e.g., preferred cleavage
sites).
Positional influence of the cleavage
site.
Improved Random Model.
HCID - Fragmentation Network
N-aa
C-aa
(N-terminal
amino acid)
pos(m)
(region in
peptide)
a
pos y P(y2|y,po
s)
(C-terminal
amino acid)
0.1
0.22
2 3
0.52
4 3
0.08
y
b
y2
b2
y-NH3
b-NH3
b-H2O
y-H2O
a-NH3
a-H2O
y-H2O-NH3
b-H2O-NH3
0 0
0 1
b-H2O-H2O
y-H2O-H2O
Amino acid influence
Ion combinations
Positional influence
Discrete Intensity Values


Peak intensity normalized according
to grass level (average of weakest
33% of peaks in spectrum).
Normalized intensities Discretized
into 4 intensity levels:




zero :
I < 0.05
low : 0.05 ≤ I < 2 (62% of peaks)
medium : 2 ≤ I < 10 (26% of peaks)
high :
I ≥ 10
(12% of peaks)
Combinations of Fragments
a
y
b
y2
b2
y-NH3
b-NH3
b-H2O
y-H2O
a-NH3
a-H2O
b-H2ONH3

b-H2OH2O
y-H2OH2O
y-H2ONH3
Different combinations have
significantly different probabilities:


P(b=high| y=high) = 0.36, vs. P(b=high|
y=low) = 0.03.
P(b-H2O > zero | b=high) = 0.5, vs.
P(b-H2O > zero | b= zero) = 0.24.
Additional Chemical Knowledge
N-aa
C-aa
(N-terminal
amino acid)
(C-terminal
amino acid)
b

The identity of the flanking amino acids
influences the peak intensities:



y
Increased intensities N-terminal to Proline and
Glycine
Increased intensities C-terminal to Aspartic
Acid.
400 amino acid combinations reduced to
15 equivalence sets (X-P,X-G, etc.).
Positional Influence
pos(m)
(region in
peptide)
a
b
y2
b2

y
Creates separate models for different locations in the
peptide
pos(m)  m / PM

Models phenomena such as:



weak b/y ions near the ends.
prevalence of a-ions in the first half of the peptides.
prevalence of b2 towards the peptide’s C-terminal and y2
near the N-terminal.
Probability under HCID

From the decomposition properties of
probabilistic networks, each node is
independent from the rest of the nodes
given the value of its parents so:

PH CID ( I , m) 
P
CID
f { y ,b ,...}
( f  i |  ( f ))
where (f) are the parents of node f.
HRandom – Regional Density
3
3
3
2ε
2
2
2
2
2
1
1
0
m/z
w
Intensity levels
Bin
Window
Computing the Random Probability


=1-(2ε)/w , is the probability of a
single peak missing the bin.
Let ni , 1≤i≤d, be counts of peaks
with intensity i in window w:
d
 ni
1. PRandom ( I  t | n1 ,...,nd )  (1   nt ) it 1
d
 ni
2. PRandom ( I  0 | n1 ,...,nd )   i1
d
3.
P
i 0
Random
( I  i | n1 ,...,nd )  1
Random Model for HRandom
Peak occurrences are treated as
random independent events:

PH Random ( I , m) 
 PH Random ( I f  i , m)

f { y ,b , y  H 2 O ...}

The probability of observing a peak
at random is estimated from the
local density of peaks in the
spectrum.
The Likelihood Ratio Score

A putative cleavage site is scored
according to the log ratio test:

PH CID ( f  i |  ( f ))


PH CID ( I , m)
f { y ,b ,...}

Score( I , m)  log
 log
PH Random ( I , m)
 PH Random ( f  i)
f { y ,b ,...}
Can be used to score a peptide by
summing the score for the prefix
masses:
n

Score( P  p1 p2 .. pn )   Score( I mi , mi )

i 1
PepNovo’s De Novo Sequencing



A spectrum graph is created from
the experimental MS/MS spectrum.
The nodes are scored using our
method.
Highest scoring anti-symmetric path
is found using dynamic
programming algorithm.
Spectrum Graph



Acyclic graph.
Nodes are cleavage sites, each has a mass m
and score s.
Edges connect nodes with mass differences
corresponding to an amino acid.
Q
V
m:0
s:5.0
A
m:71.2
s: 4.3
S
m:99.1
s:8.1
m:113
s: -1.2
L
W
m:163.2
s: 2.8
m:199.4
s: 5.6
Results
Algorithm
Average
Accuracy
Sequence
Length
Tag 3
Tag 4
Tag 5
PepNov
o
0.727
10.30
0.94
6
0.87
1
0.80
0
0.654
Shereng
a
0.690
8.65
0.82
1
0.71
1
0.564
0.364
Peaks
0.673
10.32
0.88
9
0.81
4
0.689
0.575
Lutefisk
0.566
8.79
0.66
1
0.52
1
0.425
0.339
Benchmarking reported for 280 spectra.
Tag 6
Q&A