Peptide Identification

Transcript Peptide Identification

Peptide Identification
Statistics
Pin the tail on the donkey?
US HUPO: Bioinformatics for Proteomics
Nathan Edwards – March 12, 2006
Peptide Identification
• Peptide fragmentation by CID is poorly
understood
• MS/MS spectra represent incomplete
information about amino-acid sequence
• I/L, K/Q, GG/N, …
• Correct identifications don’t come with a
certificate!
US HUPO: Bioinformatics for Proteomics
2
Peptide Identification
• High-throughput workflows demand we
analyze all spectra, all the time.
• Spectra may not contain enough
information to be interpreted correctly
• …bad static on a cell phone
• Peptides may not match our assumptions
• …its all Greek to me
• “Don’t know” is an acceptable answer!
US HUPO: Bioinformatics for Proteomics
3
Peptide Identification
We can’t prove we are right…
…so can we prove we aren’t wrong?
US HUPO: Bioinformatics for Proteomics
4
Peptide Identification
We can’t prove we are right…
…so can we prove we aren’t wrong?
NO!
US HUPO: Bioinformatics for Proteomics
5
Peptide Identification
We can’t prove we are right…
…so can we prove we aren’t wrong?
NO!
The best we can do is to show our answer
is better than guessing!
US HUPO: Bioinformatics for Proteomics
6
Better than guessing…
• Better implies comparison
• Score or measure of degree of success
• Guessing implies randomness
• Probability and statistics
US HUPO: Bioinformatics for Proteomics
7
Pin the tail on the donkey…
US HUPO: Bioinformatics for Proteomics
8
Probability Concepts
Throwing darts
• One at a time
• Blindfolded
Identically distributed?
Uniform distribution?
Mutually exclusive?
Independent?
Pr [ Dart hits x ] = 0.05
US HUPO: Bioinformatics for Proteomics
9
Probability Concepts
Throwing darts
• One at a time
• Blindfolded
• Three darts
Pr [Hitting 20 3 times]
= 0.05 * 0.05 * 0.05
Pr [Hit 20 at least twice]
= 0.007125 + 0.000125
0 times
0.857375
1 times
2 times
3 times
0.135375
0.007125
0.000125
US HUPO: Bioinformatics for Proteomics
10
Probability Concepts
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Probability
0
1
2
3
0.857375
0.135375
0.007125
0.000125
US HUPO: Bioinformatics for Proteomics
11
Probability Concepts
Throwing darts
• One at a time
• Blindfolded
• Three darts
Pr [Hitting evens 3 times]
= Pr [Hitting 1-10 3 times]
= 0.5 * 0.5 * 0.5
Pr [Evens at least twice]
= 0.5
0 times
0.125
1 times
2 times
3 times
0.375
0.375
0.125
US HUPO: Bioinformatics for Proteomics
12
Probability Concepts
0.4
0.35
Probability
0.3
0.25
0.2
0.15
0.1
0.05
0
Probability
0
1
2
3
0.125
0.375
0.375
0.125
US HUPO: Bioinformatics for Proteomics
13
Probability Concepts
Throwing darts
• One at a time
• Blindfolded
• 100 darts
Pr [Hitting 20 3 times]
= 0.139575
Pr [Hit 20 at least twice]
= 0.9629188
0 times
0.005920
1 times
2 times
3 times
0.031160
0.081181
0.139575
US HUPO: Bioinformatics for Proteomics
14
Probability Concepts
1000
500
0
Frequency
1500
Histogram of rbinom(10000, 100, 0.05)
0
5
10
15
rbinom(10000, 100, 0.05)
US HUPO: Bioinformatics for Proteomics
15
Match Score
• Dartboard represents the mass range of the
spectrum
• Peaks of a spectrum are “slices”
• Width of slice corresponds to mass tolerance
• Darts represent
• random masses
• masses of fragments of a random peptide
• masses of peptides of a random protein
• masses of biomarkers from a random class
• How many darts to we get to throw?
US HUPO: Bioinformatics for Proteomics
16
Match Score
• What is the probability
that we match at least
5 peaks?
% Intensity
100
270
0
250
500
750
• Same as the
probability of hitting
20 at least 5 times.
330
1000 m/z
870
550
755
US HUPO: Bioinformatics for Proteomics
580
17
Match Score
• Pr [ Match ≥ s peaks ]
= Binomial( p , n )
≈ Poisson( p n ), for small p and large n
p is prob. of random mass / peak match,
n is number of darts (fragments in our answer)
US HUPO: Bioinformatics for Proteomics
18
Match Score
Theoretical distribution
• Used by OMSSA
• Proposed, in various forms, by many.
• Probability of random mass / peak
match
• IID (independent, identically distributed)
• Based on match tolerance
US HUPO: Bioinformatics for Proteomics
19
Match Score
Theoretical distribution assumptions
• Each dart is independent
• Peaks are not “related”
• Each dart is identically distributed
• Chance of random mass / peak match is
the same for all peaks
US HUPO: Bioinformatics for Proteomics
20
0.10
0.00
0.05
0.00 0.05 0.10
0
2
4
6
8
10
12
0
2
4
6
8
10
12
1000 people
100000 people
0.00
0.05
0.05
0.10
0.10
0.15
0.15
100 people
10000 people
0.00
100 Darts, # 20’s
0.15
0.15
Tournament Size
0
5
10
15
0
US HUPO: Bioinformatics for Proteomics
5
10
15
21
50
40
30
0
10
20
30
20
10
0
12
14
16
18
12
14
16
18
1000 people
100000 people
40
30
20
10
0
10
20
30
40
50
100 people
10000 people
10
50
10
0
100 Darts, # 20’s
40
50
Tournament Size
10
12
14
16
18
10
12
US HUPO: Bioinformatics for Proteomics
14
16
18
22
Number of Trials
• Tournament size == number of trials
• Number of peptides tried
• Related to sequence database size
• Probability that a random match score is ≥ s
• 1 – Pr [ all match scores < s ]
• 1 – Pr [ match score < s ] Trials
• Assumes IID!
(*)
• Expect value
• E = Trials * Pr [ match ≥ s ]
• Corresponds to Bonferroni bound on (*)
US HUPO: Bioinformatics for Proteomics
23
Better Dart Throwers
US HUPO: Bioinformatics for Proteomics
24
Better Random Models
• Comparison with completely random
model isn’t really fair
• Match scores for real spectra with real
peptides obey rules
• Even incorrect peptides match with
non-random structure!
US HUPO: Bioinformatics for Proteomics
25
Better Random Models
• Want to generate random fragment
masses (darts) that behave more like the
real thing:
• Some fragments are more likely than others
• Some fragments depend on others
• Theoretical models can only incorporate
this structure to a limited extent.
US HUPO: Bioinformatics for Proteomics
26
Better Random Models
• Generate random peptides
•
•
•
•
Real looking fragment masses
No theoretical model!
Must use empirical distribution
Usually require they have the correct
precursor mass
• Score function can model anything
we like!
US HUPO: Bioinformatics for Proteomics
27
Better Random Models
Fenyo & Beavis, Anal. Chem., 2003
US HUPO: Bioinformatics for Proteomics
28
Better Random Models
Fenyo & Beavis, Anal. Chem., 2003
US HUPO: Bioinformatics for Proteomics
29
Better Random Models
• Truly random peptides don’t look much
like real peptides
• Just use peptides from the sequence
database!
• Caveats:
• Correct peptide (non-random) may be
included
• Peptides are not independent
• Reverse sequence avoids only the first
problem
US HUPO: Bioinformatics for Proteomics
30
Extrapolating from the
Empirical Distribution
Fenyo & Beavis, Anal. Chem., 2003
US HUPO: Bioinformatics for Proteomics
31
Extrapolating from the
Empirical Distribution
• Often, the empirical shape is
consistent with a theoretical model
Fenyo & Beavis, Anal. Chem., 2003
Geer et al., J. Proteome Research, 2004
US HUPO: Bioinformatics for Proteomics
32
Peptide Prophet
• From the Institute for Systems Biology
• Keller et al., Anal. Chem. 2002
• Re-analysis of SEQUEST results
• Spectra are trials (NOT peptides!)
• Assumes that many of the spectra are
not correctly identified
US HUPO: Bioinformatics for Proteomics
33
Peptide Prophet
Keller et al., Anal. Chem. 2002
Distribution of spectral scores in the results
US HUPO: Bioinformatics for Proteomics
34
Peptide Prophet
• Assumes a bimodal distribution of scores,
with a particular shape
• Ignores database size
• …but it is included implicitly
• Like empirical distribution for peptide
sampling, can be applied to any score
function
• Can be applied to any search engines’ results
US HUPO: Bioinformatics for Proteomics
35
Peptide Prophet
• Caveats
• Are spectra scores sampled from the same
distribution?
• Is there enough correct identifications for second
peak?
• Are spectra independent observations?
• Are distributions appropriately shaped?
• Huge improvement over raw SEQUEST
results
US HUPO: Bioinformatics for Proteomics
36
Peptides to Proteins
Nesvizhskii et al., Anal. Chem. 2003
US HUPO: Bioinformatics for Proteomics
37
Peptides to Proteins
US HUPO: Bioinformatics for Proteomics
38
Peptides to Proteins
• A peptide sequence may occur in
many different protein sequences
• Variants, paralogues, protein families
• Separation, digestion and ionization is
not well understood
• Proteins in sequence database are
extremely non-random, and very
dependent
US HUPO: Bioinformatics for Proteomics
39
Peptides to Proteins
US HUPO: Bioinformatics for Proteomics
40
Peptides to Proteins
• Mascot
• Protein score is sum of peptide scores
• Assumes peptide identifications are
independent!
• SEQUEST
• Keeps only one of the proteins for each
peptide?
US HUPO: Bioinformatics for Proteomics
41
Peptides to Proteins
• Peptide Prophet
• Nesvizhskii, et al. Anal. Chem 2003
• Models probability that a protein is correct
based on
• Probability that its peptides are correct
• Models probability that a peptide is correct
based on
• Probability that its proteins are correct
• Proteins with one high-probability peptide
are not eliminated
• …but are down-weighted
• Assumes identification probabilities from the
same protein are independent (like Mascot)
US HUPO: Bioinformatics for Proteomics
42
Peptides to Proteins
• Best available method, to date, is Protein
Prophet.
• The problem will only get worse, as we
search variants and isoform sequences
• Proteins do not have a single sequence!
• Peptide identification is not protein
identification!
US HUPO: Bioinformatics for Proteomics
43
Publication Guidelines
US HUPO: Bioinformatics for Proteomics
44
Publication Guidelines
1. Computational parameters
•
•
•
•
Spectral processing
Sequence database
Search program
Statistical analysis
2. Number of peptides per protein
•
•
Each peptide sequence counts once!
Multiple forms of the same peptide
count once!
US HUPO: Bioinformatics for Proteomics
45
Publication Guidelines
3. Single-peptide proteins must be explicitly
justified by
•
•
•
•
•
Peptide sequence
N and C terminal amino-acids
Precursor mass and charge
Peptide Scores
Multiple forms of the peptide counted once!
4. Biological conclusions based on singlepeptide proteins must show the spectrum
US HUPO: Bioinformatics for Proteomics
46
Publication Guidelines
5. More stringent requirements for PMF
data analysis
•
Similar to that for tandem mass spectra
6. Management of protein redundancy
•
Peptides identified from a different species?
7. Spectra submission encouraged
US HUPO: Bioinformatics for Proteomics
47
Summary
• Could guessing be as effective as a
search?
• More guesses improves the best guess
• Better guessers help us be more
discriminating
• Independent observations only count if
they are independent!
• Peptide to proteins is not as simple as it
seems
• Publication guidelines reflect sound
statistical principles.
US HUPO: Bioinformatics for Proteomics
48

Peptide Identification

Transcript Peptide Identification

Directory