Improving the Reliability of Peptide Identification by

Download Report

Transcript Improving the Reliability of Peptide Identification by

Improving the Reliability of
Peptide Identification by
Tandem Mass Spectrometry
Nathan Edwards
Department of Biochemistry and Molecular & Cellular Biology
Georgetown University Medical Center
Mass Spectrometry for
Proteomics
• Measure mass of many (bio)molecules
simultaneously
• High bandwidth
• Mass is an intrinsic property of all
(bio)molecules
• No prior knowledge required
2
Mass Spectrometer
Sample
+
_
Ionizer
• MALDI
• Electro-Spray
Ionization (ESI)
Mass Analyzer
• Time-Of-Flight (TOF)
• Quadrapole
• Ion-Trap
3
Detector
• Electron
Multiplier
(EM)
High Bandwidth
% Intensity
100
0
250
500
750
4
1000
m/z
Mass is fundamental!
5
Mass Spectrometry for
Proteomics
• Measure mass of many molecules
simultaneously
• ...but not too many, abundance bias
• Mass is an intrinsic property of all
(bio)molecules
• ...but need a reference to compare to
6
Mass Spectrometry for
Proteomics
• Mass spectrometry has been around
since the turn of the century...
• ...why is MS based Proteomics so new?
• Ionization methods
• MALDI, Electrospray
• Protein chemistry & automation
• Chromatography, Gels, Computers
• Protein sequence databases
• A reference for comparison
7
Sample Preparation for
Peptide Identification
Enzymatic Digest
and
Fractionation
8
Single Stage MS
MS
m/z
9
Tandem Mass Spectrometry
(MS/MS)
m/z
Precursor selection
m/z
10
Tandem Mass Spectrometry
(MS/MS)
Precursor selection +
collision induced dissociation
(CID)
m/z
MS/MS
m/z
11
The big picture...
• MS/MS spectra provide evidence for the
amino-acid sequence of functional
proteins.
• Key concepts:
•
•
•
•
Spectrum acquisition is unbiased
Direct observation of amino-acid sequence
Sensitive to minor sequence variation
Observed peptides represent folded proteins
12
Peptide Identification
• For each (likely) peptide sequence
1. Compute fragment masses
2. Compare with spectrum
3. Retain those that match well
• Peptide sequences from protein sequence
databases
• Swiss-Prot, IPI, NCBI’s nr, ...
• Automated, high-throughput peptide identification
in complex mixtures
13
Peptide Identification, but...
• What about novel peptides?
• Search compressed ESTs (C3, PepSeqDB)
• What about peak intensity?
• Spectral matching using HMMs (HMMatch)
• Which identifications are correct?
• Unsupervised, model-free, result combiner
with false discovery rate estimation
14
Why don’t we see more
novel peptides?
• Tandem mass spectrometry doesn’t
discriminate against novel peptides...
...but protein sequence databases do!
• Searching traditional protein sequence
databases biases the results towards
well-understood protein isoforms!
15
What goes missing?
• Known coding SNPs
• Novel coding mutations
• Alternative splicing isoforms
• Alternative translation start-sites
• Microexons
• Alternative translation frames
16
Why should we care?
• Alternative splicing is the norm!
• Only 20-25K human genes
• Each gene makes many proteins
• Proteins have clinical implications
• Biomarker discovery
• Evidence for SNPs and alternative splicing
stops with transcription
• Genomic assays, ESTs, mRNA sequence.
• Little hard evidence for translation start site
17
Novel Splice Isoform
• Human Jurkat leukemia cell-line
• Lipid-raft extraction protocol, targeting T cells
• von Haller, et al. MCP 2003.
• LIME1 gene:
• LCK interacting transmembrane adaptor 1
• LCK gene:
• Leukocyte-specific protein tyrosine kinase
• Proto-oncogene
• Chromosomal aberration involving LCK in leukemias.
• Multiple significant peptide identifications
18
Novel Splice Isoform
19
Novel Splice Isoform
20
Novel Mutation
• HUPO Plasma Proteome Project
• Pooled samples from 10 male & 10 female
healthy Chinese subjects
• Plasma/EDTA sample protocol
• Li, et al. Proteomics 2005. (Lab 29)
• TTR gene
• Transthyretin (pre-albumin)
• Defects in TTR are a cause of amyloidosis.
• Familial amyloidotic polyneuropathy
• late-onset, dominant inheritance
21
Novel Mutation
Ala2→Pro associated with familial amyloid polyneuropathy
22
Novel Mutation
23
Searching ESTs
• Proposed long ago:
• Yates, Eng, and McCormack; Anal Chem, ’95.
• Now:
• Protein sequences are sufficient for protein identification
• Computationally expensive/infeasible
• Difficult to interpret
• Make EST searching feasible for routine searching
to discover novel peptides.
24
Searching Expressed
Sequence Tags (ESTs)
Pros
• No introns!
• Primary splicing
evidence for
annotation pipelines
• Evidence for dbSNP
• Often derived from
clinical cancer
samples
Cons
• No frame
• Large (8Gb)
• “Untrusted” by
annotation pipelines
• Highly redundant
• Nucleotide error
rate ~ 1%
25
Compressed EST Peptide
Sequence Database
• For all ESTs mapped to a UniGene gene:
•
•
•
•
Six-frame translation
Eliminate ORFs < 30 amino-acids
Eliminate amino-acid 30-mers observed once
Compress to C2 FASTA database
• Complete, Correct for amino-acid 30-mers
• Gene-centric peptide sequence database:
• Size: < 3% of naïve enumeration, 20774 FASTA entries
• Running time: ~ 1% of naïve enumeration search
• E-values: ~ 2% of naïve enumeration search results
26
PepSeq FASTA Databases
• Organisms:
• HUMAN, MOUSE, RAT, ZEBRA FISH
• Peptide Evidence:
• Genbank mRNA, EST, HTC
• RefSeq mRNA, Proteins
• Swiss-Prot/TrEMBL, EMBL, VEGA, H-Inv, IPI
Proteins
• Swiss-Prot variants
• Swiss-Prot signal peptide & init. Met removal
• Singe FASTA entry per Gene
27
Spectral Matching for
Peptide Identification
• Detection vs. identification
• Increased sensitivity & specificity
• No novel peptides!
• NIST GC/MS Spectral Library
•
•
•
•
•
Identifies small molecules,
100,000’s of (consensus) spectra
Bundled/Sold with many instruments
“Dot-product” spectral comparison
Current project: Peptide MS/MS
28
NIST MS Search: Peptides
29
Peptide DLATVYVDVLK
30
Protein Families
31
Protein Families
32
Peptide DLATVYVDVLK
33
Hidden Markov Models for
Spectral Matching
• Capture statistical variation and consensus in
peak intensity
• Only need 10 spectra to build a model
• Capture semantics of peaks
• Extrapolate model to other peptides
• Good specificity with superior sensitivity for
peptide detection
• Assign 1000’s of additional spectra (p-value < 10-5)
34
Hidden Markov Model
Delete
Insert
Ion
(m/z,int) pair emitted by ion & insert states
35
The devil in the details
• Intensity normalization
• Discretize (m/z,int) pairs
• Viterbi distance as score
• Compute p-value using “random”
spectra
36
Random Spectra
# of spectra
• Uniform sample of (m/z,int)
• Permutation (m/z) of true spectra peaks
• M/z distribution between true spectra and
uniform sample (parameter)
True
False
Random
Viterbi Score
37
HMM Peptide Identification
Results – DLATV
DLAT (viterbi)
True_test(0.0001)
True_test(other)
False_test(0.0001)
False_test(other)
240
220
200
160
140
120
100
80
DLAT (-logP)
60
True_test(0.0001)
40
20
True_test(other)
False_test(0.0001)
False_test(other)
100
0
0
0
28
26
029
0
027
0
24
025
0
22
80
70
# of spectra
Viterbi Distance
023
0
20
021
0
18
019
0
16
017
0
14
015
0
013
12
011
-9
0
10
80
-5
0
-7
0
60
40
-3
0
90
20
010
# of spectra
180
60
50
40
30
20
10
0
0-1 1-2 2-3 3-4 4-5 5-6 6-7 7-8 8-9
9- 10- 11- 12- 13- 14- 15- 16- 17- 18- inf
10 11 12 13 14 15 16 17 18 19
-log(p-value)
38
Spectral Matching of
Peptide Variants
DFLAGGIAAAISK
DFLAGGVAAAISK
39
HMM model extrapolation
40
Mascot Search Results
41
Peptide Identification Results
• Search engines always provide an
answer
• Current search engines:
• Hard to determine “good” scores
• Significance estimates are unreliable
• Need better methods!
42
Common Algorithmic
Framework
• Pre-process experimental spectra
• Filter peptide candidates
• Score match between peptides and
spectra
• Rank peptides and assign
43
Comparison of search engines
• No single score is
comprehensive
Mascot
OMSSA
10%
• Search engines
disagree
4%
2%
69%
• Many spectra lack
confident peptide
assignment
9%
5%
2%
44
X!Tandem
Lots of published solutions!
• Treat search engines as black-boxes
• Apply supervised machine learning to
results
• Use multiple match metrics
• Combine/refine using multiple search
engines
• Agreement suggests correctness
• Use empirical significance estimates
• “Decoy” databases (FDR)
45
PepArML
• Peptide identification arbiter by
machine learning
• Unifies these ideas within a modelfree, combining machine learning
framework
• Unsupervised training procedure
46
PepArML Overview
• Unify Tandem, Mascot, and OMSSA results
X!Tandem
Mascot
PepArML
OMSSA
Other
47
Identified
Unidentified
Voting Heuristic Combiner
• Choose peptide ID with the most votes
• Use best FDR as confidence
• Break ties (single votes) using FDR
• Strawman for comparison
48
Dataset construction
Extract Features
Search Tools
Spectra
compare
Matched Ions
X!Tandem
Peak_intensity
Mass delta
Mascot
# of missed
cleavages
x
Peptide length
OMSSA
Tandem Score
Mascot Score
Other
OMSSA Score
49
Machine
Learning
Dataset construction
• Build feature vectors
Tandem
Mascot
OMSSA
( S1 , P1 )
T
(S1, P2 )
F
(S2 , P1 )
T
……
( Sn , Pm )
T
50
Dataset construction
• Synthetic protein mixtures provide ground truth
• C8
• 8 standard proteins (Calibrant Biosystems)
• 4594 MS/MS spectra (LTQ)
• 618 (11.2%) true positives
• S17
• 17 standard proteins (Sashimi Repository)
• 1389 MS/MS spectra (Q-TOF)
• 354 (25.4%) true positives
• AURUM
• 364 standard proteins (AURUM 1.0)
• 7508 MS/MS spectra (MALDI-TOF-TOF)
• 3775 (50.3%) true positives
51
Machine learning improves
single search engines (S17)
52
Multiple search engines are better
than single search engines (S17)
53
Feature Evaluation
54
Application to Real Data
• How well do these models generalize?
• Different instruments
• Spectral characteristics change scores
• Search parameters
• Different parameters change score values
• Supervised learning requires
• (Synthetic) experimental data from every instrument
• Search results from available search engines
• Training/models for all
parameters x search engine sets x instruments
55
Model Generalization
56
Rescuing Machine Learning
• Train a new machine-learning model for
every dataset!
• Generalization not required
• No predetermined search engines, parameters,
instruments, features
• Perhaps we can “guess” the true proteins
• Most proteins not in doubt
• Machine learning can tolerate imperfect labels
57
Unsupervised Learning
• Heuristic selection of “true” proteins
• Train classifier, predict true peptide IDs
• Update “true” proteins
• Heuristic selection of “true” proteins from
classifier predictions
• Iterate until convergence
58
Unsupervised Learning
Performance
59
Unsupervised Learning
Convergence
60
Conclusions
• Proteomics can inform genome annotation
• Eukaryotic and prokaryotic
• Functional vs silencing variants
• Peptides identify more than just proteins
• Untapped source of disease biomarkers
• Computational inference can make a
substantial impact in proteomics
61
Conclusions
• Compressed peptide sequence
databases make routine EST
searching feasible
• HMMatch spectral matching improves
identification performance for familiar
peptides
• Unsupervised, model-free, combining
PepArML framework solves peptide
identification interpretation problem
62
Acknowledgements
• Chau-Wen Tseng, Xue Wu
• UMCP Computer Science
• Catherine Fenselau
• UMCP Biochemistry
• Cheng Lee
• Calibrant Biosystems
• PeptideAtlas, HUPO PPP, X!Tandem
• Funding: NIH/NCI, USDA/ARS
63