Faster, More Sensitive Peptide ID by Sequence DB Compression

Transcript Faster, More Sensitive Peptide ID by Sequence DB Compression

Generalized Protein Parsimony
and Spectral Counting for
Functional Enrichment Analysis
Nathan Edwards
Department of Biochemistry and
Molecular & Cellular Biology
Georgetown University Medical Center
Why Tandem Mass
Spectrometry?

LC-MS/MS spectra provide evidence for the
amino-acid sequence and abundance of
functional proteins.

Key concepts:




Spectrum acquisition is unbiased by knowledge
Direct observation of amino-acid sequence
Sensitive to small sequence variations
Spectrum acquisition is biased by abundance
2
Sample Preparation for MS/MS
Enzymatic Digest
and
Fractionation
3
Single Stage MS
MS
4
Tandem Mass Spectrometry
(MS/MS)
Precursor selection
5
Tandem Mass Spectrometry
(MS/MS)
Precursor selection +
collision induced dissociation
(CID)
MS/MS
6
Peptide Fragmentation
Peptide: S-G-F-L-E-E-D-E-L-K
MW
ion
ion
MW
GFLEEDELK
y9
1080
FLEEDELK
y8
1022
88
b1
S
145
b2
SG
292
b3
SGF
LEEDELK
y7
875
405
b4
SGFL
EEDELK
y6
762
534
b5
SGFLE
EDELK
y5
633
663
b6
SGFLEE
DELK
y4
504
778
b7
SGFLEED
ELK
y3
389
907
b8
SGFLEEDE
LK
y2
260
1020
b9
SGFLEEDEL
K
y1
147
7
Unannotated Splice Isoform

Human Jurkat leukemia cell-line



LIME1 gene:


LCK interacting transmembrane adaptor 1
LCK gene:




Lipid-raft extraction protocol, targeting T cells
von Haller, et al. MCP 2003.
Leukocyte-specific protein tyrosine kinase
Proto-oncogene
Chromosomal aberration involving LCK in leukemias.
Multiple significant peptide identifications
8
Unannotated Splice Isoform
9
Unannotated Splice Isoform
10
Translation start-site correction

Halobacterium sp. NRC-1



GdhA1 gene:


Extreme halophilic Archaeon, insoluble membrane
and soluble cytoplasmic proteins
Goo, et al. MCP 2003.
Glutamate dehydrogenase A1
Multiple significant peptide identifications

Observed start is consistent with Glimmer 3.0
prediction(s)
11
Halobacterium sp. NRC-1
ORF: GdhA1


K-score E-value vs PepArML @ 10% FDR
Many peptides inconsistent with annotated
translation start site of NP_279651
0
40
80 120 160 200 240 280 320 360 400 440
12
Lost peptide identifications

Missing from the sequence database

Search engine strengths, weaknesses, quirks

Poor score or statistical significance

Thorough search takes too long
13
Peptide Sequence Databases
Organism
Human
Mouse
Rat
Zebra-fish



Size (Entries)
74,976
55,887
42,372
40,490
All amino-acid 30-mers, no redundancy


Size (AA)
248Mb
171Mb
76Mb
94Mb
From ESTs, Proteins, mRNAs
30-40 fold size and search time reduction
Formatted as a FASTA sequence database
One entry per gene/cluster.
14
Combine search engine results

No single score is
comprehensive
SEQUEST
Mascot
28%
14%
14%

Search engines
disagree
38%
1%

Many spectra lack
confident peptide
assignment
3%
2%
X! Tandem
15
Searle et al. JPR 7(1), 2008
Combining search engine
results – harder than it looks!

Consensus boosts confidence, but...




How to handle weak identifications?



How to assess statistical significance?
Gain specificity, but lose sensitivity!
Incorrect identifications are correlated too!
Consensus vs disagreement vs abstention
Threshold at some significance?
We apply "unsupervised" machine-learning....

Lots of related work unified in a single framework.
16
Search Engine Info. Gain
17
PepArML
Workflow
Spectra
Tandem
...
Mascot
...
OMSSA
Extract Peptides & Features







Select high-quality IDs
Guess true proteins
from search results
Label spectra & train
Calibrate confidence
Guess true proteins
from ML results
Iterate!
Estimate FDR using
(external) decoy
Select High-Quality IDs (D0)
Select "True" Proteins
Assign Training Labels
Train Classifier & Predict Correct IDs
Recalibrate Confidence as FDR (D1)
Select "True" Proteins
No
Stable?
Yes
Ouput Peptide Spectrum Assignments
18
False-Discovery-Rate Curves
19
PepArML Meta-Search Engine
X!Tandem,
KScore,
OMSSA,
MyriMatch,
Mascot
(1 core).
NSF TeraGrid
1000+ CPUs
Heterogeneous
compute resources
Edwards Lab
Scheduler &
80+ CPUs
Single, simple
search request
X!Tandem,
KScore,
OMSSA,
MyriMatch.
Secure
communication
Amazon AWS
Scales easily to
250+ simultaneous
searches
20
PeptideMapper Web Service
I’m Feeling Lucky
21
PeptideMapper Web Service
I’m Feeling Lucky
22
PeptideMapper Web Service

Suffix-tree index on peptide sequence
database



Peptide alignment with cluster evidence


Fast peptide to gene/cluster mapping
“Compression” makes this feasible
Amino-acid or nucleotide; exact & near-exact
Genomic-loci mapping via


UCSC “known-gene” transcripts, and
Predetermined, embedded genomic coordinates
23
Systems Biology
molecular biology
↕
phenotype
molecular biology
↕
biology
Structured
High-Throughput
Experiments
Knowledge
Databases
•
•
•
•
Proteomics
Sequencing
Microarrays
Metabolomics
24
•
•
•
•
•
•
Localization
Function
Process
Interactions
Pathway
Mutation
Systems Biology
molecular biology
↕
phenotype
molecular biology
↕
biology
Structured
High-Throughput
Experiments
Knowledge
Databases
•
•
•
•
Proteomics
Sequencing
Microarrays
Metabolomics
25
Functional
Annotation
Enrichment
Mathematical
Models
•
•
•
•
•
•
Localization
Function
Process
Interactions
Pathway
Mutation
Systems Biology
molecular biology
↕
phenotype
molecular biology
↕
biology
Structured
High-Throughput
Experiments
Knowledge
Databases
•
•
•
•
Proteomics
Sequencing
Microarrays
Metabolomics
26
Functional
Annotation
Enrichment
Mathematical
Models
•
•
•
•
•
•
Localization
Function
Process
Interactions
Pathway
Mutation
Why not in proteomics?

Double counting and false positives…


Proteomics cannot see all proteins…


…due to traditional protein inference
…proteins are not equally likely to be drawn
Good relative abundance is hard…


…extra chemistries, workflows, and software
…missing values are particularly problematic
27
In proteomics…

Double counting and false positives…


Proteomics cannot see all proteins…


Use generalized protein parsimony
Use identified proteins as background
Good relative abundance is hard…

Model differential spectral counts directly
28
Traditional Protein Parsimony

Select the smallest set of proteins that
explain all identified peptides.

Sensible principle, implies


Equivalent proteins are problematic:


Eliminate equivalent/subset proteins
Which one to choose?
Unique-protein peptides force the inclusion of
proteins into solution


True for most tools, even probability based ones
Bad consequences for FDR filtered ids
29
Peptide-Spectrum Matches

Sigma49 – 32,691 LTQ MS/MS spectra of
49 human protein standards; IPI Human

Yeast – 162,420 LTQ MS/MS spectra from a
yeast cell lysate; SGD.

X!Tandem E-value (no refinement), 1% FDR
Spectra used in: Zhang, B.; Chambers, M. C.; Tabb, D. L. 2007.
30
Many proteins are easy

Eliminate equivalent / dominated proteins



277 → 60 proteins
1226 → 1085 proteins
Many components have a single protein:



Sigma49:
Yeast:
Sigma49:
Yeast:
52 ( 3 multi-protein)
994 (43 multi-protein)
Single peptides force protein inclusion


Sigma49:
Yeast:
16 single-peptide proteins
476 single-peptide proteins
31
Must eliminate redundancy
IPI00925547
IPI00298860
IPI00925299
IPI00925519
IPI00908908
IPI00903112
XXXX
XXXX
XXXX
XXX
XXXX
XXXX
X
X
X
X
X
X
X
X
X
X
X
X
XXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXX
X
X
XXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXX
XXXXXXXX
XXX
XXXXX
37 distinct peptides

Contained proteins should not be selected
32
Must eliminate redundancy
IPI00925547
IPI00298860
IPI00925299
IPI00925519
IPI00908908
IPI00903112
XXXX
XXXX
XXXX
XXX
XXXX
XXXX
X
X
X
X
X
X
X
X
X
X
X
X
XXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXX
X
X
XXXXXXXXXXXXXXXXXXXXXX
X X X X X X X X X 1.0
X X X X X X X X 1.0
XXX
XXXXX
0.8
0.7
0.0
1.0
Single AA Difference

Contained proteins should not be selected


Even if they have some probability mass
Number of sibling peptides matter less if they are
shared.
33
Must ignore some PSMs
IPI00925547
IPI00298860
IPI00925299
IPI00925519
IPI00908908
IPI00903112
XXXX
XXXX
XXXX
XXX
XXXX
XXXX
X
X
X
X
X
X
X
X
X
X
X
X
XXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXX
X
X
XXXXXXXXXXXXXXXXXXXXXX
X X X X X X X X X 1.0
X X X X X X X X 0.0
XXX
XXXXX
0.0
0.0
0.0
1.0
Single AA Difference

A single additional peptide should not force
protein into solution
34
Example from Yeast
YLR432W
X X X X X X X 1.0
YHR216W
X
X X 0.6
0.0
YAR073W
X
X
YML056C X
X X X X X 1.0

"Inosine monophosphate dehydrogenase"



4 gene family
Contained proteins should not be selected
Single peptide evidence for YML056C
35
Must ignore some PSMs

Improving peptide identification sensitivity
makes things worse!
PSMs

False PSMs don't cluster
PSMs
2x
Proteins
10%
36
Must ignore some PSMs

Improving peptide identification sensitivity
makes things worse!
PSMs

False PSMs don't cluster
PSMs
Select Proteins to
Explain True PSM%
90%
90%
37
Must ignore some PSMs
IPI00925547
IPI00298860
IPI00925299
IPI00925519
IPI00908908
IPI00903112

XXXX
XXXX
XXXX
XXX
XXXX
XXXX
X
X
X
X
X
X
X
X
X
X
X
X
XXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXX
X
X
XXXXXXXXXXXXXXXXXXXXXX
How do we choose?



XXXXXXXXX
XXXXXXXX
XXX
XXXXX
YLR432W
XXXXXXX
YHR216W
X
X X
YAR073W
X
X
YML056C X
XXXXX
Maximize # peptides?
Minimize FDR (naïve model)?
Maximize # PSMs?
38
Generalized Protein Parsimony

Weight peptides by number of PSMs

Constrain unique peptides per protein

Maximize explained peptides (PSMs)

Match PSM filtering FDR to % uncovered PSMs

Readily solved by branch-and-bound


Permits complex protein/peptide constraints
Reduces to traditional protein parsimony
39
Match uncovered PSMs to FDR
40
Plasma membrane enrichment

Pellicle enrichment of plasma membrane


Six replicate LC-MS/MS analyses each



Choksawangkarn et al. JPR 2013 (Fenselau Lab)
Cell-lysate (44,861 MS/MS)
Fe3O4-Al2O3 pellicle (21,871 MS/MS)
625 3-unique proteins to match 10% FDR:


Lysate: 18,976 PSMs; Pellicle: 13,723 PSMs
89 proteins with significantly (< 10-5) increased counts
41
Semi-quantitative LC-MS/MS
Precursor selection +
collision induced dissociation
(CID)
MS/MS
42
Semi-quantitative LC-MS/MS
43
Chen and Yates. Molecular Oncology, 2007
Plasma membrane enrichment

Na/K+ ATPase subunit alpha-1 (P05023):


Transferrin receptor protein 1 (P02786):


Lysate: 17; Pellicle: 63; p-value: 2.0 x 10-11
DAVID Bioinformatics analysis (89/625):



Lysate: 1; Pellicle: 90; p-value: 5.2 x 10-33
Plasma membrane (GO:0005886) : 29 (5.2 x 10-5)
Transmembrane (SwissProtKW): 24 (1.3 x 10-6)
Transmembrane (SwissProtKW):

Lysate: 524; Pellicle: 1335; p-value: 2.6 x 10-158
44
Distribution of p-values (Yeast)
45
A protein's PSMs rise and fall
together!
46
A protein's PSMs rise and fall
together?
47
Anomalies indicate
proteoforms
48
HER2/Neu Mouse Model of
Breast Cancer


Paulovich, et al. JPR, 2007
Study of normal and tumor mammary tissue
by LC-MS/MS


Peptide-spectrum assignments



Normal samples (Nn): 161,286 (49.7%)
Tumor samples (Nt): 163,068 (50.3%)
4270 proteins identified in total

49
1.4 million MS/MS spectra
2-unique generalized protein parsimony
Nascent polypeptide-associated
complex subunit alpha
7.3 x 10-8
50
Pyruvate kinase isozymes M1/M2
2.5 x 10-5
51
Summary
Improve the scope and sensitivity of peptide
identification for genome annotation, using
 Exhaustive peptide sequence databases
 Machine-learning for combining
 Meta-search tools to maximize consensus
 Grid-computing for thorough search
52
Summary

Functional annotation enrichment for
proteomics too:



Careful counting (generalized parsimony)
Differential abundance by spectral counts
Use (multivariate-)hypergeometric model for


Differential abundance by spectral counts
Proteoform detection
53

Faster, More Sensitive Peptide ID by Sequence DB Compression

Transcript Faster, More Sensitive Peptide ID by Sequence DB Compression

Directory