Data-Size and False Positives - Edwards Lab
Download
Report
Transcript Data-Size and False Positives - Edwards Lab
Algorithms and Computation:
Bottom-Up
Data Analysis Workflows
Nathan Edwards
Georgetown University Medical Center
Changing landscape
Experimental landscape
Computational landscape
Controlling for false proteins/genes
Improving peptide identification sensitivity
Data-size, cloud, resources, reliability
Data-size and false positive identifications
Spectra, sensitivity, resolution, samples
Machine-learning, multiple search engines
Filtered PSMs as a primary data-type
2
Changing Experimental
Landscape
Instruments are faster…
Sensitivity improvements…
More fractionation (automation), deeper precursor
sampling, ion optics
Resolution continues to get better…
More spectra, better precursor sampling
Accurate precursors (fragments) make a big
difference
Analytical samples per study…
Fractionation, chromatography, automation
improvements
3
Clinical Proteomic Tumor
Analysis Consortium (NCI)
Comprehensive study of genomically
characterized (TCGA) cancer biospecimens
by mass-spectrometry-based proteomics
workflows
~ 100 clinical tumor samples per study
Colorectal, breast, ovarian cancer
CPTAC Data Portal provides
Raw & mzML spectra; TSV and mzIdentML
PSMs; protein reports; experimental meta-data
4
CPTAC Data Portal
…from Edwards et al., Journal of Proteome Research, 2015
5
CPTAC/TCGA Colorectal Cancer
(Proteome)
Vanderbilt PCC (Liebler)
95 TCGA samples, 15 fractions / sample
Label-free spectral count / precursor XIC quant.
Orbitrap Velos; high-accuracy precursor
1425 spectra files ~600 Gb/~129 Gb (mzML.gz)
Spectra: ~ 18M; ~ 13M MS/MS
~ 4.6M PSMs at 1% MSGF+ q-value
6
Changing Computational
Landscape
Single computer operating on a single spectral
data-file is no longer feasible
Private computing clusters are quickly obsolete
MS/MS search is the computational bottleneck
Need $$ to upgrade every 3-4 years
Personnel costs for cluster administration and
management
Cloud computing gets faster and cheaper over
time…
…but requires rethinking the computing model
7
PepArML Meta-Search Engine
Simple, unified, peptide identification search
parameterization and execution:
Cluster, grid, and cloud scheduler:
Mascot, MSGF+, X!Tandem, K-Score, S-Score,
OMSSA, MyriMatch
Reliable batch spectra conversion and upload,
Automated distribution of spectra and sequence,
Job-failure tolerant with result-file validation
Machine-learning-based result combining:
Model-free – heterogeneous features
Adapts to the characteristics of each dataset
8
PepArML Meta-Search Engine
Georgetown
& Maryland HPC
Heterogeneous
compute resources
Edwards Lab
Scheduler &
48+ CPUs
Secure
communication
Single, simple
search request
Amazon
Web Services
9
PepArML Meta-Search Engine
Georgetown
& Maryland HPC
Heterogeneous
compute resources
Edwards Lab
Scheduler &
48+ CPUs
Secure
communication
Single, simple
search request
Amazon
Web Services
10
Run all of the search engines!
11
Search Engine Running Time
Which (combination of) search engine(s) should I use?
12
Fault Tolerant Computing
Spot instances
can be
preempted for
those willing to
pay more
Spot prices are
cheaper
(7¢/hour vs
46¢/hour)
13
Identifications per $$
How long will a specific job take?
Wall-clock time can be significantly reduced:
How much memory / data-transfer is needed?
What is a good decomposition size?
What cloud-instance to use?
…but management overhead costs too.
Cost of total compute may even increase.
Failed analyses cost too!
14
Data-Scale and False Positives
Big datasets have more false positive
proteins and genes!
CPTAC Colorectal Cancer (CDAP)
4.6M MSGF+ 1% FDR PSMs + 2 peptides/gene
~ 10,000 genes identified…
15
Data-Scale and False Positives
Big datasets have more false positive
proteins and genes!
CPTAC Colorectal Cancer (CDAP)
4.6M MSGF+ 1% FDR PSMs + 2 peptides/gene
~ 10,000 genes identified…
…but ~ 40% gene FDR
16
Simple decoy protein model
Decoy peptides hit decoy
proteins uniformly.
Each decoy peptide
represents an
independent trial.
Binomial distribution on
size of protein database
number of decoy peptides
Big-datasets have more decoy peptides!
17
Example
Large: 10,000 proteins, 100,000 peptides
Small: 1,000 proteins, 10,000 peptides
18
Data-Size and False Positives
CPTAC Colorectal Cancer
Control of gene FDR requires even more
stringent filtering of PSMs.
If we require strong evidence in all 95 samples:
1% FDR PSMs, but ~ 25% peptide FDR
~ 25,000 decoy peptides on ~ 20,000 genes
No decoy genes, but less than 1000 genes identified.
Bad scenario:
PDHA1 and PDHA2 in CPTAC Breast Cancer
– shared and unique peptides
PDHA2 is testes specific!
19
Improved Sensitivity
Machine-learning models
Combining multiple search engines
Agreement indicates good identifications
Both approaches successful at boosting ids,
particularly when adaptable to each dataset.
Use additional metrics for good identifications
Watch for the use of decoys in training the model.
Both have scaling issues and lack transparency
…may add noise to comparisons
20
PepArML Performance
LCQ
QSTAR
LTQ-FT
Standard Protein Mix Database
18 Standard Proteins – Mix1
21
Search Engine Info. Gain
22
Precursor & Digest Info. Gain
23
Filtered PSMs as Primary Data
For large enough spectral datasets, we might
choose best effort peptide identification
Need linear-time spectra → PSM algorithm
Filtered PSMs become primary data
Spectral counts become more quantitative
We work less hard to identify all spectra?
Output as genome alignments, BAM files?
How should PSMs be represented to
maximize their utility?
What about decoy peptide identifications?
24
Nascent polypeptide-associated
complex subunit alpha
7.3 x 10-8
25
Pyruvate kinase isozymes M1/M2
2.5 x 10-5
26
Questions?
27