Data-Size and False Positives - Edwards Lab

Download Report

Transcript Data-Size and False Positives - Edwards Lab

Algorithms and Computation:
Bottom-Up
Data Analysis Workflows
Nathan Edwards
Georgetown University Medical Center
Changing landscape

Experimental landscape


Computational landscape


Controlling for false proteins/genes
Improving peptide identification sensitivity


Data-size, cloud, resources, reliability
Data-size and false positive identifications


Spectra, sensitivity, resolution, samples
Machine-learning, multiple search engines
Filtered PSMs as a primary data-type
2
Changing Experimental
Landscape

Instruments are faster…


Sensitivity improvements…


More fractionation (automation), deeper precursor
sampling, ion optics
Resolution continues to get better…


More spectra, better precursor sampling
Accurate precursors (fragments) make a big
difference
Analytical samples per study…

Fractionation, chromatography, automation
improvements
3
Clinical Proteomic Tumor
Analysis Consortium (NCI)


Comprehensive study of genomically
characterized (TCGA) cancer biospecimens
by mass-spectrometry-based proteomics
workflows
~ 100 clinical tumor samples per study


Colorectal, breast, ovarian cancer
CPTAC Data Portal provides

Raw & mzML spectra; TSV and mzIdentML
PSMs; protein reports; experimental meta-data
4
CPTAC Data Portal
…from Edwards et al., Journal of Proteome Research, 2015
5
CPTAC/TCGA Colorectal Cancer
(Proteome)







Vanderbilt PCC (Liebler)
95 TCGA samples, 15 fractions / sample
Label-free spectral count / precursor XIC quant.
Orbitrap Velos; high-accuracy precursor
1425 spectra files ~600 Gb/~129 Gb (mzML.gz)
Spectra: ~ 18M; ~ 13M MS/MS
~ 4.6M PSMs at 1% MSGF+ q-value
6
Changing Computational
Landscape

Single computer operating on a single spectral
data-file is no longer feasible


Private computing clusters are quickly obsolete



MS/MS search is the computational bottleneck
Need $$ to upgrade every 3-4 years
Personnel costs for cluster administration and
management
Cloud computing gets faster and cheaper over
time…

…but requires rethinking the computing model
7
PepArML Meta-Search Engine

Simple, unified, peptide identification search
parameterization and execution:


Cluster, grid, and cloud scheduler:




Mascot, MSGF+, X!Tandem, K-Score, S-Score,
OMSSA, MyriMatch
Reliable batch spectra conversion and upload,
Automated distribution of spectra and sequence,
Job-failure tolerant with result-file validation
Machine-learning-based result combining:


Model-free – heterogeneous features
Adapts to the characteristics of each dataset
8
PepArML Meta-Search Engine
Georgetown
& Maryland HPC
Heterogeneous
compute resources
Edwards Lab
Scheduler &
48+ CPUs
Secure
communication
Single, simple
search request
Amazon
Web Services
9
PepArML Meta-Search Engine
Georgetown
& Maryland HPC
Heterogeneous
compute resources
Edwards Lab
Scheduler &
48+ CPUs
Secure
communication
Single, simple
search request
Amazon
Web Services
10
Run all of the search engines!
11
Search Engine Running Time
Which (combination of) search engine(s) should I use?
12
Fault Tolerant Computing


Spot instances
can be
preempted for
those willing to
pay more
Spot prices are
cheaper
(7¢/hour vs
46¢/hour)
13
Identifications per $$

How long will a specific job take?




Wall-clock time can be significantly reduced:



How much memory / data-transfer is needed?
What is a good decomposition size?
What cloud-instance to use?
…but management overhead costs too.
Cost of total compute may even increase.
Failed analyses cost too!
14
Data-Scale and False Positives

Big datasets have more false positive
proteins and genes!

CPTAC Colorectal Cancer (CDAP)


4.6M MSGF+ 1% FDR PSMs + 2 peptides/gene
~ 10,000 genes identified…
15
Data-Scale and False Positives

Big datasets have more false positive
proteins and genes!

CPTAC Colorectal Cancer (CDAP)


4.6M MSGF+ 1% FDR PSMs + 2 peptides/gene
~ 10,000 genes identified…
…but ~ 40% gene FDR
16
Simple decoy protein model



Decoy peptides hit decoy
proteins uniformly.
Each decoy peptide
represents an
independent trial.
Binomial distribution on


size of protein database
number of decoy peptides
Big-datasets have more decoy peptides!
17
Example


Large: 10,000 proteins, 100,000 peptides
Small: 1,000 proteins, 10,000 peptides
18
Data-Size and False Positives

CPTAC Colorectal Cancer




Control of gene FDR requires even more
stringent filtering of PSMs.
If we require strong evidence in all 95 samples:


1% FDR PSMs, but ~ 25% peptide FDR
~ 25,000 decoy peptides on ~ 20,000 genes
No decoy genes, but less than 1000 genes identified.
Bad scenario:


PDHA1 and PDHA2 in CPTAC Breast Cancer
– shared and unique peptides
PDHA2 is testes specific!
19
Improved Sensitivity

Machine-learning models


Combining multiple search engines


Agreement indicates good identifications
Both approaches successful at boosting ids,
particularly when adaptable to each dataset.


Use additional metrics for good identifications
Watch for the use of decoys in training the model.
Both have scaling issues and lack transparency

…may add noise to comparisons
20
PepArML Performance
LCQ
QSTAR
LTQ-FT
Standard Protein Mix Database
18 Standard Proteins – Mix1
21
Search Engine Info. Gain
22
Precursor & Digest Info. Gain
23
Filtered PSMs as Primary Data

For large enough spectral datasets, we might
choose best effort peptide identification



Need linear-time spectra → PSM algorithm



Filtered PSMs become primary data
Spectral counts become more quantitative
We work less hard to identify all spectra?
Output as genome alignments, BAM files?
How should PSMs be represented to
maximize their utility?

What about decoy peptide identifications?
24
Nascent polypeptide-associated
complex subunit alpha
7.3 x 10-8
25
Pyruvate kinase isozymes M1/M2
2.5 x 10-5
26
Questions?
27