Transcript Document
New Developments for Open-Source Shotgun Proteomics Analysis with the Trans-Proteomic Pipeline
Joshua Tasman1, Luis Mendoza1, David Shteynberg1, James Eddes2, Ning Zhang1, Nichole King1, Chee-Hong Wong3, Brian Pratt4, Patrick Pedrioli2, Henry Lam1, Eric Deutsch1, Jimmy Eng5, Xiao-jun Li6, Alexey Nesvizhskii7, Andrew
Keller8, and Ruedi Aebersold2
1Institute
for Systems Biology, Seattle, WA; 2Institute for Molecular Systems Biology (ETH), Zurich, Switzerland, 3Bioinformatics Institute, Singapore; 4Insilicos LLC, Seattle, WA; 5University of Washington; 6Homestead Clinical; 6Rosetta Biosoftware, Seattle, WA; Seattle, WA; 7Department of Pathology,
University of Michigan, Ann Arbor, MI; 8Rosetta Biosoftware, Seattle, WA
End-to-End MS/MS Proteomics Analysis with the TPP
Overview
•We have developed a complete, end-to-end data analysis pipeline
that provides an automated, reliable, consistent, and objective
analysis of high-throughput quantitative LC-MS/MS data from
multiple data sources using multiple search engines.
MS/MS data: Conversion from proprietary (vendor) to open formats
•The Trans-Proteomics Pipeline (TPP) is a complete, mature, suite
of software tools for MS data representation, MS data visualization,
peptide identification and validation, protein identification,
quantification, and annotation, data storage and mining, and
biological inference.
•Choice of common open formats: mzXML (SPC/ISB) or mzML (HUPO PSI,
SPC/ISB, and others– see flagship poster 001)
•Converters for Thermo Xcalibur (.raw), Waters MassLynx (.raw directory),
ABI/MDS Analyst (.wiff), Agilent MassHunter (.d directory) and others
pepXML document
protXML document
raw MS/MS data file
mzML document
Downstream analysis with
Other TPP-compatible SPC tools
mzXML document
Data storage and mining with
PeptideAtlas and SBEAMS
(Systems Biology Experiment
Analysis Management System) :
• Data products of the TPP analysis
pipeline are imported into the
database
• Data exploration, annotation, and
correlation with other experiments
can all be managed
• Interface allows flexible analysis
of the data: analysis across
multiple experiments
•The TPP has been adopted throughout the international proteomics
community, in use at many prominent academic and corporate labs.
•We present an overview of the TPP and describe newly available
functionality. All software tools are freely available under an opensource software license at tools.proteomecenter.org
Peptide ID Validation with PeptideProphet
model results
The Trans-Proteomic Pipeline (TPP) is an open-source software package with
well-established community acceptance. The TPP provides a completely free,
open-source proteomics analysis solution, spanning: conversion of raw MS/MS
data to open formats and standards; support for searching MS/MS spectra with
various search engines, including the bundled X!Tandem engine
(www.thegpm.org) as well as Sequest, Mascot, Phenyx, OMSSA, and others;
conversion of search engine results to a uniform open format; statistical
validation of peptide identifications with PeptideProphet; statistically validated
protein identification with ProteinProphet; quantitative proteomics (SILAC,
ICAT, ITRAQ, etc) with XPRESS, ASAPRatio, and Libra; and tools for
visualization of and interaction with results. Here we present recent updates to
the software tools to improve analysis functionality and user experience.
Methods
The TPP is constantly improved with new functionality. Highlights of major recent
developments include:
•Build system improvements and native Windows distribution: Insilicos had previously
released their own version of the TPP (the "IPP".) In order to combine efforts more
efficiently, Insilicos has integrated their customizations into the main TPP distribution. The
TPP build system has been improved to allow a native windows distribution, allowing for
significant performance improvements as well as ease of installation
Implementation of raw-to-mzML data converters and full support for parsing mzML
throughout the TPP;
•Implementation of vendor MS/MS-to-mzML converters and full support for mzML input;
•PeptideProphet, the TPP module for peptide ID validation, has been updated with
additional modeling capabilities to compare observed retention time vs. calculated
purported peptide hydrophobicity. Additionally a high-mass-accuracy model improves
discrimination of IDs with data from newer instruments. Decoy database entries can now be
taken advantage of in distribution modeling. A semi-parametric distribution model allows
better discrimination of true and false-positive results;
•Inclusion of X!Tandem (from the GPM project) for a complete, end-to-end MS/MS
searching and validation solution;
•Upcoming multi-experiment data integration with iProphet (see Poster TPU 669);
•Spectral library searching with SpectraST
1200
1000
800
600
400
200
9.1
7.7
6.3
4.9
3.5
-2.1
0
-3.5
•Supports most common commercial and open-source data formats:
Sequest, Mascot, X!Tandem, SpectraST, Pheynx and others
incorrect
correct
1400
-4.9
Spectal search engine output: Conversion to open formats
number of spectra
1600
discriminant score
• Majority of peptide assignments
by search engines are incorrect
• Manual validation is timeconsuming, subjective and
impossible to compare
• Applies statistical principles to
automate peptide validation
• Validates peptide assignments by
Sequest, Mascot, X!Tandem,
Phenyx, SpectraST, and others
Protein Identification
ProteinProphet takes as input a list of peptides and
probabilities and infers the proteins in the sample:
• Groups peptides according to
their corresponding protein
• Adjusts individual peptide
probabilities to account for
new protein grouping
information
• Finds simplest list of proteins
sufficient to explain all
sensitivity
observed peptides (Occam’s
razor approach)
• Computes accurate protein
error rate
probabilities
• Integrates with protein-level
minimum probability threshold
quantiation results
• Allows meaningful comparison between results
of different experiments
Additional visualization, statistical analysis,
and exploration tools enabling investigation
of biological meaning and significance with
Gaggle-compatible tools such as
Cytoscape (network visualization), the
stats package R, and the PIPE (Protein
Inference and Property Explorer)
• Robust:
learns distributions of search scores
and peptide properties among correct and
incorrect results
• Accurate: probabilities are true measures of
confidence
1.0
0.8
Quantitation
0.6
percentage
High throughput LC-MS/MS is capable of simultaneously identifying and
quantifying thousands of proteins in a complex sample. The consistent and
objective analysis of the obtained large amounts of data is challenging and
time-consuming. Over the past 5 years, we have developed and refined a data
analysis pipeline that facilitates and standardizes such analysis.
2.1
Introduction
PeptideProphet performs post-search processing to
compute probabilities that peptide assignments from
MS/MS spectra are correct.
0.7
pepXML document
-0.7
Spectral search engine results file
0.6
0.2
0
0
0.2
0.4
0.6
protXML document
0.8
1.0
ASAPRatio and XPRESS calculate relative
abundance of proteins labeled with heavy and light
(2 channel) isotope tags:
• Evaluate peptide ratio from
multiple charge states (ASAP)
• Apply statistical methods to
evaluate protein ratios and
standard deviations
• Quantify ICAT, SILAC, and
many other samples
Libra performs quantification
on MS/MS spectra that have
multi-reagent labeled (4 or 8
channel) peptides, such as
iTRAQ labeled samples.
•Compute protein ratios
automatically and accurately
Results and Conclusions
Specta
Information
(mzXML/mzML/
mzData
Document)
Improvements from the Insilicos TPP version (IPP) have
been merged to the TPP, and the build system has been
improved to allow native windows deployment. Significant
speed improvements have already been seen from
moving away from a unix emulation layer (Cygwin) based
distribution. True versus false-positive peptide ID
discrimination has been improved through addition of the
decoy, retention time, and high-mass-accuracy
PeptideProphet models, as well as through using a semiparametric distribution for describing peptide population
distributions. The open-source search engine X!Tandem
is now bundled with the TPP, allowing us to provide a
complete and free solution for proteomics analysis. Work
has begun on integrating the OMSSA open-source engine
as well.
This work is performed under the Seattle Proteome Center, suppored by NHLBI contract No. N01-HV-28179. We
would also like to thank all Aebersold Lab and external developers who have contributed to this project.