New developments for open-source shotgun proteomics analysis

Download Report

Transcript New developments for open-source shotgun proteomics analysis

New Developments for Open-Source Shotgun
Proteomics Analysis with the Trans-Proteomic Pipeline
Luis Mendoza1, Natalie Tasman1, David Shteynberg1, James Eddes2, Ning Zhang1, Nichole King1, Chee-Hong Wong3, Brian Pratt4, Patrick
Pedrioli2, Henry Lam1, Eric Deutsch1, Jimmy Eng5, Xiao-jun Li6, Alexey Nesvizhskii7, Andrew Keller8, and Ruedi Aebersold2
1 Institute
for Systems Biology, Seattle, WA, United States; 2 Institute for Molecular Systems Biology (ETH), Zurich, Switzerland, 3 Bioinformatics Institute, Singapore; 4 Insilicos LLC, Seattle, WA;
5 University of Washington, Seattle, WA; 6 Homestead Clinical, Seattle, WA, US; 7 Department of Pathology, University of Michigan, Ann Arbor, MI, US; 8 Rosetta Biosoftware, Seattle, WA
Overview
End-to-End MS/MS Proteomics Analysis with the TPP
 We have developed a complete, end-to-end data analysis
pipeline that provides an automated, reliable, consistent, and
objective analysis of high-throughput quantitative LC-MS/MS data
from multiple data sources using multiple search engines.
 The Trans-Proteomic Pipeline (TPP) is a complete and mature
suite of software tools for MS data representation, MS data
visualization, peptide identification and validation, protein
identification, quantification, and annotation, data storage and
mining, and biological inference.
MS/MS data: Conversion from proprietary
(vendor) to open formats
raw MS/MS data file
• Choice of common open formats: mzXML (SPC/ISB) or
mzML (HUPO PSI, SPC/ISB, and others)
• Converters for Thermo Xcalibur (.raw), Waters MassLynx
(.raw directory), ABI/MDS Analyst (.wiff), Agilent MassHunter
(.d directory) and others
mzML document
The TPP is constantly improved with new functionality.
Highlights of major recent (4.0+) developments include:
 Build system improvements and native Windows
distribution: Insilicos had previously released their own version
of the TPP (the "IPP".) In order to combine efforts more
efficiently, Insilicos has integrated their customizations into the
main TPP distribution. The TPP build system has been upgraded
to allow a native windows distribution, allowing for significant
performance improvements as well as ease of installation;
 Implementation of raw-to-mzML data converters and full
support for parsing mzML throughout the TPP;
 Implementation of vendor MS/MS-to-mzML converters and full
support for mzML input;
 PeptideProphet, the TPP module for peptide ID validation,
has been updated with additional modeling capabilities to
compare observed retention time vs. calculated purported
peptide hydrophobicity. Additionally a high-mass-accuracy
model improves discrimination of IDs with data from newer
instruments. Decoy database entries can now be taken
advantage of in distribution modeling. A semi-parametric
distribution model allows better discrimination of true- and
false-positive results.
 Inclusion of X!Tandem (from the GPM project) for a complete,
end-to-end MS/MS searching and validation solution;
 Upcoming multi-experiment data integration with iProphet
 Spectral library searching with SpectraST, as well as library
download and management utilities;
 UI upgrades, including integration of Libra conditions editor;
 New tool to generate decoy (reverse) databases.
1200
1000
800
600
400
200
Spectra Information (mzXML/mzML/mzData document)
• Evaluate peptide ratio from multiple
charge states (ASAP)
• Apply statistical methods to evaluate
protein ratios and standard deviations
• Quantify ICAT, SILAC, and many other
samples
9.1
7.7
6.3
4.9
3.5
discriminant score
Quantitation
ASAPRatio and XPRESS calculate
relative abundances of proteins labeled
with heavy and light (2 channel) isotope
tags:
2.1
0.7
-0.7
0
• Majority of peptide assignments by
search engines are incorrect
• Manual validation is timeconsuming, subjective and
impossible to compare
• Applies statistical principles to
automate peptide validation
• Validates peptide assignments by
Sequest, Mascot, X!Tandem,
Phenyx, SpectraST, OMSSA, and
others
• Robust: learns distributions of search scores
and peptide properties among correct and
incorrect results
• Accurate: probabilities are true measures of
confidence
Protein Identification
ProteinProphet takes as input a list of peptides and
probabilities and infers the proteins in the sample:
Here we present recent updates to the software tools to improve
analysis functionality and user experience.
Methods
incorrect
correct
1400
Libra performs quantification on
MS/MS spectra that have multireagent labeled peptides, such as
iTRAQ-4 and -8 labeled samples.
• Compute protein ratios automatically
and accurately
pepXML document
protXML document
• Groups peptides according to
their corresponding protein
• Adjusts individual peptide
probabilities to account for
new protein grouping
information
• Finds simplest list of proteins
sufficient to explain all
observed peptides (Occam’s
razor approach)
• Computes accurate protein
probabilities
• Integrates with protein-level
quantitation results
percentage
The TPP is an open-source software package with well-established
community acceptance. The TPP provides a completely free, opensource proteomics analysis solution, spanning:
 Conversion of raw MS/MS data to open formats and standards
 Support for searching MS/MS spectra with various search
engines, including the bundled X!Tandem engine
(www.thegpm.org) as well as Sequest, Mascot, Phenyx, OMSSA,
and others
 Conversion of search engine results to a uniform open format
 Statistical validation of peptide identifications with
PeptideProphet
 Statistically validated protein identification with ProteinProphet
 Quantitative proteomics (SILAC, ICAT, ITRAQ, etc) with
XPRESS, ASAPRatio, and Libra
 Tools for visualization of and interaction with results.
model results
1600
-2.1
High throughput LC-MS/MS is capable of simultaneously identifying
and quantifying thousands of proteins in a complex sample. The
consistent and objective analysis of the large amounts of obtained
data is challenging and time-consuming. Over the past 5 years, we
have developed and refined a data analysis pipeline that facilitates
and standardizes such analysis.
Supports most common commercial
and open-source data formats:
• Sequest, Mascot, X!Tandem,
SpectraST, Pheynx and others
Peptide ID Validation with PeptideProphet
PeptideProphet performs post-search processing to
compute probabilities that peptide assignments from
MS/MS spectra are correct.
-3.5
Introduction
Spectral search engine output:
Conversion to open formats
pepXML document
-4.9
 We present an overview of the TPP and describe newly available
functionality. All software tools are freely available under an opensource software license at: http://tools.proteomecenter.org
Spectral search engine results file
number of spectra
 The TPP has been adopted throughout the international
proteomics community, in use at many prominent academic and
corporate labs.
mzXML document
1.0
sensitivity
0.8
0.6
0.6
0.2
error rate
0
0
0.2
0.4
0.6
0.8
1.0
minimum probability threshold
• Allows
meaningful comparison between results
of different experiments
Downstream analysis with Other TPP-compatible SPC tools
Data storage and mining with PeptideAtlas and SBEAMS (Systems Biology
Experiment Analysis Management System) :
• Data products of the TPP analysis pipeline are imported into the database
• Data exploration, annotation, and correlation with other experiments can all be
managed
• Interface allows flexible analysis of the data: analysis across multiple experiments
Additional visualization, statistical analysis, and exploration tools enabling investigation
of biological meaning and significance with Gaggle-compatible tools such as
Cytoscape (network visualization), the stats package R, and the PIPE (Protein
Inference and Property Explorer)
Results and Conclusions
Improvements from the Insilicos TPP version (IPP) have been merged into the TPP, and the
build system has been updated to allow native windows deployment. Significant speed
improvements have already been seen from moving away from a Unix emulation layer
(Cygwin) based distribution. True versus false-positive peptide ID discrimination has been
improved through addition of the decoy, retention time, and high-mass-accuracy
PeptideProphet models, as well as through using a semi-parametric distribution for
describing peptide population distributions. The open-source search engine X!Tandem is
now bundled with the TPP, allowing us to provide a complete and free solution for
proteomics analysis, as well as tools for the emerging area of Spectral Library searching.
Work has begun on integrating the OMSSA open-source engine as well.
This project has been funded by a grant to the Seattle Proteome Center from the National Heart,
Lung, and Blood Institute, National Institutes of Health, under contract No. N01-HV-28179.