Advanced CPAS ASMS 2007

Download Report

Transcript Advanced CPAS ASMS 2007

Advanced CPAS
Adam Rauch
LabKey Software
[email protected]
Agenda
• Demo of recent & advanced features
• Pipeline architecture & configuration
• Production installations
What Is CPAS?
A proteomics analysis system
that handles all data processing & management
for high-throughput labs and core facilities
Demo
“Mini-Pipeline”
• Included & configured in standard install
• CPAS invokes executables (tandem, tpp)
directly on web server
• Simple approach works fine for lowthroughput evaluation installs
FHCRC Installation
CPAS
Pipeline
Web Server
2 Proc, 2GB
Tomcat
Pipeline Mgr
Mass Spec PC
Database Server
4 Proc, 4GB
MS SQL Server
NetApp
File Server
(Sun
Hierarchical
Storage)
mzXML Conversion Server
Cluster
20+ TB
Tape Robot
Production Pipeline
• Multi-server, clustered, high-throughput pipeline demands a
more sophisticated approach
• CPAS interface for configuring, submitting jobs is identical, but
pipeline control & communication is handled differently
• Each project typically configured with separate “pipeline root”
• User initiates search by selecting raw file and specifying
search parameters (protocol)
• CPAS writes settings file to raw-file directory
• Background process (chron job) running on pipeline server
sees new job and kicks off pipeline processing
CPAS Pipeline
Automated pipeline moves MS2 data from instrument, through
MS/MS search and post-processing, and into CPAS
Sample
Input
Sample
Input
Sample
Input
LTQ FT
MALDI
LCQ
Raw
File
MS/MS Search Cluster
X! Tandem, SEQUEST, MASCOT
XPRESS, Peptide/Protein Prophet
Raw
File
Raw
File
Convert
Server
mzXML
File
PC #40
mzXML,
pepXML,
protXML
Files
CPAS
Production Pipeline Workflow
• Chron job state machine manages workflow
– Initiates RAW  mzXML conversion
• Conversion server (ConversionQueue)
• Vendor-specific DLLs require Windows server
– Submits MS/MS search to cluster scheduler
– Submits post-processing jobs to cluster scheduler
– Handles fractionation scenarios (individual, multi)
– When processing is complete, instructs CPAS to load run
• Job status is reported via log files, which CPAS reads
to update web UI
Search Engine Configuration
• SEQUEST cluster uses “SequestQueue”
– Custom Tomcat/Java web application
– Installed on head node of cluster
– Pipeline communicates with SequestQueue over HTTP
• Pipeline drives Mascot cluster directly via HTTP
• Pipeline drives X! Tandem via cluster scheduler
Configuring A Production Pipeline
• Install, customize Perl scripts that manage the workflow
– Scripts used at Fred Hutchinson are available as an example
• Configure conversion server
– Converters & vendor-specific DLLs
• Install TPP, MS/MS search engine(s) on cluster
• Enable your search engine(s) within CPAS
• Install CPAS FTP server (optional)
– Useful to allow external collaborators to submit jobs to pipeline
• Configure pipeline email notifications (optional)
– Email notifications for completion and/or failures
Demo
Production Installation
Web & Database Servers
• Server operating system(s)
– CPAS runs on all popular operating system platforms
– Solaris, Linux, Windows, OS X installations
– Windows has somewhat easier install & upgrade process
• Graphical installer
• Pre-compiled binaries
– Select OS that you & your IT staff are most comfortable with
• Database server
– PostgreSQL: runs on all popular hardware/OS platforms, free
– Microsoft SQL Server: Windows only, commercial, well tested
• Server hardware
– Invest in database server: powerful server, ample storage,
reliability
– Web server much less demanding
IT Infrastructure
• Shared file system (NFS)
– CPAS and pipeline need to access to a
common NFS
– Archive RAW, mzXML, pepXML, etc. files
• Need plan for backing up NFS and
database
Select Administrators
•
•
•
•
Database administrator
Server administrators
CPAS site administrators
CPAS project administrators
Production Installation
Customization & Settings
• Many settings for customizing CPAS to your needs
– Fully documented on www.labkey.org
– Review all settings carefully on a regular basis
• CPAS settings are handled in several places
– Most configuration is done via the “Admin Console”
– <tomcat>/conf/server.xml
– <tomcat>/conf/Catalina/localhost/labkey.xml
Database
• JDBC parameters specified in labkey.xml
– Driver class (PostgreSQL vs. SQL Server)
– URL includes server name, port, database name
– User name & password
• Protected your data
– CPAS database user needs read/write/delete/update perms
– Use a strong password!
– Provide no access to database server outside firewall
• PGTest and jtdstest tools can help test config
Networking
• Basic Networking
– Specify port in server.xml
– Open firewall port(s)
– Procure server name and update DNS
• SMTP settings
– Server, port, credentials specified in labkey.xml
– System email address specified in site settings
Security
• Designed to keep sensitive, unpublished scientific data secure
• Authentication: dual scheme approach
– Can delegate to institution’s LDAP system
– External users: invitation only
• Users choose their own passwords
• Hash of password is stored in database and used for authentication
• Authorization: Users must be granted explicit permissions
–
–
–
–
All data stored in folder hierarchy managed by the database
Users are added to groups
Groups are granted permission to folder or hierarchy
Authorized only if user belongs to group with required permissions
• Folders can be made “public” (no authentication required)
Security Settings
• SSL
– We strongly recommend requiring SSL connections
– Enable SSL port in server.xml
– Use “Require SSL connections” option & port setting
• LDAP & SASL
– Configure CPAS to authenticate users to your
organization’s LDAP server(s)
– Specify server name, domain, principal template, SASL
• Email templates
– Customize new user registration, password change, etc.
emails
Other Settings
• Network drive
– Allows CPAS running as Windows service to
attach NFS as a drive
• Site-wide option to enable caBIGTM
• Mascot & SEQUEST connection settings
• Site description, color theme, font size, logo
Future Directions
•
•
•
•
•
Web services-based pipeline
Faster, easier loading of protein annotations
Multi-engine comparisons
Improved generalized query support
Phase 2 of caBIG support
LabKey Software, Inc.
• Private consulting company created by FHCRC and team of
software professionals
– Formed to support, document, and extend the CPAS project to
other functions and labs
– Independent company to directly address other institutions’ needs
and secure outside funding
• Partnership:
– Clients provide scientific leadership
– LabKey focuses on software development
• LabKey is available to customize, install, and support your
pipeline, CPAS, and other LabKey applications
– Business model ensures you get help & support when you need it
Next Steps
• Visit our booth
• Join our informal receptions here
– 6:30 – 9:30PM Tonight & Tomorrow
• Talk to LabKey about your plans
Resources
• http://www.labkey.org – CPAS Distribution & Support Site
– Ask questions, contribute feedback
– Peruse all the CPAS documentation & tutorials
– Download the latest version (LabKey 2.1)
• Graphical installer for Windows installation
• Well documented “manual” installation for Linux/Mac
• http://www.labkey.com
– LabKey Software Inc. company web site
• CPAS Paper
– Rauch A, Bellew M, Eng J, et al. Computational Proteomics
Analysis System (CPAS): An Extensible, Open-source Analytic
System for Evaluating and Publishing Proteomic Data and High
throughput Biological Experiments. J Proteome Res
2006;5(1):112-121.
Acknowledgements
•
•
•
•
•
•
•
Fred Hutchinson Cancer Research Center
National Cancer Institute
Canary Foundation
Gates Foundation
Institute for Systems Biology
Ron Beavis & The GPM
Numerous developer contributors
Questions?
Advanced Analysis Features
• Filter groups of runs and compare peptides, proteins,
ProteinProphet, quantitation, etc
• Analyze groups of runs based on sample properties
• Search all experiments for a specific protein or gene name
• Link results to protein annotations
– Load protein knowledgebases: TrEMBL, Swiss-Prot
– Gene Ontology: produce GO charts analyzing molecular function,
cellular location, metabolic process
– Custom protein annotation lists
• Flexible, custom query capability
– Join results to protein, experiment, sample tables
– Display exactly the data you care about