CPAS and LabKey Tech Overview ASMS 2006

Download Report

Transcript CPAS and LabKey Tech Overview ASMS 2006

CPAS and LabKey
Technical Overview
Adam Rauch
LabKey Software
[email protected]
CPAS
Computational Proteomics Analysis System
• Free, open-source, web-based system for processing, storing, and
analyzing results of MS/MS experiments
• Handles high-throughput processing and analysis of results
• Automates & controls data pipeline from instrument through analysis
• Provides universal access to results and supports collaboration
• Keeps data private & secure
• Allows mining across runs, experiments, and samples
• Easy to install, administer, and use
• Supports popular operating systems & database servers
• Uses public file formats for import, export, and exchange
• Is distributed freely under open-source license (Apache 2.0)
Brief History
•
2003 – 2004
–
–
•
2005
–
–
–
•
CPAS 1.0 product, source code, and publication released
Core annotation system (based on FuGE) suitable for generic biological portal
LabKey Software formed to support core platform and extend it beyond proteomics
Now
–
–
–
–
–
•
Dr. Martin McIntosh and the Computational Proteomics Laboratory receive grant from NCI
Initial system developed for proteomics research
FHCRC CPAS: 12,900 MS/MS runs containing 125 million peptide ids and spectra
Over 100 other institutions have downloaded the system
FHCRC: Continues to drive and develop proteomics features
LabKey: Platform and proteomics dev, other modules (flow cytometry, immunologic assays)
Broad development community is beginning: Singapore, UW, Cedars-Sinai
Very soon: establishing labkey.org
–
–
Foundation run by independent board
Will manage open-source project and hold all contributed intellectual property
LabKey Software LLC
• Private consulting company created by FHCRC and team of
software professionals
– Formed to support, document, and extend the CPAS project to
other functions and labs
– Independent company to directly address other institutions’ needs
and secure outside funding
• Partnership:
– FHCRC and other clients provide scientific leadership
– LabKey focuses on software development
• LabKey is available to customize, install, and support your
pipeline, CPAS, and other LabKey applications
– Business model ensures you get help & support when you need it
What Does “Apache 2.0 Open Source” Mean?
• The product is free
• All source code is available for your review
– Not a “black box”
– Publish with confidence
• You can modify and extend the product
– Fix bugs
– Add features & modules
• You can contribute your changes back (or not)
– We strongly encourage contributing back
– Let others benefit just as you have / fame / fewer headaches
• You can re-distribute source or product (modified or not)
• A broad development community will emerge
CPAS System Components
Mass Spec PC
Java Web Application
Web Server (Tomcat)
Database Server
(PostgreSQL or SQL Server)
Shared
Disk
mzXML Conversion
MS/MS Search
(X! Tandem, SEQUEST,
Mascot, etc.)
Pipeline Processing
(queuing, control, TPP)
System Components
• Java web application
– Runs on Apache Tomcat web server
– Compatible with Windows, Linux, Solaris, Mac, et al
– Incorporates open-source libraries
• Relational database server
– PostgreSQL: open-source, all common operating systems
– Microsoft SQL Server: commercial product, Windows only
– Abstraction layer allows other database servers in future
• Network file storage: data archive
• Analysis pipeline: conversion, search, processing
• Open file formats: mzXML, pepXML, protXML, XAR
Mouse
Sample
MS2
MS1
Portal / Wiki
Site Admin
CPAS Architecture (2004)
Base Services (Security, Database, Web Views, Query)
Data Storage (Relational Database + File System)
= Modules
= Shared services
Study
Transcript
Flow Cytometry
Mouse
Sample
Experiment
Protein
Services
Data Pipeline
MS2
MS1
Portal / Wiki
Site Admin
LabKey in 2006: Beyond Proteomics
Experiment Services (Shared Ontologies, XAR)
Base Services (Security, Database, Web Views, Query)
Data Storage (Relational Database + File System)
= Modules
= Shared services
= Future services / modules
= CPAS
Experimental Annotations
• Standards-based
annotation of
experiments
• Data/experiment
exchange format
• See tutorial on
http://cpas.fhcrc.org
Experimental Annotations: Goals
•
•
Dumping gigabytes of MS/MS results into a database is not enough
Must have a framework for describing and querying experimental data
in scientifically interesting ways:
“Show me all runs performed on Chodosh mouse model plasma samples”
“Across multiple mouse models, show me all differentially regulated
proteins grouped by cancer-type”
– “Show me experiments that used the glyco-capture method where protein
X was found”
–
–
•
Needs to separate structure:
–
•
…from vocabulary:
–
•
inputs, protocol steps, outputs, relationships
properties/types specified by scientist or standardized ontologies
Requires flexibility
–
Database schema, file formats, and tools must support constantly changing
protocols, terms, properties, and ontologies
EXperiment ARchive (XAR) Files
Compressed .xar file
Xar.xml
manifest
LabKey
Export
Genologics
Protocol Definition
Starting Inputs
Experiment Runs
Paths to data files
subfolders
Input
data
Data
results
mzXML,
TSV, …
pep.xml,
prot.xml,.
..
LabKey
Import
Example: Protocol
Starting Data
Starting Material
Run Start
Label Heavy
Label Light
Pool Samples
Fractionate Ion Exch
Fractionate Rev Phase
Do LC/MS-MS
Mark Run Output
Example: Experiment Run
Legend
Sample A
Sample B
Label Heavy
Label Light
Labeled Sample
Labeled Sample
Protocol
Application
Pooling
Material
Pooling Protocol
Pooled Sample
Data
Protocol
Labeling Protocol
Labeling Protocol
Ion Exchange
Fractionation Protocol
Fractions
Fractions
Fractions
Rev. Phase
Fractionation Protocol
Fractions
Fractions
Fractions
Fractions
Fractions
Fractions
LC/MS-MS
mzXML files
LC/MS-MS Protocol
Opportunities
1.
2.
3.
4.
Integrate CPAS with cluster pipeline
Extend the pipeline
Develop or extend other tools
Extend LabKey/CPAS web application
1. Integrate CPAS with Cluster Pipeline
• CPAS web interface handles configuring, starting, and
monitoring jobs, but pipeline servers need configuration
• Requires Perl, cron, queuing & cluster scheduling experience
• Scripts used at Fred Hutchinson are available as an example
• Common tasks:
–
–
–
–
–
–
–
Configure pipeline to work with your search engine
Set up and configure conversion servers (RAW  mzXML)
Set up other analysis steps such as TPP (ISB) or IPP (Insilicos)
Integrate instruments into pipeline
Manage storage of RAW, mzXML, and searched results files
Manage FASTA file repository
Create and manage XAR files that describe the protocols you use
FHCRC Installation
CPAS
Pipeline
Web Server
2 Proc, 2GB
Tomcat
Pipeline Mgr
Mass Spec PC
Database Server
4 Proc, 4GB
MS SQL Server
2TB
RAID
File Server
(Sun
Hierarchical
Storage)
mzXML Conversion Server
Cluster
20+ TB
Tape Robot
2. Extend The Pipeline
• Develop a novel MS/MS scoring algorithm and
integrate with X! Tandem
• Develop and integrate a new quantitation tool
• Develop an mzXML conversion method for a new
instrument
• Create a pepXML converter for an unsupported
MS/MS search tool (e.g., Spectrum Mill)
3. Develop or Extend Other Tools
• Develop tools that use data stored in LabKey database
– Perform analysis, test new algorithms, publication, presentation
– Could query database directly or use exported data (TSV, Excel)
• Support XAR file format
– Export/import XAR format (e.g., LIMS)
– Develop tools for easier development of XAR files
– We are working on APIs to read/write XARs
4. Extend LabKey/CPAS
• Fix bugs
• Add new platform features, for example:
– New authentication mechanism
– Oracle support
• Add new features to an existing module, for example:
–
–
–
–
Improve loading performance of fully annotated sequence DBs
Add support for loading Spectrum Mill MS/MS results
Improve comparison of results from different search engines
Add analysis handler for ASAPRatio
• Create a new module
– New type of data
– New user interface
– New schema
LabKey Development Process
•
Source code available via public Subversion server
–
–
•
Core software written in Java
–
–
•
We use IntelliJ IDE; Eclipse, text editor, etc. will work as well
Apache Ant build scripts
Contribution requirements
–
–
–
–
•
Public access is read-only
Contributions can be made via email or via SVN if granted write access
Must work on PostgreSQL and SQL Server
Must pass Development Regression Test (DRT)
Must be prepared to fix any build breaks immediately
Should add tests to the DRT suite
“Cruise Control” server
–
–
Automated system rebuilds the product & runs DRT after every check in
If any build or DRT failure occurs the developer alias is alerted
Developer Resources
•
Tools & documentation on http://cpas.fhcrc.org
–
–
–
–
•
Issue tracker, mailing list, support board
Detailed instructions for enlisting, building, and troubleshooting
Demo module documentation
cpas architecture.ppt
Key development aids built into LabKey platform
–
SQL script manager
•
•
–
Memory leak checker
•
–
Automatically runs each module’s incremental SQL scripts at upgrade time
Strict versioning rules and careful script development ensure reliable upgrades
Tracks and reports object references that shouldn’t be held anymore
Exception reporting and recording
•
Production systems report unexpected exceptions back to cpas.fhcrc.org for
tracking & follow up
• Developer systems collect their own exceptions
Module
• JAR file that encapsulates all the handling for a particular
class of data
• Examples include: MS2, Flow, Experiment, Study, Wiki
• Only requirement is a class that implements the CpasModule
interface, but typically a module provides:
–
–
–
–
Page flow controllers, views, and web parts
Data manager(s) (provides internal and external API)
SQL scripts
All classes that define the module’s functionality
• Other options: implement…
– startup() method
– ContainerListener (delete module data on container delete)
– Searchable (search module content), etc.
Table Layer
•
•
•
•
•
•
Easy-to-use wrapper around JDBC, provides database abstraction
Handles connections, transactions, cleanup, and basic logging
Most simple queries are one line of code
Uses database meta data to automatically produce SQL queries
Updates standard columns (e.g., modified, modified by, created, etc.)
Some methods and return values:
Table.execute(schema, sql, params[])
Table.executeSingleton(schema, sql, params[], class)
Table.select(ti, columns, filter, sort)
Table.select(ti, columns, filter, sort, class)
Table.selectObject(ti, pk, class)
Table.insert(user, ti, object/map)
Table.update(user, ti, object/map, rowId, rowVersion)
Table.delete(user, ti, rowId, rowVersion)
row count
single primitive
result set
array of objects
single object
Security
• Designed to keep sensitive, unpublished scientific data secure
• Admin can choose to require SSL for all access
• Authentication: dual scheme approach
– Can delegate to institution’s LDAP system
– External users: invitation only
• Users choose their own passwords
• Hash of password is stored in database and used for authentication
• Authorization: Users must be granted explicit permissions
–
–
–
–
All data stored in folder hierarchy managed by the database
Users are added to groups
Groups are granted permission to folder or hierarchy
Authorized only if user belongs to group with required permissions
• Folders can be made “public” (no authentication required)
Resources
LabKey / CPAS distribution and
support
http://cpas.fhcrc.org
http://www.labkey.org (soon)
Xar Tutorial
See 1.4 topic “Xar Tutorial” on
http://cpas.fhcrc.org
Demo Module Documentation
See 1.4 topic “Getting Started with the
demo module” on http://cpas.fhcrc.org
FHCRC CPL
http://proteomics.fhcrc.org
LabKey Software LLC
http://www.labkey.com
Acknowledgements
•
•
•
•
•
•
•
Genologics!
National Cancer Institute
Canary Foundation
ISB: TPP, mzXML, pepXML, protXML
Ron Beavis & The GPM: X! Tandem
Insilicos: native Windows version of TPP
Many other open-source developers
Questions?