Transcript Slide 1

Biopackages.net
Operating System Packages for
Bioinformatics
Allen Day
2005.05.17
What is a package?
 Software, config files, documentation,
and/or data encapsulated in a single
file
 Metadata describing:
 Version, license, package “category”
 Dependencies
 What the package provides
 GMOD target audience
 Small MODs
Package Dependency Graph
chado-Hsa
postgresql-AffxSeq
genome-Hsa-annotation-gene
genome-Hsa-annotation-affymetrix
chado
perl-bioperl
perl-go-perl
postgresql-server
genome-Hsa-nib
 Dependencies obo-core
 What the package provides
ucsc-blat
Dependencies
 Build Dependency
 Installation Dependency
What is a Package Manager?
 Tools to manage installation, upgrade,
uninstallation of packages
 Verify package integrity (checksums)
 Maintain system integrity
 Transactional
 Allow rollbacks
 Dependency checking
 Dependency graph recursion
 Allow software customization (patches)
Why bioinformatics packages?
 Consistency of installation process
 Bioinfo. package installs vary wildly, and
commonly lack documentation
 Automatic dependency installation
 Perl modules especially bad – bioperl has 60+
modules in its dependency tree
 Integrity/Auditing of system state
 Know an installed package works, which version,
how to replicate system setup
 Tighter integration with operating system
 Daemons, config & log file locations, etc.
What’s available?
 RPM packages only right now
 Primary focus on Fedora Core 2
 Some RPMs also available for
 Fedora Core 3
 RedHat 9
 Cygwin
What’s available?
 Three primary foci
 Applications
 Libraries
 Data sets
Applications





Gbrowse
Textpresso
BLAT daemon
NCBI Toolkit (BLAST, etc)
HMMer
What’s available?
 Libraries




Bioperl
R & Bioconductor
Squid
EMBOSS
What’s available?
 Data sets




Genome & protein sequence
Sequence features
Ontologies
All installed using a common directory
structure
What’s available?
 UCSC tools (utilities, BLAT system
service, CGI scripts)
 Bioperl
 R / Bioconductor
 GMOD apps (Gbrowse, Textpresso, …)
 Data packages
 Genome sequence (fa, nib, blastdb)
 Genome features (Affy probeset
alignments, mRNA, etc)
GMOD Components Available
das2-Hsa
gmod-web-Hsa
apollo-Hsa
cmap-Hsa
chado-Hsa
chado
gbrowse
textpresso
genome-Hsa-nib
turnkey
ucsc-BLAT
‘Hsa’ can be substituted for your organism
Currently built for ‘Cel’, ‘Hsa’, ‘Sce’
More details…
chado-Hsa
genome-Hsa-annotation-gene
genome-Hsa-annotation-affymetrix
postgresql-AffxSeq
chado
perl-go-perl
perl-bioperl
postgresql-server
…
…
…
genome-Hsa-nib
ucsc-blat
…
…
Gene Expression Components
DAS/2 for
Genotyping,
GeneChip
Quant/Norm
Pipeline
chado-GEC
chado-Hsa
R
Bioconductor
Resources
 http://www.biopackages.net
 ~1000 RPMs for Fedora Core 2, 3
 Available via yum
 See site for a configuration example.
TODO
 Support more architectures
 Build for Cygwin & OS X. RPM has been
ported to both
 Automate package build process
 Build farm of multiple architectures,
controllable via scheduler (GridEngine)
 Automate (if possible) inclusion of
new software / data releases
TODO
 Build community interest and
involvement
 Keep adding more packages!
 Keep existing packages current!
Acknowledgements






Patrick Alger
Jared Fox
Brian O’Connor
Todd Harris
Lincoln Stein
Stanley Nelson