Transcript gmod-oct04

GMODTools,
Argos & cetera
A Replicable Genome infOrmation System
of Common Components
GMOD Meeting, Oct. 2004
Don Gilbert, [email protected]
Genome DB building blocks
• GMOD Tools for public data releases
• Argos framework for genome databases
• LuceGene fast document/object search
• Genome Directory System for genome
data mining
• Unified Gene Pages (XML, web page)
GMOD Tools: Bulkfiles
cvs.sourceforge.net:/cvsroot/gmod checkout
schema/GMODTools
Genome Data Tools
• Support common data update and public release
tasks.
• GmodTools to load and extract reagent sequences
(EST, cDNA, GSS) to/from Chado databases.
• GMOD Bulkfiles creates bulk genome sequence and
feature files for public distribution from a Chado
database.
• Citrina is a workflow tool to automate external
databank updates, such as GenBank and Gene
Ontologies.
12 New genomes to go
• Need to publish numerous new genomes
• Bulk files are standard public access:
• Sequence (fasta, …), features (gff,…), searches (Blast, ..);
• 11 new Drosophila genomes; Daphnia genome; many more
• Chado database; XORT & other GMOD Tools to export data
• http://flybase.net/species
Bulkfiles
• Build release files from
Chado DB
• Standardized files,
headers
• DNA - fasta, raw
• Features - GFF3,
gnomap
• Blast indices
• Lucene file indices
• Config files (blast,
gbrowse,…)
Bulkfiles - BLAST indices
Bulkfiles - Map features
Bulkfiles OUTPUTS
• DNA files (full chromosomes) in raw and fasta
formats
• GFF (v3) and FFF (used in FlyBase) feature files
• Fasta sequence for each feature set, with
standardized headers (ID,names,db_xref,…)from
feature files
• NCBI BLAST indices & configs
• Gbrowse config files with feature sets matching db
• Others added as needed (more easily than before)
Bulkfiles Logic
• Organism/database logic (mostly) in configuration files
• Dump all chado db features using simple sql to common
intermediate table files
• Feature info is simple: type, location, name/id, and a few
attributes (db_xrefs,.. GFF-like)
• Easier checking of SQL to get all features desired
• Fast (30 - 60 min for full fly genome)
• Postprocess table files to create public use formats
• Tested with FOUR different Chado dbs (Dmel, Dmel_hetero,
Dpse_Dmel, and SGDLite)
Bulkfiles stages
• postprocess table files in stages
Recode feature “oddities” to public view needs
Better debugging of steps in the process
Engineering time and configuration here
Stages are loosely coupled; go back, tweak
configurations, re-run partially as needed.
• convert common feature table + dna to several
output formats in one step.
• combine features from several dbs and other
sources like cytology here.
•
•
•
•
Bulkfiles config example
<opt
name="fbbulk-r3" relid="3"
ROOT="${GMOD_ROOT}/"
TMP="${GMOD_ROOT}/tmp"
datadir="genomes/Drosophila_melanogaster"
>
<title>FlyBase Chado DB r3.2</title>
<about>
Configuration for feature and sequence
bulk files from FlyBase chado data release 3.2.1
</about>
<org>dmel</org>
<species>Drosophila melanogaster</species>
<doc name="README">
D. melanogaster euchromatin genome data from
FlyBase Release 3.2.1. See
http://flybase.net/annot/dmel_r3.2.1.txt
</doc>
<include>fbreleases</include>
<db
driver="Pg"
name="dmel_chado"
host="localhost"
port="7302"
user="” password=""
/>
<idpattern>(FBgn|FBti)\d+</idpattern
>
<include>filesets</include>
<include>featuresets</include>
</opt>
ARGOS
http://www.gmod.org/argos
ARGOS Genome DBs
Package Description
Disk use Reference server
Argos-root Argos root server
50 Mb http:/ /flyb ase.net/argos
Multi- eukaryote
euGenes
10 Gb http:/ /eugenes.org/
summary database
Drosophila genome
FlyBase
10 Gb http:/ /flyb ase.net/
database
Daphnia genome
wFleaBase
100 Mb http:/ /wfleabase.org/
database
12 Drosophila
DroSpeGe species genomes
--ΚΚΚhttp:/ /flyb ase.net/species/
database
(in progress)
ARGOS Focus
• Automate genome database install & update
• Eliminate { fetch, compile, install, configure,…} cycle
• Developers test, compile, config once; others copy/run
• Start new project quickly - copy existing project &
edit to suit
• Clone servers easily (local cluster; global mirrors;
company/lab; laptop)
• Compatible with most GMOD projects
• Secure collaborative genome db features
• Goal: easy for biologists to use with minimal
informatics expertise
ARGOS Components
Section
FlyBa se (e.g.)
Java
Components
Perl
BioPerl, GBrowse, Chado da tabase tools , Cmap comparative maps,
database interfaces, Web tools, XML tools
BLAST (NCBI), Apache web server, PostgreSQL, and Be rkeleyD B
databases
Compil ed portions for suppo rted operating systems
Common conf igurations , web server , install ation scripts and
instructions
Server s
Systems
Install & Root
Data, database indices, documents, web tools specific to genome service
Chado database tools, genome sequence reports, LuceGene search, Ant build
system, database interfaces, XML t ools, Tomcat web server, Axis web services
ARGOS INSTALL
ARGOS INSTALL
Edit wFleaBase
Lucegene (‘Lucy
Jean’)
for Genome Information Search and Retrieval
LuceGene
Document/Object Search and Retrieval in Genome
Databases
• high-volume data search and retrieval system for genomics and
bioinformatics databases
• standard search features: booleans, phrase, near, relevance
• performance exceeds and extends relational databases
• suited to range of genome data: genes, literature, sequences,
XML annotations, Medline abstracts, HTML, PDF and text
documents.
Example LuceGene libraries
• FlyBase database
•
•
•
•
•
Annotation GAME XML, Medline XML (gamexml, medxml)
Genes, Annotation, References (fbgn, fban, fbrf)
Web, literature PDF Documents (docs)
Unified Gene Page XML (ugpxml)
Sequences, Genome Features (seqs)
• euGenes database
• Gene summaries, Sequences, Genome Features
• Unified Gene Page XML
• Web Documents
• wFleaBase database
• Sequences, Medline XML, Web documents
Thanks to these folks
•
•
•
•
•
•
Josh Goodman (gmod)
Paul Poole (gmod/iubio)
Hardik Sheth (flybase)
Nihar Sheth (flybase)
Vasanth Singan (gmod)
Victor Strelets (flybase)
And to many developers whose work we learn from and borrow from