GenomesToGrids

Transcript GenomesToGrids

Genomes to Grids
Thoughts on Building Data Grids for Biology
Biologists have discovered many millions of genes and
genome features, now part of the bio-data "library" distributed
on computers around the world. Grid computing methods for
finding and using interesting genome knowledge from this
mountain of data are discussed - their promise and practical
concerns for building usable bioinformatics grids.
Don Gilbert, [email protected]
Bio Databanks, EBI, Sept. 2002
Databank
Contents
EMBL
DNA Sequences
SWALL
Protein sequences
InterPro+ Protein motifs
HGBASE SNP database
Metabolic Pathways
MEDLINE Literature
Total
Entries
18,800,000
900,000
1,000,000
1,500,000
250,000
11,350,000
33,800,000
Many data objects, data sets updated frequently (daily)
--> Keeping current data is a problem
Constellation of Bio-Data (SRS - Lion Bioscience)
Many
databanks, variously structured,
widely distributed, loosely federated - finding “best data” a problem
Genome database & info
system components
Any genome database relies on, and feeds into, many other databases
BioGrid Schematic
• Grid-aware client software
• Data and software directories
• Grid of processing computers
Moving Bio-Data on the Grid
1. @virtualdata= biodirectory( "find protein coding
sequences for Drosophila and Anopheles
species”)
2. @realdata= biodirectory( "get locators for
@virtualdata split n ways”)
3. for i (1.. n) { copydata(realdata[i],gridcpu[i]);
runapp(gridcpu[i]) }
Directories of Bio-Data
• Directories are a necessary step for usable grids of bio-data
– "broad and shallow" directories federate the "narrow and deep"
databases
• Bio-Data Access Tools
SRS, Sequence Retrieval System; Entrez ; AceDB; Genome
relational databases (Ensembl, FlyBase, WormBase) ; IBM
DiscoveryLink; BioDAS ; BioMoby
• Directory services for data access tools
– Layer onto access tools for common query/retrieval of important data
– LDAP: mature, efficient for high volumes, queries over distributed
directories ; works well with bio-access tools
– Web Services: XML messages over Web ; wide industry support , but
standards are in progress
Bio-data Directory Needs
• Build on existing technology for finding distributed objects
• Efficient for millions of objects, by the gigabyte and terabyte
• Queries distributed across directories of collaborating
services
• Support existing and new bioinformatics data access
(relational dbs, object and XML dbs, SRS, Entrez, AceDB)
• Simple client program methods for computable use of
directories
• Flexible, common schema for describing objects
• Replicate directories and objects among bioinformatics
centers
• Peer-to-peer directories for collaborative projects
• Strong authentication and security for data access
Directory technology:
LDAP,Web Services and/or?
• LDAP
– Object-centric, optimized for efficient read operations.
– Hierarchical, network-able, distributed and replicated in nature
– Has many features needed for bio-data access
• Web/XML
– SOAP+: SOAP for directory requests, WSDL to interface the directory
repository, UDDI to locate the service (some assembly still required…)
– UDDI is potential match to LDAP as directory technology
– DSML: layer on top of LDAP for Web/XML interoperability
• Peer-to-peer (JXTA)? Grid SQL? XML-query systems?
– Possible future directory technology
BioDirectory Tests
• SRS bioinformatics data retrieval system, for efficient
retrieval of millions of bio-objects
• OpenLDAP for high performance and JavaLDAP for easy to
configure directory transport.
• GLUE and Jakarta/Tomcat for Web Services tests.
• DSML, Directory Services markup for XML/LDAP
conversion.
• Test queries: 20,000 to 1.2 million biosequence objects from
GenBank, SwissProt and related dbs.
IUBio SRS Server
+ LDAP, WebServices
--> Bio-object directory search/retrieval
Using Bio Directories
Simple client
software
Automated
use
People use
Discovery
Search by
many
criteria
Retrieve bulk
subsets
B
i
o
G
r
i
d
R
u
n
n
e
r
A Globus CoG kit application for bioinformatics
http://iubio.bio.indiana.edu/biogrid/runner/
Wrap up
• Future of Bio-data on Grids
– Globus Toolkit useful for bio-grid data & compute
intensive tools (BLAST, HMMer, Meme, others)
– High volume, complex, changing, distributed data
– Add methods to find & move data among grid,
diretories of objects
– LDAP works well ; Web-XML is usable, being defined
• Bio. Community needs and uses
– Common data descriptions, schema, ontologies
– Simple, practical, flexible grid methods ; use existing
dbs
See http://iubio.bio.indiana.edu/biogrid/