Transcript BioHaystack

IBM Watson Research
BioHaystack: Gateway to the
Biological Semantic Web
Dennis Quan
[email protected]
© 2004 IBM Corporation
IBM Watson Research
Problems in bioinformatics
 Myriad of public databases have specific facets of
information about biological objects of interest (e.g.,
proteins, genes, etc.)
 Databases have their own access protocols, data formats,
naming conventions, and means of describing relationships
between objects in different databases
 Different software required to view information from different
databases
– User must be keenly aware of which tool or site to use
– Relevant information comes in fragments
– Exploration process is discontinuous
© 2004 IBM Corporation
IBM Watson Research
A common naming convention: LSID URNs
 Life Sciences Identifiers (LSIDs) are URNs for biological
objects that are backed by RDF metadata:
– E.g., urn:lsid:ncbi.nlm.nih.gov.lsid.i3c.org:genbank:nm_001240
 LSID and LSID protocol (SOAP-based) specification
sponsored by I3C and undergoing standardization by OMG
 Most of the publicly available bioinformatics databases
available via LSID today
– PDB LSID authority online; “proxy” LSID authorities for
databases such as NIH databases, SwissProt hosted by I3C
 Really easy to set up LSID clients and servers
– IBM Internet Technology group provides Open Source LSID
client and server software for a variety of languages and
platforms
© 2004 IBM Corporation
IBM Watson Research
RDF/XML: on demand data integration
human
hemoglobin
LSID
atagccgta
cctgcgagt
ctagaagct
derives from
GenBank
derives from
atagccgta
cctgcgagt
ctagaagct
+
human
hemoglobin
LSID
is a
oxygen
transport
protein
human
hemoglobin
LSID
is a
oxygen
transport
protein
Gene Ontology
+
has 3D structure
human
hemoglobin
LSID
has 3D structure
Unified view
PDB
© 2004 IBM Corporation
IBM Watson Research
Haystack: letting users interact with their data
 Haystack is a tool for creating, exploring, and organizing
information:
– Personal information: e-mails, contacts, documents, etc.
– Bioinformatics: proteins, publications, genes, etc.
 Research project originating from MIT CSAIL
 Uses RDF as an underlying data model
 Built on Java and Eclipse, IBM’s Open Source rich client
platform
http://haystack.lcs.mit.edu/
© 2004 IBM Corporation
IBM Watson Research
Browsing highly interconnected information
 Single screen
presents multiple
facets of a single
object originating
from separate
databases
 Users navigate
space like a Web
browser:
hyperlinking, drag
and drop, etc.
© 2004 IBM Corporation
IBM Watson Research
Personalization
 People keep track of their information by
personalizing their workspaces:
– Grouping paperwork into folders
– Highlighting important text in documents
– Attaching sticky notes as reminders
– Jotting down lists of related items
 Haystack has pervasive support for
annotation and allows users to group
related objects together arbitrarily for their
own purposes
© 2004 IBM Corporation
IBM Watson Research
BioHaystack
 BioHaystack: application of Haystack technologies to
bioinformatics problem
– Integrated environment for working with biological data
– Intended for end users, i.e., non-programmers
– Builds on LSID, RDF, and Haystack
 Integration offers the promise of lowering barriers to access
to different backend systems (e.g., LSID servers, Grids, Web
Services, relational databases, annotation servers)
 Just as the Web browser acts as a client for Web content,
BioHaystack can act as a client for biological Semantic
content and services
© 2004 IBM Corporation
IBM Watson Research
Real world collaboration: myGrid
 UK-funded joint project with the University of
Manchester and other UK research institutions
 RDF-based platform for supporting e-Science
experiments
 Real use cases; developed in collaboration with
bioinformaticians
 myGrid creates LSIDs and RDF metadata in the
process of enacting experiments for scientists
 Using BioHaystack as a browser for metadata
© 2004 IBM Corporation
IBM Watson Research
myGrid Architecture
Bioinformaticians
Registry
Query & register
Annotation/description Discovery View
Query &
Retrieve
invoking
Annotation
providers
Pedro
Annotation tool
Interface
Description
Others
Vocabulary
Ontology Store
Service
WSDL SoapProviders
lab
Taverna
WF Builder
Workflow
Execution
FreeFluo
Enactor
Store data/
knowledge
mIR
Haystack
Provenance
Browser
Data descriptions
Courtesy of Professor Carole Goble, University of Manchester
Scientists
© 2004 IBM Corporation
IBM Watson Research
BioHaystack + myGrid
Courtesy of Professor Carole Goble, University of Manchester
© 2004 IBM Corporation
IBM Watson Research
Thank you for your attention
 Dennis Quan, [email protected] (IBM Watson Research)
 Haystack project home page (download available May 24)
– http://haystack.lcs.mit.edu/
 IBM LSID home page
– http://www.ibm.com/developerworks/oss/lsid/
 myGrid home page
– http://www.mygrid.org.uk/
 See also our session on constructing Haystack applications:
– Developer’s Day, Saturday, 4:30pm
© 2004 IBM Corporation