String - BioMoby

Download Report

Transcript String - BioMoby

Harnessing the Power
Of communities:
MOBY & Beyond
Mark Wilkinson
PI Bioinformatics
iCAPTURE Centre
for Cardiovascular
and Pulmonary Research
Assistant Professor
Dept. of Medical Genetics
UBC, Vancouver
A brief history of BioMoby
• Model Organism Bring Your own Database
Interface Conference, Sept, 2001 (MOBY-DIC)
• May 21, 2002 – Genome Canada Platform Award
• May 25, 2002 – API Version 0.1 deployed,
including the messaging layer that still exists
today
• July 18, 2002 – first Moby Client released (now
gbrowse_moby, part of gbrowse from GMOD)
• June 9, 2003 – API Version 0.5 deployed
• Currently, the API is at version 0.86; version 1.0
API in preparation for release end of November
What does BioMoby do?
The MOBY-S Plan
•
•
•
•
•
Create an ontology of bioinformatics data-types
Define a serialization of this ontology (data syntax)
Create an open API over this ontology
Define Web Service inputs and outputs v.v. Ontology
Register Services in an ontology-aware Registry
• Machines can find an appropriate service
• Machines can execute that service unattended
• Ontology is community-extensible
Overview of MOBY-S Transactions
MOBY hosts & services
Alignment
Sequence
Gene
names
Sequence
Align
Express.
Phylogeny
Protein
Primers
Alleles
…
MOBY
Central
MOBY-S in detail
• MOBY-S Data typing system: Semantic Type
• MOBY-S Data typing system: Syntactic Type
MOBY-S in detail
• MOBY-S Data typing system: Semantic Type
• MOBY-S Data typing system: Syntactic Type
Moby Namespaces (from GO)
• Any identifiable piece of data is an “entity”
• Identifiers for these entities fall under “Namespaces”
– NCBI has gi numbers (gi Namespace)
– GO Terms have accession numbers (GO Namespace)
• Namespaces indicate data’s semantic type.
– GO:0003476  a Gene Ontology Term
– gi|163483  a GenBank record
• Namespace + ID precisely specifies a data “entity”
• This differs from an LSID in that our identifiers ARE NOT
OPAQUE – they are semantically rich
MOBY-S in detail
• MOBY-S Data typing system: Semantic Type
• MOBY-S Data typing system: Syntactic Type
The MOBY-S Object Ontology
• Syntactic types are defined by a GO-like
ontology
– Data Class name at each node
– Edges define the relationships between Classes
– GO used as a model because of its familiarity
in the community
Edge
• Edges define one of three relationships
node
– IS A
• Inheritance relationship
• All properties of the parent are present in the child
– HAS A
• Container relationship of ‘exactly 1’
– HAS
• Container relationship with ‘1 or more’
node
The Simplest Moby Data-Type
<Object namespace=‘NCBI_gi’ id=‘111076’/>
Object
The combination of a namespace and an
identifier within that namespace
uniquely identify a data entity, not its
location(s), nor its representation
A Primitive Data-type
ISA
DateTime
ISA
Float
ISA
Integer
Object ISA
String
<Integer namespace=‘’ id=‘’>38</Integer>
A Derived Data-Type
<VirtualSequence namespace=‘NCBI_gi’ id=‘111076’>
<Integer namespace=‘’ id=‘’ articleName=“length”>38</Integer>
</ VirtualSequence >
ISA
Integer
HASA
Object
ISA
ISA
String
Virtual
Sequence
A Derived Data-Type
<GenericSequence namespace=‘NCBI_gi’ id=‘111076’>
<Integer namespace=‘’ id=‘’ articleName=“length”>38</Integer>
<String namespace=‘’ id=‘’ articleName=“SequenceString”>
ATGATGATAGATAGAGGGCCCGGCGCGCGCGCGCGC
</String>
</ GenericSequence >
ISA
Object
ISA
ISA
Integer
HASA
HASA
String
Virtual
Sequence
ISA
Generic
Sequence
A Derived Data-Type
<DNASequence namespace=‘NCBI_gi’ id=‘111076’>
<Integer namespace=‘’ id=‘’ articleName=“length”>38</Integer>
<String namespace=‘’ id=‘’ articleName=“SequenceString”>
ATGATGATAGATAGAGGGCCCGGCGCGCGCGCGCGC
</String>
</ DNASequence >
ISA
Object
ISA
ISA
Integer
HASA
HASA
String
Virtual
Sequence
ISA
Generic
Sequence
ISA
DNA
Sequence
Legacy file formats
• Containing “String” allows us to define ontological classes that represent
legacy data types (e.g. the 20 existing sequence formats!)
<NCBI_Blast_Report namespace=‘NCBI_gi’ id=‘115325’>
<String namespace=‘’ id=‘’ articleName=‘content’>
TBLASTN 2.0.4 [Feb-24-1998]
Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A.
Sch&auml;ffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman
(1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs", Nucleic Acids Res. 25:3389-3402.
Query=
gi|1401126
(504 letters)
Database: Non-redundant GenBank+EMBL+DDBJ+PDB sequences
336,723 sequences; 677,679,054 total letters
Searchingdone
Sequences producing significant alignments:
gb|U49928|HSU49928 Homo sapiens TAK1 binding protein (TAB1) mRNA...
emb|Z36985|PTPP2CMR P.tetraurelia mRNA for protein phosphatase t...
emb|X77116|ATMRABI1 A.thaliana mRNA for ABI1 protein
</String>
</NCBI_Blast_Report>
Score
(bits)
E
Value
1009
58
53
0.0
4e-07
1e-05
Binaries – pictures, movies
• We base64 encode binaries, and then define a hierarchy of data classes that
Contain String
• base64_encoded_jpeg ISA text/base64 ISA text/plain HASA String
<base64_encoded_jpeg namespace=‘TAIR_image’ id=‘3343532’>
<String namespace=‘’ id=‘’ articleName=‘content’>
MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIJQDCC
Av4wggJnoAMCAQICAwhH9jANBgkqhkiG9w0BAQQFADCBkjELMAkGA1UEBhMCWkExFTATBgNV
MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIJQDCC
Av4wggJnoAMCAQICAwhH9jANBgkqhkiG9w0BAQQFADCBkjELMAkGA1UEBhMCWkExFTATBgNV
BAgTDFdlc3Rlcm4gQ2FwZTESMBAGA1UEBxMJQ2FwZSBUb3duMQ8wDQYDVQQKEwZUaGF3dGUx
HTAbBgNVBAsTFENlcnRpZmljYXRlIFNlcnZpY2VzMSgwJgYDVQQDEx9QZXJzb25hbCBGcmVl
bWFpbCBSU0EgMjAwMC44LjMwMB4XDTAyMDkxNTIxMDkwMVoXDTAzMDkxNTIxMDkwMVowQjEf
MB0GA1UEAxMWVGhhd3RlIEZyZWVtYWlsIE1lbWJlcjEfMB0GCSqGSIb3DQEJARYQamprM0Bt
</String>
</base64_encoded_jpeg>
Extending legacy data types
•
•
•
•
With legacy data-types defined, we can extend them as we see fit
annotated_jpeg ISA base64_encoded_jpeg
annotated_jpeg HASA 2D_Coordinate_set
annotated_jpeg HASA Description
<annotated_jpeg
namespace=‘TAIR_Image’
id=‘3343532’>
<2D_Coordinate_set namespace=‘’ id=‘’ articleName=“pixelCoordinates”>
<Integer namespace=‘’ id=‘’ articleName=“x_coordinate”>3554</Integer>
<Integer namespace=‘’ id=‘’ articleName=“y_coordinate”>663</Integer>
</2D_Coordinate_set>
<String namespace=‘’ id=‘’ articleName=“Description”>
This is the phenotype of a ufo-1 mutant under long daylength, 16’C
</String>
<String namespace=‘’ id=‘’ articleName=“content”>
MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIJQDCC
Av4wggJnoAMCAQICAwhH9jANBgkqhkiG9w0BAQQFADCBkjELMAkGA1UEBhMCWkExFTATBgNV
</String>
</annotated_jpeg>
The same object…
annotated_jpeg ISA base64_encoded_jpeg HASA 2D_Coordinate_set HASA Description
<annotated_jpeg namespace=‘TAIR_Image’ id=‘3343532’>
<2D_Coordinate_set namespace=‘’ id=‘’ articleName=“pixelCoordinates”>
<Integer namespace=‘’ id=‘’ articleName=“x_coordinate”> 3554 </Integer>
<Integer namespace=‘’ id=‘’ articleName=“y_coordinate”> 663 </Integer>
</2D_Coordinate_set>
<String namespace=‘’ id=‘’ articleName=“Description”>
This is the phenotype of a ufo-1 mutant under long daylength, 16’C
</String>
<String
namespace=‘’
id=‘’ articleName=“content”>
MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIJQDCC
Av4wggJnoAMCAQICAwhH9jANBgkqhkiG9w0BAQQFADCBkjELMAkGA1UEBhMCWkExFTATBgNV
</String>
</annotated_jpeg>
The same object…
annotated_jpeg ISA base64_encoded_jpeg HASA 2D_Coordinate_set HASA Description
<annotated_jpeg namespace=‘TAIR_Image’ id=‘3343532’>
<CrossReference>
<Object namespace=“TAIR_Allele” id=“ufo-1”/>
</CrossReference>
<2D_Coordinate_set namespace=‘’ id=‘’ articleName=“pixelCoordinates”>
<CrossReference>
<Object namespace=‘TAIR_Tissue’ id=‘122’/>
</CrossReference>
<Integer namespace=‘’ id=‘’ articleName=“x_coordinate”> 3554 </Integer>
<Integer namespace=‘’ id=‘’ articleName=“y_coordinate”> 663 </Integer>
</2D_Coordinate_set>
<String namespace=‘’ id=‘’ articleName=“Description”>
This is the phenotype of a ufo-1 mutant under long daylength, 16’C
</String>
MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIJQDCC
Av4wggJnoAMCAQICAwhH9jANBgkqhkiG9w0BAQQFADCBkjELMAkGA1UEBhMCWkExFTATBgNV
How to think about MOBY Objects
and Namespaces
Data perspective X
Data perspective Y
Object X
Object Y
Record in “gi”
Namespace
(Genbank record)
Why define Objects in an ontology?
Bioinformatics service providers are not all experienced
programmers
The Moby Object Ontology provides an environment
within which “naïve” service providers can create new
complex data-types WITHOUT generating new
flatfile formats, and without having to
understand XML Schema
Minimize future heterogeneity between new data-types
to improve interoperability without requiring endless
schema-to-schema mapping efforts.
The Object Ontology Defines
an XML Schema
• Object Ontology terms have “meaningful” names,
but this is for human intuition only
– DNA Sequence, Annotated_GIF
• Object Ontology does not define the biological
meaning, however it does define how every
XML tag should be interpreted, therefore
superior to pure XML/XML-Schema solutions
• It does define the representation
– SYNTAX
The Object Ontology Defines
an XML Schema
• The position of an ontology node precisely defines
the syntax by which that node will be represented
• End-users can define new data-types without
having to write XML Schema!
– This was an important aim of the project
• A machine can “understand” the structure of any
incoming message by querying its ontological type
A portion of the MOBY-S
Object Ontology
…community-built!
Pipeline discovery “on the fly”
• No explicit coordination between providers
• Run-time discovery of appropriate Services
• Automated execution of services
• This is happening without semantics
– Syntax only… well… almost… :-)
Conclusions from the Behaviour
of this Simple Browser
• Service discovery is a semantic problem
• However interoperability is not
• Data integration is still a problem – both
syntactic and semantic - and we’ve just
made that problem worse!
– SYNTAX IS NOT THE PROBLEM!!!!
• Some “political” details about BioMoby as
we are coming to the end of the current
Genome Canada funding period and are
trying to get renewal… hint, hint, if there
are any GC external reviewers in the
audience! 
Moby: Breadth
• Namespaces (semantic datatypes): 236
• Objects (data syntaxes): 161
• Service Types (analytical categories): 18
• Service Instances: 401 (+ ~200 Soaplab)
– Hundreds more in “boutique” Moby registries serving
specialized communities worldwide
– All continents except Antarctica host Moby services
Moby: Impact
• Mailing list count 175 members (84
on developers mailing list)
• Google Scholar
– ‘BioMOBY’ 147
– Citations of 2002 BioMOBY paper 72
Moby: Developer Activity
• MOBY-DIC Chapter 7 meeting
– Vancouver, May 6-8, 2005
• 23 Developers attending
–
–
–
–
–
–
Asia
USA
Canada
Germany
Spain
France
• Mapped-out the route to the final 1.0
version of the API
Moby Registry Activity
Hits on Moby Central API
PlaNet implements
own MOBY Central
MONTH
Ap
r- 0
5
Ju
n05
Au
g05
Ju
n04
Au
g04
O
ct
-0
4
De
c04
Fe
b05
De
c03
Fe
b04
Ap
r- 0
4
400000
350000
300000
250000
200000
150000
100000
50000
0
Moby: Exemplar Users
• PlaNet consortium (7+ sites, 100-130 services)
• EBI – SOAPLAB – myGrid
• Generation Challenge Programme of the CGIAR
(18+ sites)
• Genome Espania uses MOBY for much of the
bioinformatics service provision in the GE
Bioinformatics Platform
Moby: Clients
• Gbrowse_moby (M Wilkinson)
• Browser-style client
• Ahab & Ishmael (B Good, M Wilkinson)
• “BLAST” & Semantic Web style clients
• PlaNet Locus_View (H Schoof, R Ernst)
• Aggregator-style client
• Blue-Jay (P Gordon) and Rat Genome
Database prototype (S Twigger)
• Menu-style clients
• MOBY Graphs (M Senger)
• Auto-workflow discovery tool
• Taverna (T Oinn, M Senger, E Kawas), and
MOWserv (INB, Spain)
• Workflow builder/publisher/execution client
• Enhanced support for MOBY currently being built
• Eclipse plugins… etc…
Taverna Workbench
Tom Oinn and Martin Senger
myGrid Project
MOWServ
Web interface to the
Spanish Instituto Nacional de Bioinformatica
MOBY Central installation
INB Collaboration
MOBY Enhancements
• The INB has made several additions to
the MOBY API
– Detailed error reporting
– Asynchronous service invocation
• These will become part of the official
API in the coming year.
Future plans for Moby
• “Decentralization” and enrichment of the
registry through distributed RDF-based
service instance annotations + LSID
resolution
– Complete!
• Mirroring of registries
• RDF-based messaging
– BioMoby pre-dates commodity Semantic Web tools
like RDF/OWL by a couple of years…
Future plans for Moby
• Mirroring of Services
• Enhanced registry usage metadata capture
• Ontological markup of Object Ontology Terms
• Better support for Web Service tooling if possible
– Unfortunately, W3C XML Schema is unable to describe MOBY
messages…
• Collaboration with the GBIF/DIGiR community –
biodiversity information served through MOBY
A weakness of MOBY
Automated service discovery is
fatally flawed due to
insufficiently rich semantics…
The problem with Moby
Chickens go in;
Pies come out!
The problem with Moby
What sort o’
pies?
The problem with Moby
Apple!
The MOBY-S Service Ontology
• A simple ISA hierarchy… too simple!
• Primitive types include:
– Analysis
– Parsing
– Registration
– Retrieval
– Resolution
– Conversion
– Rendering
A slice of the Service Ontology
Parse this
Parse_NCBI_Blast
“The Exploding Bicycle”
- A.ParseRector,
U
Manchester
that
Parsing
Service
WU_Blast
Analysis
Alignment
Blast
NCBI_Blast
MOBY in the future
• Tighter collaboration with myGrid
– We now have identical RDF data-models for
our registry metadata
• We inherit the excellent myGrid Service
Ontology, while retaining the power of the
MOBY Object ontology!
BioMoby Conclusions
• The bioinformatics community is facing missioncritical data management problems
• The solution must be simple.
• The community will adopt solutions that work even
if they have to change their behaviour to do so
• The community can be trusted to build
useful, simple ontologies on its own`
The Semantic Web for Plant
Genomics
How do Web Services help us with
the Semantic Web problem?
The Semantic Web: RDF Triples
URI
URI
http://biomoby.org
dc:author
URI
http://
icapture.ubc.ca/
Wilkinson
Basically, just entity-relationship diagrams
The Internet
Credit to P. Lord, myGrid
The World Wide Web
Credit to P. Lord, myGrid
The Semantic Web (low stack)
sameAs
TranscriptOf
ISA
activates
componentOf
hasProduct
address
clonedBy
Credit to P. Lord, myGrid
How do WS relate to the SW?
• Bioinformatics information is mainly in Databases
– Therefore not available as named documents (URI’s)…
• Work on Semantic Web Services has focused
primarily on semantic annotation of Web Service
functionality (e.g. Moby & myGrid)
– i.e. the problem of Service Discovery
• Can Web Services be used to build the Semantic
Web?
(credit to Phillip Lord, myGrid, for this phraseology)
Web Services… no documents to
point to!
sameAs
TranscriptOf
ISA
activates
componentOf
hasProduct
address
clonedBy
The Semantic Web
sameAs
TranscriptOf
ISA
activates
componentOf
hasProduct
address
clonedBy
Credit to P. Lord, myGrid
How do we make Web Services
look like the Semantic Web?
• Moby can help!
• Two novel Moby clients - Ahab and
Ishmael – are starting to create
Semantic Webby outputs…
The Ahab BioMoby Client
Ahab
Ahab RDF
But BioMoby can run unattended!
• Because of syntactic agreement among service
providers, and
• Because the machine can automatically
disassemble complex objects, and
• Because discovery and execution of services that
act on those objects can be fully automated
• BioMoby can build a massive Entity/Relationship
model completely unattended
Okay, so get rid of the GUI…
1. Tell Ahab engine to chose all discovered
2.
3.
4.
services for a piece of data
Execute every service
Take each output, and go to (1)
Go home for an early weekend…
This is Ishmael - a prototype BioMoby client
The Output from Ishmael
sameAs
TranscriptOf
ISA
activates
componentOf
hasProduct
address
clonedBy
mySWeb
• The output of Ishmael is “My Semantic Web”
– Personalized Semantic Web-like RDF graph
– Centered around your data of interest
– Cachable/explorable by e.g. Haystack
– Because each node is a Moby-like URI with a
namespace & id, it auto-detects “re-discovery” of
data elements (“loops” in the dataset)
Acknowledgements
O|B|F
• BioMOBY: A Bioinformatics Platform for
Genome Canada
• Ahab, Ishmael, iCAPTURer: Genome BC
Better Biomarkers in Transplantation
• CardioSHARE: Canadian Institutes for Health
Research (CIHR)
• Taverna: myGrid
• Ben Good: CIHR Bioinformatics
Training Programme
Participants and Supporters
Edward Kawas – Lead Developer , BioMOBY project, UBC, Canada
Benjamin Good – CIHR Bioinformatics Training Program, UBC, Canada
Clarence Kwan – Genome Prairie Co-op student, UBC, Canada
Bruce McManus – Co-director, iCAPTURE Centre, UBC, Canada
Carole Goble, Phillip Lord – myGrid project, U Manchester, UK
Martin Senger – myGrid/Taverna, EBI, UK
Bill Crosby & Matthew Links – U Windsor, Canada
Heiko Schoof, Rebecca Ernst – MIPS, Germany
Simon Twigger – Rat Genome Database, USA
Yan Wong – Pasteur Institute, France
Frank Gibbons – Harvard, USA
David Gonzales Pisano – Centro Nacional Biotechnologia, Spain
Damian Gessler & Gary Schiltz – NCGR, USA
Lincoln Stein – Cold Spring Harbor Labs, USA
Midori Harris - Gene Ontology Consortium, UK
Richard Bruskiewich – CGIAR/IRRI, Philippines
Mark Regan – ACPFG/UQueensland, Australia