String

Transcript String

Interoperability With BioMoby 1.0
It’s Better Than
Sharing Your Toothbrush!
Photo taken by http://flickr.com/people/mfsarwar/
A brief history of BioMoby
• Model Organism Bring Your own Database
Interface Conference, Sept, 2001 (MOBY-DIC)
• May 21, 2002 – Genome Canada Platform Award
• May 25, 2002 – API Version 0.1 deployed, including
object ontology serialization into XML
• July 18, 2002 – First Moby Client (Gbrowse Moby)
• June 9, 2003 – API Version 0.5 deployed
• 2006 – Genome Canada Platform Award
• 2007 - Version 1.0 API submitted for publication
MOBY-DIC Chapter VII
7th Model Organism Bring Your-own
Database Interface Conference
Vancouver, BC, June 2007.
The Core Ahab’s
Wendy
Richard
Martin
Mylah
Eddie
Andreas
Paul
Ivan
Mark’s Screen…
The BioMoby Plan
•
•
•
•
•
Create an ontology of bioinformatics data-types
Define a serialization of this ontology (data syntax)
Create an open API over this ontology
Define Web Service inputs and outputs v.v. Ontology
Register Services in an ontology-aware Registry
• Machines can find an appropriate service
• Machines can execute that service unattended
• Ontology is community-extensible
Overview of BioMoby
Transactions
MOBY hosts & services
Alignment
Sequence
Gene
names
Sequence
Align
Express.
Phylogeny
Protein
Primers
Alleles
…
MOBY
Central
Overview of BioMoby
Transactions
Discovery of services
That consume things
LIKE sequences!
Sequence
MOBY
Central
Align
Phylogeny
Primers
A sequence
is a ___
What
is a sequence?
That has these features __
Object
ontology
This is SCUFL – Simple Conceptual
Unified Flow Language
It is a complete record of everything
you just did, and it can be saved for
use in the Taverna workflow
application that we will look at later…
Pipeline discovery
“on the fly”
• No explicit coordination between
providers
• Dynamic discovery of ~appropriate
Services
• Automated execution of services
Some BioMoby
statistics
Moby: Breadth
•
•
•
•
Namespaces (data types): 418
Objects (data syntaxes): >561
Service Types (analytical categories): 112
Providers: ~50 active
• Service Instances: ~1200 currently “alive”
– In main Moby Central server in Canada
– Others in “boutique” Moby registries serving
specialized communities worldwide
Moby: Clients
• Gbrowse_moby (M Wilkinson)
• PlaNet Locus_View (H Schoof, R Ernst)
• Blue-Jay (P Gordon)
• Taverna (T Oinn, M Senger, E Kawas)
• MOWserv (INB, Spain)
• Remora (S Carrere, J Gouzy, INRA)
• MOBYLE (B Néron, P Tufféry, C Letondal, Pasteur Inst.)
• SeaHawk (P Gordon)
BioMoby in detail
• MOBY Data typing system: Semantic Type
• MOBY Data typing system: Syntactic Type
• Moby Registry Queries
BioMoby in detail
• MOBY Data typing system: Semantic Type
• MOBY Data typing system: Syntactic Type
• Moby Registry Queries
Moby Namespaces
• A “Namespace” is a category of identifiers
– NCBI has gi numbers (gi Namespace)
– GO Terms have accession numbers (GO
Namespace)
• Namespaces indicate data’s semantic type.
– GO:0003476  a Gene Ontology Term
– gi|163483  a GenBank record
• Though we are using the word “Namespace”
correctly, it causes confusion!
– “Namespace” in XML is tightly associated with an
XML document and/or its syntax
– In Moby, we are ONLY talking about data entities
NOT THEIR SYNTAX
BioMoby in detail
• MOBY Data typing system: Semantic Type
• MOBY Data typing system: Syntactic Type
• Moby Registry Queries
BioMoby in detail
• MOBY Data typing system: Semantic Type
• MOBY Data typing system: Syntactic Type
• Moby Registry Queries
The MOBY Object
Ontology
• Syntactic types are defined by a GO-like ontology
– Class name at each node
– Edges define the relationships between Classes
– GO used as a model because of its familiarity in the
community
• Edges define one of three relationships
– ISA
• Inheritance relationship
• All properties of the parent are present in the child
– HASA
• Container relationship of ‘exactly 1’
– HAS
• Container relationship with ‘1 or more’
The Simplest Moby DataType
<Object namespace=‘NCBI_gi’ id=‘111076’/>
Object
The combination of a namespace and an
identifier within that namespace
uniquely identify a data entity, not its
location(s), nor its representation
Moby Primitives
ISA
ISA
ISA
Object ISA
DateTime
Float
Integer <Integer namespace=‘’ id=‘’>38</Integer>
String
A Derived Data-Type
<VirtualSequence namespace=‘NCBI_gi’ id=‘111076’>
<Integer
id=‘’
articleName=“length”>38</Integer>
<Integernamespace=‘’
namespace=‘’
id=‘’>38</Integer>
</ VirtualSequence >
ISA
Integer
HASA
Object
ISA
ISA
Describes the semantic
relationship between
the Integer and
the Virtual Sequence
String
Virtual
Sequence
A Derived Data-Type
<GenericSequence namespace=‘NCBI_gi’ id=‘111076’>
<Integer namespace=‘’ id=‘’ articleName=“length”>38</Integer>
<String namespace=‘’
id=‘’ articleName=“SequenceString”>
<VirtualSequence
namespace=‘NCBI_gi’
id=‘111076’>
ATGATGATAGATAGAGGGCCCGGCGCGCGCGCGCGC
<Integer
namespace=‘’ id=‘’ articleName=“length”>38</Integer>
</String>
</ VirtualSequence
>
</ GenericSequence >
ISA
Object
ISA
ISA
Integer
HASA
HASA
String
Virtual
Sequence
ISA
Generic
Sequence
A Derived Data-Type
<DNASequence namespace=‘NCBI_gi’
<GenericSequence
namespace=‘NCBI_gi’
id=‘111076’>
id=‘111076’>
<Integer namespace=‘’ id=‘’ articleName=“length”>38</Integer>
<String namespace=‘’ id=‘’ articleName=“SequenceString”>
ATGATGATAGATAGAGGGCCCGGCGCGCGCGCGCGC
</String>
</ DNASequence
GenericSequence
> >
ISA
Object
ISA
ISA
Integer
HASA
HASA
String
Virtual
Sequence
ISA
Generic
Sequence
ISA
DNA
Sequence
Legacy file formats
• Containing “String” allows ontological classes to represent legacy data types
<NCBI_Blast_Report namespace=‘NCBI_gi’ id=‘115325’>
<String namespace=‘’ id=‘’ articleName=‘content’>
TBLASTN 2.0.4 [Feb-24-1998]
Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A.
Schäffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman
(1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs", Nucleic Acids Res. 25:3389-3402.
Query=
gi|1401126
(504 letters)
Database: Non-redundant GenBank+EMBL+DDBJ+PDB sequences
336,723 sequences; 677,679,054 total letters
Searchingdone
Sequences producing significant alignments:
gb|U49928|HSU49928 Homo sapiens TAK1 binding protein (TAB1) mRNA...
emb|Z36985|PTPP2CMR P.tetraurelia mRNA for protein phosphatase t...
emb|X77116|ATMRABI1 A.thaliana mRNA for ABI1 protein
</String>
</NCBI_Blast_Report>
Score
(bits)
E
Value
1009
58
53
0.0
4e-07
1e-05
Binaries – pictures,
movies
• Text-base64 is a Class that contains String
• Binaries are base64 encoded and passed in classes that inherit from textbase64
• base64_encoded_jpeg ISA text/base64 ISA text/plain HASA String
<base64_encoded_jpeg namespace=‘TAIR_image’ id=‘3343532’>
<String namespace=‘’ id=‘’ articleName=‘content’>
MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIJQDCC
Av4wggJnoAMCAQICAwhH9jANBgkqhkiG9w0BAQQFADCBkjELMAkGA1UEBhMCWkExFTATBgNV
MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIJQDCC
Av4wggJnoAMCAQICAwhH9jANBgkqhkiG9w0BAQQFADCBkjELMAkGA1UEBhMCWkExFTATBgNV
BAgTDFdlc3Rlcm4gQ2FwZTESMBAGA1UEBxMJQ2FwZSBUb3duMQ8wDQYDVQQKEwZUaGF3dGUx
HTAbBgNVBAsTFENlcnRpZmljYXRlIFNlcnZpY2VzMSgwJgYDVQQDEx9QZXJzb25hbCBGcmVl
bWFpbCBSU0EgMjAwMC44LjMwMB4XDTAyMDkxNTIxMDkwMVoXDTAzMDkxNTIxMDkwMVowQjEf
MB0GA1UEAxMWVGhhd3RlIEZyZWVtYWlsIE1lbWJlcjEfMB0GCSqGSIb3DQEJARYQamprM0Bt
</String>
</base64_encoded_jpeg>
Extending legacy
datatypes
•
•
•
•
With legacy data-types defined, we can extend them as we see fit
annotated_jpeg ISA base64_encoded_jpeg
annotated_jpeg HASA 2D_Coordinate_set
annotated_jpeg HASA Description
<annotated_jpeg
namespace=‘TAIR_Image’
id=‘3343532’>
<2D_Coordinate_set namespace=‘’ id=‘’ articleName=“pixelCoordinates”>
<Integer namespace=‘’ id=‘’ articleName=“x_coordinate”>3554</Integer>
<Integer namespace=‘’ id=‘’ articleName=“y_coordinate”>663</Integer>
</2D_Coordinate_set>
<String namespace=‘’ id=‘’ articleName=“Description”>
This is the phenotype of a ufo-1 mutant under long daylength, 16’C
</String>
<String namespace=‘’ id=‘’ articleName=“content”>
MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIJQDCC
Av4wggJnoAMCAQICAwhH9jANBgkqhkiG9w0BAQQFADCBkjELMAkGA1UEBhMCWkExFTATBgNV
</String>
</annotated_jpeg>
The same object…
annotated_jpeg ISA base64_encoded_jpeg HASA 2D_Coordinate_set HASA Description
<annotated_jpeg namespace=‘TAIR_Image’ id=‘3343532’>
<2D_Coordinate_set namespace=‘’ id=‘’ articleName=“pixelCoordinates”>
<Integer namespace=‘’ id=‘’ articleName=“x_coordinate”> 3554 </Integer>
<Integer namespace=‘’ id=‘’ articleName=“y_coordinate”> 663 </Integer>
</2D_Coordinate_set>
<String namespace=‘’ id=‘’ articleName=“Description”>
This is the phenotype of a ufo-1 mutant under long daylength, 16’C
</String>
<String
namespace=‘’
id=‘’ articleName=“content”>
MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIJQDCC
Av4wggJnoAMCAQICAwhH9jANBgkqhkiG9w0BAQQFADCBkjELMAkGA1UEBhMCWkExFTATBgNV
</String>
</annotated_jpeg>
The same object…
annotated_jpeg ISA base64_encoded_jpeg HASA 2D_Coordinate_set HASA Description
<annotated_jpeg namespace=‘TAIR_Image’ id=‘3343532’>
<CrossReference>
<Object namespace=“TAIR_Allele” id=“ufo-1”/>
</CrossReference>
<2D_Coordinate_set namespace=‘’ id=‘’ articleName=“pixelCoordinates”>
<CrossReference>
<Object namespace=‘TAIR_Tissue’ id=‘122’/>
</CrossReference>
<Integer namespace=‘’ id=‘’ articleName=“x_coordinate”> 3554 </Integer>
<Integer namespace=‘’ id=‘’ articleName=“y_coordinate”> 663 </Integer>
</2D_Coordinate_set>
<String namespace=‘’ id=‘’ articleName=“Description”>
This is the phenotype of a ufo-1 mutant under long daylength, 16’C
</String>
<String namespace=‘’ id=‘’ articleName=“content”>
MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3
Av4wggJnoAMCAQICAwhH9jANBgkqhkiG9w0BAQQFADCBkjELMAkGA1U
</String>
</annotated_jpeg>
Cross reference types
• Simple
– A MOBY Object
• Rich
<Object namespace=‘foo' id=‘12345‘/>
– Takes the form:
<Xref namespace='' id='' authURI='' serviceName='' evidenceCode='' xrefType=''>
... Textual Description ...
</Xref>
– …Incidentally, this avoids the problem of
reification that is experienced in RDF
XML Schema?
The Object Ontology allows new data-types
WITHOUT new flatfile formats, and
without having to understand e.g. XML Schema
Minimize future heterogeneity
Improve interoperability without requiring schemato-schema mapping
XML Schema?
• Object Ontology terms have semantically
rich names, but this is primarily for human
intuition
– DNA Sequence
– Annotated_GIF
• Object Ontology does not define the
meaning of an object to the machine
– No machine-readable semantics
• It does define the representation
– SYNTAX
A portion of the MOBY-S
Object Ontology
…community-built!
BioMoby in detail
• MOBY Data typing system: Semantic Type
• MOBY Data typing system: Syntactic Type
• Moby Registry Queries
A Moby Central Query
• Give me:
– Services that consume THIS data-type in
THIS syntax…
– …do SOMETHING LIKE THIS to it…
– …and provide me THAT data-type in
response
Example
• Find me services that
– consume FASTA sequence data,
– do a BLAST with it,
– and provide me lists of GenBank GI numbers in
return.
• Query can be any or all of the above
criterion
– Also limit by service provider and service
description keyword
Remember!!
Moby Registry Query
INPUT TYPE
|
|
TRANSFORMATION TYPE
|
|
OUTPUT TYPE
A weakness of MOBY
Service discovery is horribly
flawed due to insufficiently
rich semantics…
The problem with Moby
Chickens go in;
Pies come out!
The problem with Moby
What sort o’
pies?
The problem with Moby
Apple!
The MOBY-S Service
Ontology
• A simple ISA hierarchy…
– too simple!
• Primitive types include:
–
–
–
–
–
–
–
Analysis
Parsing
Registration
Retrieval
Resolution
Conversion
Rendering
A slice of the Service
Ontology
Parse_NCBI_Blast
“The Exploding Bicycle”
- A. Rector, U Manchester
Parsing
Parse_WU_Blast
Service
WU_Blast
Analysis
Alignment
Blast
NCBI_Blast
Summary so far
• BioMoby uses ontologies to describe both
data types and data syntaxes
– This is where the interoperability comes from
– These are used to match consumers with
providers during service discovery
• BioMoby uses a simple ontology to describe
bioinformatics operations
– This ontology is only marginally useful
Seahawk
• Highlight data in
your browser and
drag/drop it into
Moby
• What could be
easier than that?!
Paul MK Gordon and Christoph W Sensen
BMC Bioinformatics 2007, 8:208
Seahawk: A New Moby Client
for Biologists
Drag ‘n’ drop, highlight existing data for use with MOBY Services
Paul Gordon & Christoph Sensen
BMC Bioinformatics, in press
Seahawk looks like a browser
How do I load data?
How do I load data?
How do I load data?
• Use the “open” button:
–
–
–
–
Text file (e.g. FASTA sequences)
HTML page (e.g. NCBI Entrez Web page)
RTF document (e.g. conference abstract)
MOBY XML document
• Drag ‘n’ Drop
– Web links and desktop files
– Highlighted text from open documents or
Web pages
Under the Hood
(Beneath the Bonnet?)
• Data has to be converted into Moby
XML format to be used by Moby
• Moby data has to be converted back
to human-readable text for
presentation to the biologist
Again: How do I load data?
How do I Find Services?
• Right-click  MOB rules are invoked
• Resulting Moby XML is used for service search
How do I run a service?
• Click it!
• If necessary, a
service’s extra
parameters
can be set
• Control+click
submits using
default params
How do I run a service?
• If required inputs
are missing, the
missing ones must
be dragged into
place.
• Unrecognized data
will be rejected
How do I collate data?
• Seahawk clipboard
lets you build
collections of
objects
• Seahawk “knows”
the type of
collection and will
suggest appropriate
Moby services
Seahawk Summary
• Seahawk integrates Moby Web Service
discovery and execution into the
biologists day-to-day “Web Surfing”
activity
• It uses Regular Expressions and XSLT to
move normal web or hard-drive-file
data into and out of BioMoby
Why doesn’t Moby
Use RDF/OWL?
Timeline of Moby/W3C Activities
RDF
Candidate
Spec
RDF/OWL
Formal W3C
Recommendations
W3C Launches Semantic
Web (SW) Activity Group
RDF Schema
Candidate
Spec
Extensive SW toolbuilding…
>>>>>>
2000
2001
2002
2003
2004
2005
2006
BioMoby XML
Finalized
BioMoby
Project Established
BioMoby
Stable 0.85 API
Published
(>400 services)
BioMoby
Stable 1.0 API
Published
Moby 2.0
Getting it right, the second time!
What BioMoby Already Does
Sequence
Data
Blast Hit
BLAST SERVER
What BioMoby Already Does
givesBlastResult
Sequence
Data
Blast Hit
Not “Bologically” Meaningful
What BioMoby Already Does
Sequence
Data
hasHomologyTo
Blast Hit
…looks a lot like…
URI
hasHomologyTo
Which is effectively just an RDF triple,
URI
Now think
in reverse…
(in case you forgot…)
Moby Registry Query
INPUT TYPE
|
|
TRANSFORMATION TYPE
|
|
OUTPUT TYPE
Moby 2.0
What does
Sequence
Data
Have homology to?
hasHomologyTo
Maps to
Send data
Blast Hit
BLAST SERVICE
Query
FIND SERVICES THAT
Consume Sequence Data
|
|
Provide hasHomologyTo Property
|
|
Attached to other Sequence Data
SPARQL
• A Semantic Web query language
• Queries “look like” graphs
Find “X”
with predicate “Y”
attached to “Z”
Moby 2.0 extends the
SPARQL query language
• SPARQL queries contain concepts and the
relationships between them (subject,
predicate, object)
• We simply map RDF predicates onto Moby
services capable of generating that
relationship
• Registry query: “What Moby service
consumes [subject] and generates the
[predicate] relationship type?”
But wait,
there’s more!
Exploit knowledge in OWL
ontologies to enhance query
Predicate
Subject
Subject
Predicate
Evaluate Query Expression
Look up and execute Moby service
Consumes STK or proteins and
Looks-up inhibitor molecules
Look up and execute Moby service
Consumes proteins and generates
Functional annotation info
Exploit knowledge in OWL
ontologies to enhance query
This SPARQL query could be posed on
a database of RAW, UNANNOTATED
Protein sequences, and be answered
by Moby 2.0 (a.k.a. CardioSHARE)
Credits
• Genome Canada/Genome Alberta
• myGrid – Carole Goble in particular
• Spanish National Institute for Bioinformatics
(INB) through Fundación Genoma España
• Generation Challenge Programme (GCP) of
the Consultative Group for International
Agricultural Research (CGIAR)
• Heart and Stroke Foundation of BC and
Yukon (CardioSHARE)
• Microsoft Research (CardioSHARE)

String

Transcript String

Directory