Why ontologies for services?

Download Report

Transcript Why ontologies for services?

myGrid:
Using Workflow and Ontologies to
coordinate Bioinformatics Web
Services
Carole Goble
http://www.mygrid.org.uk
Wellcome Trust/eScience Programme UK BioGrid Meeting
1-3rd October, 2002, Hinxton Genome Campus
Roadmap
•
•
•
•
myGrid’s objectives
Seeking services
Taste of myGrid 0.
Observations
• myGrid:
personalised
extensible environments for
data-intensive
in silico experiments in biology
• EPSRC eScience pilot project
•
•
•
•
official start 01/10/01
actual 01/01/02
end 30/03/05
16 RAs, 9 studentships (start 09/03)
myGrid partners
m
Courtesy of Mark Wilkinson (BioMOBY)
Information Weaving
• Large amounts of different
kinds of data & many
applications.
• Highly heterogeneous.
– Different types, algorithms,
forms, implementations,
communities, service providers
• High autonomy.
• Highly complex and interrelated, & volatile.
Its not all numbers
Courtesy of Mark Wilkinson
(BioMOBY)
Circadian Rhythms
1.
2.
3.
4.
Has anyone else studied the effect of
neurotransmitters on the circadian
rhythms in Drosophila?
I’ve got a cluster of proteins from my
experiment. How do their functions
interrelate? And what are the proteins
with a particular function?
Is a structure known for my protein?
What other proteins have a similar
structure?
Publish my results by adding to some
annotation in a database.
1
2
3
4
Workflow
•
•
•
•
•
•
•
Know how.
Associate base resources with derived
data.
Keep, describe, find, compare, protect,
share.
Repeat/reuse/re-enact
Specialise/Customise/Personalise
Evolution – notification, knowledge
Quality & best practice
– It would be good if the workflows
were good.
– = good experimental practice.
1
2
3
4
myGrid 0.0 WSFL
Personalisation
• Dynamic creation of personal
data sets.
• Personal views over repositories.
• Personalisation of workflows.
• Personal notification
• Annotation of datasets and
workflows.
• Personalisation of service
descriptions – what I think the
service does.
1
2
3
4
Provenance
•
•
•
•
•
•
•
•
•
•
Who, what, where, why, when, (w)how?
The tracability of knowledge as it is
evolves and as it is derived.
Identity – the Life Sciences ID
The Lab Book. Methods in papers.
Immutable Metadata
Migration – travels with its data but
may not be stored with it.
Aggregates as data aggregates
Private vs Shared provenance records.
Ownership => success -> being sued?
Credit.
1
2
3
4
myGrid 0.0
WHATS WRONG WITH
THIS PICTURE?
Event Notification
•
•
•
•
•
•
Has PDB changed since I last ran this?
Has the record I derived my record
from changed?
Has the workflow I adapted my
workflow from changed?
Did the provenance record change?
Has a service I am using right now
gone? Has an equivalent one sprung up?
Event notification service myGrid
0.1
1
2
3
4
myGrid in silico experimentation
• Resource Interoperation.
– Workflow Coordination & Database Integration -> MALCOLM
• Provenance & Change Propagation.
– Improving quality of experiments & data.
• Personalisation & Collaborative working.
– Scientific discovery is personal & global.
– Security, ownership -> valuable assets
• Service based architecture
(formally known as agents)
– Publication, discovery, interoperation, composition,
decommissioning of myGrid services
• Metadata.
– Describing stuff, using ontologies, Semantic Web.
Who is myGrid for?
myGrid users
biologists
IS specialists
tool
builders
systems
administrators
infrequent
problem
specific bioinformaticians service
provider
bioinformatics
tool builders
A marketitecture diagram
Portals
Provenance
Personalisaion
Metadata
BioMedical Services Library:
DAS, Talisman, workflow sets
Upper level knowledge-based Grid Common
Services:
Semantic integration, knowledge based querying,
workflow composition, visualisation, provenance mgt,
semantic service discovery
Middle level Grid Common Services:
Database access, distributed query processing,
service discovery, workflow enactment, event
notification
Low level Grid Common Services (OGSI)
Co-scheduling, data shipping, authentication, job
execution, resource monitoring …
Security
Knowledge
(ontologies)
Applications
Data mining, PRINTS
annotation workbench
Client
Framework
Semantic
Aspect
Metadata
Aspect
Coordination
Services
User
Agent
Custom
Application
Portal
Semantic
Data
Integration
Provenance
Validation &
Assessment
Versioning
QoS
Presentation
Services
Semantic
Workflow
Design
Availability
Information
Extraction
QoS
Workflow
Enactment
Management
Tools
Semantic
Discovery
Preferences
Provenance
Distributed
Query
Collaboration
Support
Third-party
Metadata
Ontology
Service
Preferences
QoS
Syntactic
Discovery
Event
Notification
Networked
Services
Personal
Database
Repository
Access
Device
Access
Job
Execution
‘White Pages’ &
‘Yellow Pages’
Discovery
Security:
Authentication
& Authorization
Distributed
Resources
Database
resources: data and
tools
Current programme
• Use case scenarios.
• Rolling programme of prototyping.
• April myGrid 0.0, October myGrid 0.1 …
•
•
•
•
•
•
Identifying the most important services.
Agreeing consistent interfaces.
Integrating with other Grid services.
Implementing core services.
Describing services.
Connecting with other efforts.
Service based architecture
• Each bio resource is a
service
– Database, archive, analysis,
tool, person, instrument, a
workflow …
• Each myGrid architectural
component is a service
– Workflow enactment engine,
event notification, registry,
scheduler…
• Web services
• Grid services (OGSA)
Service based architecture
Find them
Publication, registration,
discovery, matchmaking,
deregistration.
Run them.
Execution,
monitoring,
exception
handling.
Organise them.
Interoperation,
composition,
substitution.
Service Discovery
• Find appropriate type of services
– sequence alignment
• Find appropriate instances of that service
– BLAST (an algorithm for sequence alignment), as
delivered by NCBI
• Assist in forming an appropriate assembly of
discovered services.
• Find, select and execute instances of services
while the workflow is being enacted.
Knowledge in the head of expert bioinformatian
1. User selects values from a drop
down list to create a property based
description of their required service.
Values are constrained to provide only
sensible alternatives.
2. Once the user has
entered a partial
description they submit
it for matching. The
results are displayed
below.
3. The user adds
the operation to
the growing
workflow.
4. The workflow
specification is complete
and ready to match
against those in the
workflow repository.
Portal
Repository
Client
Personal
Repository
Workflow
Client
Workflow
Repository
Workflow
enactment
Bioinformatics
services
Client framework
myGrid
Ontology
Client
(Meta Data)
Ontology
Server
(Meta Data)
Service Type
Directory
Service
instance
directory
0.0
DAML+OIL
Reasoner
(FaCT)
Matcher
and
Ranker
REGISTRY
Metadata & Ontologies
• Metadata – computationally
accessible data about the
services
• Ontologies – the shared and
common understanding of a
domain
– A vocabulary of terms
– Definition of what those terms
mean.
– A shared understanding for
people and machines
– Usually organised into a
taxonomy.
Why ontologies for
services?
1. A shared vocabulary for describing a service
2. Service classifications
–
–
“BLAST” Finds tblastx, tblastn, psi-blast, and
marks_super_blast.
“Alignment” Finds ClustalW, Blast, SmithWaterman, Needleman-Wunsch
3. Guiding service composition
–
•
Blastn compares a nucleotide query sequence
against a nucleotide sequence database (usually –
intelligent misuse of services…)
Not the only way to find a service.
Four tiered
service descriptions
Domain “semantic”
1.
Class of service:
•
a protein sequence alignment, a protein sequence database.
–
BLASTn is a tool for computing sequence homology that
uses the BLAST algorithm over nucleotides;
2. Specific example of an abstract service:
Business “operational”
3. Instance service description of a specific service:
–
BLASTn service by the NCBI is 80% reliable.
•
BLAST as offered by the EBI on a particular date, with
particular parameters when a service was actually enacted.
4. Invoked instance service description:
W3C: DAML+OIL/OWL
• From the Semantic Web community
• DAML+ OIL / OWL designed to describe ontologies
• Information about classes, properties, and individuals as a
sequence of axioms and facts & inclusion references to other
ontologies, each of which can have an ID which is URI
reference.
• OWL ontologies are web documents referenced a URI
• Ontologies also reference XML Schema datatypes.
• Automated reasoning for inferring classification lattice and
checking concepts are consistent
• OWL Web Ontology Language 1.0 Reference
• W3C Working Draft 29 July 2002
• http://www.w3.org/TR/owl-ref/
W3C -> Lots of Tools!
http://oiled.man.ac.uk/
class-def defined pairwise_sequence_alignment_service
subclass-of atomic_service_operation
has_Class performs_task
(aligning has_Class has_feature local
has_Class has_feature pairwise)
has_Class produces_result
(report has_Class is_report_of sequence_alignment)
has_Class uses_resource
(database has_Class contains
(data has_Class encodes
(sequence has_Class is_sequence_of
nucleic_acid_molecule)))
has_Class requires_input
(data has_Class encodes
(sequence has_Class is_sequence_of
nucleic_acid_molecule))
has_Class is_function_of (BLAST_application)
class-def defined BLAST-n_service_operation
subclass-of atomic_service_operation
has_Class performs_task
(aligning has_Class has_feature local
has_Class has_feature pairwise)
has_Class produces_result
(report has_Class is_report_of sequence_alignment)
has_Class uses_resource
(database has_Class contains
(data has_Class encodes
(sequence has_Class is_sequence_of
nucleic_acid_molecule)))
has_Class requires_input
(data has_Class encodes
(sequence has_Class is_sequence_of
nucleic_acid_molecule))
has_Class is_function_of (BLAST_application)
Reasoning in DAML+OIL
• Consistency — check if
knowledge is meaningful
• Subsumption — structure
knowledge, compute
classification
• Equivalence — check if two
classes denote same set of
instances
• Instantiation — check if
individual i instance of class C
• Retrieval — retrieve set of
individuals that instantiate C
Suite
Upper level
ontology
Task
ontology
Informatics
ontology
Web service
ontology
Specialises. All concepts are
subclassed from those in the more
general ontology.
Contributes concepts to form
definitions.
Molecular
biology ontology
Bioinformatics
ontology
Publishing
ontology
Organisation
ontology
Uses of ontology
• Labelling data items in databases.
– Semantic typing for controlling inputs and
outputs of workflows
– Use by distributed query processing.
• Workflow, database classification.
• Linking & browsing XML-based components
– COHSE
• Soft build of portals.
• Link with the Life Science Identifier (I3C)
• BioMOBY Central service classification
(some) Registry Issues
• Find services based on name, signature,
types, a word (not just using the ontology).
• Registry management – weeding,
authorisation, decommissioning.
• Publishing of services. Keeping their
descriptions up to date and faithful.
• Alternative descriptions of services.
• Staged descriptions.
• Maintenance and evolution of the ontology
• Multiple registries – personal, local,
enterprise
We are not alone
Open Source
Open Bio Foundation BioJava, BioPerl …
(DeFacto) Standards
OMG LSR, I3C, MGED, Gene Ontology
Other Projects
Astrogrid, Geodise,
CLEF, Comb-e-chem,
BIRN, OGSA-DAI
Semantic Web
RDF, RDFS, DAML+OIL
Bioinformatics integration platforms
DAS, OpenBSA, ISYS, OpenMMS, Kleisli, Ensembl, AppLab,
SRS, BioNavigator, DiscoveryLink, K1
TAMBIS. BioMOBY …
Web Services
XML, SOAP, WSDL, UDDI
Distributed Computing Environments
CORBA, RMI, JavaOne
GRID
Globus/SRB/Condor/Sun Grid Engine
What about other efforts?
• Integration
– DAS: Distributed Annotation System
– ISYS: Integration of Desktop Tool
– DiscoveryLink: wrapper and distributed query
environment
– GO: Gene Ontology etc…
• Service discovery and common typing
– BioMOBY: Integration of online biological
databases and analysis services
• Tackling parts of the problem.
• myGrid is a framework for a platform.
Top 10 thoughts
1. Application driven by use cases
2. Open Source
3. Data object types, APIs, protocols, ontologies
have longer life span that s/w
4. Components are useful – don’t have to buy into
the whole shooting match.
5. Don’t reinvent the wheel
6. Get others to build services / applications
7. Lower barriers of entry
8. Keep it simple.
9. It’s distributed and global
10. One solution won’t work
The myGrid team
•
•
•
•
•
•
•
•
•
•
•
Carole Goble
Norman Paton
Brian Warboys
Stephen Pettifer
Alvaro Fernandes
Luc Moreau
Dave De Roure
Chris Greenhalgh
Tom Rodden
John Brooke
Paul Watson
• Alan Robinson
•
Rob Gaizauskas
•
Ian Horrocks
• Robert Stevens
• Neil Wipat
•
•
•
•
•
•
•
•
Matthew Addis
Nick Sharman
Rich Cawley
Simon Harper
Karon Mee
Simon Miles
(Vijay Dailani)
Xiaojian Liu
•
Mark Greenwood
•
•
Darren Marvin
Justin Ferris
•
Nedim Alpdemir
•
•
•
Milena Radenkovic
Kevin Glover
(Angus Roberts)
•
•
•
•
Tony Storey
Bernard Horan
Paul Smart
Robert Haynes
• Phil Lord
• Neil Davis
• Peter Li
• Luca Toldo
• Tom Oinn
• Robin McEntire
• Martin Senger • Anne Westcott
• Chris Wroe