20080521_T2_Wolstenc..

Download Report

Transcript 20080521_T2_Wolstenc..

Metadata in myGrid: Finding Services for
in silico Science
Dr Katy Wolstencroft
myGrid
University of Manchester
……or how to use metadata and semantics to add value in
a ‘standards free’ environment
Outline
•
•
•
•
•
•
Introduction to Taverna, myGrid and myExperiment
Bioinformatics – use of Web services and other services
Semantic Service Discovery in myGrid
myGrid ontology
Our experiences
BioCatalogue – bioinformatics service registry
Taverna Workflow Workbench
• Design and execution of
workflows
• Access to local and
remote resources and
analysis tools
• Automation of data flow
• Iteration over large data
sets
• Part of the myGrid project
Client
Applications
myGrid
Workflow
Warehouse
Service /
Component
Catalogue
Service
Ontology
Taverna
Workbench
GUI
myExperiment
Web Interface
Provenance
Ontology
Provenance
Warehouse
Feta
Information
Services
Service
Management
Taverna
Workflow
Enactor
LogBook
Provenance
Management
3rd Party Resources
(Web Services, Grid Services)
Default
Results
Custom
Datasets
Resources
Lots of Resources
NAR 2008 – over 1000 databases
Where From?
• Over 3500 services available
• Major Service Providers
– European Bioinformatics Institute
– DNA DataBank of Japan
– NCBI – USA
• ‘Boutique’ Services
– Individual research labs producing public data sets
– Specialist tools for niche experiments
What types of services?
•
•
•
•
•
•
•
•
•
HTML
WSDL Web Services
BioMart
R-processor
BioMoby
Soaplab
Local Java services
Beanshell
Workflows
Variable or non-existent documentation or help
Taverna in a ‘open’ world
Advantages
• Connection to lots of resources
• Flexible system
• Can adapt to new technologies
Disadvantages
• Services are developed for other purposes
• We can’t control how that work
• We have to deal with the heterogeneity
Taverna Use
• Users worldwide
• Over 48500 downloads
• Bioinformatics – largest group
of users
• Other users from
–
–
–
–
–
astronomy,
chemoinformatics,
health informatics
Systems Biology
Social sciences
High throughput experiments
Microarray
Fisher et al (2007).Nucleic Acids
Res.35(16):5625-33
http://www.genomics.liv.ac.uk/tryps/trypsindex.html
Paul Fisher
QTL analysis
Andy Brass
• Some cattle breeds more resistant than others
• Differences between resistant and susceptible cattle?
• Can we breed cattle resistant to infection?
Steve Kemp
• Sleeping Sickness in African Cattle
• Caused by infection by parasite (Trypanosoma brucei)
Bioinformatics Workflows
• Workflows allow high throughput experiments and
automation
• Workflows are encapsulations of experiments
• Workflows developed for one experiment can be reused
for others
• Easier to share, reuse and repurpose
The METHODS section of a scientific publication
Workflow Reuse
• Downloaded 836 times
• Viewed 799 times
• Jo Pennock, lab biologist with no bioinformatics
experience – Mouse whipworm infection
• Identified no candidate genes in 2 years with manual
analysis
• Identified candidate genes in several hours using Paul’s
workflow
In Silico Science Life Cycle
Workflows are combinations of
different services
Locations and descriptions of services
required at the design phase
Reusing workflows – need to
understand what they do
Finding Services
When using services, scientists need to:
• Find them – in distributed locations, produced by
different host institutions
• Interpret them – what do the services do - what
experiments can they perform using them?
• Know how to invoke them – what data and initial
parameters do they need to supply?
We could Google for them…
• If a service is called by the name you expect, you’ll find it
– Search for ‘clustalw’ and ‘web service’
• What if its not?
– The clustalw program from emboss is called ‘emma’
– What if it’s the only web service version of clustalw?
– Does it stop you designing your workflow?
Metadata from a WSDL
<wsdl:message name="getGlimmersResponse">
<wsdl:part name="getGlimmersReturn" type="xsd:string"/>
</wsdl:message>
<wsdl:message name="aboutServiceRequest"/>
Name of the service
<wsdl:message name="getGlimmersRequest">
<wsdl:part name="in0" type="xsd:string"/>
<wsdl:part name="in1" type="xsd:string"/>
<wsdl:part name="in2" type="xsd:string"/>
Uninformative names for
<wsdl:part name="in3" type="xsd:string"/>
parameters
<wsdl:part name="in4" type="xsd:string"/>
<wsdl:part name="in5" type="xsd:string"/>
<wsdl:part name="in6" type="xsd:string"/>
What kind of string?
<wsdl:part name="in7" type="xsd:int"/>
<wsdl:part name="in8" type="xsd:string"/>
Pathport Web service from the Virginia Bioinformatics Institute
http://pathport.vbi.vt.edu/services/wsdls/beta/glimmer.wsd
Semantics and Web Services
• SAWSDL – Semantic Annotations for WSDL working
group
• Virtually no uptake by bioinformatics service providers
• Doesn’t address non-WSDL services
Adding Semantics – Annotating Services
Find services by their function instead of their name
• The services might be distributed, but a registry of
service descriptions can be central and queried
• We need to annotate services with semantics
In myGrid, we use the Feta Semantic Discovery tool
and a semantic annotation tool – and expert curation
myGrid
Ontology
Logically separated into two parts:
• Service ontology
Physical and operational features of (web) services
• Domain ontology
Annotation vocabulary for core bioinformatics data, data
types and their relationships
Service Ontology
• Models services from the point of view of the scientist
– Where is it?
– How many inputs/outputs?
– Who hosts it?
• Invocation details are hidden by the Taverna workbench
• Differs from related initiatives in this respect
Domain Ontology
• Informatics: captures the key concepts of data, data structures,
databases and metadata.
• Bioinformatics: The domain-specific data sources (e.g. the model
organism sequencing databases), and domain-specific algorithms
for searching and analyzing data (e.g. the sequence alignment
algorithm, clustalw).
• Molecular biology: Concepts include examples such as, protein
sequence, and nucleic acid sequence.
• Formats: A hierarchy describing bioinformatics file formats. For
example, fasta format for sequence data, or phylip format for
phylogenetic data
• Tasks: A hierarchy describing the generic tasks a service operation
can perform. Examples include retrieving, displaying, and aligning.
myGrid
Ontology
Specialises
Web Service
ontology
Task
ontology
Informatics
ontology
Contributes to
Molecular Biology
ontology
Bioinformatics
ontology
sequence
protein_structure_feature
biological_sequence
Similarity Search Service
protein_sequence
BLAST service
nucleotide_sequence
BLASTp service
InterProScan service
DNA_sequence
Example Service Annotation
• Example : BLAST from the DDBJ
–
–
–
–
Performs task: Alignment
Uses Method: Similarity Search Algorithm
Uses Resources: DNA/Protein sequence databases
Inputs:
• biological sequence (and format)
• database name (and format)
• blast program (and format)
– Outputs: Blast Report
• Minimum Information model
Minimum Models in Biology
• MIBBI – Minimum Information about Biomedical and
Biological Investigations
–
–
–
–
MIAME – Microarray experiments
MIAPE - Proteomics
MIRIAM – Biochemical models (SBML models)
Etc
– MIOAWS – Minimum Information About the Operation of the
Web Service
myGrid
Ontology
First version of the ontology ~ 2002
Originally developed in DAML+OIL
Now developed in OWL and a version exported to RDFS
Number of classes in the ontology ~750
Domain and service ontology used by myGrid users and
developers of myGrid related plugins
Service ontology also used by BioMoby
W3C compliant WRT ontology modelling
How do we use the ontology?
Two methods of service description
1. Decision Making - reasoning
Single description – whole service model
Ontology used to build a single, complete service description and
annotations are classified
Enables automated composition of workflows
2. Decision Support - querying
Composite matches to ontology terms
Multiple terms are used to query the annotations
Originally – Decision Making
• Difficult and time consuming to produce the detailed
service descriptions
• Assumption that people would want automated workflow
composition
Predicted
Genes out
Sequence
Repeat
Masker
Web service
Only 1 exists
Gene
Prediction
Web Service
Blast
Web Service
Many different algorithms – effective
with different organisms etc
Works over
underlying databases
Resource Compatibility Difference?
• Scientists choice – can they be sure the experiments are
equivalent?
Example: Nucleotide sequence databases
• GenBank - USA
• EMBL - Europe
• DDBJ - Japan
Nightly updates – mirrored data BUT the sequence
annotation could be different
myGrid
– Decision Support
– Reducing the list of know services from thousands to
several
– Scientist makes the final decision about which of a
selection of services to use
– Services are ‘tagged’ with terms from the ontology –
very simple!
– No requirement for OWL-DL reasoning
– Generating service annotations is much easier
So why do we need OWL?
Building workflows is a two-stage process
1. Assembly – identifying services that perform the
scientific functions needed for the experiment
2. Gluing – identifying how (or more usually, if) theses
services are compatible
If they are incompatible – we need services that convert
data formats and act as connectors – we call these
services Shims
Cases for using the OWL version
• Automatic shim integration
– Shims don’t do anything scientific, so choosing one over another
makes no difference
• Detecting mismatches
– A scientist has built a workflow and the output of processor 1 is
incompatible with processor 2
Limitations of the Current Model
• Feta discovery tool is only accessible from the Taverna
Workbench
• Only pertinent to Taverna users – other people need to
find and use web services
• Focuses on finding services, but not workflows. For
reuse, we need to do both
• Closed annotation system - myGrid curator provides
service descriptions – only 700 so far!
BioCatalogue:
Public Bioinformatics Service Registry
• Collaboration between University of Manchester and EBI
• Expanding from a service for Taverna users to a service
for anyone using bio web services
• Combine service and workflow discovery
• Accelerating the process of gathering service
descriptions/annotations by engaging the scientific
community
• Combines the myGrid initiative with BioMoby etc
Combining Service and Workflow Discovery
myExperiment – social networking – Web 2.0
• Workflows tagged
• No formal model
• No control
• Services – semantically described, ontology terms
• Access each through the same interface
• Exchanging metadata objects
‘Shopping’ for Services and Workflows
Screen shot of bio Service shopping site
Getting the Minimum
Community annotation
• Must be easy and quick
• Must allow partial descriptions
• Multiple annotations of the same service
• What is the minimum information to enable
– service discovery
– service invocation
• Tagging terms to formal models – OWL, SKOS
intermediate?
Grading Services
• Bronze – enough to locate the service. Example of
service invocation
• Silver
• Gold
• Platinum – full description. All properties annotated –
including dependencies between them – reliability
metrics etc
Annotation Provenance
•
•
•
•
Who said what about what?
Harvesting community annotation
Verifying and augmenting by a curator
‘Trust’ Models
• Annotation versions
– In a workflow context
– As stand alone services
Annotation Process
Open Issues
• ‘Open’ world means we cannot impose metadata
standards
• Lots of heterogeneity
• Ontology modelling stable standards to build upon
• Web services – shifting standards – need flexibility for
future-proofing
• Other services as well as web services
• Combining and exchanging metadata objects behind
interfaces
• Can we adopt something from the digital library
community? e.g. OAI and ORE (Open Archives Initiative
Object Reuse and Exchange )
myGrid
acknowledgements
Carole Goble, Norman Paton, Robert Stevens, Anil Wipat, David De Roure, Steve Pettifer
•
•
•
•
•
•
•
OMII-UK Tom Oinn, Katy Wolstencroft, Daniele Turi, June Finch, Stuart Owen, David
Withers, Stian Soiland, Franck Tanoh, Matthew Gamble, Alan Williams, Ian Dunlop
Research Martin Szomszor, Duncan Hull, Jun Zhao, Pinar Alper, Antoon Goderis,
Alastair Hampshire, Qiuwei Yu, Wang Kaixuan.
Current contributors Matthew Pocock, James Marsh, Khalid Belhajjame, PsyGrid
project, Bergen people, EMBRACE people.
User Advocates and their bosses Simon Pearce, Claire Jennings, Hannah Tipney,
May Tassabehji, Andy Brass, Paul Fisher, Peter Li, Simon Hubbard, Tracy Craddock,
Doug Kell, Marco Roos, Matthew Pocock, Mark Wilkinson
Past Contributors Matthew Addis, Nedim Alpdemir, Tim Carver, Rich Cawley, Neil
Davis, Alvaro Fernandes, Justin Ferris, Robert Gaizaukaus, Kevin Glover, Chris
Greenhalgh, Mark Greenwood, Yikun Guo, Ananth Krishna, Phillip Lord, Darren
Marvin, Simon Miles, Luc Moreau, Arijit Mukherjee, Juri Papay, Savas Parastatidis,
Milena Radenkovic, Stefan Rennick-Egglestone, Peter Rice, Martin Senger, Nick
Sharman, Victor Tan, Paul Watson, and Chris Wroe.
Industrial Dennis Quan, Sean Martin, Michael Niemi (IBM), Chimatica.
Funding EPSRC, Wellcome Trust.
http://www.mygrid.org.uk
http://www.myexperiment.org