provenance service.
Download
Report
Transcript provenance service.
Recording and Reasoning Over
Data Provenance in
Web and Grid Services
Martin Szomszor and Luc Moreau
[email protected]
University of Southampton
Contents
A definition of provenance
Example 1: Aerospace engineering
Example 2: Organ transplant management
Example 3: Bioinformatics grid
Provenance architecture
Provenance service
Conclusion
The Grid and Virtual
Organisations
The Grid problem is defined as coordinated resource
sharing and problem solving in dynamic, multiinstitutional virtual organisations [FKT01].
Effort is required to allow users to place their trust in
the data produced by such virtual organisations
Understanding how a given service is likely to modify
data flowing into it, and how this data has been
generated is crucial.
Provenance and Virtual
Organisations
Given a set of services in an open grid environment
that decide to form a virtual organisation with the aim
to produce a given result;
How can we determine the process that generated the
result, especially after the virtual organisation has
been disbanded?
The lack of information about the origin of results
does not help users to trust such open environments.
Provenance and Workflows
Workflow enactment has become popular in the
Web Services and Grid communities
Workflow enactment can be seen as a scripted
form of virtual organisation.
The problem is similar: how can we determine
the origin of enactment results.
Provenance: Definition
Provenance is an annotation able to explain how a
particular result has been derived.
In a service-oriented architecture, provenance
identifies what data is passed between services, what
services are available, and what results are generated
for particular sets of input values, etc.
Using provenance, a user can trace the “process” that
led to the aggregation of services producing a
particular output.
Provenance in Aerospace
Engineering
Aerospace engineering
requires to undertake
scientific simulations,
data pre- and postprocessing and
visualisation,
composed in complex
workflows.
Provenance in Aerospace
Engineering
Provenance is crucially required in this context, as the
need to maintain a historical record of outputs from
each sub-system is an important requirement for many
customers that utilise the end result of simulations.
For instance, aircrafts’ provenance data need to be
kept for up to 99 years when sold to some countries.
Currently, however little direct support is available for
this.
Provenance in Organ Transplant
Management
Medical information systems, and in
particular decision support systems
for organ and tissue transplant, rely
on a wide range of data sources,
patient data, and knowledge added by
doctors,
surgeons
and
other
individuals using the systems.
Provenance in Organ Transplant
Management
Such a domain is heavily regulated
European, national, regional and site specific rules
govern how decisions are made
Application of these rules must be ensured, be
auditable and may change over time
Patient recovery is highly dependent on
organ allocation choice,
extraction and insertion methods,
care/recovery regime.
Provenance in Organ Transplant
Management
Tracking back previous decisions in any one centre to
identify whether the best match was made, who was
involved in the decision, what was the context.
Maximise the efficiency in matching and
recovery rate of patients.
Provenance in a Bioinformatics
Grid (myGrid)
myGrid aims to build a personalised problem-solving
environment, in which:
the scientist can construct in silico experiments,
find and adapt others,
store results in data repositories,
have their own view on public repositories,
be better informed as to the provenance and
the currency of the tools and data directly
relevant to their experimental space.
Provenance in a Bioinformatics
Grid (myGrid)
Two major forms of provenance [Greenwood03]:
The derivation path records the process by which
results are generated from input data.
Derivation data provides the answer to questions
about what initial data was used for a result, and
how was the transformation from initial data to
result achieved.
FDA requirement on drug companies to keep a
record of provenance of drug discovery as long as
the drug is in use (up to 50 years sometimes).
Provenance in a Bioinformatics
Grid (myGrid)
Two major forms of provenance [Greenwood03]:
Annotations are attached to objects, or collections
of objects.
Annotation data provides more contextual
information that might be of interest: who
performed an experiment, when did they supply
any comments on the specific methods and
materials used, when an object was created, last
updated,who owns it and its format.
Useful to provide personalised environment.
Other Provenance Requirements
and Uses
Standard lineage representation, automated
lineage recording, unobtrusive information
collecting [Frew and Brose 02]
To give reliability and quality, justification and
audit, re-usability, reproducibility and
repeatability, change and evolution, ownership,
security, credit and copyright [Goble02]
What is the problem?
Provenance recording should be part of the
infrastructure, so that users can elect to enable
it when they execute their complex tasks over
the Grid or in Web Services environments.
Currently, the Web Services protocol stack and
the Open Grid Services Architecture do not
provide any support for recording provenance.
Our Contributions
A service-oriented architecture for provenance support
in Grid and Web Services environments, based on the
idea of a provenance service;
A client-side API for recording provenance data for
Web Service invocation;
A data model for storing provenance data;
A server-side interface for querying provenance data;
Two components making use of provenance:
provenance browsing and provenance validation.
Overall Architecture
Overall Architecture
Provenance gathering is a collaborative process that
involves multiple entities, including the workflow
enactment engine, the enactment engine's client, the
service directory, and the invoked services.
Provenance data will be submitted to one or more
“provenance repositories” acting as storage for
provenance data.
Upon user's requests, some analysis, navigation and
reasoning over provenance data can be undertaken.
Overall Architecture
Storage could be achieved by a provenance
service.
A library, optionally hosted in the provenance
service, would perform the analysis, navigation
or reasoning.
A client side library would submit provenance
data to the provenance service.
System Overview
Sequence Diagram
To identify the interactions between provenance
service, client side library and enactment engine
Creation of a session
Need to be able to support the most complex
workflows including conditional branching, iteration,
recursion and parallel execution.
Support asynchronous submission of provenance data
so that provenance submission does not delay
workflow execution.
Sequence Diagram
Provenance Data Model
Must support recording of all information
necessary to replay execution
Must support all complex forms of workflows
(recursion, iterations, parallel execution).
Provenance Data Model
Discussion
In order for provenance data to be useful, we expect such a
protocol to support some “classical” properties of distributed
algorithms.
Using mutual authentication, an invoked service can ensure
that it submits data to a specific provenance server, and viceversa, a provenance server can ensure that it receives data from
a given service.
With non-repudiation, we can retain evidence of the fact that a
service has committed to executing a particular invocation and
has produced a given result.
We anticipate that cryptographic techniques will be useful to
ensure such properties
The purpose of project PASOA to investigate
provenance in Grid architectures
Funded by EPSRC under the “fundamental computer
science for e-Science call”
In collaboration with Cardiff
www.pasoa.org
Conclusion
Provenance is a rather unexplored domain
Strategic to bring trust in open environment
Our provenance service is the first attempt to
incorporate provenance in the infrastructure of Web
and Grid services
Need to further investigate the algorithmic
foundations of provenance, which will lead to scalable
and secure industrial solutions.
Acknowledgements
Syd Chapman, IBM
Omer Rana, Cardiff
Andreas Schreiber and Rolf Hempel, DLR
Lazslo Varga, SZTAKI
Ulises Cortes and Steven Willmott, UPC
Mark Greenwood, Carole Goble, Manchester