Formalising a protocol for recording provenance in Grids
Download
Report
Transcript Formalising a protocol for recording provenance in Grids
Recording and Using
Provenance in a
Protein Compressibility
Experiment
Paul Groth, Simon Miles, Weijian Fang, Sylvia C. Wong,
Klaus-Peter Zauner and Luc Moreau
University of Southampton
High Performance Distributed Computing 05
July 27, 2005
Outline
Biology
The Workflow
Use Cases
Provenance
Implementation
Evaluation
Conclusion
High Performance Distributed Computing 05
July 27, 2005
Biology
Determine how protein sequences
(chains of amino acids) fold into a
3D structure?
Which part of DNA translates into
one protein sequence?
Structure of protein sequences may
help to answer these questions.
Structure can be quantified by
textual compressibility.
Determine the amino acid groupings
that maximize compressibility?
High Performance Distributed Computing 05
July 27, 2005
The Workflow
Get Sequences
Make a Sample
Recode Sample
Compress and Measure
Shuffle the sample
Compress and Measure
each permutation
Collate all measures
Produce the average
compressibility
High Performance Distributed Computing 05
July 27, 2005
Use Case (1)
A bioinformatician, A, downloads
sequence data of microbial proteins from
the database RefSeq.
Runs the compressibility experiment.
A later performs the same experiment on
the same sequence data, again
downloaded from RefSeq.
A compares the two experiment results
and notices a difference.
A determines whether the difference was
caused by the algorithms changing
High Performance Distributed Computing 05
July 27, 2005
Use Case (2)
A bioinformatician performs an
experiment on a FASTA
sequence encoding a protein.
A reviewer, later determines
whether or not the sequence was
in fact processed by a service
that meaningfully processes
protein sequences only.
High Performance Distributed Computing 05
July 27, 2005
Provenance
Use case’s related to process
Provenance Definition:
The
provenance of a result is
the process that led to that
result.
o
This is a conceptual definition.
High Performance Distributed Computing 05
July 27, 2005
Documentation of Process
Conceive a computer based representation
of provenance
We represent the provenance of some
data by documenting the process that
led to the data:
documentation can be complete or partial;
it can be accurate or inaccurate;
it can present conflicting or consensual
views of the actors involved;
it can provide operational details of
execution or it can be abstract.
High Performance Distributed Computing 05
July 27, 2005
Heterogeneity
This is a heterogeneous application
Heterogeneity is common in Grid
based apps
Has shell scripts, java programs, web
services
LCG Atlas - Athena & VDT coexist
Support for plugging-in different
execution environments
High Performance Distributed Computing 05
July 27, 2005
Provenance “Lifecycle”
Application
Results
Record Documentation
of Process
Query to retrieve the
provenance of a result
Provenance
Store
High Performance Distributed Computing 05
July 27, 2005
Use Case 1: Do services differ
between experiments?
Provenance
Store
Retrieve documentation of experiments
Service A
•
•
•
……….
………
……………..
Service A
•
•
•
•
……….
………
……………..
….
Highlight differences in services between
experiments
High Performance Distributed Computing 05
July 27, 2005
Implementation
Implemented as a VDT workflow
Scheduled by Condor
Each service, script, command records
process documentation into a provenance
store.
Uses PReServ: a web services
implementation of a provenance store
High Performance Distributed Computing 05
July 27, 2005
PReServ Implementation Diagram
WS Client
PS Client
Side
Library
Axis
Handler
Web Service
Axis
Handler
PS Client
Side
Library
Provenance Service
WS Calls
Java Calls
Backend Store Interface
PS Client
Side
Library
Query Actor WS
Database
Store
In-Memory
Store
Backend Stores
July 27, 2005
High Performance Distributed Computing 05
…
Evaluation Deployment
Runs on VMWare
deployment consistency
ease of development
Workflow is executed on one machine
PReServ runs on another machine
High Performance Distributed Computing 05
July 27, 2005
Recording Performance
High Performance Distributed Computing 05
July 27, 2005
Query Performance
High Performance Distributed Computing 05
July 27, 2005
Conclusion
Both recording and query times are linear
10% overhead for asynchronous recording
Our provenance concept / system are grounded
in a number of use cases
The experiment is ready to be moved to a
cluster or a grid
Southampton Cluster
A Grid
Will allow us to test scalability
High Performance Distributed Computing 05
July 27, 2005
Contact Info
Paul Groth
[email protected]
www.pasoa.org
- use case descriptions
- papers
- PReServ software
High Performance Distributed Computing 05
July 27, 2005
Configuration
Redhat Linux 9.1 on VMWare on
Windows XP
Pentium P4 2.8 GHZ 1.5 GB RAM
PReServ on another machine
Database backend Berkley JDB
100 Mb local ethernet
High Performance Distributed Computing 05
July 27, 2005