Formalising a protocol for recording provenance in Grids

Download Report

Transcript Formalising a protocol for recording provenance in Grids

Recording and Using
Provenance in a
Protein Compressibility
Experiment
Paul Groth, Simon Miles, Weijian Fang, Sylvia C. Wong,
Klaus-Peter Zauner and Luc Moreau
University of Southampton
High Performance Distributed Computing 05
July 27, 2005
Outline
Biology
 The Workflow
 Use Cases
 Provenance
 Implementation
 Evaluation
 Conclusion

High Performance Distributed Computing 05
July 27, 2005
Biology





Determine how protein sequences
(chains of amino acids) fold into a
3D structure?
Which part of DNA translates into
one protein sequence?
Structure of protein sequences may
help to answer these questions.
Structure can be quantified by
textual compressibility.
Determine the amino acid groupings
that maximize compressibility?
High Performance Distributed Computing 05
July 27, 2005
The Workflow
 Get Sequences
 Make a Sample
 Recode Sample
 Compress and Measure
 Shuffle the sample
 Compress and Measure
each permutation
 Collate all measures
 Produce the average
compressibility
High Performance Distributed Computing 05
July 27, 2005
Use Case (1)





A bioinformatician, A, downloads
sequence data of microbial proteins from
the database RefSeq.
Runs the compressibility experiment.
A later performs the same experiment on
the same sequence data, again
downloaded from RefSeq.
A compares the two experiment results
and notices a difference.
A determines whether the difference was
caused by the algorithms changing
High Performance Distributed Computing 05
July 27, 2005
Use Case (2)
A bioinformatician performs an
experiment on a FASTA
sequence encoding a protein.
 A reviewer, later determines
whether or not the sequence was
in fact processed by a service
that meaningfully processes
protein sequences only.

High Performance Distributed Computing 05
July 27, 2005
Provenance
Use case’s related to process
 Provenance Definition:

The
provenance of a result is
the process that led to that
result.
o
This is a conceptual definition.
High Performance Distributed Computing 05
July 27, 2005
Documentation of Process

Conceive a computer based representation
of provenance

We represent the provenance of some
data by documenting the process that
led to the data:




documentation can be complete or partial;
it can be accurate or inaccurate;
it can present conflicting or consensual
views of the actors involved;
it can provide operational details of
execution or it can be abstract.
High Performance Distributed Computing 05
July 27, 2005
Heterogeneity

This is a heterogeneous application


Heterogeneity is common in Grid
based apps


Has shell scripts, java programs, web
services
LCG Atlas - Athena & VDT coexist
Support for plugging-in different
execution environments
High Performance Distributed Computing 05
July 27, 2005
Provenance “Lifecycle”
Application
Results
Record Documentation
of Process
Query to retrieve the
provenance of a result
Provenance
Store
High Performance Distributed Computing 05
July 27, 2005
Use Case 1: Do services differ
between experiments?
Provenance
Store
Retrieve documentation of experiments
Service A
•
•
•
……….
………
……………..
Service A
•
•
•
•
……….
………
……………..
….
Highlight differences in services between
experiments
High Performance Distributed Computing 05
July 27, 2005
Implementation
Implemented as a VDT workflow
 Scheduled by Condor
 Each service, script, command records
process documentation into a provenance
store.
 Uses PReServ: a web services
implementation of a provenance store

High Performance Distributed Computing 05
July 27, 2005
PReServ Implementation Diagram
WS Client
PS Client
Side
Library
Axis
Handler
Web Service
Axis
Handler
PS Client
Side
Library
Provenance Service
WS Calls
Java Calls
Backend Store Interface
PS Client
Side
Library
Query Actor WS
Database
Store
In-Memory
Store
Backend Stores
July 27, 2005
High Performance Distributed Computing 05
…
Evaluation Deployment

Runs on VMWare
deployment consistency
 ease of development

Workflow is executed on one machine
 PReServ runs on another machine

High Performance Distributed Computing 05
July 27, 2005
Recording Performance
High Performance Distributed Computing 05
July 27, 2005
Query Performance
High Performance Distributed Computing 05
July 27, 2005
Conclusion




Both recording and query times are linear
10% overhead for asynchronous recording
Our provenance concept / system are grounded
in a number of use cases
The experiment is ready to be moved to a
cluster or a grid



Southampton Cluster
A Grid
Will allow us to test scalability
High Performance Distributed Computing 05
July 27, 2005
Contact Info
Paul Groth
[email protected]
www.pasoa.org
- use case descriptions
- papers
- PReServ software
High Performance Distributed Computing 05
July 27, 2005
Configuration
Redhat Linux 9.1 on VMWare on
Windows XP
 Pentium P4 2.8 GHZ 1.5 GB RAM
 PReServ on another machine



Database backend Berkley JDB
100 Mb local ethernet
High Performance Distributed Computing 05
July 27, 2005