Scientific Workflows - San Diego Supercomputer Center

Download Report

Transcript Scientific Workflows - San Diego Supercomputer Center

Towards Scientific Workflows Based on
Dataflow Process Networks
(or from Ptolemy to Kepler)
Bertram Ludäscher
San Diego Supercomputer Center
[email protected]
LBL, 11/4/2003
• NSF, NIH, DOE
Acknowledgements
• GEOsciences Network (NSF)
– www.geongrid.org
• Biomedical Informatics Research Network (NIH)
– www.nbirn.net
• Science Environment for Ecological Knowledge (NSF)
– seek.ecoinformatics.org
• Scientific Data Management Center (DOE)
– sdm.lbl.gov/sdmcenter/
LBNL, 11/4/2003
Outline
•
•
•
•
•
Scientific Workflows
Business Workflows
[Problem Solving Environments (SCIRun)]
Dataflow Process Networks (Ptolemy-II)
Scientific Workflows (Kepler)
LBNL, 11/4/2003
Promoter Identification Workflow (PIW)
LBNL, 11/4/2003
Source: Matt Coleman (LLNL)
Source: NIH BIRN (Jeffrey Grethe, UCSD)
LBNL, 11/4/2003
GARP Invasive Species Pipeline
Test sample (d)
Registered
Ecogrid
Database
EcoGrid
Query
Species
presence &
absence points
(native range)
(a)
Registered
Ecogrid
Database
+A1
+A2
+A3
Sample
Data
Training
sample
(d)
Data
Calculation
GARP
rule set
(e)
Integrated
layers
(native range) (c)
Invasion
area prediction
map (f)
Map
Generation
Layer
Integration
Registered
Ecogrid
Database
Validation
Model quality
parameter (g)
Environmental
layers (native
range) (b)
Environmental
layers (invasion
area) (b)
Layer
Integration
User
Model quality
parameter (g)
Integrated layers
(invasion area) (c)
EcoGrid
Query
Registered
Ecogrid
Database
Map
Generation
Native
range
prediction
map (f)
Validation
Archive
To Ecogrid
Selected
prediction
maps (h)
Generate
Metadata
Species presence
&absence points
(invasion area) (a)
LBNL, 11/4/2003
Source: NSF SEEK (Deana Pennington, UNM)
Scientific Workflow Aspects
• Data orientation
– Data volume
– Data complexity
– Data integration
• Computational complexity
• Grid-aspects
– Distributed computation
– Distributed data
• Analysis and tool integration
• User-interactions/WF steering
• Data and workflow provenance
LBNL, 11/4/2003
Business Workflows
• Business Workflows
–
–
–
–
–
LBNL, 11/4/2003
show their office automation ancestry
documents and “work-tasks” are passed
no data streaming, no data-intensive pipelines
lots of standards to choose from: WfMC, WSFL, BMPL, BPEL4WS,.. XPDL,…
but often no clear execution semantics for constructs as simple as this:
Source: Expressiveness and Suitability of Languages for Control Flow
Modelling in Workflows, PhD thesis, Bartosz Kiepuszewski, 2002
A ZOO of Workflow Standards and Systems
Source: W.M.P. van der Aalst et al.
http://tmitwww.tm.tue.nl/research/patterns/
LBNL, 11/4/2003
More on Scientific WF vs Business WF
• Business WF
– Tasks, documents, etc. undergo modifications (e.g., flight reservation from
reserved to ticketed), but modified WF objects still identifiable throughout
– Complex control flow, task-oriented
– Transactions w/o rollback (ticket: reserved  purchased)
– …
• Scientific WF
– data-in and data-out of an analysis step are not the same object!
– dataflow, data-oriented (cf. AVS/Express, Khoros, …)
– re-run automatically (a la distrib. comp., e.g. Condor) or userdriven/interactively (based on failure type)
– data integration & semantic typing as part of SWF framework
– …
LBNL, 11/4/2003
Scientific Workflows: Some Findings
• More dataflow than (business) workflow
– but some branching looping, merging, …
– not: documents/objects undergoing modifications
– instead often: dataset-out = analysis(dataset-in)
• Need for “programming extension”
– Iterations over lists (foreach); filtering; functional composition; generic &
higher-order operations (zip, map(f), …)
• Need for abstraction and nested workflows
• Need for data transformations (compute/transform alternations)
• Need for rich user interaction & workflow steering:
– pause / revise / resume
– select & branch; e.g., web browser capability at specific steps as part of a
coordinated SWF
• Need for high-throughput transfers (“grid-enabling”, “streaming”)
• Need for persistence of intermediate products
 data provenance (“virtual data” concept)
LBNL, 11/4/2003
Problem Solving Environments
• SCIRun: a dynamic dataflow system (in the Ptolemy sense)
 separate presentation
LBNL, 11/4/2003
SWF vs Distributed Computing
• Distributed Computing (e.g. a la Condor-(G) )
– Batch oriented
– Transparent distributed computing (“remote Unix/Java”;
standard/Java universes in Condor)
– HPC resource allocation & scheduling
• SWF
– Often highly interactive for decision making/steering of the WF
and visualization (data analysis)
– Transparent data access (Grid) and integration (database
mediation & semantic extensions)
– Desktop metaphor ; often (but not always!) light-weight web
service invocation
LBNL, 11/4/2003
Dataflow Process
Networks and Ptolemy-II
see!
read!
try!
LBL, 11/4/2003
Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/ptolemyII/
Dataflow Process Networks: Why Ptolemy-II?
• PtII Objective:
– “The focus is on assembly of concurrent components. The key
underlying principle in the project is the use of well-defined
models of computation that govern the interaction between
components. A major problem area being addressed is the use of
heterogeneous mixtures of models of computation.”
• Data & Process oriented:
– Dataflow process networks
• Natural Data Streaming Support
• Pragmatics
– mature, actively maintained, open source system
– leverage “sister projects” activities (e.g. SEEK)
LBNL, 11/4/2003
Ptolemy-II Type System
LBNL, 11/4/2003
Scientific Workflows = Dataflow Process Networks + ?
Kepler
=
Ptolemy-II
+
X
• X=…
– Grid extensions:
• Actors as web/grid services
• 3rd party data transfer, high-throughput data streaming
• Data and service repositories, discovery Extended type system (structural & semantic extensions)
–
–
–
–
Programming extensions (declarative/FP) and
Rich user interactions/workflow steering
Rich data transformations (compute/transform alternations)
Data provenance
• (semi-)automatic meta-data creation
– …
•
…– (minus) upcoming Ptolemy-II extensions (PtII, SEEK, …)!
– The slower we are, the less we have to do ourselves ;-)
LBNL, 11/4/2003
X includes: The customer is always right …
• Intuitive …
– component composition
– data binding
– execution monitoring
• Reusability of …
– Generic components (actors)
– Derived data products
• Application specific packaging and “branding”
• Transparent “gridification”
LBNL, 11/4/2003
Some specific tasks for Kepler
$DONE(or almost ;-), %ONGOING, *NEW
• User interaction, workflow steering
– $ Pause/revise/resume
– % BrowserUI actor (browser as a 0-learning display and selection tool)
• Distributed execution
– % Dynamically port-specializing WSDL actor
– * Dynamically specializing Grid service actor
• Port & actor type extensions (SEEK leverage)
– * Structural types (XML Schema)
– * Semantic types (OWL) incl. unit types w/ automatic conversion
• Programming extensions
– % Data transformation actors (XSLT, XQuery, Python, Perl,…)
– * map, zip, zipWith, …, loop, switch “patterns”
• Specialized Data Sources
– $ EML (SEEK),
– % MS Access (GEON), *JDBC,
– *XML, *NetCDF, …
LBNL, 11/4/2003
Some specific tasks for Kepler
(all NEW)
• Design & develop transparent, Grid-enabled PNs:
–
–
–
–
Communication protocol details
Grid-actor extensions and/or
Grid-Process Network director (G-PN)
Host/Source-location becomes actor parameter
• add “active-inline” parameter display for grid-actors (@exec-loc), channels
(@transport-protocol), source-actors (@{src-loc|catalog-loc})
• Activity Monitoring
– Add “activity status” display (green, yellow, red) to replace PtII animation
(needed for concurrently executing PN!)
• Register & Deploy mechanism
– Actor/Data/Workflow repository (=composite actors)
– Shows up as (config’able) actor library
– OGSA Service Registry approach? (SEEK leverage; UDDI complex & limited says MattJ)
• http://www-unix.globus.org/toolkit/draft-ggf-ogsi-gridservice-33_2003-06-27.pdf
• MOML extensions
– Also separate language?
LBNL, 11/4/2003
Example: Grid-enabling
(again: SEEK leverage opportunity)
LBL, 11/4/2003
Dataflow Process Networks
typed i/o ports
FIFO
actor
actor
• Synchronous Dataflow Network (SDF)
advanced push/pull
– Statically schedulable single-threaded dataflow
• Can execute multi-threaded, but the firing-sequence is known in advance
– Maximally well-behaved, but also limited expressiveness
• Process Network (PN)
– Multi-threaded dynamically scheduled dataflow
– More expressive than SDF (dynamic token rate prevents static scheduling)
– Natural streaming model
• Other Execution Models (“Domains”)
– Implemented through different “Directors”
LBNL, 11/4/2003
Transparently Grid-Enabling PtII: Handles
Logical token transfer (3)
requires get_handle(1,2);
then exec_handle(4,5,6,7)
for completion.
PtII
space
A
3
4
1 2
Grid
space
LBNL, 11/4/2003
B
7
1.
2.
3.
4.
5.
6.
7.
AGA: get_handle
GAA: return &X
AB: send &X
BGB: request &X
GBGA: request &X
GA GB: send *X
GBB: send done(&X)
5
GA
6
GB
Example:
&X = “GA.17”
*X =<some_huge_file>
Transparently Grid-Enabling PtII
• Different phases
–
–
–
–
–
Register designed WF (could include external validation service)
Find suitable grid service hosts for actors
Pre-stage execution
Execute
Archive execution log
• Implementation choices:
– Grid-actors (no change of director necessary)
– and/or Grid-(PN)-director (also need to change actors!?)
– Add grid service host id as actor parameter: A@GA
– Similar for data: myDB@GA
LBNL, 11/4/2003
Programming Extensions
(some lessons from SciDAC/SSDBM demo)
LBL, 11/4/2003
Promoter
Identification
Workflow
in control
Ptolemy-II
hand-crafted
(SSDBM’03)
solution; also:
forces
designed to fit
designed to fit
sequential execution!
hand-crafted
Web-service actor
No data transformations
available
LBNL, 11/4/2003
Complex backward
control-flow
Promoter Identification Workflow in FP
genBankG :: GeneId -> GeneSeq
genBankP :: PromoterId -> PromoterSeq
blast
:: GeneSeq -> [PromoterId]
promoterRegion :: PromoterSeq -> PromoterRegion
transfac :: PromoterRegion -> [TFBS]
gpr2str :: (PromoterId, PromoterRegion) -> String
d0
d1
d2
d3
d4
d5
d6
d7
d8
d9
=
=
=
=
=
=
=
=
=
=
LBNL, 11/4/2003
Gid "7"
-- start with some gene-id
genBankG d0
-- get its gene sequence from GenBank
blast d1
-- BLAST to get a list of potential promoters
map genBankP d2
-- get list of promoter sequences
map promoterRegion d3 -- compute list of promoter regions and ...
map transfac d4
-- ... get transcription factor binding sites
zip d2 d4
-- create list of pairs promoter-id/region
map gpr2str d6
-- pretty print into a list of strings
concat d7
-- concat into a single "file"
putStr d8
-- output that file
Simplified Process Network PIW
• Back to purely functional
dataflow process network
map(f)-style
iterators
(= a data streaming model!)
Powerful type
checking
Generic, declarative
“programming”
constructs
Generic data
transformation actors
• Re-introducing map(f) to
Ptolemy-II (was there in PT
Classic)
 no control-flow spaghetti
 data-intensive apps
 free concurrent execution
 free type checking
 automatic support to go from
piw(GeneId) to PIW :=map(piw)
over [GeneId]
Forward-only, abstractable subworkflow piw(GeneId)
LBNL, 11/4/2003
Optimization by Declarative Rewriting I
• PIW as a declarative,
referentially transparent
functional process
map(f
o
 optimization via functional
rewriting possible
g)
instead of
map(f) o map(g)
e.g. map(f o g) = map(f) o map(g)
• Details:
Combination of
map and zip
– Technical report &PIW specification
in Haskell
http://kbi.sdsc.edu/SciDAC-SDM/scidac-tn-map-constructs.pdf
LBNL, 11/4/2003
Optimizing II: Streams & Pipelines
Source: Real-Time Signal
Processing: Dataflow, Visual, and
Functional Programming, Hideki
John Reekie, University of
Technology, Sydney
• Clean functional semantics facilitates algebraic workflow (program)
transformations (Bird-Meertens); e.g. mapS f • mapS g  mapS (f • g)
LBNL, 11/4/2003
Data Transformation Actors:
Our Approach (proposal)
• Manual
– XQuery, XSLT, Perl, Python, … transformation actor
(development)
• (Semi-)automatic
– Semantic-type guided transformation generation (research)
• Also: Web Service Composition is …
– … a hot topic
– … a reincarnation of many “old” ideas
– (e.g., AI-style planning born-again; functional composition; query
composition; … )
– … a separate topic
LBNL, 11/4/2003
Contrast to Existing Dataflow Systems
Here: Commercial
LBNL, 11/4/2003
Workflow and distributed
computation grid created
with Kensington Discovery
Edition from InforSense.
LBNL, 11/4/2003
F I N: Words to/from the Wise
FYI: Flow-based programming has been re-discovered/re-invented several times by
different communities. Here is an “IBM practitioner’s view”:
– Flow-based Programming, http://www.jpaulmorrison.com/fbp/
…
In "Flow-Based Programming" (FBP), applications are defined as networks of "black box" processes, which
exchange data across predefined connections. These black box processes can be reconnected endlessly to form
different applications without having to be changed internally. It is thus naturally component-oriented. To describe this
capability, the distinguished IBM engineer, Nate Edwards, coined the term "configurable modularity", which he calls
the basis of all true engineered systems.
When using FBP, the application developer works with flows of data, being processed asynchronously, rather than the
conventional single hierarchy of sequential, procedural code. It is thus a good fit with multiprocessor computers, and
also with modern embedded software. In many ways, an FBP application resembles more closely a real-life factory,
where items travel from station to station, undergoing various transformations. Think of a soft drink bottling factory,
where bottles are filled at one station, capped at the next and labelled at yet another one. FBP is therefore highly
visual: it is quite hard to work with an FBP application without having the picture laid out on one's desk, or up on a
screen! For an example, see Sample DrawFlow Diagram.
Strangely though, in spite of being at the leading edge of application development, it is also simple enough that trainee
programmers can pick it up, and it is a much better match with the primitives of data processing than the conventional
primitives of procedural languages. The key, of course (and perhaps the reason why it hasn't caught on more widely), is
that it involves a significant paradigm shift that changes the way you look at programming, and once you have made
this transition, you find you can never go back!
FBP seems to dovetail neatly with a concept that I call "smart data". There is a section on this in stuff about the author.
A new web page on this topic has just been uploaded - see "Smart Data" and Business Data Types - and we will be
publishing more as it develops.
…
LBL, 11/4/2003