Towards Scientific Workflows Based on Dataflow Process Networks

Download Report

Transcript Towards Scientific Workflows Based on Dataflow Process Networks

Towards Scientific Workflows Based on
Dataflow Process Networks
(or from Ptolemy to Kepler)
Bertram Ludäscher
San Diego Supercomputer Center
[email protected]
SEEK meeting, UCSB, 10/22-26/2003
A Note on the Style of the following Slides
Due to lack of time, most of the following slides will be “by reference”
only ;-)
– …Each speaker was given four minutes to present his paper, as there were so
many scheduled -- 198 from 64 different countries. To help expedite the
proceedings, all reports had to be distributed and studied beforehand, while the
lecturer would speak only in numerals, calling attention in this fashion to the
salient paragraphs of his work. ... Stan Hazelton of the U.S. delegation
immediately threw the hall into a flurry by emphatically repeating: 4, 6, 11, and
therefore 22; 5, 9, hence 22; 3, 7, 2, 11, from which it followed that 22 and only
22!! Someone jumped up, saying yes but 5, and what about 6, 18, or 4 for that
matter; Hazelton countered this objection with the crushing retort that, either
way, 22. I turned to the number key in his paper and discovered that 22 meant the
end of the world… [The Futurological Congress, Stanislaw Lem, translated from
the Polish by Michael Kandel, Futura 1977]
SEEK meeting, UCSB, 10/22-26/2003
• NSF, NIH, DOE
Acknowledgements
• GEOsciences Network (NSF)
– www.geongrid.org
• Biomedical Informatics Research Network (NIH)
– www.nbirn.net
• Science Environment for Ecological Knowledge (NSF)
– seek.ecoinformatics.org
• Scientific Data Management Center (DOE)
– sdm.lbl.gov/sdmcenter/
SEEK meeting, UCSB, 10/22-26/2003
Example: Promoter Identification Workflow (PIW)
(simplified)
From: SciDAC/SDM project and collaboration w/ Matt Coleman (LLNL)
SEEK meeting, UCSB, 10/22-26/2003
Conceptual Workflow
(Promoter Identification Workflow PIW)
Compute clusters
(min. distance)
For each
promoter
Select gene-set
(cluster-level)
For each gene
Retrieve matching
cDNA
Retrieve genomic
Sequence
Extract promoter
Region(begin, end)
SEEK meeting, UCSB, 10/22-26/2003
Retrieve
Transcription factors
Compute
Subsequence labels
Arrange
Transcription factors
With all
Promoter Models
Align promoters
Create consensus
sequence
Compute Joint
Promoter Model
SEEK meeting, UCSB, 10/22-26/2003
Details of the Functional MRI (Magnetic Resonance Imaging)
Analysis Workflow (Jeffrey Grethe)
1.
2.
3.
4.
Collect data (K-Space images in Fourier space) from MR scanner while subject performs a specific task
Reconstruct K-Space data to image data (this requires scanner parameters for the reconstruction)
Now have anatomical and functional data
Pre-process the functional data
1. Correct for difference in slice acquisition (each slice in a volume is collected at a slightly different time). Try to
correct for these differences so that all slices seem to be acquired at same time
2. Not correct for subject motion (head movement in scanner) by realigning all functional images
5. Register the functional images with the anatomical image  all images are now in the same space (all
aligned with one another)
6. Move all subjects into template space through non-linear spatial normalization. There exist atlas
templates (made from many subjects) that one can normalize to so that all subjects are in the same space,
allowing for direct comparison across subjects.
7. DATA VERIFICATION - check if all these procedures worked. If not, go back and try again (possibly
tweaking some parameters for the routines or by re-doing some of it by hand).
8. Move onto statistics. First we do single subject statistics: in addition to the images, information about
the experimental paradigm is required. These can be overlayed onto an anatomical to create visual
displays of brain activation during a particular task.
9. Can also combine statistical data from multiple subjects and do a group/population analysis and display
these results.
 Interactive nature of these workflows is critical (data verification) can these steps be automated or semi-automated?
 need metadata from collection equipment and experimental design !
SEEK meeting, UCSB, 10/22-26/2003
GARP Invasive Species Pipeline
Test sample (d)
Registered
Ecogrid
Database
EcoGrid
Query
Species
presence &
absence points
(native range)
(a)
Registered
Ecogrid
Database
+A1
+A2
+A3
Sample
Data
Training
sample
(d)
Data
Calculation
GARP
rule set
(e)
Map
Generation
Native
range
prediction
map (f)
Model quality
parameter (g)
Integrated
layers
(native range) (c)
Environmental
layers (native
range) (b)
Invasion
area prediction
map (f)
Map
Generation
Layer
Integration
Registered
Ecogrid
Database
Environmental
layers (invasion
area) (b)
Layer
Integration
User
Model quality
parameter (g)
Integrated layers
(invasion area) (c)
EcoGrid
Query
Registered
Ecogrid
Database
Validation
Archive
To Ecogrid
Species presence
&absence points
(invasion area) (a)
SEEK meeting, UCSB, 10/22-26/2003
Validation
From: NSF SEEK (Deana Pennington et al)
Selected
prediction
maps (h)
Generate
Metadata
Scientific Workflows: Some Findings
• More dataflow than workflow
– but some branching looping, merging, …
– not: documents/objects undergoing modifications
– instead: dataset-out = analysis(dataset-in)
• Need for “collection/functional-style programming” (FP)
– Iterations over lists (foreach); filtering; functional composition; generic &
higher-order operations (zip, map(f), …)
• Need for abstraction and nested workflows
• Need for data transformations (compute/transform alternations)
• Need for rich user interaction / steering:
– pause & resume
– select & branch; e.g., web browser capability at specific steps as part of a
coordinated SWF
• Need for high-throughput transfers (“grid-enabling”, “streaming”)
• Need for persistence of intermediate products
 data provenance (“virtual data”; cf. several ITR and e-Science projects)
SEEK meeting, UCSB, 10/22-26/2003
(Analytical) Pipelines …. (Scientific) Workflows
• Spectrum of languages & formalisms:
– Pipelines (a la Unix)
– Dataflow languages:
• Synchronous dataflow networks (SDF)
• Kahn’s process networks (PN)
– “Web page-flow”:
• Active XML, WebML, …
• Hesitating-weak-alternating-tree-automata-ML
• …
– (Business) Workflows:
• WfMC’s XPDL, WSFL, BPELWS, …
SEEK meeting, UCSB, 10/22-26/2003
Business Workflows
• Business Workflows
–
–
–
–
–
show their office automation ancestry
documents and “work-tasks” are passed
no data streaming, data-intensive pipelines
lots of standards to choose from: WfMC, BMPL, BPEL4WS,.. XPDL,…
but often no clear semantics for constructs as simple as this:
Source: Expressiveness and Suitability of Languages for Control Flow
Modelling in Workflows, PhD thesis, Bartosz Kiepuszewski, 2002
SEEK meeting, UCSB, 10/22-26/2003
The ZOO of Workflow Standards and Systems
Source: W.M.P. van der Aalst et al.
http://tmitwww.tm.tue.nl/research/patterns/
SEEK meeting, UCSB, 10/22-26/2003
More on Scientific WF vs Business WF
• Business WF
– Tasks, documents, etc. undergo modifications (e.g., flight reservation from
reserved to ticketed), but modified WF objects still identifiable throughout
– Complex control flow, task-oriented
– Transactions w/o rollback (ticket: reserved  purchased)
– …
• SWF
– data-in and data-out of an analysis step are not the same object!
– dataflow, data-oriented (cf. AVS/Express, Khoros, …)
– re-run automatically (a la distrib. comp., e.g. Condor) or userdriven/interactively (based on failure type)
– data integration & semantic mediation as part of SWF framework!
– …
SEEK meeting, UCSB, 10/22-26/2003
SWF vs Distributed Computing
• Distributed Computing (e.g. a la Condor-(G) )
– Batch oriented
– Transparent distributed computing (“remote Unix/Java”;
standard/Java universes in Condor)
– HPC resource allocation & scheduling
• SWF
– Often highly interactive for decision making/steering of the WF
and visualization (data analysis)
– Transparent data access (Grid) and integration (database
mediation & semantic extensions)
– Desktop metaphor (“microworkflow”!?); often (but not always!)
light-weight web service invocation
SEEK meeting, UCSB, 10/22-26/2003
Ptolemy-II
• Recommendations following:
– must read
– must see (now: snippets following; watch for new ways to
compress slides ;-)
– must try
• Bottom line:
– a sophisticated system to do “simple” things (dataflows) as
well as highly complex things (hybrid models)
(compare to your favorite standard/approach/system)
SEEK meeting, UCSB, 10/22-26/2003
Dataflow Process
Networks and Ptolemy-II
see!
read!
try!
SEEK meeting, UCSB, 10/22-26/2003
Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/ptolemyII/
SEEK meeting, UCSB, 10/22-26/2003
SEEK meeting, UCSB, 10/22-26/2003
SEEK meeting, UCSB, 10/22-26/2003
In our (SEEK) terminology:
Think of it as “Workflow
Execution Model++”
SEEK meeting, UCSB, 10/22-26/2003
SEEK meeting, UCSB, 10/22-26/2003
Our SEEK/
SciDAC/Kepler
extensions here!
SEEK meeting, UCSB, 10/22-26/2003
Some Glimpses on the PT-II Execution
Models (“Domains”)
SEEK meeting, UCSB, 10/22-26/2003
Kahn Process Networks (PN)
• Concurrent processes communication through one-way FIFO channels
with unbounded capacity
• A functional process F maps a set of input sequences into a set of
output sequences (sounds like XSM!)
• increasing chain of sets of sequences  outputs may not increase!
• Consider increasing chains (wrt. prefix ordering “<“) of streams
• PN is continuous if lub(Xs) exists for all increasing chains Xs and
– F(lub(Xs)) < lub(F(Xs))
• Continuous implies montonic:
– if Xs < Ys then F(Xs)<F(Ys)
SEEK meeting, UCSB, 10/22-26/2003
Process Networks (cont’d)
• PN in essence: simultaneous relations between sequences
• Network of functional processes can be described by a
mapping
X = F(X,I)
– X denotes all the sequences in the network (inputs I+outputs)
• X that forms a solution is a fixed point
• Continuity implies exactly one “minimal” fixed point
– minimal in the sense of pre-fix ordering for any inputs I
– execution of the network: given I = ^ and find the minimal fixed
point (works because of the monotonic property)
SEEK meeting, UCSB, 10/22-26/2003
Synchronous
Data Flow
Networks (SDF)
• Special case of PN
• Ptolemy-II SDF overview
– SDF supports efficient execution of Dataflow graphs that lack control
structures
– with control structures  Process Networks(PN)
– requires that the rates on the ports of all actors be known before hand
– do not change during execution
– in systems with feedback, delays, which are represented by initial tokens on
relations must be explicitly noted  SDF uses this rate and delay information to
determine
the execution sequence of the actors before execution begins.
SEEK meeting, UCSB,
10/22-26/2003
Extended Kahn-MacQueen Process Networks
• A process is considered active from its creation until its termination
• An active process can block when trying to read from a channel (readblocked), when trying to write to a channel (write-blocked) or when
waiting for a queued topology change request to be processed
(mutation-blocked)
• A deadlock is when all the active processes are blocked
– real deadlock: all the processes are blocked on a read
– artificial deadlock: all processes are blocked, at least one process is blocked on
a write  increase the capacity of receiver with the smallest capacity amongst
all the receivers on which a process is blocked on a write. This breaks the
deadlock.
– If the increase results in a capacity that exceeds the value of
maximumQueueCapacity, then instead of breaking the deadlock, an exception is
thrown. This can be used to detect erroneous models that require unbounded
queues.
SEEK meeting, UCSB, 10/22-26/2003
Towards
SciMod/SDMSWE/Kepler/…
(my vote is for ‘Kepler’…)
SEEK meeting, UCSB, 10/22-26/2003
Scientific Workflows = Dataflow Process Networks + X
• Kepler = current Ptolemy-II plus X, where X = …
–
–
–
–
–
Extended type system (structural & semantic extensions)
Collection programming extensions (declarative/FP) and
Rich user interactions/workflow steering
Rich data transformations (compute/transform alternations)
(Eco-)Grid extensions:
• Actors as web/grid services
• 3rd party data transfer, high-throughput data streaming
• Data and service repositories, discovery
– Data provenance
• (semi-)automatic meta-data creation
– What else???
•
… minus upcoming Ptolemy-II extensions!
– The slower we are, the less we have to do ourselves ;-)
SEEK meeting, UCSB, 10/22-26/2003
Extended Type System (here: OWL Semantic Types)
SemType m1 ::
Observation & itemMeasured.AbundanceCount &
hasContext.appliesTo.LifeStageProperty
 DerivedObservation & itemMeasured.MortalityRate
& hasContext.appliesTo.LifeStageProperty
Substructure association:
XML raw-data =(X)Query=> object model =link => OWL ontology
SEEK meeting, UCSB, 10/22-26/2003
Actor Repositories (here: a commercial tool)
See why we
said userdefinable (or
auto-generated)
actor libraries?
SEEK meeting, UCSB, 10/22-26/2003
Collection Programming
(some lessons from SciDAC/SSDBM demo)
SEEK meeting, UCSB, 10/22-26/2003
Promoter
Identification
Workflow
in control
Ptolemy-II
hand-crafted
(SSDBM’03)
solution; also:
forces
designed to fit
designed to fit
sequential execution!
hand-crafted
Web-service actor
No data transformations
available
SEEK meeting, UCSB, 10/22-26/2003
Complex backward
control-flow
Promoter Identification Workflow in FP
genBankG :: GeneId -> GeneSeq
genBankP :: PromoterId -> PromoterSeq
blast
:: GeneSeq -> [PromoterId]
promoterRegion :: PromoterSeq -> PromoterRegion
transfac :: PromoterRegion -> [TFBS]
gpr2str :: (PromoterId, PromoterRegion) -> String
d0
d1
d2
d3
d4
d5
d6
d7
d8
d9
=
=
=
=
=
=
=
=
=
=
Gid "7"
-- start with some gene-id
genBankG d0
-- get its gene sequence from GenBank
blast d1
-- BLAST to get a list of potential promoters
map genBankP d2
-- get list of promoter sequences
map promoterRegion d3 -- compute list of promoter regions and ...
map transfac d4
-- ... get transcription factor binding sites
zip d2 d4
-- create list of pairs promoter-id/region
map gpr2str d6
-- pretty print into a list of strings
concat d7
-- concat into a single "file"
putStr d8
-- output that file
SEEK meeting, UCSB, 10/22-26/2003
Simplified Process Network PIW
• Back to purely functional
dataflow process network
map(f)-style
iterators
(= a data streaming model!)
Powerful type
checking
• Re-introducing map(f) to
Ptolemy-II (was there in PT Classic)
Generic, declarative
“programming”
constructs
Generic data
transformation actors





no control-flow spaghetti
data-intensive apps
free concurrent execution
free type checking
automatic support to go from
piw(GeneId) to PIW :=map(piw)
over [GeneId]
Forward-only, abstractable subworkflow piw(GeneId)
SEEK meeting, UCSB, 10/22-26/2003
Optimization by Declarative Rewriting I
• PIW as a declarative,
referentially transparent
functional process
map(f
o
 optimization via functional
rewriting possible
g)
instead of
map(f) o map(g)
e.g. map(f o g) = map(f) o map(g)
• Details:
Combination of
map and zip
– Technical report &PIW specification
in Haskell
http://kbi.sdsc.edu/SciDAC-SDM/scidac-tn-map-constructs.pdf
SEEK meeting, UCSB, 10/22-26/2003
Optimization by Declarative Rewriting II
• Rewritings require that data transformation semantics is known
• e.g., Haskell-like for FP and SQL (XQuery)-like for (XML) database querying
Source: Real-Time Signal Processing: Dataflow, Visual, and Functional
Programming, Hideki John Reekie, University of Technology, Sydney
SEEK meeting, UCSB, 10/22-26/2003
Data Transformation Actors:
Chaining together web services is easy …
… (NOT)
SEEK meeting, UCSB, 10/22-26/2003
SEEK meeting, UCSB, 10/22-26/2003
MAP: Data Massaging a la Blue-Titan/Perl
SEEK meeting, UCSB, 10/22-26/2003
Data Transformation Actors:
Our Approach (proposal)
• Manual
– XQuery, XSLT, Perl, Python, … transformation actor
(development)
• (Semi-)automatic
– Semantic-type guided transformation generation (research)
• Also: Web Service Composition is …
– … a hot topic
– … a reincarnation of many “old” ideas
– (e.g., AI-style planning born-again; functional composition; query composition;
…)
– … a separate topic
SEEK meeting, UCSB, 10/22-26/2003
User Interaction
• Brower Actor demo … (Ilkay)
SEEK meeting, UCSB, 10/22-26/2003
FIN
(addtl. material follows)
FYI: Flow-based programming has been re-discovered/re-invented several times:
– Flow-based Programming, http://www.jpaulmorrison.com/fbp/index.shtm
SEEK meeting, UCSB, 10/22-26/2003