Kepler: Towards a Grid-Enabled System for

Download Report

Transcript Kepler: Towards a Grid-Enabled System for

Kepler: Towards a Grid-Enabled
System for Scientific Workflows
Ilkay Altintas, Chad Berkley, Efrat Jaeger,
Matthew Jones, Bertram Ludäscher* , Steve Mock
*[email protected]
San Diego Supercomputer Center (SDSC)
University of California, San Diego (UCSD)
Outline
• Motivation: Scientific Workflows (SEEK, SDM, GEON, ..)
• Current Features of the Kepler Scientific Workflows System
• Extending Kepler:
– Grid-Enabling Kepler:
• 3rd party transfer
– WF planning & optimization
• Shipping and Handling Algebra (SHA)
• Web Service Composition as Declarative Query Plans
– Semantic Types for Scientific Workflows
• Conclusions
B. Ludäscher et al. – Grid-Enabling Kepler
2
Kepler Team, Projects, Sponsors
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Ilkay Altintas SDM
Chad Berkley SEEK
Shawn Bowers SEEK
Jeffrey Grethe BIRN
Christopher H. Brooks Ptolemy II
Zhengang Cheng SDM
Efrat Jaeger GEON
Matt Jones SEEK
Edward A. Lee Ptolemy II
Kai Lin GEON
Bertram Ludäscher BIRN, GEON, SDM, SEEK
Steve Mock NMI
Steve Neuendorffer Ptolemy II
Jing Tao SEEK
Mladen Vouk SDM
Yang Zhao Ptolemy II
…
B. Ludäscher et al. – Grid-Enabling Kepler
3
Ptolemy II
Example: SEEK – Science Environment for
Ecological Knowledge (large NSF ITR)
• Analysis & Modeling System
– Design and execution of
ecological models and
analysis
– End user focus
– application-/upperware
• Semantic Mediation System
– Data Integration of hardto-relate sources and
processes
– Semantic Types and
Ontologies
– upper middleware
• EcoGrid
– Access to ecology data
and tools
– middle-/underware
B. Ludäscher et al. – Grid-Enabling Kepler
Architecture Overview
(cf. Cyberinfrastructure)
4
Ecology: GARP Analysis Pipeline for
Invasive Species Prediction
Test sample (d)
Registered
Ecogrid
Database
EcoGrid
Query
Species
presence &
absence points
(native range)
(a)
Registered
Ecogrid
Database
+A1
+A2
+A3
Sample
Data
Training
sample
(d)
Data
Calculation
GARP
rule set
(e)
Map
Generation
Native
range
prediction
map (f)
Model quality
parameter (g)
Integrated
layers
(native range) (c)
Environmental
layers (native
range) (b)
Invasion
area prediction
map (f)
Map
Generation
Layer
Integration
Registered
Ecogrid
Database
Environmental
layers (invasion
area) (b)
Layer
Integration
User
Model quality
parameter (g)
Integrated layers
(invasion area) (c)
EcoGrid
Query
Registered
Ecogrid
Database
Validation
Validation
Archive
To Ecogrid
Selected
prediction
maps (h)
Generate
Metadata
Species presence
&absence points
(invasion area) (a)
B. Ludäscher et al. – Grid-Enabling Kepler
Source:
NSF SEEK (Deana Pennington et. al, UNM)
5
Genomics Example: Promoter
Identification Workflow (PIW)
Source: Matt Coleman (LLNL)
B. Ludäscher et al. – Grid-Enabling Kepler
6
Source: NIH BIRN (Jeffrey Grethe, UCSD)
B. Ludäscher et al. – Grid-Enabling Kepler
7
Scientific “Workflows”: Some Findings
• More dataflow than (business control-/) workflow
– DiscoveryNet, Kepler, SCIRun, Scitegic, Taverna, Triana,, …,
• Need for “programming extension”
– Iterations over lists (foreach); filtering; functional composition;
generic & higher-order operations (zip, map(f), …)
• Need for abstraction and nested workflows
• Need for data transformations (WS1DTWS2)
• Need for rich user interaction & workflow steering:
– pause / revise / resume
– select & branch; e.g., web browser capability at specific steps as
part of a coordinated SWF
• Need for high-throughput transfers (“grid-enabling”, “streaming”)
• Need for persistence of intermediate products and provenance
B. Ludäscher et al. – Grid-Enabling Kepler
8
In a Flux: Workflow “Standards”
Source: W.M.P. van der Aalst et al. http://tmitwww.tm.tue.nl/research/patterns/
http://tmitwww.tm.tue.nl/staff/wvdaalst/Publications/publications.html
B. Ludäscher et al. – Grid-Enabling Kepler
10
Commercial & Open Source
Scientific “Workflow” (well Dataflow) Systems
Kensington Discovery
Edition from InforSense
Triana
B. Ludäscher et al. – Grid-Enabling Kepler
Taverna
11
SCIRun: Problem Solving Environments for Large-Scale
Scientific Computing
•
•
•
SCIRun: PSE for interactive construction, debugging, and
steering of large-scale scientific computations
New collaboration under Kepler/SDM
Component model, based on generalized dataflow
programming
Steve Parker (cs.utah.edu)
B. Ludäscher et al. – Grid-Enabling Kepler
12
Our Starting Point:
Ptolemy II & Dataflow
Process Networks
see!
read!
try!
Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/ptolemyII/
Why Ptolemy II?
• Ptolemy II Objective:
– “The focus is on assembly of concurrent components. The key
underlying principle in the project is the use of well-defined models of
computation that govern the interaction between components. A major
problem area being addressed is the use of heterogeneous mixtures of
models of computation.”
• Data & Process oriented: Dataflow process networks
• Natural Data Streaming Support
• User-Orientation
– “application-ware”, not middle-/under-ware)
– Workflow design & exec console (Vergil GUI)
• PRAGMATICS
– mature, actively maintained, well-documented (500+pp)
– open source system
– developed across multiple projects (NSF/ITRs SEEK and GEON, DOE
SciDAC SDM, …)
– hoping to leverage e-sister projects (e.g. Taverna, …)
B. Ludäscher et al. – Grid-Enabling Kepler
14
Dataflow Process Networks: Putting
Computation Models (“Orchestration”) first!
typed i/o ports
FIFO
actor
actor
• Synchronous Dataflow Network (SDF)
– Statically schedulable single-threaded dataflow advanced push/pull
• Can execute multi-threaded, but the firing-sequence is known in advance
– Maximally well-behaved, but also limited expressiveness
• Process Network (PN)
– Multi-threaded dynamically scheduled dataflow
– More expressive than SDF (dynamic token rate prevents static
scheduling)
– Natural streaming model
• Other Execution Models (“Domains”)
– Implemented through different “Directors”
B. Ludäscher et al. – Grid-Enabling Kepler
15
Actor-/Dataflow
Orientation
vs
Object-/
Control flow
Orientation
B. Ludäscher et al. – Grid-Enabling Kepler
Source: Edward Lee
16 et al. http://ptolemy.eecs.berkeley.edu/ptolemyII/
Marrying or Divorcing Control- & Dataflow
Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/ptolemyII/
B. Ludäscher et al. – Grid-Enabling Kepler
17
Overview: Scientific Workflows in Kepler
• Modeling and Workflow Design
• Web services = individual components (“actors”)
• “Minute-Made” Application Integration:
– Plugging-in and harvesting web service components is easy, fast
• Rich SWF modeling semantics (“directors”):
– Different and precise dataflow models of computation
– Clear and composable component interaction semantics
 Web service composition and application integration tool
• Coming soon:
– Shrinked wrapped, pre-packaged “Kepler-to-Go”
– Structural and semantic typing (better design support)
– Grid-enabled web services (for big data, big computations,…)
– Different deployment models (web service, web site, applet, …)
B. Ludäscher et al. – Grid-Enabling Kepler
18
The KEPLER GUI: Vergil
(Steve Neuendorffer, Ptolemy II)
Drag and drop utilities, director
and actor libraries.
B. Ludäscher et al. – Grid-Enabling Kepler
19
Running a Genomics WF (Ilkay Altintas, SDM)
B. Ludäscher et al. – Grid-Enabling Kepler
20
Support for Multiple Workflow Granularities
Boulders
Plumbing
Powder
Abstraction:
Sand to
Rocks
Sand
B. Ludäscher et al. – Grid-Enabling Kepler
21
Directors and Combining Different
Component Interaction Semantics
Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/ptolemyII/
B. Ludäscher et al. – Grid-Enabling Kepler
22
Application Examples: Mineral Classification
with Kepler … (Efrat Jaeger, GEON)
B. Ludäscher et al. – Grid-Enabling Kepler
23
… inside the Classifier
B. Ludäscher et al. – Grid-Enabling Kepler
24
Standard BrowserUI: Client-Side SVG
B. Ludäscher et al. – Grid-Enabling Kepler
25
SWF Reengineering (Ashraf, Efrat, Kai, GEON)
B. Ludäscher et al. – Grid-Enabling Kepler
26
DataMapper Sub-Workflow
B. Ludäscher et al. – Grid-Enabling Kepler
27
Result launched via BrowserUI actor
(coupling with ESRI’s ArcIMS)
B. Ludäscher et al. – Grid-Enabling Kepler
28
Distributed Workflows in KEPLER
• Web and Grid Service plug-ins
– WSDL (now) and Grid services (stay tuned …)
– ProxyInit, GlobusGridJob, GridFTP, DataAccessWizard
– SSH, SCP, SDSC SRB, OGS?-???… coming
• WS Harvester
– Import query-defined WS operations as Kepler actors
• XSLT and XQuery Data Transformers
– to link not “designed-to-fit” web services
• WS-deployment interface (planned)
B. Ludäscher et al. – Grid-Enabling Kepler
29
Generic Web Service Actor (Ilkay Altintas)
Given a WSDL and
the name of an
operation of a web
service, dynamically
customizes itself to
implement and
execute that method.
Configure - select service
operation

B. Ludäscher et al. – Grid-Enabling Kepler
30
Set Parameters and Commit
Set parameters
and commit
B. Ludäscher et al. – Grid-Enabling Kepler
31
Specialized WS Actor (after instantiation)
B. Ludäscher et al. – Grid-Enabling Kepler
32
Web Service Harvester (Ilkay Altintas, SDM)
• Imports the web services in a repository
into the actor library.
• Has the capability to search for web
services based on a keyword.
B. Ludäscher et al. – Grid-Enabling Kepler
33
Composing 3rd-Party WSs (NMI, Steve Mock)
Output of previous
web service
User interaction &
Transformations
B. Ludäscher et al. – Grid-Enabling Kepler
34
Input of next
web service
A Special Generic Ingestion Actor for EML Data
(SEEK, Chad Berkley)

Ingests any data format described by EML metadata

Converts raw data to Ptolemy format

Data can then be operated on with other actors
B. Ludäscher et al. – Grid-Enabling Kepler
35
Wrapping Legacy Applications
B. Ludäscher et al. – Grid-Enabling Kepler
36
Promoter Identification Workflow (PIW)
B. Ludäscher et al. – Grid-Enabling Kepler
37
Source: Matt Coleman (LLNL)
Execution
Semantics
Promoter
Identification
Workflow
in Ptolemy-II
[SSDBM’03]
B. Ludäscher et al. – Grid-Enabling Kepler
38
designed to fit
hand-crafted control
solution; also: forces
sequential execution!
designed to fit
hand-crafted
Web-service actor
No data transformations
available
B. Ludäscher et al. – Grid-Enabling Kepler
39
Complex backward
control-flow
Promoter Identification Workflow in FP
genBankG :: GeneId -> GeneSeq
genBankP :: PromoterId -> PromoterSeq
blast
:: GeneSeq -> [PromoterId]
promoterRegion :: PromoterSeq -> PromoterRegion
transfac :: PromoterRegion -> [TFBS]
gpr2str :: (PromoterId, PromoterRegion) -> String
d0
d1
d2
d3
d4
d5
d6
d7
d8
d9
=
=
=
=
=
=
=
=
=
=
Gid "7"
-- start with some gene-id
genBankG d0
-- get its gene sequence from GenBank
blast d1
-- BLAST to get a list of potential promoters
map genBankP d2
-- get list of promoter sequences
map promoterRegion d3 -- compute list of promoter regions and ...
map transfac d4
-- ... get transcription factor binding sites
zip d2 d4
-- create list of pairs promoter-id/region
map gpr2str d6
-- pretty print into a list of strings
concat d7
-- concat into a single "file"
putStr d8
-- output that file
B. Ludäscher et al. – Grid-Enabling Kepler
40
Cleaned up Process Network PIW
• Back to purely functional
dataflow process network
(= also a data streaming model!)
map(f)-style
iterators
Powerful type
checking
• Re-introducing map(f) to PtolemyII (was there in PT Classic)
 no control-flow spaghetti
Generic, declarative
“programming”
 data-intensive apps
constructs
 free concurrent execution
 free type checking
 automatic support to go from
Generic data
piw(GeneId) to
transformation actors
PIW :=map(piw) over [GeneId]
Forward-only, abstractable subworkflow piw(GeneId)
B. Ludäscher et al. – Grid-Enabling Kepler
41
Optimization by Declarative Rewriting I
map(f
o
• PIW as a declarative, referentially
transparent functional process
 optimization via functional
rewriting possible
g)
e.g. map(f o g) = map(f) o map(g)
• Technical report &PIW specification in
Haskell
instead of
map(f) o map(g)
Combination of
map and zip
http://kbis.sdsc.edu/SciDAC-SDM/scidac-tn-map-constructs.pdf
B. Ludäscher et al. – Grid-Enabling Kepler
42
Optimizing II: Streams & Pipelines
Source: Real-Time Signal
Processing: Dataflow, Visual, and
Functional Programming, Hideki
John Reekie, University of
Technology, Sydney
• Clean functional semantics facilitates algebraic workflow (program)
transformations (Bird-Meertens); e.g. mapS f • mapS g  mapS (f • g)
B. Ludäscher et al. – Grid-Enabling Kepler
43
Middle/Underware Access: Querying Databases
• Database connection actor:
– Opening a database connection and passing it to all actors
accessing this database.
• Database query actor:
– A generic actor that queries a database and provides its
result.
• DBConnection type and DBConnectionToken:
– A new IOPort type and a token to distinguish a database
connection from any general type.
Database Connection Actor
• OpenDBConnection actor:
– Input: database connection information
– Output: DBConnectionToken (reference to a DB connection
instance, via a DBConnection output port)
Database Query Actor
• Database Query actor:
– Input: SQL query string and a DB connection token
– Parameters:
• output type: XML, Record, or String
• tuple-at-a-time vs set-at-a-time
– Process:
• execute query
• produce results according to parameters
Querying Example
An (oversimplified) Model of the Grid
• Hosts: {h1, h2, h3, …}
• Data@Hosts: d1@{hi}, d2@{hj}, …
• Functions@Hosts: f1@{hi}, f2@{hj}, …
X
• Given: data/workflow:
• … as a functional plan:
• … as a logic plan:
f
Y
g
Z
[…; Y := f(X); Z := g(Y); …]
[…; f(X,Y)g(Y,Z); …]
• Find Host Assignment: di  hi , fj  hj
for all di , fj
… s.t. […; d3@h3 := f@h2(d1@h1), …] is a valid plan
B. Ludäscher et al. – Grid-Enabling Kepler
48
Shipping and Handling Algebra (SHA)
f@A
x@b
f@A
Logical view
x@b
y@c
(1)
y@c
(2)
f@A
plan Y@C = F@A of X@B =
1.
y@c
[ X@B to A, Y@A := F@A(X@A), Y@A to C ]
x@b
2.
[ F@A => B, Y@B := F@B(X@B), Y@B to C ]
3.
[ X@B to C, F@A => C, Y@C := F@C(X@C) ]
Physical view: SHA Plans
B. Ludäscher et al. – Grid-Enabling Kepler
49
f@A
x@b
y@c
(3)
Grid-Enabling PTII: Handles
1.
2.
3.
4.
5.
6.
7.
Logical token transfer (3)
requires get_handle(1,2);
then exec_handle(4,5,6,7)
for completion.
Kepler
space
A
3
4
1 2
Grid
space
B
Example:
&X = “GA.17”
*X =<some_huge_file>
7
5
GA
B. Ludäscher et al. – Grid-Enabling Kepler
6
AGA: get_handle
GAA: return &X
AB: send &X
BGB: request &X
GBGA: request &X
GA GB: send *X
GBB: send done(&X)
Candidate Formalisms:
• GridFTP
• SSH, SCP
• SDSC SRB
• OGS?-??? … WSRF?
GB
50
Extensions: Semantic Type
• Take concepts and relationships from an ontology to “semantically
type” the data-in/out ports
• Application: e.g., design support:
– smart/semi-automatic wiring, generation of “massaging actors”
m1
p3
(normalize)
Takes Abundance Count
Measurements for Life Stages
B. Ludäscher et al. – Grid-Enabling Kepler
p4
Returns Mortality Rate Derived
Measurements for Life Stages
51
B. Ludäscher et al. – Grid-Enabling Kepler
52
B. Ludäscher et al. – Grid-Enabling Kepler
53
Semantic Types
• The semantic type signature
– Type expressions over the (OWL) ontology
m1
p3
(normalize)
p4
SemType m1 ::
Observation & itemMeasured.AbundanceCount &
hasContext.appliesTo.LifeStageProperty
->
DerivedObservation & itemMeasured.MortalityRate &
hasContext.appliesTo.LifeStageProperty
B. Ludäscher et al. – Grid-Enabling Kepler
54
Extended Type System (here: OWL Semantic Types)
SemType m1 ::
Observation & itemMeasured.AbundanceCount &
hasContext.appliesTo.LifeStageProperty
 DerivedObservation & itemMeasured.MortalityRate
& hasContext.appliesTo.LifeStageProperty
Substructure association:
XML raw-data =(X)Query=> object model =link => OWL ontology
B. Ludäscher et al. – Grid-Enabling Kepler
55
Semantic Types for Scientific Workflows
B. Ludäscher et al. – Grid-Enabling Kepler
56
Deriving Data Transformations from
Semantic Service Registration
[Bowers-Ludaescher,
DILS’04]
B. Ludäscher et al. – Grid-Enabling Kepler
57
Structural and Semantic Mappings
[Bowers-Ludaescher,
DILS’04]
B. Ludäscher et al. – Grid-Enabling Kepler
58
Workflow Planning as Planning Queries with
Limited Access Patterns
• User query Q: answer(ISBN, Author, Title) 
book(ISBN, Author, Title),
catalog(ISBN, Author),
not library(ISBN).
• Limited (web service) Access Patterns (API)
– Src1.books:
in: ISBN
out: Author, Title
– Src1.books:
in: Author
out: ISBN, Title
– Src2.catalog: in: {}
out: ISBN, Author
– Src3.library:
in: {}
out: ISBN
• Q is not executable, but feasible (equivalent to executable Q’:
catalog ; book ; not library)
 ICDE (poster), EDBT, PODS (papers), [Nash-Ludaescher,2004]
B. Ludäscher et al. – Grid-Enabling Kepler
59
Conclusions
• Summary
– Kepler Scientific Workflow System
– Open source, cross-project collaboration
(SEEK, GEON, SDM,…)
– Actor & Dataflow-oriented Modeling, Design,
Execution (Ptolemy II heritage)
– Prototyping, static analysis, web services, data
transformations
• Next Steps
– First official release (“Kepler-to-Go”) April/May
’04
• e-Science meeting NeSC, Edinburgh
– Grid-enabling
• 3rd party transfer, planning, optimization, …
–
–
–
–
Semantic Typing [DILS’04]
Provenance, Fault tolerance, …
Link-Up w/ e.g. Taverna, Pegasus, …
Become a member or co-developer (You!)
B. Ludäscher et al. – Grid-Enabling Kepler
60