Under the Hood of a Workflow Manager Matthew Shields

Download Report

Transcript Under the Hood of a Workflow Manager Matthew Shields

Under the Hood of a
Workflow Manager
T
n
a
r
i
a
Matthew Shields,
BiodiversityWorld GRID workshop, NeSC, 30 June - 1 July
Outline
What is Workflow management?
Why should I care?
Current State of the Art
Workflow Languages
Other Projects
Triana, Architecture & Services
Extending Triana for BDWorld
Conclusion
Matthew Shields, Cardiff University
What is Workflow
Management?
Concept comes from business world
Many years of research and practice
Process capture and reuse
Repeatability, provenance, audit trails &
accountability
Domain expert knowledge capture
Analysis and optimization
Matthew Shields, Cardiff University
What Can a Workflow
Manager do for Me?
Scientific Workflow different focus to business
Large-scale data collection
Querying
Analysis
Visualization
Similar goals
Component & workflow reuse
Knowledge capture
Additional goals
Simplified application/experiment design
Environment/Complexity abstraction
Matthew Shields, Cardiff University
State of the Art
Schedule workflow tasks (Grid/distributed
environment)
Monitor/Control execution
Active visualization and computational steering
User interaction
Pause and restart
Data provenance
Component and sub-workflow reuse
Analysis and optimization
Matthew Shields, Cardiff University
Workflow Languages
No current agreed standard
Most projects use DAG or Petri-Net
Data vs control flow
Dependency vs scripting language
Many XML schema
Business workflow standards - BPEL
Not good enough fit
GGF WFM-RG
Attempting to solicit agreement on standards
Matthew Shields, Cardiff University
Workflow Management
Projects
myGrid/Taverna - Southampton & others
XML/DAG based workflow language
Initially WS choreography tool - now incorporates local
tools/components
Grid integration with databases via OGSA Distributed Query
Processor
myGrid Project main users - Bioinformatics
Kepler - SDSC
Based on Ptolemy - modeling, simulation & design of real time &
concurrent systems
Concurrent dataflow
Actors (components), Directors (workflow engines)
Local, Web Service & Grid Service actors
Ecology, biology, chemistry, oceanography, and the geosciences
Matthew Shields, Cardiff University
WM Projects 2
Karajan/Commodity Grid (CoG) Kit, Argonne &
Berkerley
Scripting workflow language for Grid tasks
Integration with Globus Toolkit GT3 & GT4
Pure control flow
Data flow performed by data tasks - GridFTP
And many more…
See
http://www.gridworkflow.org/snips/gridworkflow/
http://www.extreme.indiana.edu/swf-survey/
Matthew Shields, Cardiff University
Triana
Cardiff University! PPARC funded
Java based Scientific Workflow Tool or PSE
Originally designed for Signal Processing
Now domain independent
Bioinformatics - obviously!
Signal Processing - gravitational wave detection & radio
astronomy
Design optimisation
Data mining
Medical imaging
Distributed Audio Processing
Matthew Shields, Cardiff University
Triana Components
Local Java components
Service-oriented Components
Web services as components (WSRF coming soon)
Web service workflow
Peer 2 Peer services as components
Distributed service workflow
Grid-oriented Components
Grid file and job primitives as components
Complex Grid workflow
Legacy code components via GridMonSteer
Mix and Match composition
Matthew Shields, Cardiff University
Workflow
Inherently data flow based
control flow through “messages”
XML/DCG workflow format
Internally workflow language independent
Migration to standards based language
Simple Parent/Child relationship between tasks
Context based implied actions
Local file -> local file = file copy
Local file -> remote file = file transfer
Import/Export other workflow formats
Pegasus/EGEE read/write DAGMan format
Matthew Shields, Cardiff University
Triana Architecture
Service Based
Computing:
Grid Computing:
Deployment,
discovery and
communication
with distributed
services e.g. P2P
and (GSI) Web
services
Job Submission,
File services
A Graphical Grid
Computing
Environment or
Portal
GAP Interface
GAT Interface
Condor
Unicore
Globus RLS
SSH
GridFTP
PBS
SGE
GRMS
.NET
GridLab
LDR
WSRF
Other..
Matthew Shields, Cardiff University
P2PS
P2PS
Discovery
P2PS
Pipes
JXTA
JXTA
Discovery
Web
Services
UDDI
JXTA
Pipes
SOAP
Grid services
Triana in a SO World
en_fr
hello
network
Service Discovery
Dynamic?
Decentralized?
Communication
Message Format
SOAP?
Transport Protocol
TCP?
UDP?
Matthew Shields, Cardiff University
bonjour
BabelFish
babelfish.
altavista.
com
GAP Interface
A Simple Service based API, for
Service Deployment,
Service Discovery
Pipe Based Communication
Static application interface with multiple
middleware bindings
P2PS
JXTA
Web services
GAP Interface
P2PS
P2PS
Discovery
Matthew Shields, Cardiff University
Web
Services
JXTA
P2PS
Pipes
JXTA
Discovery
UDDI
SOAP
JXTA
Pipes
WSPeer
High Level Interface to Web Services
Discovery
Invocation
Deployment
Hosting
Abstract from usual Web Service Discovery and
Communication Mechanisms (i.e. UDDI and HTTP)
P2PS Web Service Discovery?
Uses Apache AXIS as SOAP Engine
Extends Capabilities of Apache AXIS
Stubless Invocation (including complex types)
Non Standard Transports (i.e. P2PS)
Matthew Shields, Cardiff University
WSPeer
Application
deploy
publish
locate
WSPeer – HTTP/UDDI
invoke
WSPeer – P2PS
locate
locate
publish
publish
deploy
deploy
invoke
UDDI
invoke
launch server
HTTP
Server
Matthew Shields, Cardiff University
Extending Triana for
BDWorld
BDWorld proxy components talk to Web Services
Workflow Design Assistant (WfDA)
selection and composition of BDWorld workflows from available
services
Uses Meta Data Repository (MDR) & Meta Data Agent (MDA)
MDR contains mapping from proxies to resources
WfDA captures domain knowledge in constraints
Constraints used to limit the possible components at
each stage of composition
Simplifies valid workflow creation
Matthew Shields, Cardiff University
Conclusion
A workflow manager should:
Simplify scientific experimentation
Enable reuse at multiple levels
Component
Sub-workflow/Compund components
Collaboration
Abstract component and environment complexities
Think of all components as a service that performs a known task
Implied/Context based operations - file copy/move
Put the scientist back in control of the science, not the
computing
Matthew Shields, Cardiff University