Transcript Kepler

Scientific workflow management system
based on Ptolemy II
 Allows scientists to visually design and
execute scientific workflows
 Actor-oriented model with directors
acting as the main workflow engine
 Enables different models of computation


Modeling flow of data from one step to
another in series of computations to
achieve some scientific goal

Software system for modeling, simulation, and
design of concurrent, real-time, embedded
systems developed at UC Berkeley

Objective:
“The focus is on assembly of concurrent
components. The key underlying principle in the
project is the use of well-defined models of
computation that govern the interaction between
components. A major problem area being
addressed is the use of heterogeneous mixtures
of models of computation.”
 Directors
Link
Port
Link
Port
Attributes
ec
ec
nn
Link
tio
n
Attributes
co
n
tio
 Relations
Actor
Relation
nn
 Ports
Actor
co
 Actors
connection
Port
Actor
Attributes
Directors control execution of workflow
 Actors are executable components of a
workflow (scheduling, dispatching threads, etc)

 Directors govern execution of Actors
Actor-/Dataflow
Orientation
vs
Object-/
Control flow
Orientation
 Every
Kepler workflow needs a
director
 Execute networks of components
under multiple execution models
› Synchronous vs. Parallel vs. Dataflow vs.
time-based vs. event-based vs. all
combined
 Computation
model dictates
semantics for component interaction
Make use of separation of concerns
› e.g., component execution, workflow
execution and provenance tracking
 Managers acts like “common execution
environment”
› governing different concerns related to
execution of network and services

CT
– continuous time modeling
DE – discrete event systems
FSM – finite state machines
PN – process networks
SDF – synchronous dataflow
DDF – dynamic dataflow
SR - synchronous/reactive systems
Reusable components that execute
variety of functions
 Communicate with other actors in
workflow through ports
 Composite actor – aggregation of actors
 Composite actor may have a local
director

Top level workflows can be conceptual
representation of science process
 Drilling down reveals increasing levels of
detail
 Composing models using hierarchy
promotes development of re-usable
components


Each actor implements several methods
› initialize() – initializes state variables
› prefire() – indicates if actor wants to fire
› fire() – main point of execution
 Read inputs, produce outputs, read
parameter values
› postfire() – update persistent state, see if
execution complete
› wrapup()

Each director calls these methods
according to its model

Copy actor– copy files from one resource to another
during execution
› Stage actor – local to remote host
› Fetch actor - remote to local host







Job execution actor – submit and run a remote job
Monitoring actor – notify user of failures
Service discovery actor – import web services from a
service repository or web site
Rexpression actors
MatlabExpression actors
Web services actors – Given WSDL and name of an
operation of a web service, dynamically customizes itself to
implement and execute that method
Database connection and query actors

Ports used to produce and consume
data and communicate with other
actors in workflow
› Input port – data consumed by actor
› Output port – data produced by actor
› Input/output port – data both produced and
consumed

Direct same input or output to more than
one port

Example: direct output to
1. display actor to show intermediate results,
and
2. operational actor for further processing

Execution Options:
› inside GUI
› at command-line
› distributed computing

Kepler components can be shared by
exporting workflow or component into a
Kepler Archive (KAR) file (extension of JAR file
format)
Component Repository is centralized
system for sharing Kepler workflows
 Users can search for components from
repository from within Vergil


Kepler provides direct access to
scientific data archived in many of
commonly used data archives.
› Ex. access to data stored in Knowledge
Network for Biocomplexity (KNB) Metacat
server and described using Ecological
Metadata Language.

Additional supported data sources
› DiGIR protocol, OPeNDAP protocol, GridFTP,
JDBC, SRB, and others.

Kepler ships by default with:
› Globus actors
› GridFTP actors

No BES implementation*
Job submission to openPBS, G-lite
 Kepler actors capable of using Unicore by
Euforia (Poznań SC)
 TeraGrid gateways exists that use Kepler


Actor Data Polymorphism:
› Add numbers (int, float, double, complex)
› Add strings (concatenation)
› Add complex types (arrays, records,
matrices)
› Add user-defined types









Distributed execution of workflow parts (peer to peer)
Efficient data transfer
Provenance tracking of data and processes
Tracking workflow evolution
Streaming data analysis
Easy-to-deploy batch interfaces
Intuitive workflow design
Customizable semantic typing
Interoperability with other workflow and analytical
environments (at exec level)

Ecology
›
›
›

Geosciences
›
›

›
DIGARCH: Digital preservation; UK Text Mining Center: Cheshire feature and archival
Conservation biology
›

Resurgence: Computational chemistry; DART/ARCHER: X-Ray crystallography
Library science
›

REAP: SST data processing; LOOKING/OOI CI: ocean observing CI
ROADNet: real-time data modeling and analysis
ATOL: Processing Phylodata ; CiPRES: Phylogentic tools
Chemistry
›

SDM: Gene promoter identification and ScalaBLAST
ChIP-chip: Genome-scale research; CAMERA: Metagenomics
Oceanography
›
›

GEON: LiDAR data processing, Geological data integration
NEESit: Earthquake engineering
Molecular biology
›
›

SEEK: Ecological Niche Modeling and climate change
REAP: Modeling parasite invasions in grasslands using sensor networks
NEON: Ecological sensor networks; COMET: Environmental science
SanParks: Thresholds of Potential Concerns
Physics
›
SDM: astrophysics TSI-1 and TSI-2 ; CPES: Plasma fusion simulation; ITER-EU: ITM fusion
workflows