Scientific Workflows - San Diego Supercomputer Center

Download Report

Transcript Scientific Workflows - San Diego Supercomputer Center

Scientific Workflows:
Some Examples and Technical Issues
Bertram Ludäscher
[email protected]
Data & Knowledge Systems (DAKS)
San Diego Supercomputer Center
University of California, San Diego
GGF9 10/07/2003, Chicago
Outline
• Scientific Workflow (SWF) Examples
– Genomics: Promoter Identification (DOE SciDAC/SDM)
– Neuroscience: Functional MRI (NIH BIRN)
– Ecology: Invasive Species, Climate Change (NSF SEEK)
• SWFs & Analysis Pipelines …
– vs Business WFs
– vs Traditional Distributed Computing
• Some Technical Issues
GGF9 10/07/2003, Chicago
• NSF, NIH, DOE
Acknowledgements
• GEOsciences Network (NSF)
– www.geongrid.org
• Biomedical Informatics Research Network (NIH)
– www.nbirn.net
• Science Environment for Ecological Knowledge (NSF)
– seek.ecoinformatics.org
• Scientific Data Management Center (DOE)
– sdm.lbl.gov/sdmcenter/
GGF9 10/07/2003, Chicago
Scientific Workflow Examples
1. Promoter Identification (Genomics)
2. fMRI (Neurosciences)
3. Invasive Species (Ecology)
4. Bonus Material: Semantic Data Integration (Geology)
GGF9 10/07/2003, Chicago
Example: Promoter Identification Workflow (PIW)
(simplified)
From: SciDAC/SDM project and collaboration w/ Matt Coleman (LLNL)
GGF9 10/07/2003, Chicago
Conceptual Workflow
(Promoter Identification Workflow PIW)
Compute clusters
(min. distance)
For each
promoter
Select gene-set
(cluster-level)
For each gene
Retrieve matching
cDNA
Retrieve genomic
Sequence
Extract promoter
Region(begin, end)
GGF9 10/07/2003, Chicago
Retrieve
Transcription factors
Compute
Subsequence labels
Arrange
Transcription factors
With all
Promoter Models
Align promoters
Create consensus
sequence
Compute Joint
Promoter Model
SWF: Promoter Identification
• More dataflow than workflow
– but some branching looping, merging, …
– not: documents/objects undergoing modifications
– instead: dataset-out = analysis(dataset-in)
• Need for “collection programming” (functional programming style)
–
–
–
–
Iterations over lists (foreach)
Filtering
Functional composition
Generic & higher-order operations (zip, map(f), …)
• Need for abstraction and nested workflows
• Need for rich user interaction / steering:
– pause & resume
– select & branch; e.g., web browser capability at specific steps as part of a
coordinated SWF
• Need for persistence of intermediate products
 data provenance (virtual data concept, e.g. GriPhyN)
GGF9 10/07/2003, Chicago
From: BIRN-CC, UCSD (Jeffrey Grethe)
GGF9 10/07/2003, Chicago
GGF9 10/07/2003, Chicago
GGF9 10/07/2003, Chicago
Details of the Functional MRI (Magnetic Resonance
Imaging) Analysis Workflow (Jeffrey Grethe)
1.
2.
3.
4.
Collect data (K-Space images in Fourier space) from MR scanner while subject performs a specific task
Reconstruct K-Space data to image data (this requires scanner parameters for the reconstruction)
Now have anatomical and functional data
Pre-process the functional data
1. Correct for difference in slice acquisition (each slice in a volume is collected at a slightly different time). Try to
correct for these differences so that all slices seem to be acquired at same time
2. Not correct for subject motion (head movement in scanner) by realigning all functional images
5. Register the functional images with the anatomical image  all images are now in the same space (all
aligned with one another)
6. Move all subjects into template space through non-linear spatial normalization. There exist atlas
templates (made from many subjects) that one can normalize to so that all subjects are in the same space,
allowing for direct comparison across subjects.
7. DATA VERIFICATION - check if all these procedures worked. If not, go back and try again (possibly
tweaking some parameters for the routines or by re-doing some of it by hand).
8. Move onto statistics. First we do single subject statistics: in addition to the images, information about
the experimental paradigm is required. These can be overlayed onto an anatomical to create visual
displays of brain activation during a particular task.
9. Can also combine statistical data from multiple subjects and do a group/population analysis and display
these results.
 Interactive nature of these workflows is critical (data verification) can these steps be automated or semi-automated?
 need metadata from collection equipment and experimental design !
GGF9 10/07/2003, Chicago
SEEK: Vision
& Overview
• Large collaborative NSF/ITR project: UNM, UCSB, SDSC/UCSD, UKansas,..
• Fundamental improvements for researchers: Global access to ecologically relevant
data; Rapidly locate and utilize distributed computation; Capture, reproduce, extend
analysis process
EcoGrid
provides unified access to
Distributed Data Stores ,
Parameter Ontologies, &
Stored Analyses, and runtime
capabilities via the Execution
Environment
AM: Analysis and Modeling System
TS1
ASy
TS2
ASz
SEEK is the combination of
EcoGrid data resources and
information services, coupled
with advanced semantic and
modeling capabilities
GGF9 10/07/2003, Chicago
ASr
W
S
D
L
/
U
D
D
I
etc.
Parameters w/ Semantics
Semantic Mediation System
& Analysis and Modeling
System use WSDL/UDDI to
access services within the
EcoGrid, enabling
analytically driven data
discovery and integration
SAS, MATLAB,
FORTRAN, etc
Example of “AP0”
Analytical Pipeline (AP)
ASx
Execution Environment
Data Binding
SMS: Semantic

Mediation

System
Logic Rules
Semantic Mediation
Engine
Invasive species
over time
Library of Analysis
Steps, Pipelines
& Results
WSDL/UDDI
WSDL/UDDI
ECO2
C
C
AP0
Query Processing
ECO2-CL
Parameter
Ontologies
ASr
C
ECO2
C
C
Dar
MC
EML
SRB
KNB Species
Wrp
C
TaxOn
...
Raw data sets
wrapped
for integration
w/ EML, etc.
GARP Invasive Species Pipeline
Test sample (d)
Registered
Ecogrid
Database
EcoGrid
Query
Species
presence &
absence points
(native range)
(a)
Registered
Ecogrid
Database
+A1
+A2
+A3
Sample
Data
Training
sample
(d)
Data
Calculation
GARP
rule set
(e)
Integrated
layers
(native range) (c)
Invasion
area prediction
map (f)
Map
Generation
Layer
Integration
Registered
Ecogrid
Database
Environmental
layers (invasion
area) (b)
Layer
Integration
User
Model quality
parameter (g)
Integrated layers
(invasion area) (c)
EcoGrid
Query
Validation
Archive
To Ecogrid
Species presence
&absence points
(invasion area) (a)
From: NSF SEEK (Deana Pennington et al)
GGF9 10/07/2003, Chicago
Validation
Model quality
parameter (g)
Environmental
layers (native
range) (b)
Registered
Ecogrid
Database
Map
Generation
Native
range
prediction
map (f)
Selected
prediction
maps (h)
Generate
Metadata
Details GARP Invasive Species Pipeline
Modeling the distribution of a species using appropriate ecological niche
modeling algorithms (e.g., GARP—the Genetic Algorithm for Rule-set
Production) in an analytical pipeline environment. a) The EcoGrid is queried
for data specifying the presence or absence of a particular species in a given
area. b) Multiple environmental layers relevant to the specie’s distribution are
selected with a second EcoGrid query. c) Environmental layers, representing
the current range of the species (native range) and the range of interest to the
invasion study, are spatially-integrated into layer stacks. d) Samples are
selected from the presence/absence data, and the corresponding values from
the native range environmental layer stack are retrieved. The sample is
divided into a training set and a testing set. e) The GARP algorithm is run on
the training set. The GARP ruleset is then applied to both areas, creating
predictive maps under current conditions (native range) and after invasion
(invaded range). f) Predictive maps are sent to the scientist’s workstation for
further analysis. g) A comparison is made between the ground truth
occurrence data that were set aside as test data, and the corresponding
location on the native range predictive map. Error measures providing an
indication of model quality are sent to the scientist’s workstation. h) After
multiple runs, the user may select maps for metadata generation and archiving
back to the EcoGrid.
From: NSF SEEK (Deana Pennington et al)
GGF9 10/07/2003, Chicago
Data Integration: Spatial Integration Aspects
Excel File
Access File
Sample 1, lat, long, presence
Sample 3, lat, long, absence
Vegetation cover type
Sample 2, lat, long, presence
Integrated data:
Elevation (m)
P, juniper, 2200m, 16C
P, pinyon, 2320m, 14C
A, creosote, 1535m, 22C
Mean annual temperature (C)
Integration of heterogeneous data formats. Semantically-integrated species occurrence data is
combined with spatially-integrated environmental data, to produce sample data consisting of specie’s
occurrence (P = present, A = absent), vegetation type, elevation (m), and mean annual temperature
(C).
From: NSF SEEK (Deana Pennington et al)
GGF9 10/07/2003, Chicago
SEEK Components
• EcoGrid
• Seamless access to distributed, heterogeneous data: ecological, biodiversity,
environmental data
• “Semantically” mediated and metadata driven
• Centralized search & management portal(s)
• Analysis and Modeling System (AMS)
– Capture, reproduce, and extend analysis process
• Declarative means for documenting analysis
• “Pipeline” system for linking generic analysis steps
• Strong version control for analysis steps
– Easy-to-use interface between data and analysis
• Semantic Mediation System (SMS)
–
–
–
–
“smart” data discovery, “type-correct” pipeline construction & data binding:
determine whether/how to link analytic steps
determine how data sets can be combined
determine whether/how data sets are appropriate inputs for analysis steps
GGF9 10/07/2003, Chicago
AMS Overview
• Objective
– Create a semi-automated system for analyzing data and
executing models that provides documentation, archiving,
and versioning of the analyses, models, and their outputs
(visual programming language?)
• Scope
– Any type of analysis or model in ecology and biodiversity
science
– Massively streamline the analysis and modeling process
– Archiving, rerunning analyses in SAS, Matlab, R, SysStat,
C(++),…
– …
GGF9 10/07/2003, Chicago
SEEK Analytical Pipeline
• A “workflow” is one or more analytical processes chained together
into an analytical pipeline
• In the SEEK model, data ingestion/cleaning is metadata driven
(specifically with EML)
From: NSF SEEK (Chad Berkley, Matt Jones)
GGF9 10/07/2003, Chicago
Automation of data integration using workflows
• Workflows can automate the integration process if data is
described with adequate structured metadata
•  WF layer one level above data integration/mediation layer
GGF9 10/07/2003, Chicago
Simple Data Integration (homogeneous data):
Metadata (EML) may be good enough!
• Integration of homogeneous or mostly homogeneous data via EML metadata
is relatively straightforward
GGF9 10/07/2003, Chicago
Complex Data Integration (simple example!)
• Integration of heterogeneous data requires much more advanced metadata and
processing
–
–
–
–
–
Attributes must be semantically typed
Collection protocols must be known
Units and measurement scale must be known
Measurement mechanics must be known (i.e. that Density=Count/Area)
This is an advanced research topic within the SEEK project (SMS)
GGF9 10/07/2003, Chicago
Semantic Typing
• Label data with semantic types
• Label inputs and outputs of analytical components with semantic types
Data
Ontology
Workflow Components
• Use Semantic Mediation System (SMS) to generate transformation steps
– Beware of analytical constraints
• Use SMS to discover relevant components
• Ontology = specification of a conceptualization (a knowledge map)
GGF9 10/07/2003, Chicago
SWF: Ecology Examples
• Similar requirements as before:
–
–
–
–
–
Rich user interaction
Analysis pipelines running on an EcoGrid
Collection programming probably needed
Abstraction & nested workflows
Persistent intermediate steps (cf. e.g. Virtual Data concept)
• Additionally:
– Very heterogeneous data
 need for semantic typing of data and analysis steps
 semantic mediation support …
• … for pipeline design
• … for data integration at design time and at runtime
GGF9 10/07/2003, Chicago
Bonus Material: Semantic Data Integration
“Geology Workbench”
Kai Lin (GEON, SDSC)
GGF9 10/07/2003, Chicago
domain
knowledge
Nevada
Show formationsShow
whereformations where
AGE = ‘Paleozic’AGE = ‘Paleozic’
(without age ontology)
(with age ontology)
GGF9 10/07/2003, Chicago
Navigatable, Amalgamated
Rocktype Ontology
GGF9 10/07/2003, Chicago
Geology Workbench : Initial State
click on Ontologies
click on Datasets
click on Applications
An Ontology-based Mediator
GGF9 10/07/2003, Chicago
Geology Workbench: Uploading Ontologies
click
on
Ontology
Submission
Choose
Click
antoOWL
checkfile
its to
detail
upload
GGF9 10/07/2003, Chicago
Name Space
Can be used to import this
ontology into others
Geology Workbench: Data (to Ontology!) Registration
Step 1: Choose Classes
Click on Submission
Data set name
Select a shapefile
Choose an ontology class
GGF9 10/07/2003, Chicago
Geology Workbench: Data Registration
Step 2: Choose Columns for Selected Classes
It contains information about
geologic age
AREA
PERIMETER
AZ_1000
AZ_1000_ID
GEO
PERIOD
ABBREV
DESCR
D_SYMBOL
P_SYMBOL
GGF9 10/07/2003, Chicago
Geology Workbench: Data Registration
Step 3: Resolve Dismatches
Two terms are
not matched any
ontology terms
Manually mapping
algonkian into
the ontology
GGF9 10/07/2003, Chicago
Geology Workbench: Ontology-enabled Map Integrator
All areas with the
age Paleozoic
Click on the name
Choose interesting
Classes
GGF9 10/07/2003, Chicago
Geology Workbench: Change Ontology
Run it
New query interface
Switch from Canadian
Rock Classification to
British Rock
Classification
Submit a mapping
GGF9 10/07/2003, Chicago
Ontology mapping
between British Rock
Classification and Canadian
Rock Classification
Scientific Workflows (SWF) vs.
Business Workflows
and some
Technical Issues
GGF9 10/07/2003, Chicago
Business Workflows
• Business Workflows
–
–
–
–
–
show their office automation ancestry
documents and “work-tasks” are passed
no data streaming, data-intensive pipelines
lots of standards to choose from: WfMC, BMPL, BPEL4WS,.. XPDL,…
but often no clear semantics for constructs as simple as this:
Source: Expressiveness and Suitability of Languages for Control Flow
Modelling in Workflows, PhD thesis, Bartosz Kiepuszewski, 2002
GGF9 10/07/2003, Chicago
What is a Scientific Workflow?
• A Misnomer …
• … well, at least for a number of examples…
• Scientific Workflows  Business Workflows
– Business Workflows: “control-flow-rich”
– Scientific Workflows: “data-flow-rich”
– … much more to say …
GGF9 10/07/2003, Chicago
More on Scientific WF vs Business WF
• Business WF
– Tasks, documents, etc. undergo modifications (e.g., flight
reservation from reserved to ticketed), but modified WF objects
still identifiable throughout
– Complex control flow, task-oriented
– Transactions w/o rollback (ticket: reserved  purchased)
– …
• SWF
– data-in and data-out of an analysis step are not the same object!
– dataflow, data-oriented (cf. AVS/Express, Khoros, …)
– re-run automatically (a la distrib. comp., e.g. Condor) or userdriven/interactively (based on failure type)
– data integration & semantic mediation as part of SWF framework!
– …
GGF9 10/07/2003, Chicago
SWF vs Distributed Computing
• Distributed Computing (e.g. a la Condor-(G) )
– Batch oriented
– Transparent distributed computing (“remote Unix/Java”;
standard/Java universes in Condor)
– HPC resource allocation & scheduling
• SWF
– Often highly interactive for decision making/steering of the WF
and visualization (data analysis)
– Transparent data access (Grid) and integration (database
mediation & semantic extensions)
– Desktop metaphor (“microworkflow”!?); often (but not always!)
light-weight web service invocation
GGF9 10/07/2003, Chicago
Some Technical Issues (SWFs)
• Design Environment
– Intuitive “visual programming” interface (ideally w/o the “programming” part!!)
– “Smart Typing” extensions (for datatask and tasktask bindings)
•
•
•
•
Structural typing (e.g. XML Schema)
Semantic typing (e.g. OWL)
Specialized semantic types (SI unit system, measurement scales, …)
“resource typing”: token consumption/production, execution preconditions
– Declarative programming extensions
• Functional collection programming (e.g., Haskell-like; cf. also BioKleisli/CPL)
• Also: consider what standards bring to the table (BPEL4WS)
• Alternation of analysis and data transformation steps
• Sophisticated dataflow execution models and hybrids thereof:
– Ptolemy-II leads the way: Process Networks (PN), Synchronous Dataflow Networks
(SDF), Continuous Time modeling (CT), Discrete Event modeling (DE)
• Grid-enabling process networks and data integration:
– Borrow from distributed computing technologies and tools (e.g. Globus,
Condor) and distributed data access (e.g., SRB) and integration (mediators)
– Virtualize and Grid-enable everything! analysis@LOC, data@LOC, …
GGF9 10/07/2003, Chicago
Where do we go?
• from Ptolemy-II to Kepler
• example of what extensions are needed
GGF9 10/07/2003, Chicago
From Ptolemy-II to … Kepler
• Ptolemy-II: Extensible Open Source Tool (EECS UC Berkeley)
• Various combinable, clearly defined execution models (“domains”)
– PN, SDF, DE, CT
• Kepler
= PT-II extensions for
Scientific Workflows
• Adopted by
SEEK, SciDAC/SDM,
and hopefully others!
(open source!)
GGF9 10/07/2003, Chicago
Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/ptolemyII/
Promoter
Identification
Workflow
in control
Ptolemy-II
hand-crafted
(SSDBM’03)
solution; also:
forces
designed to fit
designed to fit
sequential execution!
hand-crafted
Web-service actor
No data transformations
available
GGF9 10/07/2003, Chicago
Complex backward
control-flow
Simplified Process Network PIW
• Back to purely functional
dataflow process network
map(f)-style
iterators
(= a data streaming model!)
Powerful type
checking
• Re-introducing map(f) to
Ptolemy-II (was there in PT Classic)
Generic, declarative
“programming”
constructs
Generic data
transformation actors





no control-flow spaghetti
data-intensive apps
free concurrent execution
free type checking
automatic support to go from
piw(GeneId) to PIW :=map(piw)
over [GeneId]
Forward-only, abstractable subworkflow piw(GeneId)
GGF9 10/07/2003, Chicago
Optimization by Declarative Rewriting I
• PIW as a declarative,
referentially transparent
functional process
map(f
o
 optimization via functional
rewriting possible
g)
instead of
map(f) o map(g)
e.g. map(f o g) = map(f) o map(g)
• Details:
Combination of
map and zip
– Technical report &PIW specification
in Haskell
http://kbi.sdsc.edu/SciDAC-SDM/scidac-tn-map-constructs.pdf
GGF9 10/07/2003, Chicago
Optimization by Declarative Rewriting II
• Rewritings require that data transformation semantics is known
• e.g., Haskell-like for FP and SQL (XQuery)-like for (XML) database querying
Source: Real-Time Signal Processing: Dataflow, Visual, and Functional
Programming, Hideki John Reekie, University of Technology, Sydney
GGF9 10/07/2003, Chicago
Summary: Scientific Workflows Everywhere
• Shown bits scientific workflows in:
– SciDAC/SDM, SEEK, BIRN, GEON, …
• Many others are there:
– GriPhyN et al (virtual data concept): Chimera, Pegasus, DAGman, CondorG,
…, GridANT, …
– E-Science: e.g, myGrid: XScufl, Taverna, DiscoveryNet
– Pragma, iLTER, ..
– Commercial efforts: DiscoveryNet (inforsense), Scitegic, IBM, Oracle, …
• One size fits all?
– Most likely not (Business WFs =/= Scientific WFs)
– Some competition is healthy and reinventing a round wheel is OK
– But some coordination & collaboration can save …
• reinventing the squared wheel
• “leveraging” someone else’s wheel in a squared way …
– Even within SWF, quite different requirements:
• exploratory and ad-hoc vs. well-designed and high throughput
• interactive desktop (w/ lightweight web services/Grid) vs. distributed, batched
GGF9 10/07/2003, Chicago
Combine Everything:
Die eierlegende Wollmilchsau:
• Database Federation/Mediation
– query rewriting under GAV/LAV
– w/ binding pattern constraints
– distributed query processing
• Semantic Mediation
– semantic integrity constraints, reasoning w/ plans, automated
deduction
– deductive database/logic programming technology, AI “stuff”...
– Semantic Web technology (OWL, …)
• Scientific Workflow Management
– more procedural than database mediation (often the scientist is
the query planner)
– deployment using grid services!
GGF9 10/07/2003, Chicago
FIN
GGF9 10/07/2003, Chicago