Insert Title Here - San Diego Supercomputer Center

Download Report

Transcript Insert Title Here - San Diego Supercomputer Center

Scientific Data & Workflow Engineering
Preliminary Notes from the Cyberinfrastructure Trenches
Bertram Ludäscher
Associate Professor
Dept. of Computer Science & Genome Center
University of California, Davis
Fellow
San Diego Supercomputer Center
University of California, San Diego
UC DAVIS
Department of
Computer Science
San Diego
Supercomputer Center
Outline
• Introduction: CI Sample Architectures
• Scientific Data Integration
• Scientific Workflow Management
• Links & Crystallization Points
• Lessons learnt & Summary
Nov. 15th 2004
2
Scientific Data & WF Engineering, B.Ludäscher
Science Environment for Ecological
Knowledge (SEEK) Overview
•
Domain Science Driver
•
Analysis & Modeling System
•
Semantic Mediation System
•
EcoGrid
– Ecology (LTER), biodiversity, …
– Design & execution of ecological
models & analysis (“scientific
workflows”)
– {application,upper}-ware
 Kepler system
– Data Integration of hard-torelate sources and processes
– Semantic Types and Ontologies
– upper middleware
 Sparrow Toolkit
– Access to ecology data and tools
– {middle,under}-ware
 unified API to SRB/MCAT,
MetaCat, DiGIR, … datasets
Nov. 15th 2004
sample CS problem [DILS’04]
3
Scientific Data & WF Engineering, B.Ludäscher
Common CI Infrastructure Pieces
• Other CI-projects (e.g. GEON, … ) have similar
service-oriented architectures:
– Seamless and uniform data access (“Data-Grid”)
• data & metadata registry
– distributed and high performance computing
platform (“Compute-Grid”)
• service registry
– Federated, integrated, mediated databases
• often use of semantic extensions (e.g. ontologies)
– User-friendly workbench / problem-solving
environment
 scientific workflows
• add to this sensors, observing systems …
Nov. 15th 2004
4
Scientific Data & WF Engineering, B.Ludäscher
… Example: Realtime Environment for
Analytical Processing (REAP vision)
Nov. 15th 2004
5
Scientific Data & WF Engineering, B.Ludäscher
The Great Unified System
• Many engineering and CS challenges!
… we’ll see some …
• Our focus:
– Scientific data integration
• How to associate, mediate, integrate complex scientific data?
– Scientific workflows
• How to devise larger scientific workflows for process
automation from individual components (e.g. web services)?
• Disclaimer:
… often scratching the surface; see references &
research literature for details …
Nov. 15th 2004
6
Scientific Data & WF Engineering, B.Ludäscher
Outline
• Introduction: CI Sample Architectures
• Scientific Data Integration
• Scientific Workflow Management
• Links & Crystallization Points
• Lessons learnt & Summary
Nov. 15th 2004
7
Scientific Data & WF Engineering, B.Ludäscher
An Online Shopper’s Information Integration
Problem
El Cheapo: “Where can I get the cheapest copy (including shipping cost) of
Wittgenstein’s Tractatus Logicus-Philosophicus within a week?”
addall.com
?
Mediator (virtual DB) Information
(vs. Datawarehouse)
Integration
NOTE: non-trivial
data engineering challenges!
amazon.com
Nov. 15th 2004
barnes&noble.com
8
half.com
“One-World”
Mediation
A1books.com
Scientific Data & WF Engineering, B.Ludäscher
A Home Buyer’s Information Integration Problem
What houses for sale under $500k have at least 2 bathrooms, 2 bedrooms,
a nearby school ranking in the upper third, in a neighborhood
with below-average crime rate and diverse population?
?
“Multiple-Worlds”
Mediation
Information
Integration
Realtor
Nov. 15th 2004
Crime Stats
School Rankings
9
Demographics
Scientific Data & WF Engineering, B.Ludäscher
A Neuroscientist’s Information
Integration Problem
Biomedical Informatics
Research Network
http://nbirn.net
What is the cerebellar distribution of rat proteins with more than
70% homology with human NCS-1? Any structure specificity?
How about other rodents?
?
Information
Integration
Inter-source links:
“Complex
Multiple-Worlds”
Mediation
• unclear for the non-scientists
• hard for the scientist
protein localization
sequence info
(NCMIR)
(CaPROT)
Nov. 15th 2004
morphometry
10
(SYNAPSE)
neurotransmission
(SENSELAB)
Scientific Data & WF Engineering, B.Ludäscher
Nov. 15th 2004
11
Scientific Data & WF Engineering, B.Ludäscher
Interoperability & Integration Challenges
•
System aspects: “Grid” Middleware
•
•
•
•
Syntax & Structure:
(XML-Based) Data Mediators
•
•
•
•
 reconciling
heterogeneities
 “gluing” together resources
 bridging information and
•
knowledge gaps
computationally
S5
wrapping, restructuring
(XML) queries and views
sources = (XML) databases
Semantics:
Model-Based/Semantic Mediators
•
•
•
conceptual models and declarative views
Knowledge Representation: ontologies,
description logics (RDF(S),OWL ...)
sources = knowledge bases (DB+CMs+ICs)
Synthesis: Scientific Workflow Design &
Execution
•
•
Nov. 15th 2004
distributed data & computing, SOA
web services, WSDL/SOAP, WSRF, OGSA, …
sources = functions, files, data sets …
Composition of declarative and procedural
components into larger workflows
(re)sources = services, processes, actors, …
12
Scientific Data & WF Engineering, B.Ludäscher
Information Integration Challenges:
S4 Heterogeneities
• System aspects
– platforms, devices, data & service distribution, APIs, protocols, …
 Grid middleware technologies
+ e.g. single sign-on, platform independence, transparent use of remote
resources, …
• Syntax & Structure
– heterogeneous data formats (one for each tool ...)
– heterogeneous data models (RDBs, ORDBs, OODBs, XMLDBs, flat files, …)
– heterogeneous schemas (one for each DB ...)
 Database mediation technologies
+ XML-based data exchange, integrated views, transparent query rewriting, …
• Semantics
– descriptive metadata, different terminologies, “hidden” semantics
(context), implicit assumptions, …
 Knowledge representation & semantic mediation technologies
+ “smart” data discovery & integration
+ e.g. ask about X (‘mafic’); find data about Y (‘diorite’); be happy anyways!
Nov. 15th 2004
13
Scientific Data & WF Engineering, B.Ludäscher
Information Integration Challenges:
S5 Heterogeneities
• Synthesis of applications, analysis tools, data &
query components, … into “scientific workflows”
– How to make use of these wonderful things & put them
together to solve a scientist’s problem?
 Scientific Problem Solving Environments (PSEs)
 Portals,Workbench (“scientist’s view”)
+ ontology-enhanced data registration, discovery,
manipulation
+ creation and registration of new data products from
existing ones, …
 Scientific Workflow System (“engineer’s view”)
+ for designing, re-engineering, deploying analysis pipelines
and scientific workflows; a tool to make new tools …
+ e.g., creation of new datasets from existing ones, dataset
registration, …
Nov. 15th 2004
14
Scientific Data & WF Engineering, B.Ludäscher
Information Integration from a
Database Perspective
• Information Integration Problem
– Given: data sources S1, ..., Sk (databases, web sites, ...)
and user questions Q1,..., Qn that can –in principle– be
answered using the information in the Si
– Find: the answers to Q1, ..., Qn
• The Database Perspective: source = “database”
 Si has a schema (relational, XML, OO, ...)
 Si can be queried
 define virtual (or materialized) integrated (or global)
view G over local sources S1 ,..., Sk using database
query languages (SQL, XQuery,...)
 questions become queries Qi against G(S1,..., Sk)
Nov. 15th 2004
15
Scientific Data & WF Engineering, B.Ludäscher
Standard (XML-Based) Mediator Architecture
USER/Client
1. Query Q ( G (S1,..., Sk) )
6. {answers(Q)}
Integrated Global
(XML) View G
Integrated View
Definition
MEDIATOR
G(..) S1(..)…Sk(..)
2.
5. Query
Post rewriting
processing
3. Q1
Q2
Q3
4. {answers(Q1)}
{answers(Q2)}
{answers(Q3)}
Nov. 15th 2004
(XML) View
(XML) View
(XML) View
Wrapper
Wrapper
Wrapper
S1
S2
Sk
16
web services as
wrapper APIs
Scientific Data & WF Engineering, B.Ludäscher
Query Planning in Data Integration
• Given:
–
–
–
–
Declarative user query Q: answer(…)  …G ...
…&{G…S…}
global-as-view (GAV)
…&{S…G…}
local-as-view (LAV)
… & { ic(…)  … S … G… } integrity constraints (ICs)
• Find:
– equivalent (or minimal containing, maximal contained)
query plan Q’: answer(…)  … S …
 query rewriting (logical/calculus, algebraic, physical levels)
• Results:
– A variety of results/algorithms; depending on classes of
queries, views, and ICs: P, NP, … , undecidable
– hot research area in core CS (database community)
Nov. 15th 2004
17
Scientific Data & WF Engineering, B.Ludäscher
Scientific Data Integration using
Semantic Extensions
Nov. 15th 2004
18
Scientific Data & WF Engineering, B.Ludäscher
Nov. 15th 2004
19
Scientific Data & WF Engineering, B.Ludäscher
Example: Geologic Map Integration
• Given:
– Geologic maps from different state geological
surveys (shapefiles w/ different data schemas)
– Different ontologies:
• Geologic age ontology (e.g. USGS)
• Rock classification ontologies:
– Multiple hierarchies (chemical, fabric, texture, genesis)
from Geological Survey of Canada (GSC)
– Single hierarchy from British Geological Survey (BGS)
• Problem:
– Support uniform queries across all map
– … using different ontologies
– Support registration w/ ontology A, querying w/
ontology B
Nov. 15th 2004
20
Scientific Data & WF Engineering, B.Ludäscher
Schema Integration
Sources
Arizona
Colorado
Utah
Nevada
Wyoming
New Mexico
Montana E.
Nov. 15th 2004
(“registering” local
schemas to the global schema)
ABBREV
Formation
…
PERIOD
Age
…
NAME
Formation
…
PERIOD
Age
…
TYPE
Formation
…
PERIOD
Age
…
FMATN
Formation
…
Age
…
NAME
Formation
…
PERIOD
Age
…
TIME_UNIT
… Formation
FORMATION
… Age
AGE
… Fabric
LITHOLOGY
… Texture
Integration
Schema
… Formation
Livingston formation
FORMATION
… Age
NAME
Formation
…
… Composition
PERIOD
Age
…
… Fabric
FORMATION
PERIOD
Formation
…
Age
…
Idaho
… Composition
… Texture
21
TertiaryCretaceous
AGE Montana West
LITHOLOGY
andesitic sandstone
Sources
Scientific Data & WF Engineering, B.Ludäscher
Multihierarchical Rock Classification “Ontology”
(Taxonomies) for “Thematic Queries” (GSC)
Genesis
Fabric
Composition
Texture
Nov. 15th 2004
22
Scientific Data & WF Engineering, B.Ludäscher
Ontology-Enabled Application Example:
Geologic Map Integration
domain
knowledge
Show
formations
where AGE =
‘Paleozic’
(without age
ontology)
(with age
ontology)
+/- a few hundred
million years
Nevada
Nov. 15th 2004
Show
formations
where AGE
= ‘Paleozic’
23
Scientific Data & WF Engineering, B.Ludäscher
Querying by Geologic Age …
Nov. 15th 2004
24
Scientific Data & WF Engineering, B.Ludäscher
Querying by Geologic Age: Results
Nov. 15th 2004
25
Scientific Data & WF Engineering, B.Ludäscher
Querying by Chemical Composition
… (GSC)
Nov. 15th 2004
26
Scientific Data & WF Engineering, B.Ludäscher
Semantic Mediation
(via “semantic
registration” of schemas and ontology articulations)
• Schema elements and/or data values are associated
with concept expressions from the target ontology
 conceptual queries “through” the ontology
• Articulation ontology
 source registration to A, querying through B
• Semantic mediation: query rewriting w/ ontologies
Database1
semantic
registration
Ontology A
Concept-based
(“semantic”)
queries
ontology
articulations
Database2
Nov. 15th 2004
semantic
registration
Ontology B
27
Scientific Data & WF Engineering, B.Ludäscher
Different views on State
Geological Maps
Nov. 15th 2004
28
Scientific Data & WF Engineering, B.Ludäscher
Sedimentary Rocks: BGS Ontology
Nov. 15th 2004
29
Scientific Data & WF Engineering, B.Ludäscher
Sedimentary Rocks: GSC Ontology
Nov. 15th 2004
30
Scientific Data & WF Engineering, B.Ludäscher
Implementation in OWL: Not only “for the
machine” …
Nov. 15th 2004
31
Scientific Data & WF Engineering, B.Ludäscher
Source Contextualization
through Ontology Refinement
In addition to registering
(“hanging off”) data relative to
existing concepts, a source
may also refine the mediator’s
domain map...
 sources can register new
concepts at the mediator ...
Nov. 15th 2004
32
Scientific Data & WF Engineering, B.Ludäscher
Outline
• Introduction: CI Sample Architectures
• Scientific Data Integration
• Scientific Workflow Management
• Links & Crystallization Points
• Lessons learnt & Summary
Nov. 15th 2004
33
Scientific Data & WF Engineering, B.Ludäscher
What is a Scientific Workflow (SWF)?
• Goals:
– automate a scientist’s repetitive data management and
analysis tasks
– typical phases:
• data access, scheduling, generation, transformation,
aggregation, analysis, visualization
 design, test, share, deploy, execute, reuse, … SWFs
Nov. 15th 2004
34
Scientific Data & WF Engineering, B.Ludäscher
Promoter Identification Workflow
Source: Matt Coleman (LLNL)
Nov. 15th 2004
35
Scientific Data & WF Engineering, B.Ludäscher
Source: NIH BIRN (Jeffrey Grethe, UCSD)
Nov. 15th 2004
36
Scientific Data & WF Engineering, B.Ludäscher
Ecology: GARP Analysis Pipeline
for Invasive Species Prediction
Test sample (d)
Registered
Ecogrid
Database
EcoGrid
Query
Species
presence &
absence points
(native range)
(a)
Registered
Ecogrid
Database
+A1
+A2
+A3
Sample
Data
Training
sample
(d)
Data
Calculation
GARP
rule set
(e)
Integrated
layers
(native range) (c)
Invasion
area prediction
map (f)
Map
Generation
Layer
Integration
Registered
Ecogrid
Database
Validation
Model quality
parameter (g)
Environmental
layers (native
range) (b)
Environmental
layers (invasion
area) (b)
Layer
Integration
User
Model quality
parameter (g)
Integrated layers
(invasion area) (c)
EcoGrid
Query
Registered
Ecogrid
Database
Map
Generation
Native
range
prediction
map (f)
Validation
Archive
To Ecogrid
Selected
prediction
maps (h)
Generate
Metadata
Species presence
&absence points
(invasion area) (a)
Nov. 15th 2004
Source: NSF SEEK (Deana
Pennington et. al, UNM)
Scientific Data & WF Engineering, B.Ludäscher
37
Nov. 15th 2004
38
Scientific Data & WF Engineering, B.Ludäscher
Commercial & Open Source
Scientific “Workflow” (well Dataflow) Systems
Kensington Discovery
Edition from InforSense
Triana
Taverna
Nov. 15th 2004
39
Scientific Data & WF Engineering, B.Ludäscher
SCIRun: Problem Solving Environments for
Large-Scale Scientific Computing
•
•
SCIRun: PSE for interactive construction, debugging,
and steering of large-scale scientific computations
Component model, based on generalized dataflow
programming
Steve Parker (cs.utah.edu)
Nov. 15th 2004
40
Scientific Data & WF Engineering, B.Ludäscher
Ptolemy II
see!
read!
try!
Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/ptolemyII/
Nov.
15th
2004
41
Scientific Data & WF Engineering, B.Ludäscher
Why Ptolemy II (and thus KEPLER)?
• Ptolemy II Objective:
– “The focus is on assembly of concurrent components. The key underlying
principle in the project is the use of well-defined models of computation that
govern the interaction between components. A major problem area being
addressed is the use of heterogeneous mixtures of models of computation.”
• Dataflow Process Networks w/ natural support for abstraction,
pipelining (streaming) actor-orientation, actor reuse
• User-Orientation
– Workflow design & exec console (Vergil GUI)
– “Application/Glue-Ware”
•
•
•
•
excellent modeling and design support
run-time support, monitoring, …
not a middle-/underware (we use someone else’s, e.g. Globus, SRB, …)
but middle-/underware is conveniently accessible through actors!
• PRAGMATICS
– Ptolemy II is mature, continuously extended & improved, well-documented
(500+pp)
– open source system
– Ptolemy II folks actively participate in KEPLER
Nov. 15th 2004
42
Scientific Data & WF Engineering, B.Ludäscher
KEPLER/CSP:
Contributors, Sponsors, Projects
(or loosely coupled Communicating Sequential Persons ;-)
Ilkay Altintas SDM, Resurgence
Kim Baldridge Resurgence, NMI
Chad Berkley SEEK
Shawn Bowers SEEK
Terence Critchlow SDM
Tobin Fricke ROADNet
Jeffrey Grethe BIRN
Christopher H. Brooks Ptolemy II
Zhengang Cheng SDM
Dan Higgins SEEK
www.kepler-project.org
Efrat Jaeger GEON
Matt Jones SEEK
Werner Krebs, EOL
Edward A. Lee Ptolemy II
Kai Lin GEON
Bertram Ludaescher SEEK, GEON, SDM, ROADNet, BIRN
Mark Miller EOL
Steve Mock NMI
Steve Neuendorffer Ptolemy II
Jing Tao SEEK
Mladen Vouk SDM
Xiaowen Xin SDM
Yang Zhao Ptolemy II
Bing Zhu SEEK
•••
Nov. 15th 2004
43
Ptolemy II
Scientific Data & WF Engineering, B.Ludäscher
KEPLER: An Open Collaboration
• Initiated by members from NSF SEEK and DOE SDM/SPA; now
several other projects (GEON, Ptolemy II, EOL, Resurgence/NMI, …)
• Open Source (BSD-style license)
• Intensive Communications:
– Web-archived mailing lists
– IRC (!)
• Co-development:
– via shared CVS repository
– joining as a new co-developer (currently):
• get a CVS account (read-only)
• local development + contribution via existing KEPLER member
• be voted “in” as a member/co-developer
• Software & social engineering
– How to better accommodate new groups/communities?
– How to better accommodate different usage/contribution models (core
dev … special purpose extender … user)?
Nov. 15th 2004
44
Scientific Data & WF Engineering, B.Ludäscher
Ptolemy II/KEPLER GUI (Vergil)
“Directors” define the
component interaction
& execution semantics
Large, polymorphic component
(“Actors”) and Directors
libraries (drag & drop)
Nov. 15th 2004
45
Scientific Data & WF Engineering, B.Ludäscher
KEPLER/Ptolemy II GUI refined
Ontology based actor
(service) and dataset search
Result Display
Nov. 15th 2004
46
Scientific Data & WF Engineering, B.Ludäscher
Web Services  Actors
(WS Harvester)
1
2
4
3
 “Minute-made” (MM) WS-based application integration
• Similarly: MM workflow design & sharing w/o implemented components
Nov. 15th 2004
47
Scientific Data & WF Engineering, B.Ludäscher
Some Recent Actor Additions
Nov. 15th 2004
48
Scientific Data & WF Engineering, B.Ludäscher
An “early” example:
Promoter Identification
SSDBM, AD 2003
•
•
•
Scientist models
application as a
“workflow” of
connected
components
(“actors”)
If all components
exist, the
workflow can be
automated/
executed
Different
directors can be
used to pick
appropriate
execution model
(often “pipelined”
execution: PN
director)
Nov. 15th 2004
49
Scientific Data & WF Engineering, B.Ludäscher
Reengineering a Geoscientist’s
Mineral Classification Workflow
Nov. 15th 2004
50
Scientific Data & WF Engineering, B.Ludäscher
Job Management (here: NIMROD)
• Job management infrastructure in place
• Results database: under development
• Goal: 1000’s of GAMESS jobs (quantum
mechanics) – Fall/Winter’04
Nov. 15th 2004
51
Scientific Data & WF Engineering, B.Ludäscher
ORB
Nov. 15th 2004
52
Scientific Data & WF Engineering, B.Ludäscher
Rapid Web Service-based Prototyping
(Here: ROADNet Command & Control Services for LOOKING Kick-Off Mtg)
Nov. 15th 2004
53
Source: Ilkay Altintas, SDM, NLADR
ROADNet: Vernon, Orcutt et al
Web services: Tony Fountain et al
Scientific Data & WF Engineering, B.Ludäscher
in KEPLER (w/ editable script)
Nov. 15th 2004
54
Source:
Dan Higgins, Kepler/SEEK
Scientific Data & WF Engineering, B.Ludäscher
in KEPLER (interactive session)
Nov. 15th 2004
55
Source:
Dan Higgins, Kepler/SEEK
Scientific Data & WF Engineering, B.Ludäscher
Blurring Design (ToDo) and Execution
Nov. 15th 2004
56
Scientific Data & WF Engineering, B.Ludäscher
Scientific Workflow Challenges
• Typical Features
–
–
–
–
–
–
–
data-intensive and/or compute-intensive
plumbing-intensive (consecutive web services won’t fit)
dataflow-oriented
distributed (remote data, remote processing)
user-interaction “in the middle”, …
… vs. (C-z; bg; fg)-ing (“detach” and reconnect)
advanced programming constructs (map(f), zip,
takewhile, …)
– logging, provenance, “registering back” (intermediate)
products…
Nov. 15th 2004
57
Scientific Data & WF Engineering, B.Ludäscher
designed to fit
hand-crafted control
solution; also: forces
sequential execution!
designed to fit
[Altintas-et-al-PIW-SSDBM’03]
hand-crafted
Web-service actor
No data transformations
available
Nov. 15th 2004
Complex backward
control-flow
58
Scientific Data & WF Engineering, B.Ludäscher
A Scientific Workflow Problem: Solved
(Computer Scientist’s view)
• Solution based on
declarative, functional
dataflow process network
map(f)-style
iterators
(= also a data streaming
model!)
Powerful type
checking
• Higher-order constructs:
map(f)
Generic, declarative
“programming”
constructs
no control-flow spaghetti
data-intensive apps
free concurrent execution
Generic data
transformation actors
free type checking
automatic support to go
from piw(GeneId) to
PIW :=map(piw) over
Forward-only, abstractable
sub[GeneId]
Nov. 15th 2004





workflow piw(GeneId)
59
Scientific Data & WF Engineering, B.Ludäscher
Promoter Identification Workflow Redesigned
map(GenbankWS)
Input: {“NM_001924”, “NM020375”}
Output: {“CAGT…AATATGAC",“GGGGA…CAAAGA“}
Nov. 15th 2004
60
Scientific Data & WF Engineering, B.Ludäscher
A Research Problem:
Optimization by Rewriting
• Example: PIW as a declarative,
referentially transparent functional
process
map(f
o
 optimization via functional rewriting
possible
g)
instead of
map(f) o map(g)
e.g. map(f o g) = map(f) o map(g)
•
Technical report & PIW specification in
Haskell
Combination of
map and zip
http://kbis.sdsc.edu/SciDAC-SDM/scidac-tn-map-constructs.pdf
Nov. 15th 2004
61
Scientific Data & WF Engineering, B.Ludäscher
A KR+DI+Scientific Workflow Problem
• Services can be semantically compatible, but
structurally incompatible
Ontologies (OWL)
Compatible (⊑)
Semantic
Type Ps
Structural
Type Ps
Incompatible
(⋠)

Source
Service
Nov. 15th 2004
(Ps)
Semantic
Type Pt
Structural
Type Pt
(≺)
Desired Connection
Pt
Ps
62
Target
Service
Source:Scientific
[Bowers-Ludaescher,
DILS’04]
Data & WF Engineering,
B.Ludäscher
Ontology-Informed Data
Transformation (“Structure-Shim”)
Ontologies (OWL)
Semantic
Type Ps
Compatible
Correspondence
Generate
Nov. 15th 2004
Registration
Mapping (Input)
Registration
Mapping (Output)
Structural
Type Ps
Source
Service
(⊑)
Structural
Type Pt
(Ps)
Transformation
Ps
Semantic
Type Pt
Desired Connection
63
Pt
Target
Service
Source:Scientific
[Bowers-Ludaescher,
DILS’04]
Data & WF Engineering,
B.Ludäscher
Outline
• Introduction: CI Sample Architectures
• Scientific Data Integration
• Scientific Workflow Management
• Links & Crystallization Points
• Lessons learnt & Summary
Nov. 15th 2004
64
Scientific Data & WF Engineering, B.Ludäscher
Link-Up & Crystallization Points
• Shared (Domain) Science Vision, Goals
– NVO, SCEC, Human Genome Project, …
• Technology Waves
– XML, web services, WSRF, Semantic Web (OWL), Portlets, …
• Standards for data exchange, metadata, data access
protocols, …
– GML, EML, netCDF, HDF, …, ADN, …, DODS/OpenDAP, …
– Organizations: W3C, GGF, …,
• Community ontologies
– GO (Gene Ontology), ecoinformatics, seismology, geochemistry, …
– … from Saulus to Paulus …
• Shared Community Tools and Tool Co-Development
– SRB, Globus, …, Kepler, …
Nov. 15th 2004
65
Scientific Data & WF Engineering, B.Ludäscher
Shared Science Vision, Goals: SCEC/CME
Southern California Earthquake Center / Community Modeling Environment
Project
Simulation of Seismic Wave
Propagation of a Magnitude 7.7
Earthquake on San Andreas
Fault
– PIs: Thomas Jordan, Bernard
Minster, Reagan Moore, Carl
Kesselman
– Simulation
• 240 Processors for 5 days
• 47 Terabytes of data generated
– SDSC SAC project optimized
code on DataStar parallel
computer (both MPI I/O
management and checkpointing)
– Future simulation – Increase
resolution a factor of 2, implies
1 PB of simulation results, 1000
processors for 20 days
Nov. 15th 2004
66
Source:
Reagan
Moore, SDSC
Scientific
Data
& WF Engineering,
B.Ludäscher
Example: NVO Community Processes
- created standard data encoding format (FITS image
format)
- made accessible common digital holdings (sky survey images)
- defined Uniform Content Descriptors (common metadata
attributes)
- created standard services (standard access mechanisms to
catalogs
and surveys)
- created digital library (manage derived data products)
- created portals (for combining services interactively)
- created processing pipelines (for automated processing)
- created preservation environment
• Broader impact: found a new star!
Nov. 15th 2004
67
Source:
Reagan
Moore, SDSC
Scientific
Data
& WF Engineering,
B.Ludäscher
Semantic Mediation “Waterfall” …
Iterative
Development
Ontologies
Semantic Data,
Service Annotation
Resource
Discovery
Resource
Integration
Workflow
Analysis
Workflow
Planning
Nov. 15th 2004
68
Source: Scientific
Shawn Bowers,
AHM’04
Data & WFSEEK
Engineering,
B.Ludäscher
GEON Dataset Generation & Registration
(a co-development in KEPLER)
% Makefile
$> ant run
Matt,Chad,
Dan et al.
(SEEK)
SQL database access (JDBC)
Efrat
(GEON)
Ilkay
(SDM)
Yang (Ptolemy)
Xiaowen (SDM)
Edward et al.(Ptolemy)
Nov. 15th 2004
69
Scientific Data & WF Engineering, B.Ludäscher
KEPLER as a Melting Pot …
• A grass-roots project
– Needed a coalition of the (really!) willing
• Inter-project links
– SEEK ITR, GEON ITR, ROADNet ITRs, DOE SciDAC SDM,
Ptolemy II, NIH BIRN (coming …), UK eScience myGrid, …
• Intra-project links
– e.g. in SEEK: AMS  SMS  EcoGrid
• Inter-technology links
– Globus, SRB, JDBC, web services, soaplab services, command
line tools, R, GRASS, XSLT, …
• Interdisciplinary links
– CS, IT, domain sciences, … (recently: usability engineer)
Nov. 15th 2004
70
Scientific Data & WF Engineering, B.Ludäscher
Outline
• Introduction: CI Sample Architectures
• Scientific Data Integration
• Scientific Workflow Management
• Links & Crystallization Points
• Lessons learnt & Summary
Nov. 15th 2004
71
Scientific Data & WF Engineering, B.Ludäscher
Some Lessons Learnt
• Eat your own dog-food (or at least try…)
– start using your own (CI) tools early
• Collaboration tools
–
–
–
–
CVS repositories (+cvsview, webcvs)
Mailing lists (e.g. mailman  googlified)
Bugzilla (detailed tracking of tech. issues & bugs)
Wiki (community authored web resource, e.g. high-level tech. issues)
• Where is the XYZ repository/registry?
– EcoGrid (SEEK) registry, GEON registry, KEPLER actor & datasets
repository, …
– UDDI what?
• CI Melting Pots: SDSC, …
– NCEAS, LTER, NLADR (w/ NCSA), KU Specify, …
Nov. 15th 2004
72
Scientific Data & WF Engineering, B.Ludäscher
Q & A
Nov. 15th 2004
73
Scientific Data & WF Engineering, B.Ludäscher
Further Reading
Nov. 15th 2004
74
Scientific Data & WF Engineering, B.Ludäscher
Related Publications
• Semantic Data Registration and Integration
•
•
•
•
•
On Integrating Scientific Resources through Semantic Registration, S. Bowers, K. Lin, and
B. Ludäscher, 16th International Conference on Scientific and Statistical Database
Management (SSDBM'04), 21-23 June 2004, Santorini Island, Greece.
A System for Semantic Integration of Geologic Maps via Ontologies, K. Lin and B.
Ludäscher. In Semantic Web Technologies for Searching and Retrieving Scientific Data
(SCISW), Sanibel Island, Florida, 2003.
Towards a Generic Framework for Semantic Registration of Scientific Data, S. Bowers and
B. Ludäscher. In Semantic Web Technologies for Searching and Retrieving Scientific Data
(SCISW), Sanibel Island, Florida, 2003.
The Role of XML in Mediated Data Integration Systems with Examples from Geological
(Map) Data Interoperability, B. Brodaric, B. Ludäscher, and K. Lin. In Geological Society of
America (GSA) Annual Meeting, volume 35(6), November 2003.
Semantic Mediation Services in Geologic Data Integration: A Case Study from the GEON
Grid, K. Lin, B. Ludäscher, B. Brodaric, D. Seber, C. Baru, and K. A. Sinha. In Geological
Society of America (GSA) Annual Meeting, volume 35(6), November 2003.
• Query Planning and Rewriting
•
•
•
Processing First-Order Queries under Limited Access Patterns, Alan Nash and B.
Ludäscher, Proc. 23rd ACM Symposium on Principles of Database Systems (PODS'04) Paris,
France, June 2004.
Processing Unions of Conjunctive Queries with Negation under Limited Access Patterns, Alan
Nash and B. Ludäscher., 9th Intl. Conference on Extending Database Technology (EDBT'04)
Heraklion, Crete, Greece, March 2004, LNCS 2992.
Web Service Composition Through Declarative Queries: The Case of Conjunctive Queries
with Union and Negation, B. Ludäscher and Alan Nash. Research abstract (poster), 20th
Intl. Conference on Data Engineering (ICDE'04) Boston, IEEE Computer Society, April 2004.
Nov. 15th 2004
75
Scientific Data & WF Engineering, B.Ludäscher
Related Publications
• Scientific Workflows
•
•
•
•
Kepler: An Extensible System for Design and Execution of Scientific Workflows, I. Altintas, C.
Berkley, E. Jaeger, M. Jones, B. Ludäscher, S. Mock, 16th International Conference on
Scientific and Statistical Database Management (SSDBM'04), 21-23 June 2004, Santorini Island,
Greece.
Kepler: Towards a Grid-Enabled System for Scientific Workflows, Ilkay Altintas, Chad Berkley,
Efrat Jaeger, Matthew Jones, Bertram Ludäscher, Steve Mock, Workflow in Grid Systems
(GGF10), Berlin, March 9th, 2004.
An Ontology-Driven Framework for Data Transformation in Scientific Workflows, S. Bowers and
B. Ludäscher, Intl. Workshop on Data Integration in the Life Sciences (DILS'04), March 25-26,
2004 Leipzig, Germany, LNCS 2994.
A Web Service Composition and Deployment Framework for Scientific Workflows, I. Altintas,
E. Jaeger, K. Lin, B. Ludaescher, A. Memon, In the 2nd Intl. Conference on Web Services
(ICWS), San Diego, California, July 2004.
Nov. 15th 2004
76
Scientific Data & WF Engineering, B.Ludäscher