Transcript ppt. Large

NSF Workshop on Cyberinfrastructure for
Environmental Observatories
Introduction to CI Topics
Chaitan Baru, SDSC/NLADR
Bertram Ludaescher, UC Davis/SDSC
Michael Welge, NCSA/NLADR
NSF Workshop on CI for Environmental Observatories, Dec 6.7, 2004
SAN DIEGO SUPERCOMPUTER CENTER
Outline
•
•
•
•
A nexus of CI projects
CI project “principles”
CI technical focus areas/topics
CI organizational issues
NSF Workshop on CI for Environmental Observatories, Dec 6.7, 2004
SAN DIEGO SUPERCOMPUTER CENTER
CI Projects
•
Biomedical
•
•
•
Geosciences
•
•
•
•
•
•
•
Monitoring Health of Civil Infrastructure (El Gamal PI, Fountain, …)
CLEANER (Minsker, Welge, Zaslavsky, Fountain, Pancake, …)
CISE
•
•
•
•
SEEK (Michener PI, Ludaescher, Jones, Rajasekar, …)
LTER (Michener PI, SDSC partner (Arzberger, Baru, Fountain, Rajasekar)…)
NEON (Hayden/Michener Lead PI’s, Krishtalka, Baru, Welge…)
ROADNet (Orcutt PI, Vernon, Rajasekar, Ludaescher, Fountain, …)
NSF/BDI Lake Metabolism (Arzberger/Kratz PI’s, Fountain, …) ...
Engineering
•
•
•
GEON (Baru PI, Ludaescher, Papadopoulos, Helly, …)
SCEC (Jordan PI, Moore, …)
LEAD (Drogemeier PI, Wilhelmson, Welge, …)
Chronos (Cervato PI, Baru…)
CUAHSI-HIS (Maidment PI, Helly, Zaslavsky, …)
LOOKING (Smarr/Orcutt PI, Welge, Fountain, …) ...
Bio/Eco/Environmental
•
•
•
•
•
•
BIRN-CC (Ellisman PI, Papadopoulos, Gupta, Baru, …)
National Biomedical Computational Resource, NBCR (Arzberger PI, Ellisman, Papadopoulos, Gupta,
Baru, …) ...
OpIPuter (Smarr PI, Ellisman, Orcutt, Papadopoulos, Welge, …)
NMI, GRIDs Center
Data Intensive Grid Benchmarking (Baru PI, Snavely, Casanova)
MPS
•
NVO, GriPhyN, …
NSF Workshop on CI for Environmental Observatories, Dec 6.7, 2004
SAN DIEGO SUPERCOMPUTER CENTER
CI Project Principles
•
Use IT state-of-the-art, and develop advanced IT where needed, to support
the “day-to-day” conduct of science (e-science)
•
•
•
 The “two-tier” approach
•
•
•
IT works in close conjunction with science, to create CI, i.e., the best practices, data
sharing frameworks, useful and usable capabilities and tools
Create the “science IT infrastructure”
•
•
•
Use best practices, including commercial tools,
while developing advanced technology in open source, and doing CS research
An equal partnership
•
•
(not just “hero” computations)
Based on a Web/Grid services-based distributed environment
Online databases with advanced search engines
Robust tools and applications, etc.
Leverage from other intersecting projects
•
•
•
Much commonality in the technologies, regardless of science disciplines
Constantly work towards eliminating (or, at least, minimizing) the “NIH” syndrome
And, importantly, try not to reinvent what industry already knows how to do…
NSF Workshop on CI for Environmental Observatories, Dec 6.7, 2004
SAN DIEGO SUPERCOMPUTER CENTER
Important Focus Areas / Topics
• Security
• Authentication, access control, controls for data publication…
• Grid middleware
• WSRF implementations, architecting “core” services (e.g. for metadata
management, versioning, …)
• Data integration and ontologies
• Data interoperability, schema and semantic integration
• Workflow systems
• “system-level” and science workflows (ingestion and analysis)
• Sensor network and sensor data management
• Extensible, scalable, autonomic software; intelligent sensor management
• Data mining
• Online analysis, large-scale data, novel algorithms, advanced triggering and
notification
• Visualization
• Large-scale, multi-model (data viz, GIS, info viz)
NSF Workshop on CI for Environmental Observatories, Dec 6.7, 2004
SAN DIEGO SUPERCOMPUTER CENTER
Example: GEON “Software Stack”
Portal
Other service
(myGEON)
“consumers”
service interfaces
Registration
Registration
Services
GEONsearch
Data
Integration
Services
GIS Mapping
Services
GEONworkbench
Computational
And Modeling
Services
Core Grid Services
Data, Metadata, Indexing, Logging, Other Systems Services
“Physical” Grid
RedHat Linux, ROCKS, OGSI, Internet, I2, OptIPuter (planned)
NSF Workshop on CI for Environmental Observatories, Dec 6.7, 2004
SAN DIEGO SUPERCOMPUTER CENTER
Antelope WSRF Extensions
Courtesy: Tony Fountain, SDSC and LOOKING project
Services Repository
name, definiton, others
Proxy Repository
Certs,username, password, others
Soap
Request
Proxy
Cert
Portal
Request
Params
Soap
Header
Soap
Body
SOAP/HTTP
ORB
Manager
Database
operator
ORB
commander
Data
Analyzer
ORB
Monitor
Event
Coordinator
Services
Subscriber
Other
Services
Lookup
Service
WSRF
Authentication
& Authorization
Service
Invoker
Antelope
Web Services
field
digitizer
WSResource Object
WSResource
WSResource
WSResource
Ring
Buffer
Field
Interface
Module
Databases
WSResource
field
digitizer
ORB Operations:
Orb Import
Orb Export
Processing
Archiving
WSResource
Antelope
Executive
Module
WSResource
field
digitizer
CI Organizational Issues
• How to foster development of common
infrastructure (based upon science needs/input),
across multiple science domains
• Not just at hardware level (e.g. supercomputers, highspeed networks) or OS and system services level
• But, at the database, data integration, data mining levels
• How to deal with the continuum of activities from
basic CS research to production IT systems
• NLADR – created with above issues in mind
• Prototype for a CI organization
NSF Workshop on CI for Environmental Observatories, Dec 6.7, 2004
SAN DIEGO SUPERCOMPUTER CENTER
NLADR—National Lab for Advanced Data
Research
• Joint activity between SDSC and NCSA, started October
1, 2004
• Formed based on NSF’s requirement that SDSC and
NCSA collaborate on CI activities
• Collaborative R&D activity focused on advanced data
technologies
•
•
•
•
Guided by real applications from science communities
…to assemble expertise and a “knowledge base” of data technologies
And, also develop a broad data architecture framework
…within which to develop, integrate, test, and benchmark data-related
technologies
• …in the context of national-scale physical infrastructure
NSF Workshop on CI for Environmental Observatories, Dec 6.7, 2004
SAN DIEGO SUPERCOMPUTER CENTER
NLADR Services Architecture
Applications
NSF – LEAD, GEON, LTERGrid, CLEANER, LOOKING
NIH/NCRR – BIRN
NASA – Space & Earth Sciences
Strategic Industrial Partners -- …
nladrSearch
DataWorkbench
NLADR Query, Analysis, and Visualization Services
Data
Database
Workflow
Registration
Federation
Authoring
And Indexing & Integration Execution
Data Analysis
and
Mining
Data and
Information
Visualization
Collaboration
Benchmarking
NLADR Data Management Services
Management and archiving of large simulation outputs, streaming data, databases, data collections
Grid and Web Middleware – (Globus/WSRF/WebServices/J2EE)
Node Operating Systems (Linux, …)
Internet2, LambdaGrids
SDSC/NCSA testbed, OptIPuter
NSF Workshop on CI for Environmental Observatories, Dec 6.7, 2004
SAN DIEGO SUPERCOMPUTER CENTER
Some Core IT Areas
• Data integration and ontologies
• Data interoperability, schema and semantic integration
• Scientific Workflows
NSF Workshop on CI for Environmental Observatories, Dec 6.7, 2004
SAN DIEGO SUPERCOMPUTER CENTER
Schema Integration (“registering” local
Sources
Arizona
Colorado
Utah
Nevada
Wyoming
New Mexico
Montana E.
schemas to a global schema)
ABBREV
Formation
…
PERIOD
Age
…
NAME
Formation
…
PERIOD
Age
…
TYPE
Formation
…
PERIOD
Age
…
FMATN
Formation
…
Age
…
NAME
Formation
…
PERIOD
Age
TIME_UNIT
…
…
Formation
…
Age
…
Composition
…
Fabric
…
Texture
…
Formation
…
Age
Formation
…
…
Composition
PERIOD
Age
…
…
Fabric
PERIOD
Formation
…
Age
…
AGE
Idaho
LITHOLOGY
Integration
Schema
NAME
FORMATION
FORMATION
…
Texture
NSF Workshop on CI for Environmental Observatories, Dec 6.7, 2004
Livingston formation
FORMATION
TertiaryCretaceous
AGE Montana West
LITHOLOGY
andesitic sandstone
Sources
SAN DIEGO SUPERCOMPUTER CENTER
Multihierarchical Rock Classification “Ontology”
(Taxonomies) for “Thematic Queries” (GSC)
Genesis
Fabric
Composition
Texture
NSF Workshop on CI for Environmental Observatories, Dec 6.7, 2004
SAN DIEGO SUPERCOMPUTER CENTER
Ontology-Enabled Application Example:
Geologic Map Integration
domain
knowledge
Show
formations
where AGE =
‘Paleozic’
(without age
ontology)
Nevada
NSF Workshop on CI for Environmental Observatories, Dec 6.7, 2004
Show
formations
where AGE
= ‘Paleozic’
(with age
ontology)
+/- a few hundred
million years
SAN DIEGO SUPERCOMPUTER CENTER
Different views on State Geological
Maps
NSF Workshop on CI for Environmental Observatories, Dec 6.7, 2004
SAN DIEGO SUPERCOMPUTER CENTER
Sedimentary Rocks: BGS Ontology
NSF Workshop on CI for Environmental Observatories, Dec 6.7, 2004
SAN DIEGO SUPERCOMPUTER CENTER
Sedimentary Rocks: GSC Ontology
NSF Workshop on CI for Environmental Observatories, Dec 6.7, 2004
SAN DIEGO SUPERCOMPUTER CENTER
Example: Domain Knowledge
to “glue” SYNAPSE & NCMIR
Data
Purkinje cells and Pyramidal cells have dendrites
that have higher-order branches that contain spines.
Dendritic spines are ion (calcium) regulating components.
Spines have ion binding proteins. Neurotransmission
involves ionic activity (release). Ion-binding proteins
control ion activity (propagation) in a cell. Ion-regulating
components of cells affect ionic activity (release).
domain expert knowledge
Made usable for the system
using Description Logic
on CI for
Environmental Observatories, Dec 6.7, 2004
formalizedNSF
asWorkshop
domain
map/ontology
SAN DIEGO SUPERCOMPUTER CENTER
“Semantic Source Browsing”:
Domain Maps/Ontologies (left) & conceptually linked data (right)
NSF Workshop on CI for Environmental Observatories, Dec 6.7, 2004
SAN DIEGO SUPERCOMPUTER CENTER
A Semantic Mediation
Result View
NSF Workshop on CI for Environmental Observatories, Dec 6.7, 2004
SAN DIEGO SUPERCOMPUTER CENTER
Source Contextualization
through Ontology Refinement
In addition to registering
(“hanging off”) data relative to
existing concepts, a source
may also refine the mediator’s
domain map...
 sources can register new
concepts at the mediator ...
 increase your data usability
NSF Workshop on CI for Environmental Observatories, Dec 6.7, 2004
SAN DIEGO SUPERCOMPUTER CENTER
What is a Scientific Workflow (SWF)?
• Aims:
• automate a scientist’s repetitive data management and analysis tasks
• typical phases:
• data access, scheduling, generation, transformation, aggregation,
analysis, mining, visualization
design, test, share, deploy, execute, reuse, … SWFs
NSF Workshop on CI for Environmental Observatories, Dec 6.7, 2004
SAN DIEGO SUPERCOMPUTER CENTER
Promoter Identification Workflow
Source: Matt Coleman (LLNL)
NSF Workshop on CI for Environmental Observatories, Dec 6.7, 2004
SAN DIEGO SUPERCOMPUTER CENTER
KEPLER/CSP: Contributors, Sponsors, Projects
(or loosely coupled Communicating Sequential Persons ;-)
Ilkay Altintas SDM, Resurgence
Kim Baldridge Resurgence, NMI
Chad Berkley SEEK
Shawn Bowers SEEK
Terence Critchlow SDM
Tobin Fricke ROADNet
Jeffrey Grethe BIRN
Christopher H. Brooks Ptolemy II
Zhengang Cheng SDM
Dan Higgins SEEK
Efrat Jaeger GEON
Matt Jones SEEK
Werner Krebs, EOL
Edward A. Lee Ptolemy II
Kai Lin GEON
Bertram Ludaescher SDM, SEEK, GEON, BIRN, ROADNet
Mark Miller EOL
Steve Mock NMI
Steve Neuendorffer Ptolemy II
Jing Tao SEEK
Mladen Vouk SDM
Xiaowen Xin SDM
Yang Zhao Ptolemy II
Bing Zhu SEEK
•••
Ptolemy II
www.kepler-project.org
NSF Workshop on CI for Environmental Observatories, Dec 6.7, 2004
SAN DIEGO SUPERCOMPUTER CENTER
Scientific Workflows as a Melting Pot:
Example: The Kepler SWF System
• A grass-roots project
• collaboration at the level of developers
• Intra-project links
• e.g. in SEEK: AMS  SMS  EcoGrid
• Inter-project links
• SEEK ITR, GEON ITR, ROADNet ITRs, DOE SciDAC SDM, Ptolemy
II, NIH BIRN (coming we hope …), UK eScience myGrid, …
• Inter-technology links
• Globus, SRB, JDBC, web services, soaplab services, command line
tools, R, GRASS, XSLT, …
• Interdisciplinary links
• CS, IT, domain sciences, …
NSF Workshop on CI for Environmental Observatories, Dec 6.7, 2004
SAN DIEGO SUPERCOMPUTER CENTER
Promoter Identification Workflow
in KEPLER
NSF Workshop on CI for Environmental Observatories, Dec 6.7, 2004
SAN DIEGO SUPERCOMPUTER CENTER
Promoter Identification Workflow
in KEPLER
NSF Workshop on CI for Environmental Observatories, Dec 6.7, 2004
SAN DIEGO SUPERCOMPUTER CENTER
Web Services  Actors (WS Harvester)
1
2
4
3
 “Minute-made” (MM) WS-based application integration
• Similarly: MM workflow design & sharing w/o implemented components
NSF Workshop on CI for Environmental Observatories, Dec 6.7, 2004
SAN DIEGO SUPERCOMPUTER CENTER
Job Management (here: NIMROD)
• Job management infrastructure in place
• Results database: under development
• Goal: 1000’s of GAMESS jobs (quantum
mechanics)
NSF Workshop on CI for Environmental Observatories, Dec 6.7, 2004
SAN DIEGO SUPERCOMPUTER CENTER
Some Recent Actor Additions
NSF Workshop on CI for Environmental Observatories, Dec 6.7, 2004
SAN DIEGO SUPERCOMPUTER CENTER
in KEPLER (w/ editable script)
Source: Dan Higgins, Kepler/SEEK
NSF Workshop on CI for Environmental Observatories, Dec 6.7, 2004
SAN DIEGO SUPERCOMPUTER CENTER
Blurring Design (ToDo) and Execution
NSF Workshop on CI for Environmental Observatories, Dec 6.7, 2004
SAN DIEGO SUPERCOMPUTER CENTER
Towards Real-time Analysis Pipelines:
Combining Simulations, Models, and Observations
NSF Workshop on CI for Environmental Observatories, Dec 6.7, 2004
SAN DIEGO SUPERCOMPUTER CENTER
A Briefing On Data Mining to the NSF Planning Meeting
Discussion Group on Cyberinfrastructure For
Environmental Observatories
December 6 & 7, Arlington, VA
Michael Welge
University Of Illinois/NCSA
[email protected]
Modern Discovery and Problem Solving
Team-oriented and collaborative
• Information-based, decision focused
•
•
•
•
•
•
Requires large-scale data fusion and analysis
All data is not under user’s control
Geographically distributed experts
Geographically distributed data and applications
Multiple stakeholders – multiple objectives
Enabling Scientist
Scientists, Engineers, Decision Makers,
Policy Makers, Media and Citizens
Engaging in discovery, analysis, discussion, deliberation,
decisions, policy formulation and communication
Collaboration Framework facilitates Idea and Knowledge Sharing,
eLearning and Multi-Objective Decision Support Processes
Analysis Framework facilitates Data and Model Discovery,
Exploration, and Analysis; via the Collaboration Framework
Data Management Framework builds logical maps of distributed,
heterogeneous information resources (data, models, tools, etc.)
and facilitates their use via the Analysis and Collaboration Frameworks
Physical Infrastructure
Data Streams – large number of applications
•
•
•
•
•
•
•
•
•
•
Sensor networks
Massive Simulation data sets (stored but random access is too
expensive)
Monitoring & surveillance: video streams
Network monitoring and traffic engineering
Text based systems
RFID tags
Web logs and Web page click streams
Credit card transaction flows
Telecommunication calling records
Engineering & industrial processes: power supply & manufacturing
Support For Large Data Driven Problems
•
•
•
•
•
•
•
Streaming Data
Continuous, unbounded, rapid,
time-varying
Huge volumes of continuous data,
possibly infinite
Unpredictable arrival
Fast changing and requires realtime response
Random access is expensive so an
application can only have one look
at the data
May require methods to detect
rare events
•
•
•
•
•
•
•
Large Static Data
Databases involving many
terabytes can exceed reasonable
processing capacity
Thousands of files problems of
management and version control
Thousands of fields create
problems with model building
May require auxiliary models to
support data quality issues
May require methods to detect
rare events
Distributed data store necessary
for some application domains
Managing and Mining Data Streams
Event Federation I
Data Sources
•
Connect with data
sources.
1
Parse and Compose
Type Info
•
•
Event Interface
Parse source data to
form (composite)
events according to
type definitions.
Collect and stage
events for retrieval.
…
2
N
E
Q
L
Event Collector
…
Persistence
Stream Clients
Buffering
Event Federation 2
•
•
•
EventWorks
Monitors are event expression
recognition agents.
•
Recognize Event
•
Evaluate Conditions
•
Act
EQL (Event Query Language)
implements a compositional
semantics for event
expressions.
•
Composite events are
“first order” events.
•
Monitors can monitor
monitors.
Clock events are part of the
language implementation.
•
Easy to write queries with
temporal constraints.
Event
Router
Streams
New
Events
Monitor 1
EQL
Monitor N
… EQL
Monitors are generated by users or programmatically.
D2K : A Framework For Building Data-Driven Apps –
Persistent Stream Data Analytics Foundation
Designed for Building and Maintaining Complex Persistent and Stream
Data-Driven Applications
http://alg.ncsa.uiuc.edu
D2K/T2K/I2K: Data, Text, and Image Analysis
http://alg.ncsa.uiuc.edu
LOOKING: Stream Data Analytics/Information
Visualization scientific “dashboard”
Uses novel
methods to do
real-time stream
data analysis.
Online Stream Query Engine
Adaptable to the
changes and
evolution of data
streams.
Online Stream Classification
Discovers
association and
correlation
rules in data
stream
environment.
Online Frequent Pattern Mining
Detects
outliers and
finds evolution
of clusters in
data streams.
Online Clustering of Data Streams
CI Issues Architecture – NEON/CLEANER
Real-time Visualization of RFID people location
sensors: Supercomputing IntelliBadge™
Atmospheric Science: Analytic Feature
Extraction Scientific Visualization Techniques
LOOKING: Scientist Analytical/Spatial-temporal
Visualization Techniques
LOOKING/Optiputer/Planetary Collaboratory
U of W
1024 Processor Altix
3 TB Shared Memory
>300 TeraBytes Disk
8 X 8 Processor 4 Pipe, 16 gig Memory each, Prisms
coupled with Infiniban for On-demand, Interactive
NLADR Tier 1 Architecture
…. Data-Drive Science
•
•
Collaboration
• Information Gathering (experiments, simulation, observation –
calendar of upcoming activities)
Data Management
• Generation and Publishing of Data (experiments, simulation, or
observation)
•
•
•
Detection
• Mining of new types of data, such as large static data stores
(>>1TB), streams, networks,..
•
•
•
Persistent Data Stores (Distributed Data Management)
Stream Data Management (Event Management)
Behavior Characterization (atypical, surprising, normal)
Discovery
• Hypothesis Generation
Collaboration
• Focusing results for testing and validation