Put Your Title Here

Download Report

Transcript Put Your Title Here

Scientific Data Management
Center
(ISIC)
http://sdmcenter.lbl.gov
contains extensive publication list
1
Scientific Data Management Center
Participating Institutions
Center PI:
Arie Shoshani
LBNL
DOE Laboratories co-PIs:
Bill Gropp, Rob Ross
Arie Shoshani, Doron Rotem
Terence Critchlow, Chandrika Kamath
Nagiza Samatova, Andy White
Universities co-PIs :
Mladen Vouk
Alok Choudhary
Reagan Moore, Bertram Ludaescher
Calton Pu
ANL
LBNL
LLNL
ORNL
North Carolina State
Northwestern
UC San Diego (SDSC)
Georgia Tech
2
Phases of Scientific Exploration

Data Generation
 From large scale simulations or experiments
 Fast data growth with computational power
 examples
• HENP: 100 Teraops and 10 Petabytes by 2006
• Climate: Spatial Resolution: T42 (280 km) -> T85 (140 km) -> T170 (70 km),
T42: about 1 TB/100 year run => factor of ~ 10-20
 Problems
•
•
•
•
Can’t dump the data to storage fast enough – waste of compute resources
Can’t move terabytes of data over WAN robustly – waste of scientist’s time
Can’t steer the simulation – waste of time and resource
Need to reorganize and transform data – large data intensive tasks slowing
progress
3
Phases of Scientific Exploration

Data Analysis
 Analysis of large data volume
 Can’t fit all data in memory
 Problems
•
•
•
•
•
Find the relevant data – need efficient indexing
Cluster analysis – need linear scaling
Feature selection – efficient high-dimensional analysis
Data heterogeneity – combine data from diverse sources
Streamline analysis steps – output of one step needs to match input of
next
4
Example Data Flow in TSI
Aggregate to
~500 files (< 2 to
10+ GB each)
Input
Data
Logistical
Network
Logistic Network
L-Bone
Local Mass
Storage 14+TB)
Aggregate to
one file (1+
TB each)
Data Depot
Archive
Highly
Parallel
Compute
Local 44 Proc.
Data Cluster
- data sits on local nodes for weeks
Output
~500x500
files
Viz Software
Viz
Wall
Viz Client
Courtesy: John Blondin
5
Goal: Reduce the Data Management Overhead
• Efficiency
• Example: parallel I/O, indexing, matching storage structures to
the application
• Effectiveness
• Example: Access data by attributes-not files, facilitate massive
data movement
• New algorithms
• Example: Specialized PCA techniques to separate signals or to
achieve better spatial data compression
• Enabling ad-hoc exploration of data
• Example: by enabling exploratory “run and render” capability to
analyze and visualize simulation output while the code is running
6
Approach
 Use an integrated
framework that:
SDM Framework
• Provides a scientific
workflow capability
Scientific
Process
Automation
Layer
• Supports data mining and
analysis tools
• Accelerates storage and
access to data
 Simplify data management
tasks for the scientist
• Hide details of underlying
parallel and indexing
technology
• Permit assembly of
modules using a simple
graphical workflow
description tool
Scientific
Application
Data
Mining &
Analysis
Layer
Scientific
Understanding
Storage
Efficient
Access
Layer
7
Technology Details by Layer
Scientific
Process
Automation
(SPA)
Layer
Data
Mining &
Analysis
(DMA)
Layer
Storage
Efficient
Access
(SEA)
Layer
WorkFlow
Management
Tools
ASPECT:
integration
Framework
Storage
Resource
Manager
(To HPSS)
Data
Analysis
tools
(PCA, ICA)
Parallel
NetCDF
Software
Layer
Web
Wrapping
Tools
Efficient
indexing
(Bitmap
Index)
Efficient
Parallel
Visualization
(pVTK)
ROMIO
MPI-IO
System
Parallel
Virtual
File
System
Hardware, OS, and MSS (HPSS)
8
Accomplishments:
Storage Efficient Access (SEA)
Parallel Virtual File System:
Shared memory communication
Enhancements and deployment

Developed Parallel netCDF


Enables high performance parallel I/O to
netCDF datasets
Achieves up to 10 fold performance
improvement over HDF5
P0
P1
P2
P2
P3
Parallel File System
Provides MPI access to PVFS
Advanced parallel file system interfaces
for more efficient access
Before
After
Developed PVFS2





P1
Enhanced ROMIO:



P0
Parallel netCDF
netCDF
Parallel File System

P3
Adds Myrinet GM and InfiniBand support
improved fault tolerance
asynchronous I/O
offered by Dell and HP for Clusters
Deployed an HPSS Storage Resource
Manager (SRM) with PVFS


Automatic access of HPSS files to PVFS
through MPI-IO library
SRM is a middleware component
FLASH I/O Benchmark Performance (8x8x8 block sizes)
9
Robust Multi-file Replication
Anywhere

Problem: move thousands
of files robustly






Takes many hours
Need error recovery
Mass storage systems
failures
Network failures
Use Storage Resource
Managers (SRMs)
DataMover
SRM-COPY
(thousands of files)
NCAR
SRM-GET (one
Get list
of files
LBNL
file at a time)
SRM
SRM
(performs writes)
(performs reads)
GridFTP GET (pull mode)
Problem: too slow




Use parallel streams
Use concurrent transfers
Use large FTP windows
Pre-stage files from MSS
MSS
Disk
Cache
Disk
Cache
Network transfer
archive files
stage files
10
Accomplishments:
Data Mining and Analysis (DMA)
Developed Parallel-VTK



Developed “region tracking” tool



30
20
10
0
0
2
4
6
8
10
12
14
16
18
Number of Processors
PVTK Serial Writer
PVTK Parallel Writer
Used for accurate for signal separation
Used for discovering key parameters
that correlate with observed data
Developed highly effective data reduction



For exploring 2D/3D scientific
databases
Using bitmap technology to identify
regions based on multi-attribute conditions
40
Implemented Independent Component
Analysis (ICA) module



Efficient 2D/3D Parallel Scientific
Visualization for NetCDF and HDF files
Built on top of PnetCDF
PVTK Serial (vs) Parallel Writer (80 MB)
Time (seconds)

Achieves 15 fold reduction with high level
of accuracy
Using parallel Principle Component Analysis
(PCA) technology
Combustion region tracking
Developed ASPECT



A framework that supports a rich set of
pluggable data analysis tools
Including all the tools above
A rich suite of statistical tools based on R
package
El Nino signal (red) and estimation (blue) closely match
11
ASPECT Analysis Environment
Data Select  Data Access  Correlate  Render  Display
(temp, pressure)
From astro-data
Where (step=101)
(entropy>1000);
Data Mining &
Analysis Layer
Sample (temp, pressure)
Select
Data
Use Bitmap
(condition)
Storage Efficient
Access Layer
Take
Sample
Get variables
(var-names, ranges)
Bitmap
Index
Selection
Run R
analysis
Run pVTK
filter
R Analysis
Tool
Read Data
(buffer-name)
Write Data
Read Data
(buffer-name)
Write Data
Parallel
NetCDF
Visualize
scatter
plot in QT
pVTK
Tool
Read Data
(buffer-name)
PVFS
Hardware, OS, and MSS (HPSS)
12
Accomplishments:
Scientific Process Automation (SPA)
Unique requirements of scientific WFs
 Moving large volumes between modules
• Tightlly-coupled efficient data movement
 Specification of granularity-based iteration
• e.g. In spatio-temporal simulations –
a time step is a “granule”
 Support for data transformation
• complex data types (including file
formats, e.g. netCDF, HDF)
 Dynamic steering of workflow by user
• Dynamic user examination of results
workflow steps defined graphically
Developed a working scientific work
flow system
 Automatic microarray analysis
 Using web-wrapping tools developed by
the center
 Using Kepler WF engine
 Kepler is an adaptation of the UC Berkeley
tool, Ptolemy
workflow results presented to user
13
GUI for setting up and running
workflows
14
Re-applying Technology
SDM technology, developed for one application, can be
effectively targeted at many other applications …
Technology
Initial Application
New Applications
Parallel NetCDF
Astrophysics
Climate
Parallel VTK
Astrophysics
Climate
Compressed bitmaps
HENP
Combustion, Astrophysics
Storage Resource
Managers
HENP
Astrophysics
Feature Selection
Climate
Fusion
Scientific Workflow
Biology
Astrophysics (planned)
15
Broad Impact of the SDM Center…
Astrophysics:
High speed storage technology, parallel NetCDF,
parallel VTK, and ASPECT integration software
used for Terascale Supernova Initiative (TSI) and
FLASH simulations
Tony Mezzacappa – ORNL, John Blondin –NCSU,
Mike Zingale – U of Chicago, Mike Papka – ANL
ASCI FLASH – parallel NetCDF
Climate:
High speed storage technology, Parallel NetCDF,
and ICA technology used for Climate Modeling
projects
Ben Santer – LLNL, John Drake – ORNL, John
Michalakes – NCAR
Dimensionality reduction
Combustion:
Compressed Bitmap Indexing used for fast
generation of flame regions and tracking their
progress over time
Wendy Koegler, Jacqueline Chen – Sandia Lab
Region growing
16
Broad Impact (cont.)
Biology:
Kepler workflow system and webwrapping technology used for executing
complex highly repetitive workflow tasks
for processing microarray data
Matt Coleman - LLNL
Building a scientific workflow
High Energy Physics:
Compressed Bitmap Indexing and Storage
Resource Managers used for locating
desired subsets of data (events) and
automatically retrieving data from HPSS
Doug Olson - LBNL, Eric Hjort – LBNL,
Jerome Lauret - BNL
Fusion:
Dynamic monitoring of HPSS file transfers
A combination of PCA and ICA technology
used to identify the key parameters that
are relevant to the presence of edge
harmonic oscillations in a Tokomak
Keith Burrell - General Atomics
Identifying key
parameters for the
DIII-D Tokamak
17
Goals for Years 4-5

Fully develop the integrated SDM framework





Implement the 3 layer framework on SDM center facility
Provide a way to select only components needed
Develop self-guiding web pages on the use of SDM components
Use existing successful examples as guides
Generalize components for reuse
 Develop general interfaces between components in the layers
 support loosely-coupled WSDL interfaces
 Support tightly-coupled components for efficient dataflow

Integrate operation of components in the framework
 Hide details form user – automate parallel access and indexing
 Develop a reusable library of components that can be selected
for use in the workflow system
18