The HDF Group

Download Report

Transcript The HDF Group

NCSA-NARA investigations of
HDF5 in support of
EXPRESS-Driven data
Mike Folk
The HDF NARA Project
PDES, Inc. Offsite Meeting
September 24-29, 2006
Acknowledgement
This report is based upon work supported by the
National Archives and Records Administration (NARA)
through the grant NARA NSF 0202 GPG. Any opinions,
findings, and conclusions or recommendations expressed in
this material are those of the author and do not necessarily
reflect the views of the NARA.
PDES, Inc. Offsite Sept 2006
2
Participants
Mike Folk, Vailin Choi, Elena Pourmal – The HDF Group
Mark Conrad and Bob Chadduck – NARA
David Price – EuroSTEP
Keith Hunten – Lockheed-Martin
Steve Cooper and Denny Moore – Electric Boat
Others
PDES, Inc. Offsite Sept 2006
3
1. What is HDF5?
HDF5 is
•
•
•
•
•
A file format for managing any kind of data
Software system to manage data in the format
Suited especially to large volume or complex data
Suited for every size and type of system
Open file format, open software
PDES, Inc. Offsite Sept 2006
5
Definitions
• “HDF” – Hierarchical Data Format
• Originated in 1988
• NCSA at University of Illinois at Urbana-Champaign
• “HDF5”
• Successor to HDF, introduced in 1998
PDES, Inc. Offsite Sept 2006
6
An HDF5 file is a container…
…into which
you can put
your data
objects.
PDES, Inc. Offsite Sept 2006
lat | lon | temp
----|-----|----12 | 23 | 3.1
15 | 24 | 4.2
17 | 21 | 3.6
7
HDF5 data model
• HDF5 file – container for data objects
• Primary Objects
• Groups
• Datasets
• Additional ways to organize data
• Attributes for metadata
• Sharable objects
• Storage and access properties
PDES, Inc. Offsite Sept 2006
8
HDF “groups” for organizing objects in files
“/” (root)
“/foo”
3-D array
lat | lon | temp
----|-----|----12 | 23 | 3.1
15 | 24 | 4.2
17 | 21 | 3.6
palette
Table
Raster image
Raster image
PDES, Inc. Offsite Sept 2006
2-D array
9
HDF5 “dataset” for holding the data
Metadata
Data
Dataspace
Rank
Dimensions
3
Dim_1 = 4
Dim_2 = 5
Dim_3 = 7
Datatype
IEEE 32-bit float
Attributes
Storage info
time = 32.4
Chunked
pressure = 987
compressed
temp = 56
PDES, Inc. Offsite Sept 2006
10
Datatypes (array elements)
• Datatype – how to interpret a data element
• Two classes: atomic and compound
PDES, Inc. Offsite Sept 2006
11
Datatypes
• HDF5 atomic types
•
•
•
•
•
•
normal integer & float
user-definable (e.g. 13-bit integer)
fixed length and variable length multiples (e.g. strings)
references to objects/dataset regions
enumeration - names mapped to integers
array
• HDF5 compound types
• Records with fields – comparable to C structs
• Members can be atomic or compound types
PDES, Inc. Offsite Sept 2006
12
“Groups”
• A mechanism for
collections of related
objects
• Every file starts with a root
group
• Similar to UNIX
directories
• Can have attributes
PDES, Inc. Offsite Sept 2006
“/”
tom
a
b
dick
c
13
harry
Special Storage Options
chunked
Improves storage
efficiency,
transmission speed
compressed
Arrays can be
extended in any
direction
extendable
File B
Dataset “Fred”
Split file
Better subsetting
access time;
extendable
File A
Metadata for Fred
PDES, Inc. Offsite Sept 2006
Metadata in one file,
raw data in another.
Data for Fred
14
Mesh Example, in HDFView
PDES, Inc. Offsite Sept 2006
15
HDF5 Software
Tools & Applications
HDF I/O Library
HDF File
PDES, Inc. Offsite Sept 2006
16
Features of library
•
•
•
•
•
Ability to create and access complex data structures
Fast, flexible I/O
Data transformation and filtering during I/O
Flexible API for power users
Compatibility with common data models
• Able to represent all common data structures
• Supports key language models – C, Fortran, Java, etc.
PDES, Inc. Offsite Sept 2006
17
Other info
• Library and tools run almost anywhere
• Other software from THG
• Java viewer
• Command-line utilities
• Other software
• Commercial (IDL, Matlab, Labview, etc.)
• Community (EOS, ASCI, etc.)
• Integration with other software (SRB, databases, etc.)
PDES, Inc. Offsite Sept 2006
18
Making HDF useful for your application
• There are many ways to organize and access data in
HDF5
• How do we apply these capabilities to a particular
domain, such as product data?
• We have to decide how we will organize and access
our data in a way that best addresses our needs.
• And create data models, APIs and tools as appropriate
to support our applications.
• Or adapt existing data models, APIs and tools as
appropriate to support our applications.
PDES, Inc. Offsite Sept 2006
19
Sample uses of HDF
1. NASA Earth Observing System (EOS)
Aqua (6/01)
Aqua
Terra
CERES
MISR
MODIS
MOPITT
PDES, Inc. Offsite Sept 2006
CERES
MODIS
AMSR
Aura
TES HRDLS
MLS OMI
21
2. Advanced Simulation & Computing (ASC)
Question: How do we maintain a nuclear
Answer: Very large simulations
stockpile in the absence of testing?
PDES, Inc. Offsite Sept 2006
22
ASC Data requirements
•
•
•
•
Large datasets (> a terabyte)
Fast I/O on massive parallel systems
Complex data and extensive metadata
Availability on leading edge systems
PDES, Inc. Offsite Sept 2006
23
3. Bioinformatics
caacaagccaaaactcgtacaa
Cgagatatctcttggaaaaact
gctcacaatattgacgtacaag
gttgttcatgaaactttcggta
Acaatcgttgacattgcgacct
aatacagcccagcaagcagaat
--
Managing genomic data
DNA sequencing workflows are complex
•
•
•
•
•
•
•
PDES, Inc. Offsite Sept 2006
Diverse formats
Highly redundant data
Multiple levels of information
Complex associations
Repeated file processing
Non-scalable storage
Lack of persistence
25
HDF5 as binary exchange format
for bioinformatics
PDES, Inc. Offsite Sept 2006
26
4. Flight test data
Boeing flight test
PDES, Inc. Offsite Sept 2006
28
HDF role in the Software
Stack
Examples: Thermonuclear simulations
Product modeling
Data mining tools
Visualization tools
Climate models
Apps: simulation, visualization, remote sensing…
Common application-specific data models
app-specific
API or GUI
BioHDF SAF HDF-Packet Matlab HDF-EOS
LANL
LLNL, SNL
Grids
COTS
NASA
HDF5 data model & API
HDF5 serial &
parallel I/O
HDF5 virtual file layer (I/O drivers)
Stdio
Split Files
MPI I/O
Custom
Stream
Storage
?
HDF5 format
File
Split metadata File on parallel
and raw data files file system
PDES, Inc. Offsite Sept 2006
Across the network
User-defined or to/from another
device application or library
30
2. Why is there interest in
HDF5 for product data?
(Courtesy of David Price, EuroSTEP)
Needs
• STEP and related models exist using EXPRESS
• ASCII, XML STEP formats defined, software
developed
• But ASCII/XML don’t adapt well for highly
voluminous, complex data
• Finite element analysis
• Computational fluid dynamics
• Heterogeneous product data
PDES, Inc. Offsite Sept 2006
32
EuroSTEP project
• VIVACE: “Value Improvement through a Virtual
Aeronautical Collaborative Enterprise”
• Deliverable: EXPRESS-driven Large Volume Binary
Data Representation
PDES, Inc. Offsite Sept 2006
33
Survey of State of the Art
• Candidates
•
•
•
•
•
ASN.1 : Abstract Syntax Notation 1
HDF5 : Hierarchical Data Format
XML/Binary
CGNS : CFD General Notation System
SDAI implementation by LKSoft
• Found HDF5 most suitable for very large scientific
datasets and complex relationships
PDES, Inc. Offsite Sept 2006
36
Goal:
Create open-source toolkit
mapping EXPRESS to HDF5
Examples: Thermonuclear simulations
Product modeling
Data mining tools
Visualization tools
Climate models
Apps: simulation, Product
visualization,
modelremote
Applications
sensing…
Common application-specific
STEP data modelsdata models
appl-specific
STEP-HDF5 BioHDF SAF HDF-Packet Matlab HDF-EOS
APIs
LANL
LLNL, SNL
Grids
COTS
NASA
HDF5 data model & API
HDF5 serial &
parallel I/O
HDF5 virtual file layer (I/O drivers)
Stdio
Split Files
MPI I/O
Custom
Stream
Storage
?
HDF5 format
File
Split metadata File on parallel
and raw data files file system
PDES, Inc. Offsite Sept 2006
Across the network
User-defined or to/from another
device application or library
38
NARA-sponsored work
NCSA-THG NARA Research
• Investigate the viability of scientific data formats,
such as HDF5, for long-term preservation of
engineering data in the federal archives
PDES, Inc. Offsite Sept 2006
40
Heterogeneous data aggregation, with HDF5
• Goal:
Using NARA’s TWR collection, investigate the
possibilities and limitations of using HDF5 as a
container for archiving heterogeneous
collections of records, with special attention to
STEP data.
PDES, Inc. Offsite Sept 2006
41
Activities
• Use files, datatypes, structures in NARA TWR
collection – STEP files, photos, schematics, etc.
• Map these to HDF5 objects and structures,
exploiting features of HDF5
• Assess benefits and costs in terms of storage
efficiency and accessibility
• Investigate use of HDF5 as container for collection
PDES, Inc. Offsite Sept 2006
42
Relationship EuroSTEP, Electric Boat, et al
• Working together to develop mappings from
EXPRESS to HDF5
• Sharing data for testing
• Periodic meetings to share information and
coordinate research
• Some involvement with standardization
PDES, Inc. Offsite Sept 2006
43
Investigating I/O efficiency and size
• Explore different datatypes and storage options
for b-spline surface models (later: finite element
models)
• Two types of data – b-splines themselves and
cartesian points
• Variables
• Different HDF5 datatypes
• Dataset compression
• Use of extra indexes in HDF5 for fast access
PDES, Inc. Offsite Sept 2006
44
Some results
• Small files
• HDF5 not appreciably better then STEP, sometimes worse
• Large files
• Compression always made HDF5 files smaller
• Even without compression, HDF5 storage better
• Indexing approach also tended to save space
• Lessons
• HDF5 can provide very efficient storage for cartesian points
• Choice of data types and data storage is important
PDES, Inc. Offsite Sept 2006
45
HDF5 as container
HDFView Demo
PDES, Inc. Offsite Sept 2006
47
Thank you
HDF Information
• HDF Information Center
• http://hdfgroup.org/
• HDF Help email address
• [email protected]/
• HDF users mailing list
• [email protected]/
PDES, Inc. Offsite Sept 2006
49