Presentation - HDF-EOS Tools and Information Center

Download Report

Transcript Presentation - HDF-EOS Tools and Information Center

Substituting HDF5 tools with Python/H5py scripts
Daniel Kahn
Science Systems and Applications Inc.
HDF HDF-EOS Workshop XIV, 28 Sep. 2010
1 of 14
What are HDF5 tools?
HDF5 tools are command line programs distributed with the HDF5 library.
They allow users to manipulate HDF5 files.
h5dump: dump HDF5 data as ASCII text.
h5import: convert non-HDF5 data to HDF5
h5diff: show differences between HDF5 files.
h5copy: Copy objects between HDF5 files.
h5repack: Copy entire file while changing storage properties of
HDF5 objects.
h5edit: (proposed) add attributes to HDF5 objects.
HDF5 tools have a long history as the first (and for a long time only) way to
manipulate HDF5 files conveniently. I.e. without writing a C or Java
program, or without buying expensive commercial software such as IDL or
Matlab.
2 of 14
The tools can be characterized as having three parts:
Text Processing—Evaluate command arguments, process input text
files, match group names.
Tree Walking – Search HDF5 file hierarchy for objects by name.
Object Level Operations – Operate on the objects: copy, diff, repack,
etc.
The tools are simple to use and convenient as they are
distributed with the HDF5 library.
3 of 14
Disadvantage of HDF5 tools:
The command line arguments limit tool capability.
Adding new features with command line syntax which is both readable
and does not break the legacy syntax becomes difficult.
Development time for designing and implementing new features is long
(weeks...months).
Use cases must be evaluated, a solution proposed in an RFC, the
proposal must be implemented, new code is distributed in next release.
4 of 14
Here's an example from HDF documentation:
h5copy -v -i "test1.h5" -o "test1.out.h5" -s "/array" -d "/array
But suppose we had multiple datasets named arrayNNN where
N is 0–9. We'd like to write something like:
h5copy -v -i "test1.h5" -o "test1.out.h5" -s "/array\d+{3}”
So that \d+{3} would provide a match to all such objects.
Extending the tool syntax to meet this use case, and then again
for the next use case would be a never ending game of catch up.
A more flexible substitute is desirable...
5 of 14
...Python?
6 of 14
What is Python?
Python is a programming language.
It features dynamic binding of variables, like Perl or shell
scripts, IDL, Matlab, but not C or Fortran.
Unlike Perl, it supports native floating point numbers.
It has scientific array support in the style of IDL or Matlab (numpy
module). Array operations can be programmed using normal
arithmetic operators.
It has access to the HDF5 library (Anderw Collette's
h5py module).
Python is currently the only programming language in wide spread
use to have all these features. They are essential to the success of the
language for easy HDF5 file manipulation.
7 of 14
Real world Experience: Learning Python and h5py is quick.
In the summer of 2010 SSAI hired a summer intern.
Equipped with some Perl programming experience the
intern was able to come up to speed on Python, HDF5,
h5py, and numpy within one to two weeks and, over the
summer, develop a specialized file/dataset merging tool
and a dataset conversion tool.
Python and h5py are the best way to introduce HDF5 because
it allows the user to concentrate on the H in HDF5, rather
than the C API syntax.
8 of 14
Python is well suited to HDF5
Python is well suited to HDF5 because the HDF5 array objects
carry the dimensionality, extent, and element data type
information, just as HDF5 datasets do. The object oriented nature
of Python allows these objects to be manipulated at a high level.
C, by contrast, lacks a scientific array object and the ability to
define object methods.
9 of 14
Example: Creating and Writing a Dataset to a New File
Python:
import h5py
import numpy
TestData = numpy.array(range(1,25),dtype='int32').reshape(4,6)
h5py.File("WrittenByH5PY.h5","w")['/TestDataset'] = TestData
Compare to C version:
#include "hdf5.h"
int main() {
hid_t
file_id, dataspace_id, dataset_id; /* identifiers */
herr_t status;
hsize_t dims[2];
const int FirstIndex = 4, SecondIndex = 6;
int
i, j, dset_data[4][6];
for (i = 0; i < 4; i++) /* Initialize the dataset. */
for (j = 0; j < 6; j++)
dset_data[i][j] = i * 6 + j + 1;
dims[0] = FirstIndex;
dims[1] = SecondIndex;
file_id = H5Fcreate("WrittenByC.h5", H5F_ACC_TRUNC, H5P_DEFAULT,H5P_DEFAULT); /* Open an existing file. */
dataspace_id = H5Screate_simple(2, dims, NULL);
dataset_id = H5Dcreate(file_id, "/TestDataset", H5T_STD_I32LE, dataspace_id,
H5P_DEFAULT,H5P_DEFAULT,H5P_DEFAULT);
/* Write the dataset. */
status = H5Dwrite(dataset_id, H5T_NATIVE_INT, H5S_ALL, H5S_ALL, H5P_DEFAULT, dset_data);
status = H5Dclose(dataset_id); /* Close the dataset. */
status = H5Fclose(file_id); /* Close the file. */
}
10 of 14
And here's the output:
h5dump WrittenByH5PY.h5
HDF5 "WrittenByH5PY.h5" {
GROUP "/" {
DATASET "TestDataset" {
DATATYPE H5T_STD_I32LE
DATASPACE SIMPLE { ( 4, 6 ) / ( 4, 6 ) }
DATA {
(0,0): 1, 2, 3, 4, 5, 6,
(1,0): 7, 8, 9, 10, 11, 12,
(2,0): 13, 14, 15, 16, 17, 18,
(3,0): 19, 20, 21, 22, 23, 24
}
}
}
}
11 of 14
Python and the Three Pillars of HDF5 Tools
Python is well suited to Text Processing
Python has wide range of string manipulation functions, an easy-touse regular expression module, and list and dictionary (hash table)
objects. No segmentation faults!
Python is well suited to Tree Walking. Recursive functions
and loops over lists are easy to write
Object Level Operations...Not so much.
Object Level Operations (e.g. copy, diff) are challenging to write
efficiently and should be provided as part of the API by the HDF
Group, for example h5o_copy. API functions are available to the
Python programmer via h5py.
12 of 14
Why use Python to substitute HDF5 tools?
Python is available now.
Some HDF5 tools are still under development as new use
cases are presented. For example, users have requested a
tool to add attributes to HDF5 files. Such a capability
already exists with h5py:
python -c "import h5py ; fid = h5py.File('FileForAttributeAddition.h5','r+') ;
fid['/TestDataset'].attrs['CmdLine1'] = 'NewValue' ; fid.close()"
It's little ugly, but it is available today.
Python is a full programming language. It can accomplish tasks
which HDF5 tools cannot.
Further Resources:
http://groups.google.com/group/h5py
http://h5py.alfven.org/
13 of 14
Recommendations:
Users should consider Python and H5py to accomplish their HDF5 file
manipulation projects.
The HDF Group should concentrate on providing efficient API
functions for object level tasks: object copy, dataset difference,
etc.
The HDF Group should avoid complex enhancements to tools where
Python/h5py could be used instead.
An easily searched contributed application repository on the HDF Group
website with user ratings would be very helpful.
14 of 14