XML for Science - Digital Science Center

Download Report

Transcript XML for Science - Digital Science Center

XML for Scientific
Computing
Several case studies for XML
data in scientific computing
Overview

We will present case studies of the following
systems
•
•
•
•



XSIL: Extensible Scientific Interchange Language
XDMF: Extensible Data Model and Format
Discipline Specific XML: ChemicalML
Gateway Application Descriptors (plus Castor)
XML by itself is just markup, like HTML without a
browser. Each of the above uses a related set of
software to manipulate the XML data.
We present several examples of XML to give you
an overview.
We conclude with some remarks about standards
for science applications.
Overview of Case Studies

XSIL and XDMF are examples of
representing (meta)data for scientific
computing.
• Concentrate on data structures, data I/O.
• Meaning of data not described.

ChemicalML marks up domain specific
data.
• Meaningfully describes data content.


Gateway application data describes
science codes themselves.
All possess a data object model.
• Object oriented data descriptions guide the
markup tag definitions.
XSIL
XML tags for generic scientific
data markup, with related
Java software.
XSIL

Developed in support of several projects led by
CACR.
• Example: LIGO, Digital Sky
• Roy Williams, CalTech.




See http://www.cacr.caltech.edu/SDA/xsil/ for
more information and free software.
XSIL developed for astronomical and gravitational
wave communities.
But provides general purpose tags.
Also comes with software for building Java
applications that manipulate, display XSIL
documents.
XSIL Tags

XSIL defines a small number of tags
•
•
•
•
•
•
•
•

XSIL: base container for the object model.
Comment
Param: an arbitrary name/value pair
Time: describes time, plus format
Table: data in columns and rows
Array: table data with specific size
URL:
Streams: for handling data
We’ll now go over some of these in detail.
The XSIL Tag I


XSIL documents map to a document
object model with associated
handling code.
The root tag for XSIL is <XSIL>:
<XSIL Name=“Example” Type=“Examples.MyExample>
…
</XSIL>

Type points to the Java code that should
process this file.
• It’s some file called MyExample.java in the package
Examples.
The XSIL Tag II

XSIL tags can be nested if different parts
of the XSIL document need to be handled
by different codes.
<XSIL Name=“Example” Type=“Examples.MyExample”>
…
<XSIL Name=“Subsection” Type=“Examples.Subsection”>
…
</XSIL> </XSIL>

XSIL tags thus are the base container in a
generic object hierarchy.
• MyExample object “has a” Subsection object
More On Object Containers


Consider an Electromagnetics example:
• A target is represented as a grid for finite
difference integration of Maxwell’s eqns.
• The base input file contains one or more
materials.
• Each material has specific EM properties.
If translated to XSIL, could look like this:
<XSIL Name=“EMRoot” Type=“CEA.Root”>
<!– Some general parameters -->
<XSIL Name=“EMMaterial” Type=“CEA.Material”>
<!– Some info describing the material. -->
</XSIL>
</XSIL>
Parameters



Each XSIL tag can contain one or
more parameters.
Params are arbitrary name/value
pairs.
Params optionally have units.
<XSIL …>
<Param Name=“Color”>Red</Param>
<Param Name=“Weight” Unit=“kg”>3.14</Param>
</XSIL>
Tables


Params associate one value per
name
Tables support multiple values
• A Table row can have any number of
values.


Each table contains column
definitions followed by an arbitrary
number of entries.
Tables get data from streams
(discussed later).
Example Table
<XSIL…>
…
<Table>
<Column Name=“Color” Type=“string”/>
<Column Name=“Weight” Type=“float” Unit=“kg”/>
<Column Name=“Length” Type=“float” Unit=“meter”/>
<Stream Type=“Local” Delimiter=“,”>
“Red”,100.2,0.2
“Green”,21.7,1.2
</Stream>
</Table>
</XSIL>
XSIL Arrays



XSIL arrays are similar to Fortran and C
arrays.
For mixed type data, use Tables.
If all data is the same (integers, floats),
use Arrays.
<Array Type=“int”>
<Dim Name=“x-dim”>2</Dim>
<Dim Name=“y-dim”>2</Dim>
<Stream Type=“Local” Delimiter=“,”>
137,42
8,13
</Stream>
</Array>
XSIL Streams


XSIL Streams can be used to load data
Data sources can be
• In the file itself (as shown in previous examples).
• From files on disk
• From URLs (http://, ftp://, and file:// supported)

Loading data from disk
<Stream Type=“Remote” Encoding=“Littleendian”>
/home/user1/data/datafile.dat
</Stream>

Loading data from URLs
<Stream Type=“Remote”>
http://my.server.edu/XSILdata/datafile.dat
</Stream>
Ex: Use XSIL to describe input data
<XSIL Name=“InputData” Type=“Examples.InDataHandler”>
<XSIL Name=“Target 1” Type=“Examples.Target”>
<Param Name=“Target”>Scud</Param>
<Param Name=“dx”>0.1</Param>
<Array>
<Dim Name=“X-Dimension”>100</Dim>
<Dim Name=“Y-Dimension”>100</Dim>
<Stream Type=“Remote”>
/home/mpierce/data/mydata.dat
</Stream>
</Array>
</XSIL>
<XSIL Name=“Target 2” Type=“Examples.Target”>
<!– Another target -->
</XSIL>
</XSIL>
Table and Array Types

Table and Array data can be (in bits)
•
•
•
•
•
•
•
•
•
•
boolean (1)
byte (8)
short (16)
int (32)
long (64)
float (32)
double (64)
floatComplex (64)
doubleComplex (128)
string (arbitrary length)
Using XSIL


The previous example just marks up data.
XSIL also comes with Java bindings that
• Read the file and parse it.
• Extract parameter values, units, etc.
• Read in and manipulate tables, arrays

Central ideas:
• Each XSIL tag corresponds to a Java class
• XSIL’s Type points to your custom driver code
that uses the XSIL classes.
XSIL Coding Example

Consider following small XSIL
example
<XSIL Type=“Examples.MyExample”>
<Param Name=“x0”>12.0</Param>
<Param Name=“dx”>0.1</Param>
</XSIL>
XSIL Java Code Example
package extensions.Examples
import org.escience.XSIL
public class MyExample {
String x0,dx;
XSIL root;
public MyExample(String xsilFileName) {
root=new XSIL(xsilFileName);
}
public void construct() {
for(int i=0;i<root.getChildCount();i++) {
XSIL x=root.getChild(i);
if(x instance of Param) {
Param p=(Param)x;
if(p.getName().equals(“x0”)) x0=p.getText();
if(p.getName().equals(“dx”)) dx=p.getText();
}}}}
Code Notes


All classes (Param, Table, etc.)
extend the XSIL class.
Pass the XSIL class root the XSIL
path through the constructor.
• XSIL handles all parsing


XSIL class defines getChildCount(),
getChild() methods.
Param class defines getName() and
getText() methods.
XSIL Summary


Defines a small set of general
purpose tags for scientific data.
Data itself is not directly marked up.
• Read in through streams

XSIL software maps Java classes to
XSIL tags.
• Convenient for working with XSIL docs.
• DOM classes are much more
cumbersome to use.
XDMF
A data model geared toward
finite element codes, with
associated software in C++,
Java, and TCL
ICE XDMF

ICE (Interdisciplinary Computing
Environment) is a comprehensive project
at ARL MSRC that attempts to provide a
common software platform for DoD
scientific codes.
• Jerry Clarke, lead developer

XDMF (Extensible Data Model and Format)
provides a common data format for
several different codes
• Primary focus: finite element codes for fluid
dynamics and structural mechanics.
• XDMF and related software provides the
backbone for loosely coupling applications and
visualization.
XDMF Design



XDMF divides data into “light” and
“heavy” types.
Light data, or metadata, is formatted
in XML and will be described in more
depth.
Heavy data is in HDF5 and not
presented here.
XDMF Basic Concepts



XDMF basic tags are <DataStructure> and
<DataTransform>
<DataStructure> defines the actual data.
<DataTransform> defines the area of
interest (AOI) in the data.
• AOI defined by coordinates, a function, or a
hyperslab.

<DataTransform> contains one or more
<DataStructures>
• The transform defines how the data structure
will be filtered.
Simple Data Structure
The example below is for 655 XYZ values
in the indicated HDF5 file.
<DataStructure Name="Some XYZ Data"
Type="Float"
Dimensions="655 3">
MyData.h5:/MyXYZdata
</DataStructure>
 Simple character data can also be included
directly the XML document.

Data Structure for Mesh
Connections and Pressures
<DataStructure
<DataStructure
Name="Connections"
Name="Pressure"
Type="Int"
Type="Float"
Precision="8"
Precision="8"
Dimensions="100 8" >
Dimensions="100">
MyData.h5:/MyConns
MyData.h5:/MyPressure
</DataStructure>
</DataStructure>
Data Structure Attribute Summary
<DataStructure
Name= "Any name " Some meaningful name to
the owner
Rank="NumberOfDimensions" Redundant
information
Dimensions="Kdim Jdim Idim" The slowest
varying
dimension is listed first
Type="Char | Float | Int | Compound" Default is
Float
Precision="BytesPerElement" Default is 4
Format="XML | HDF" Default is XML
>
XDMF Array Types

XDMF array entries can have these
types:
• Integer
• Float
• Char

All are 4 bytes by default, can be
increased to 8 bytes.
DataTransform

DataTransform defines a way for the
raw data to be filtered
• Gives a certain Area of Interest in data
set.

Possible transforms:
• Coordinate: Select an particular area
• Function: Define simple algorithm for
selecting area
• Hyperslab: Define start, stride, and
count for each dimension of an array.
Hyperslab Transform Example


The following markup instructs the processing
code to apply an hyperslab transform to a 4-D
array.
The first data structure defines the hyperslab:
• 0000 are the starting points for each dim
• 2221 are the strides for each dim
• 25 50 75 3 are the step sizes for each dim


The second data structure gives the raw data, a
100x200x300x3 array in the noted HDF5 file.
The transform will produce a 25x50x75x3 region
that includes every other plane of the original
data in the original data region [0,0,0,0][50,100,150,2].
Hyperslab Transform Example
<DataTransform
Dimensions="25 50
75 3"
Type="HyperSlab">
<DataStructure
Dimensions="3 4"
Format="XML">
0 0 0 0 2 2 2 1 25
50 75 3
</DataStructure>
<DataStructure
Name="Points"
Dimensions="100
200 300 3"
Format="HDF">
MyData.h5:/XYZ
</DataStructure>
</DataTransform>
Data Organization




DataStructures and DataTransform
constitute XDMF’s data
representation.
XDMF Domain tags are used as
arbitrary containers.
Domains contain grids, grids contain
topologies, geometries and
attributes, as well as data structures.
Attributes include scalars, vectors,
tensors
An XDMF Example
<Domain Name="Example #1">
<Grid Name="My Hex Grid with
Pressure">
<Topology Type="Hexahedron"
Dimensions="100"
Order="7 6 5 4 3 2 1 0">
<DataStructure
Name="Connections"
Type="Int"
Precision="8"
Dimensions="100 8" >
MyData.h5:/MyConns
</DataStructure>
</Topology>
(continued in next column)
<Geometry Type="XYZ">
<DataStructure Name="XYZ
Data"
Type="Float"
Dimensions="655 3">
MyData.h5:/MyXYZdata
</DataStructure>
</Geometry>
<Attribute Type="Scalar“
Center="Cell">
<DataStructure
Name="Pressure"
Type="Float"
Precision="8"
Dimensions="100">
MyData.h5:/MyPressure
</DataStructure>
</Attribute>
</Grid>
</Domain>
Review of Example

Recall XDMF is primarily for structured and
unstructured finite element grids.
• Input data includes grid connectivity info, grid
geometry, and pressure values



The Domain contains a Grid
The Grid is defined by Topology,
Geometry, and Attributes.
Topology, Attributes, and Geometry
contain data sources and structure info.
XDMF API


Like XSIL, XDMF treats the XML markup
as a set of instructions to be processed by
actual programs.
XDMF defines an API of document
processing engines.
• Core is in C++
• ICE also provides Java and TCL APIs through
wrappers around core.

See
http://www.arl.hpc.mil/ice/Examples/Code
Integration/DemoIceRt.cxx for code
example.
XDMF Summary


Provides a few general purpose tags
Again, data is not directly marked
up.
• Stored in HDF5


XDMF handled programmatically with
APIs in C++, Java, Tcl.
More information:
• http://www.arl.hpc.mil/ice/
Comparison of XSIL and XDMF

XSIL
• Larger tag set
• Java API
• Can read data that
is in document, on
disk, from URL
• Questionable
performance and
memory efficiency
for very large data
sets.
• Free and open
source

XDMF
• Uses HDF5 for large
data sets.
• C++, Java, TCL
APIs.
• Defines both data
structures and
transform
instructions.
• Supports arrays,
but not mixed data
types (such as XSIL
Tables).
• Integrated with ICE
Chemical Markup
Language
A domain specific XML
markup language.
CML Introduction




XSIL and XDMF use XML to describe code
input files and give simple processing
instructions.
Tags describe data structure, not content.
We now examine a domain specific
example, the Chemical Markup Language.
Other domain markup languages:
• Mathematics Markup Language (MathML)
• Geography Markup Language (GML)
XML for Chemistry

Goal: provide a common chemical data
format that is an open, universal
standard.
• Data representation is platform independent
• Support structured searches of data banks.
• Provide a common format for software
(particularly visualization).
• Support multidisciplinary data formats
(biology, math) through XML namespaces.
• Provide a data object hierarchy suitable for
object oriented programming.
CML Structure

Chemistry lends itself to object
container structure
• Atoms have protons, neutrons,
electrons
• Molecules have atoms
• Complex molecules and compounds are
composed of molecules, molecular
pieces (benzene rings, for example)

CML defines these as data objects
with property fields
A Simple Example: Glycine
<molecule convention="MDLMol"
id="glycine"
title="GLYCINE">
<date day="22" month="11"
year="1995">
</date>
<atomArray>
<atom id="a1">
<string
builtin="elementType">
C</string>
<float
builtin="x2">0.6424</float>
<float
builtin="y2">0.4781</float>
</atom>
….
</atomArray>
<bondArray>
<bond id="b1">
<string
builtin="atomRef">a1</str
ing>
<string
builtin="atomRef">a2</str
ing>
<string
builtin="order">1</string
>
</bond>
….
</bondArray>
</molecule>
CML Example Software
Previous Slide

Browser tool, Jumbo-3.0
• User can display dozens of CML’d
molecules.
• Molecules can by rotated in display.
• Display is rendered in SVG (Adobe
plugin).
• Molecule displayed is cholesterol. They
also have glycine in database, but not
as exciting to look at.
Gateway Application
Descriptors
Describing scientific
applications themselves with
XML and mapping to Java
with Castor.
Gateway Application Descriptors



Gateway is a computational web
portal for securely submitting and
monitoring jobs, transferring files,
and archiving information.
Gateway describes scientific
applications and host computers with
XML metadata.
This is used to provide general
purpose tools that can be used to
build portals for specific applications.
Application Descriptors



Gateway describes scientific applications
and host machines in XML.
This is used to generate HTML forms
needed to collect information needed to
create batch queuing scripts and job
submission.
The general object container scheme is
• Portals contain applications
• Applications contain hosts
• Each also has a set of descriptive parameters.
Example: ANSYS on Grids
<Application>
<ApplicationName>ANSYS
</ApplicationName>
<Version>5.0</Version>
<Parameter Name="IOStyle">
<Value>StandardIO</Value>
</Parameter>
<Parameter
Name="NumberOfInFiles">
<Value>1</Value>
</Parameter>
(continued on next column)
<Host>
<HostName>
grids.ucs.indiana.edu
</HostName>
<HostIP>156.56.103.5</HostIP>
<RemoteCopy>rcp
</RemoteCopy>
<RemoteExec>rsh</RemoteExec
>
<WorkDir>/tmp</WorkDir>
<QueueType>CSH</QueueType>
<QsubPath>/usr/bin/csh
</QsubPath>
<ExecPath>echo
</ExecPath>
</Host>
</Application>
Java Data Object Bindings



As with other examples, the
descriptor does not do anything by
itself.
Must provide language bindings to
make it useful in programs.
We used Castor
(http://castor.exolab.org) to
generate classes for us.
Castor for Data Object Creation





Direct mapping between Application tag and Java
object, for example.
Each object has necessary getter and setter
methods for manipulating data.
After making classes from XML schema (once),
load in XML file to program to create particular
data object instances (unmarshalled)
When program is done, modified data objects can
be marshalled back into XML file format.
We still have to write the Java code for specific
uses, utility classes….
Other markup languages
and some comparison
Various shortcomings of
programming and markup
languages
XML Schema

XML Schema defines many built-in
types
• binary, boolean, byte, decimal, double,
float, int, long, short, string
• And many more

Does not define standards for
• Arrays
• Complex (real+imaginary) numbers
SOAP

Known as XML Remote Procedure Call
protocol.
• RPC is only one part of SOAP



Also defines encoding rules for data
exchange.
SOAP inherits all XML Schema Built-in
Types (see previous slide).
Defines additional compound types
• Struct: arbitrary collection of types (say,
strings and floats) similar to XSIL table entry.
• Array: can contain primitive and compound
types

An array can be built out of arrays.
HDF5 and XML

Types include
• Integers

2-64 bit, signed or unsigned, big or little
endian
• Floats (32, 64 bit, BE or LE)
• Strings
• Arrays


Arbitrary compound types
See
http://hdf.ncsa.uiuc.edu/HDF5/XML/
Compatibility and Missing Features

No standard XML definitions for
arrays and “compound types” like
XSIL tables.
• We have several defs: SOAP, XSIL,
XDMF, XML-HDF5

Lack of built-in support for complex
(real + imaginary) types
• XML, XML-HDF5, XDMF can easily
define complex but not in standard way.
• Java does not have built-in complex
type, either
More Missing Features

Varying support for integers, floats
with different sizes.
• C/C++ does not guarantee consistent
bit size.

Binary data must specify Big
Endian/Little Endian encoding for
cross platform compatibility.
• XML-HDF5, XSIL, XDMF all do this
• XML does not

XSIL does not have signed/unsigned