Data Model - Computer Science & Engineering

Download Report

Transcript Data Model - Computer Science & Engineering

Data Products and
Product Management
Bill Howe
June 3, 2003
"...a trend within astronomy over the past decade or
so, which has seen the field change from being very
individualistic (individual astronomer proposes
observations, goes to telescope, brings data back on
tape, reduces data, writes paper, puts tape in drawer
and forgets about it) to being a more collective
enterprise where much telescope time is devoted to
systematic sky surveys and much science is
performed using data from the archives they
produce."
-- Bob Mann, U of Edinburgh
Traditional Scientific Data Processing
data product
dataset
browse
analyze
filesystem
others’
datasets
Personally managed
Convenient
Efficient
Modern Scientific Data Processing
dataset
data product
browse?
analyze
filesystem
others’
datasets
More Data
More Analyses
More Users
A New Environment
Axes of Growth





Number of Users
Number of Datasets
Size of Datasets
Number of Analysis Routines
Complexity of Analysis Routines
Problems




Too many files to browse
Multiple users performing same analyses
Too many routines to manually invoke
Datasets too large to fit into memory
Solutions
Number of Users


Better sharing of data
and data products
Both Intra-group and
Inter-group
Number of Datasets


Better Organization
(Metadata)
Query instead of Browse
Size of Datasets


Number of Analyses


Identify equivalences
Reuse common
operations
Complexity of Analyses


Macro-data issues
Better Hardware
Better Algorithms
Simpler Applications
Better Understanding of
Data Products
Micro-data issues
Roadmap
Micro-Data Issues


Expressing Data Products
Executing Data Product Recipes
Macro-Data Issues:


Techniques for Managing Data and
Processes
Provenance
CORIE
Data Products
Horizontal Grid
Vertical Grid
Some Data Products
•
•
•
•
•
•
•
•
Timeseries (1D)
Isoline (2D)
Transect (2D)
Isosurface (3D)
Volume Render (3D)
Calculation (?D)
Animations (+1D)
Ensembles (+1D)
Expressing Data Products
Specification



Salt: “Show the salinity at depth 3m”
Max: “Show the maximum salinity over depth”
Vort: “Show the vorticity at depth 3m”
Implementation

How should we translate these descriptions
into something executable?
Expressing Data Products (2)
Criteria
1.
2.
3.
Simple specs  simple implementations.
Small spec   small implementation 
Environment   minimum implementation 
Existing Technology

General Purpose Programming Languages


Example: CORIE (Perl and C)
Visualization Libraries/Applications

Examples: VTK, Data Explorer (DX), XMVIS
Simple Specification  Simple Implementation
In CORIE:
Salinity:

XMVIS5d does the job, with a little help
Max Salinity:

Custom C program: read, traverse grid, write
Vorticity

Custom C program: read, find neighbors,
traverse grid, write
Simple Specification  Simple Implementation
In VTK/DX:
Salinity:

Horizontal Slice is simple.
Max Salinity:

Not so simple, since the 3D grid is Unstructured.
Vorticity

Vorticity is simple.
Small spec   Small Implementation 
In CORIE:
Vertical Slice instead of Horizontal Slice


Custom Code: likely a drastic change. Why?
XMVIS5d: just feed in the vertical ‘region’
Zoom in on Estuary, then Horizontal Slice


Custom Code: not insignificant changes
XMVIS5d: give region coords in the .par file
Small spec   small implementation 
In VTK/DX:
Vertical Slice instead of Horizontal Slice

VTK/DX: Equivalent to Horizontal case. Why?
Zoom in on Estuary, then Horizontal Slice

Just filter out the unwanted portion
Environment   Minimum Implementation 
In CORIE:
 File Layout


Custom Code: usually drastic.
XMVIS5d: Just an extra conversion task?
Unstructured grid  Structured Grid


Custom Code: significant changes
XMVIS5d: can convert, but missing an opportunity
Grid undergoes significant refinement

does this matter?
Environment   Minimum Implementation 
In VTK/DX:
File Layout


DX provides a file import language
VTK would need a new ‘reader’ module to be written
Unstructured grid  Structured Grid


DX: changes in the import module only
VTK: Structured Grids require a different set of
objects
 Algorithms are different, so the objects are different
Grid is significantly refined

does this matter?
Expressing Data Products
Programs are frequently too tightly
coupled to data characteristics






file format
data size (>,< available memory)
data location (file, memory)
data type (float, int; scalar, vector)
grid type (structured, unstructured)
grid construction (repeated horizontal grid)
Executing Data Product Recipes
Executing Data Product Recipes
Efficient algorithms are tuned to the
environment


In-Memory, Out-of-memory
Parallel, Sequential
But we want to specify one operation that
works in multiple environments
So each specified operation must correspond
to multiple algorithms…
How can we separate our specifications from
the algorithms that implement them?

Hold off on this for now…
Execution: Pipelining
Two Reasons:


Reduce Memory Required
Return partial results earlier
VTK and DX materialize intermediate results

Requires a lot of memory
CORIE forecast engine pipelines timesteps


Mostly to return partial results early
There are (were?) occasional memory problems
Is Pipelining always a good idea?


Code is more complex
If you have enough memory, pipelining will be slower
Execution: Parallelism
Data Parallel



Split up the data and compute a piece of the data
product on each node
Pros?
Cons?
Task Parallel



If a data product consists of independent steps,
perform each on separate processors.
Pros?
Cons?
Micro-Data Summary
At some level, we should capture the logical
data product.
A logical ‘recipe’ should be immune to
changes at the physical level.
The logical recipe + data characteristics
together are precise enough to execute in the
computer.
Logical vs Physical
Data Model
Logical
Physical
Data
Representations
Operators
Algorithms
Logical vs. Physical
Expression
Execution
Logical Layer
Correctness, simplicity
Physical Layer
File formats, memory,
algorithms, etc.
Digression:
Business Data Processing, ca 1960
Record-Oriented Data
Employees(Name, HiredOn, Salary, Dept)
Sue, 6/21/2000, $46k, Engineering
Jim, 1/11/2001, $39k, Sales
Yin, 12/1/2000, $42k, (Sales, Engineering)
Dept(Name, City)
Sales, New York
Engineering, Denver
We ask Queries
Who works in Sales?
Where is Jim’s Dept?
How to connect two records?
50s – 60s:
Hierarchical Data Model
Sales
Jim
Yin
Eng
…
Sue
Yin
Who works in Sales?
FIND ANY Dept USING ‘Sales’
FIND FIRST Employee WITHIN Dept
DOWHILE DB-STATUS <> 0
GET Employee
FIND NEXT Employee WITHIN Dept
ENDDO
…
50s – 60s:
Hierarchical Data Model
Jim
Yin
Sales
…
Sue
Yin
Eng
Will the same query work now?
FIND ANY Dept USING ‘Sales’
FIND FIRST Employee WITHIN Dept
DOWHILE DB-STATUS <> 0
GET Employee
FIND NEXT Employee WITHIN Dept
ENDDO
…
Data Dependence
What Changed?

The representation, but not the information
Observation: Representation dependent
queries break
Codd* investigated this problem of data
dependence
*E.F. Codd, A relational model for large shared data banks, CACM v13,6 1970
Data Dependence: Solution
Define a logical Data Model
 Tables are Relations between their columns
 Even Dept-Emp connection modeled as a relation
Extract a few logical Operators
 Select records that match criteria
 Project away unused columns
 Join two tables based on common values
DB management systems provide physical
implementations of the logical operators
 Users are insulated from representational and algorithmic
complexity; free to focus on asking the right query
1970:
Relational Model
Employees(Name, HiredOn, Salary)
Sue, 6/21/2000, $46k
Jim, 1/11/2001, $39k
Yin, 12/1/2000, $42k
Dept(Name, City)
Sales, New York
Engineering, Denver
Dept-Emp(Name,Name)
Engineering, Sue
Engineering, Yin
Sales, Yin
Sales, Jim
Explicit Connection between Emps and Depts
Data Model is provably correct
Relational Algebra provably complete
Systems Designers’ tasks reduced to finding efficient
implementations
(Not to say this is trivial!)
So What?
For Scientific Data, can we find:


A Logical Data Model?
Logical Operators?
…rich enough to express all relevant
data products
…precise enough to guide efficient
implementations
Scientific Data Analysis
retrieve
read
analyze
filesystem
database
tertiary storage
grid &
data
representation
data
product(s)
Scientific Data Manipulation
Preparation
Representation
Manipulation
Fortran,
C, Perl
Library

=
Scientific Data Manipulation:
Patterns
Preparation
Representation
Patterns
Manipulation
Patterns
Iteration,
aggregation,
filtering
Iteration,
aggregation,
filtering

=
Iteration,
aggregation,
filtering
Data Model
Datasets are defined over Grids
Grids are sets of cells
Cells: Nodes, Edges, Faces, ...
A GridField associates data
values to Grid Elements
GridField = (G, k, g) = Gkg
where
G is a grid
k is an integer
g : Gk  a

=
G
a
a
G2area
Operators
Pattern
Operator
associating grids with
data
bind
combining grids
topologically
union, intersection,
cross product
reducing a grid using
data values
restrict
transforming grids
aggregate
Restrict
18
14
18
21
15
restrict(<19)
14
15
Merge
x2
x1
x3
merge
(x2,y2)
(x1,y1)
(x3,y3)
y1
y2
y3
y4
Aggregate
Input
12.6C
13.1C
13.2C
12.8C
12.5C
12.1C
G0temp
Assignment
chunk(3)
Target
T
{12.6C, 13.1C, 13.2C}
{12.8C , 12.5C , 12.1C}
Aggregation
Output
average
12.95C
12.45C
A
Example: Max Salinity
(HV)2salt
1.
2.
3.
target grid: H
assignment:
cross V e = e  V
aggregate: max
H2maxsalt
H2maxsalt
agg(H2cross(V), max)
(HV)2salt
Example: Salinity Gradient
1.
2.
3.
target grid: G
assignment: neighbors
aggregate: gradient
G2saltgrad
agg(H2neighbors, gradient)
G2salt
Example: WetGrid
gfWetGrid
restrict(z>bathym)

merge
bind
bind
bind
xy
bathym.
z
H
V
Example: Plume
gfPlume
restrict(salt<26)
restrict(z<elev)
merge

restrict(x>300)
bind
bind
bind
z
elev
salt
V
H
gfWetGrid
Logical Analysis
O(n2)
r(S,X)
r(S)
m
O(1)
O(1)
r(S)
m
Ss
r(X)
r(X)
Xx
X=S
X’  S
m
Ss
Xx
X=S
Ss
Xx
X=S
Macro-Data Issues
Database Extensions for Scientific
Data Management
Scientific Data
Big

efficicency is paramount
Complex Processing

formats must match existing tools
Few updates

concurrency is not a major concern
Extensive Metadata

Provenance/Lineage/Pedigree
Science is hitting a wall
FTP and GREP are not adequate




You
You
You
You
can
can
can
can
GREP
GREP
GREP
GREP
1
1
1
1
MB in 1 sec, and FTP 1 MB in 1 sec
GB in 1 min, and FTP 1 GB in 1 min
TB in 2 days, and FTP 1 TB in 2 days (1K$)
PB in 3 years, and FTP 1 TB in 3 years (1M$)
Oh!, and 1PB ~10,000 disks
At some point you need
indices to limit search
parallel data search and analysis
This is where databases can help
Slide courtesy of Alex Szalay, Jim Gray. From “Public Access to Large Astronomical
Datasets” presented at the Data Provenance Workshop 2002
With No Database
file
browse
filesystem
read
limited sharing
array
no backup
limited automation
no recovery
data dependence
no transactions
no concurrency
limited security
no query
:
:
data product
First Attempt:
Relational Databases
query results
impedance mismatch
performance
render
database
relation
array
data product
dbms
special
tools
Second Attempt:
DB Managed Files
dbms
query results
special tools
read
analyze
database
file
array
IBM Datalinks
Oracle IFS
data product
redundant computation
repetitive computation
data dependence
Third Attempt:
Analysis-Aware DB
query results
analyze
database
ESSW
Chimera
file
dbms
data dependent
special
tools
data product
Data Product Pipeline
M1
10/3
C=.2
atm
tide
cpu
mem
F1
F1
E1
Recipe Version
R1
E1
P1
FF2
2
F1
E2
M2
Mesh
P2
Parameters
F3
Forcings
Executions
Bindings
Process
Modeling
Fourth Attempt:
Array Types for DBs
data dependence?
query results
expressiveness?
data product
render
database
AML
AQL
Monoid Calculus
array
dbms
special
tools
Fifth Attempt:
Specialized Data Models for DBs
query results
query language
doesn’t exist!
data product
render
database
specialized data model
Active Data Repository
Aurora
GridFields...
dbms
special
tools
Macro-Data Summary
Science is outgrowing its infrastructure;
Databases can help
Competing solutions, no clear winner


Extensions to Existing Database
Technology
Specialized Scientific Computing Platforms
Limited Industrial Interest
(Science not a big source of $$$)
A Final Topic:
Data Provenance
Data Provenance
“...a record of the origin and history of a piece of
data.”
-- Dave Pearson, Oracle UK
“...a history of steps and procedures associated with
the processing of associated data”
-- Bob Mann, University of Edinburgh
“...metadata which uniquely defines data and
provides a traceable path to its origin.”
-- Carmen Pancerella et al., Sandia Natl Lab
“...determining the validity of data by gaining access
to a complete audit trail describing how the data was
produced from [base] datasets...”
-- Ian Foster, U of Chicago
Data Provenance
Used for:



Discovery (querying)
Validation
Reproducibility
Related Issues:



Annotations
Federated Databases
Publishing
Data Provenance
Research Thrusts

Domain-specific standards





Astronomy
High Energy Physics
Bioinformatics
Environmental Observation and Forecasting?
Representation
 XML
 BLOBs
 Explicit Schema Support

Database Extensions
 Tracking provenance through queries implicitly
Summary
Micro-Data Issues

Logical Level
 Convenient Expression
 Genericity
 Algebraic Optimization

Physical Level
 Efficient Execution
Macro-Data Issues



Database Features for Scientific Data
Metadata
Provenance

=
Timeseries
Isoline
Transect
Ensemble
Volume Rendering
Isosurface
Salt in DX
Vorticity in DX
Max Salt in DX
Filtering in DX