Early Experience Prototyping a Science Data Server for
Download
Report
Transcript Early Experience Prototyping a Science Data Server for
Ameriflux Scientific Data Server
Technology Overview
Catharine van Ingen, MSFT
([email protected])
11 December 2006
technical overview
Outline
Overview
Database schema
Data cube structure
Data staging pipeline
Work in progress and future plans
technical overview
Overview
technical overview
Ameriflux Collaboration Overview
149 Sites across the Americas
Each site reports a minimum
of 22 common measurements.
Communal science – each
principle investigator acts
independently to prepare and
publish data.
Data published to and
archived at Oak Ridge.
Total data reported to date on
the order of 160M half-hourly
measurements (includes 3
processing levels).
http://public.ornl.gov/ameriflux/
technical overview
4
Ameriflux Scientific Data Server - Goals
Act as a local repository for data and metadata assembled by a
small group of scientists from a wide variety of sources
Simplify provenance by providing a common “safe deposit box”
for assembled data
Interact simply with existing and emerging Internet portals for data
and metadata download, and, over time, upload
Simplify data assembly by adding automation
Simplify name space confusion by adding explicit decode
translation
Support basic analyses across the entire dataset for both data
cleaning and science
Simplify mundane data handling tasks
Simplify quality checking and data selection by enabling data
browsing
technical overview
Ameriflux Scientific Data Server - Non-Goals
Replace the large Internet data source sites
The technology developed may be applicable, but the focus is
on the group collaboration scale and usability
Very large datasets require different operational practices
Perform complex modeling and statistical analyses
There are a lot of existing tools with established trust based on
long track records
Only part of a full LIMS (laboratory information management
system)
Develop a new standard schema or controlled vocabulary
Other work on these is progressing independently
Due to the heterogeneity of the data, more than one such
standard seems likely to be relevant
technical overview
Ameriflux Scientific Data Server - Workflows
Staging: adding data or metadata
New downloaded or field measurements
added
New derived measurements added
Editing: changing data or metadata
Existing older measurements re-calibrated
or re-derived
Data cleaning or other algorithm changes
Gap filling
Sharing: making the latest acquired data
available rapidly
Even before all the checks have been
made
Browsing new data before more detailed
analyses
Private Analysis: Supporting individual
researchers (MyDB)
Stable location for personal calibrations,
derivations, and other data transformations
Import/Export to analysis tools and models
Curating: data versioning and provenance
Simple parent:child versioning to track
collections of data used for specific uses
Large Data Archives
technical overview
Local measurements
Scientific Data Server - Logical Overview
Data
Access
and
Analysis
Tools
Analysis Tools
Excel, Matlab,
SPlus, SAS,
ArcGIS
Simple web
data plots and
tables
Staging
Databases
and
Cubes
BigPlot data
browsing
Private
Data
Analysis
Databases
and
Cubes
Older Dataset(s)
Archive
Database
Last Known
Good Dataset(s)
Database
Last Known Good
Dataset(s) Cubes
Flat file data
import/export
Latest Dataset
Database
Computational
Models
Scientific Data Server
technical overview
Latest
Dataset
Cube
Databases: SQL Server 2005
All data, descriptive ancillary data and
metadata held in relational databases
Access via SQL query
Metadata is important too but handled
differently than data
While separate databases are shown, the
datasets actually reside in a single database
today
Mapping is transparent to the scientist
due to dataset grouping and versioning
Separate databases used for
performance
Unified databases used for simplicity
New ancillary data and data are staged with a
temporary database
Minimal quality checks applied
All name and unit conversions
Data may be exported to flat file, copied to a
private MyDb database, directly accessed
programmatically via ODBC connection, or ?
Database may be replicated for availability
and load balancing
technical overview
Older Dataset(s)
Archive
Database
Staging
Database
Last Known
Good Dataset(s)
Database
Latest Dataset
Database
MyDb
Analysis
Database
Data cubes: SQL Server Analysis Services
A data cube is a database specifically for data
mining (OLAP)
Simple aggregations (sum, min, max, count)
can be pre-computed for speed
Additional calculations (median) can be
computed dynamically
Both operate along dimensions such as time,
site, or datumtype
Staging
Data
Constructed from a relational database via
Cube
Visual 2005 project
A specialized query language (MDX) is used
Client tool integration is evolving
Excel PivotTables allow simple data viewing
More powerful charting with Tableaux or
ProClarity (commercial mining tools)
Management similar, but different to SQL Server
Most tools are shared, but actual usage
differs due to functional and organizational
differences
References:
Professional SQL Server Analysis Services 2005
with MDX (Programmer to Programmer) by
Sivakumar Harinath and Stephen Quinn
MDX Solutions: With Microsoft SQL Server
Analysis Services by George Spofford
technical overview
Last Known Good
Dataset(s) Cubes
MyDb
Analysis
Data
Cubes
Latest
Dataset
Cube
Simple plots: SQL Server Reporting Services
Server-side rendering for simple tables
and charts
Initially developed for static
rendering of database reports
such as quarterly sales
Report layout via Visual Studio
2005 project
Report description language is
XML and stored in a relational
database
Accessed via ASP.NET executing
in IIS
Actual data source can be either
relational database or data cube
Reports can be downloaded in various
formats (eg Excel, jpg).
Not intended for large quantities of data
(>50000 points) or highly customized
graphics
References: SQL Books Online (online
product documentation)
Download Site Ancillary Data
Ameriflux Fair Use Policy
Datumtypes: Leaf area index (LAI)
Sites: CA - Tonzi Ranch, CA - Vaira Ranch
Ancillary Datumtype
Site
Value
Source
0.600000024
Global Biogeochemical Cycles, Vol. 18,
GB4002, 2004.
2
Xu and Baldocchi, 2003.
2
Xu and Baldocchi, 2003.
2
http://www.cnr.berkeley.edu/biometlab/Bio
metWeb/FieldStudies.html
Leaf area index
technical overview
CA - Tonzi Ranch
CA - Vaira Ranch
Data Import/Export: SQL Server Integration Services
Simple GUI-based tool generation for
data migration, import, export
Good for exploration and/or rapid
script development
Includes several format converters
Eliminates the need for much of the
“glue” code
We’ve had mixed luck with for more
complex operations such as custom
selection for export to flat file and/or
ingesting all data from the ORNL web
site.
References:
Extending SSIS 2005 with Script by
Donald Farmer or Professional SQL
Server 2005
Integration Services (Programmer to
Programmer) by Brian Knight et al
technical overview
bwc.berkeley.edu Today
ORNLArchive
ReportServer
Analysis Tools
Excel, Matlab,
SPlus, SAS,
ArcGIS
OD
BC
ReportServer
TempDB
SeptemberORNL
EverythingORNL
Harvard
Forest
NovemberORNL
Vaira
Willow
Flat Files
technical overview
IIS
Website
Database Schema
technical overview
Schema Overview
investigator
PK
firstname
lastname
name
phone
fax
email
institution
address1
address2
address3
description
site
PK
columndecode
siteid
latitude
longitude
shortname
name
state
country
intocube
dataPruneView
datasetid
siteid
datumid
value
time
exdatumid
offsetid
repeatid
qualityid
exdatumid
debris
intoCube
name
PK
siteid
investigatorid
startTime
endTime
comment
FK1
datasetid
createTime
lastAppendTime
lastModifyTime
appendOnlyTime
fixTime
deleteTime
creatorid
name
description
howmade
PK
sitePruneView
FK1
FK2
datumid
quality
name
match
PK
dataView
datasetid
siteid
datumid
time
exdatumid
offsetid
qualityid
value
summaryEuro
lowVariabilityEuro
spikeDetectEuro
USTcorrectEuro
gapborderEuro
PIcontrolEuro
gapEuro
exdatumtype
offset
PK
PK
exdatumid
offsetid
debris
intoCube
name
value
units
qualityid
siteset
PK
sitesetid
FK2
siteid
createTime
lastAppendTime
lastModifyTime
appendOnlyTime
fixTime
deleteTime
parentSitesetid
creatorid
name
description
howmade
path
dataset_siteset
siteid
latitude
longitude
name
display
offsetView
units
value
offsetid
totalcm
display
datumtypeStage
siteid
latitude
longitude
display
lai
dataset
Subsets will be
considered in turn
time
yy
mm
dd
hh
mi
doy
yymm
yymmdd
yymmddhh
yymmddhhmi
yydoy
yydoyhh
yydoyhhmi
shortname
columntext
units
offset
repeat
debris
quality
offsetunits
datumidStage
datumid
offsetid
exdatumid
qualityid
exdatumtypeView
siteView
site_investigator
FK2
FK1
timeline
investigatorId
datasetid
sitesetid
comment
FK3
FK1
datumtype
PK
siteStage
PK
datumid
shortname
units
name
offsetunits
ancdata
intoCube
dailyCalc
repeatCalc
siteCalc
siteid
name
latitude
longitude
investigatortext
fileChecksum
match
repeat
PK
datumtypeView
datumid
shortname
units
name
offsetunits
ancdata
intoCube
dailyCalc
repeatCalc
siteCalc
sitedecode
data
ancdata
sitesetview
siteid
investigatorid
FK3
FK1
FK2
sitename
sitetext
gapfill
sitesetname
siteidStage
siteid
sitesetid
fileChecksum
ORNLkeyID
FK7
FK6
FK1
FK8
FK2
FK3
FK5
FK4
investigatordecode
text
investigatorId
sitesetchecksums
FK1
sitesetid
year
checksum
technical overview
sitesetid
siteid
datumid
value
time
exdatumid
offsetid
repeatid
qualityid
FK2
FK1
siteid
datumid
value
string
valuetype
starttime
stoptime
source
repeat
times
PK
time
L2 Data Decode
Discovered column headings are represented as:
[Datumtype] [repeat][_offset][_offset][extended datumtype][units]
Datumtype: the measurement primary datumtype.
Example: TA, PREC, or LE.
Repeat: an optional number indicating that multiple
measurements were taken at the same site and offset.
Example: include TA2.
[_offset][_offset]: major and minor part of the z offset.
Example: SWC_10 (SWC at 10 cm) or TA_10_7 (TA at
10.7m).
Extended datumtype: any remaining column text.
Example: “fir”, “E”, “sfc”.
Units: measurement units (should only be one per datumtype)
Example: w/m2, or deg C.
technical overview
L3/L4 European Data Decode
European discovered column headings can be represented as:
[qualityprefix][Datumtype][count][extended datumtype][qualitypostfix]
QualityPrefix: “qf_” quality flags computed at level 3 processing
Datumtype: measurement primary datumtype, although other text can
be used (such as G for soil heat flux)
Count: an optional number a site-specific offset depth
Replaces offset from previous column decode
We’ve asked this be changed; likely an artifact of the original
ORNL column format
Extended datumtype: a datumtype-specific combination of additional
modifying text OR deriviation algorithm identifier
QualityPrefix: “qc” quality flags computed at level 4 processing
Key issue: merging the quality data in columns with the correct data value
column. Not all quality columns are 1:1 with data columns
technical overview
Data Table
datumtype
Data values are stored in a
“tall” normalized or atomized
table
Only valid data values are
retained
Each row contains exactly
one value; all other columns
are foreign keys to other
tables
Data are stored as single
precision floating point value
Horizontal partitioned tables
are created for BigPlot
Various views created to for
cube building and report
generation
PK
site
data
datumid
shortname
units
name
offsetunits
ancdata
intoCube
dailyCalc
repeatCalc
siteCalc
FK6
FK5
FK1
FK2
FK3
FK4
FK7
FK8
PK
latitude
longitude
shortname
name
state
country
intocube
sitesetid
siteid
datumid
value
exdatumid
offsetid
repeatid
qualityid
time
siteset
exdatumtype
PK
exdatumid
PK
sitesetid
FK1
siteid
createTime
lastAppendTime
lastModifyTime
appendOnlyTime
fixTime
deleteTime
parentSitesetid
creatorid
name
description
howmade
path
debris
intoCube
name
quality
PK
FK2
qualityid
summaryEuro
lowVariabilityEuro
spikeDetectEuro
USTcorrectEuro
gapborderEuro
PIcontrolEuro
gapEuro
repeat
PK
repeat
siteid
offset
times
PK
time
PK
offsetid
value
units
technical overview
Times, repeat, offset
Times table algorithmically generated
on half-hour intervals
All times are site local
Considering adding UTC via view
with site-specific conversions or
additional column
Repeat table monitonically increases
Offset table generated by discovery of
staged data
Views of offset and times used for cube
and report menu generation
Multiple time represented to
support hierarchies
Offsets ordered to simplify usage
technical overview
times
PK
time
repeat
PK
repeat
offset
PK
timeline
time
yy
mm
dd
hh
mi
doy
yymm
yymmdd
yymmddhh
yymmddhhmi
yydoy
yydoyhh
yydoyhhmi
offsetid
value
units
offsetView
units
value
offsetid
totalcm
display
Datumtype and Exdatumtype
Datumtype holds primary (not “other”) data
datumtypes and all ancillary datumtypes (eg
LAI)
Columns for cube building and perdatumtype handling
intoCube: determines whether
datumtype flows to cube
dailyCalc: indicates whether daily value
should be average or cumulative
repeatCalc: indicates whether repeats
are averaged, ignored, or summed
when building default cube
siteCalc: determines how numeric site
ancillary data should flowed into cube.
Eg LAI is max
Ordered views used for cube building and
report generation
technical overview
datumtype
PK
datumid
shortname
units
name
offsetunits
ancdata
intoCube
dailyCalc
repeatCalc
siteCalc
exdatumtype
PK
exdatumid
debris
intoCube
name
Quality
Holds datumtype specific European
quality indicators
Algorithmically generated given
quality definitions
Intended to be extensible
qualityID indicates remaining
column applicability
Additional columns can be
added as new quality metric
emerge
Flows into cube as a quality
dimension
Will support selection and
plotting by quality
Still being debugged
technical overview
quality
PK
qualityid
summaryEuro
lowVariabilityEuro
spikeDetectEuro
USTcorrectEuro
gapborderEuro
PIcontrolEuro
gapEuro
Site Ancillary Data
Ancillary data holds site specific
properties such as biome, climate,
canopy height
May be text or numeric
Separate table necessary as multiple
reported measurements exist due to
different sources or changes over time
Source column tracks provenance
To flow into the cube, need an
algorithm to select one and only one
value
Leaf area index uses “max”
Over time, could use start/stop times to
generate values and treat as data
technical overview
datumtype
PK
datumid
shortname
units
name
offsetunits
ancdata
intoCube
dailyCalc
repeatCalc
siteCalc
site
PK
siteid
latitude
longitude
shortname
name
state
country
intocube
ancdata
FK2
FK1
siteid
datumid
value
string
valuetype
starttime
stoptime
source
Site, Investigator
Site and investigator initially
created by scrape of ORNL site
Now maintained by manual
query
Spline connects the two and
tracks the relevant start/stop time
Investigator could be removed
with GUID identifier to alternate
(membership) database
Deploying CasJobs or other
workflow or authentication
mechanism will necessitate
this
site
PK
siteid
latitude
longitude
shortname
name
state
country
intocube
site_investigator
FK2
FK1
siteid
investigatorid
startTime
endTime
comment
technical overview
investigator
PK
investigatorId
firstname
lastname
name
phone
fax
email
institution
address1
address2
address3
description
Datasets, Sitesets
Datasets used to expose
siteset
data versions such as
NEE_or_MDS or
September download
Sitesets used to build
datasets and support
incremental data staging or
other changes to existing
data
Original single level
versioning scheme
replaced by this two level
scheme due to better
mix/match ability with
different data sources
across datumtypes
PK
dataset
PK
datasetid
createTime
lastAppendTime
lastModifyTime
appendOnlyTime
fixTime
deleteTime
creatorid
name
description
howmade
technical overview
FK1
dataset_siteset
FK1
FK2
datasetid
sitesetid
comment
sitesetid
siteid
createTime
lastAppendTime
lastModifyTime
appendOnlyTime
fixTime
deleteTime
parentSitesetid
creatorid
name
description
howmade
path
sitesetchecksums
FK1
sitesetid
year
checksum
Decoding Tables
The conversion from column header
to decoded data is table driven
Sites and column headers are
discovered, then decoded.
Allows for multiple different
representations of the same
site/datumtype.
Checksums used to suppress load if
newly scraped data file has not
changed
siteStage
PK
siteid
name
latitude
longitude
investigatortext
fileChecksum
match
sitedecode
FK1
sitename
sitetext
gapfill
sitesetname
siteidStage
siteid
sitesetid
fileChecksum
ORNLkeyID
datumtypeStage
PK
shortname
columntext
units
offset
repeat
debris
quality
offsetunits
datumidStage
datumid
offsetid
exdatumid
qualityid
investigatordecode
datumid
name
match
technical overview
columndecode
text
investigatorId
Versioning: Sitesets and Datasets
A siteset is a collection of similar data from a single site
May include actual observations, measurements derived
from observations by known algorithm, derived from
observations by private (investigator specific algorithm)
Derived from a single source – actual measurements,
download from national data set, simulation or other
• The Ameriflux site is only one such source
Sitesets are primarily internal bookkeeping
A dataset is a collection of sitesets
No restriction on sitesets in the collection, although some
combinations won’t make scientific sense
Datasets are the external scientist-facing “version”
Reduces older data churn, maps to (likely) scientist data usage,
avoids even more complex queries and views
technical overview
Server and Researcher Sets
Sitesets and datasets may be maintained by the server or maintained
by a researcher
Server Sets are maintained on behalf of all researchers in the
collaboration
Accessible to all researchers
Set change policy decisions determined by the server
(administrator) on behalf of the collaboration
Researcher Sets are maintained by a specific researcher or small
subset group of researchers in the collaboration
Accessible only to owning researcher and/or collaborators
selected by the owner
Set change decisions determined by the owner
technical overview
Fixed, AppendOnly, and Dynamic
Sets
Fixed sets are used to define “known good” sets for long term
data curation
Contains data of a known quality as of a specific point in
time
Analogous to a software release
Once fixed, always fixed
A fixed dataset may only contain fixed sitesets
AppendOnly sets are used to define “latest good” sets for
common accesses
Contains data of a known quality although additional data
may be added at any time
May not be necessary, but offers a little protection against
inadvertent staging side effects
An appendonly dataset may contain both fixed and append
only sitesets
Dynamic sets are used to define “latest” sets for rapid data
access, active analysis or cleaning
Contains data of a relatively well known quality.
No restriction on changes
A dynamic dataset may contain any siteset
technical overview
Modify
Dynamic
Add New
Remove
AppendOnly
Fixed
Add Data
Dataset and Siteset Timestamps
Significant changes to a dataset or siteset are tracked by timestamps
CreateTime: dataset or siteset creation
LastAppendTime: last addition of new siteset to a dataset or new
data to a siteset (includes new derived measurements, new
measurement types and recent data additions)
LastModifyTime: last deletion or substitution of a siteset member
of a dataset or change to older previously existing data values of a
siteset
FixTime: dataset or siteset converted to fixed
DeleteTime: all dataset or siteset data deleted
Notes:
Intended to be used as “something has changed, query for more
details” indicators only
DeleteTime retained for some period to help the “oops” cases to
help scientists determine what happened in inadvertent or
forgotten scenarios
All timestamps are server local time
technical overview
Dataset and Siteset provenance
creatorid tracks investigator that created the dataset or siteset
includes any software automation tracked via investigator table
entry
prepares for investigator private sets in MyDB/MyCube
name is (relatively) short text string for simple identification
Commonly used to select dataset when building a cube or other
user interaction
description is longer text string for reference
Siteset howmade is a controlled vocabulary text string to track data
origins
Text used to simplify data viewing.
Initial strings:
• ORNLsite: non-gap filled data downloaded from ORNL website
• ANN: gap-filled data using ANN method
• gap: gap-filled data using unknown or unspecified method
Siteset path is the URL (if available) that was used when downloading
data
NULL used when URL unavailable
Treated as a text string only, no sanity checking or format
enforcement
technical overview
Siteset and Dataset Scenarios
Sitesets and datasets are just mechanisms, we also need
policies to determine the actual behavior
Usage cases of interest:
New staged measurements
Existing older measurements re-calibrated or re-derived
Data cleaning or other algorithm changes
New derived measurements added
Gap filling
Making the latest acquired data available rapidly
Supporting individual researchers (MyDB)
Data version provenance: parents and children
technical overview
Example: data version provenance
When making a new version, it’s good to track the original version
Full provenance is possible, but very easily leads to poor usability
Proposal: track siteset parents
A siteset has exactly one parent
A siteset may have any number of children due to gap filling, other
algorithm change applied, or new data reported
When a new siteset is made by combining two different sitesets,
the combination can be determined by annotation only
When a siteset is deleted, the youngest surviving ancestor
become the parent to any surviving children
Relationships between datasets are more complex and should be
determined by the constituent siteset relations
technical overview
Database schema learnings
The tall table has both advantages and disadvantages
Best for building a cube and most extensible as new
datumtypes or exdatumtypes et al are encountered
Queries are fairly complex to write and join multiple tables
to specify siteset, site, datumtype, exdatumtype, offset, and
repeat values
Views can simplify the query, but not the performance
Materialized views and pivot tables too restricted for
practical application here
Build infrastructure for machine schemas and use the right
one for the right job
100+M rows in any table takes care
Clustered unique index to improve performance of common
queries
Updates affecting many rows can fill the log; allocate 2-5x
storage
Full backups are far simpler operationally and footnetting often a
good alternative
technical overview
Data cube structure
technical overview
Sample Cube Construction
Visual Studio UI for construction, deployment, and browsing
technical overview
Datacube structure
Structure follows tall table organization
Each dimension corresponds to a foreign key
Time has multiple hierarchies
Dataset replaces siteset as it is the scientist-facing concept
Accumulating computed measures as we discover them
Median, standard deviation, variance can be expensive
HasDailyData, HasYearlyData, DailyValue, Average cheap
and handy
Development uses smaller subset cubes
Harvard Forest for L2 data and SeptemberORNL
development
VairaWillow for L2/L3/L4 data and NovemberORNL
development
Spreadsheets for Q/A checking
technical overview
Minimalist MDX Query
SELECT
NON EMPTY { [Measures].[Average]} ON COLUMNS,
NON EMPTY {
[Timeline].[Year To Day].[Yydoy].&[2002-001]:
[Timeline].[Year To Day].[Yydoy].&[2002-365] }
ON ROWS
FROM [NovemberORNL]
WHERE( [Datumtype].[Datumtype].&[42],
[Site].[Site].&[29],
[Exdatumtype].[Exdatumtype].&[1],
[Dataset].[Dataset].&[6],
[Offset].[Offset].&[0 (cm)])
Retrieves daily average GPP values for Tonzi Ranch for 2002
technical overview
Report Generation Cube Query
SELECT NON EMPTY { [Measures].[Average] } ON
COLUMNS,
NON EMPTY {
([Site].[Site].[Site].ALLMEMBERS *
[Datumtype].[Datumtype].[Datumtype].ALLMEM
BERS * [Timeline].[Year To
Month].[Yymmdd].ALLMEMBERS ) }
DIMENSION PROPERTIES
MEMBER_CAPTION,
MEMBER_UNIQUE_NAME ON ROWS FROM (
SELECT ( { [Offset].[Offset].&[0 (cm)] } ) ON
COLUMNS FROM (
SELECT ( { [Exdatumtype].[Exdatumtype].&[1] }
) ON COLUMNS FROM (
SELECT ( STRTOSET(@TimelineYear,
CONSTRAINED) ) ON COLUMNS FROM (
SELECT ( STRTOSET(@SiteSite,
CONSTRAINED) ) ON COLUMNS FROM (
SELECT ( STRTOSET(@DatumtypeDatumtype,
CONSTRAINED) ) ON COLUMNS FROM
[SeptemberORNL])))))
WHERE ( IIF( STRTOSET(@TimelineYear,
CONSTRAINED).Count = 1,
STRTOSET(@TimelineYear, CONSTRAINED),
[Timeline].[Year].currentmember ),
[Exdatumtype].[Exdatumtype].&[1],
[Offset].[Offset].&[0 (cm)] )
Site Daily Average Chart
Datumtypes:
GPP
Sites:
CA - Tonzi Ranch
Years:
2004
Query used for report generation;
STRTOSET in the above retrieves report parameters
Still retrieves daily average GPP values for Tonzi Ranch for 2002
technical overview
Excel Cube Query
SELECT NON EMPTY
HIERARCHIZE(Except({AddCalculatedMembers(Except({AddCalculatedMembers(DrillDownLevel({[Datumtype].[Da
tumtype].[All]}))}, {[Datumtype].[Datumtype].&[56], [Datumtype].[Datumtype].&[40], [Datumtype].[Datumtype].&[39],
[Datumtype].[Datumtype].&[38], [Datumtype].[Datumtype].&[37], [Datumtype].[Datumtype].&[36],
[Datumtype].[Datumtype].&[33], [Datumtype].[Datumtype].&[32], [Datumtype].[Datumtype].&[31],
[Datumtype].[Datumtype].&[59], [Datumtype].[Datumtype].&[55], [Datumtype].[Datumtype].&[29],
[Datumtype].[Datumtype].&[58], [Datumtype].[Datumtype].&[57], [Datumtype].[Datumtype].&[26],
[Datumtype].[Datumtype].&[25], [Datumtype].[Datumtype].&[24], [Datumtype].[Datumtype].&[23],
[Datumtype].[Datumtype].&[22], [Datumtype].[Datumtype].&[54], [Datumtype].[Datumtype].&[21],
[Datumtype].[Datumtype].&[20], [Datumtype].[Datumtype].&[19], [Datumtype].[Datumtype].&[18],
[Datumtype].[Datumtype].&[17], [Datumtype].[Datumtype].&[13], [Datumtype].[Datumtype].&[60],
[Datumtype].[Datumtype].&[11], [Datumtype].[Datumtype].&[9], [Datumtype].[Datumtype].&[7],
[Datumtype].[Datumtype].&[6], [Datumtype].[Datumtype].&[5], [Datumtype].[Datumtype].&[2],
[Datumtype].[Datumtype].&[1]}))}, {[Datumtype].[Datumtype].&[56], [Datumtype].[Datumtype].&[40],
[Datumtype].[Datumtype].&[39], [Datumtype].[Datumtype].&[38], [Datumtype].[Datumtype].&[37],
[Datumtype].[Datumtype].&[36], [Datumtype].[Datumtype].&[33], [Datumtype].[Datumtype].&[32],
[Datumtype].[Datumtype].&[31], [Datumtype].[Datumtype].&[59], [Datumtype].[Datumtype].&[55],
[Datumtype].[Datumtype].&[29], [Datumtype].[Datumtype].&[58], [Datumtype].[Datumtype].&[57],
[Datumtype].[Datumtype].&[26], [Datumtype].[Datumtype].&[25], [Datumtype].[Datumtype].&[24],
[Datumtype].[Datumtype].&[23], [Datumtype].[Datumtype].&[22], [Datumtype].[Datumtype].&[54],
[Datumtype].[Datumtype].&[21], [Datumtype].[Datumtype].&[20], [Datumtype].[Datumtype].&[19],
[Datumtype].[Datumtype].&[18], [Datumtype].[Datumtype].&[17], [Datumtype].[Datumtype].&[13],
[Datumtype].[Datumtype].&[60], [Datumtype].[Datumtype].&[11], [Datumtype].[Datumtype].&[9],
[Datumtype].[Datumtype].&[7], [Datumtype].[Datumtype].&[6], [Datumtype].[Datumtype].&[5],
[Datumtype].[Datumtype].&[2], [Datumtype].[Datumtype].&[1]})) DIMENSION PROPERTIES
PARENT_UNIQUE_NAME ON COLUMNS , NON EMPTY
HIERARCHIZE(Except({AddCalculatedMembers(Except({AddCalculatedMembers(DrillDownMember({DrillDownLev
el({[Timeline].[Year To Day].[All]})}, {[Timeline].[Year To Day].[Year].&[2004]}))}, {[Timeline].[Year To
Day].[Year].&[2006], [Timeline].[Year To Day].[Year].&[2005], [Timeline].[Year To Day].[Year].&[2003],
[Timeline].[Year To Day].[Year].&[2002], [Timeline].[Year To Day].[Year].&[2001], [Timeline].[Year To
Day].[Year].&[2000], [Timeline].[Year To Day].[Year].&[1999], [Timeline].[Year To Day].[Year].&[1998],
[Timeline].[Year To Day].[Year].&[1997], [Timeline].[Year To Day].[Year].&[1996], [Timeline].[Year To
Day].[Year].&[1995], [Timeline].[Year To Day].[Year].&[1994], [Timeline].[Year To Day].[Year].&[1993],
[Timeline].[Year To Day].[Year].&[1992], [Timeline].[Year To Day].[Year].&[1991], [Timeline].[Year To
Day].[Year].&[1990]}))}, {[Timeline].[Year To Day].[Year].&[2006], [Timeline].[Year To Day].[Year].&[2005],
[Timeline].[Year To Day].[Year].&[2003], [Timeline].[Year To Day].[Year].&[2002], [Timeline].[Year To
Day].[Year].&[2001], [Timeline].[Year To Day].[Year].&[2000], [Timeline].[Year To Day].[Year].&[1999],
[Timeline].[Year To Day].[Year].&[1998], [Timeline].[Year To Day].[Year].&[1997], [Timeline].[Year To
Day].[Year].&[1996], [Timeline].[Year To Day].[Year].&[1995], [Timeline].[Year To Day].[Year].&[1994],
[Timeline].[Year To Day].[Year].&[1993], [Timeline].[Year To Day].[Year].&[1992], [Timeline].[Year To
Day].[Year].&[1991], [Timeline].[Year To Day].[Year].&[1990]})) DIMENSION PROPERTIES
PARENT_UNIQUE_NAME ON ROWS FROM [NovemberORNL] WHERE ([Measures].[Average],
[Dataset].[Dataset].&[6], [Site].[Site].&[29], [Offset].[Offset].&[0 (cm)], [Exdatumtype].[Exdatumtype].&[1])
Query as generated by Excel Pviot Table
Still retrieves daily average GPP values for Tonzi Ranch for 2002
technical overview
Datacube learnings
It’s fairly simple to build a cube
We’ve built lots
The GUI tool is straight forward and the tutorials help
It’s not so simple to build a good cube
Incorrectly specified hierarchies can give either incorrect answers
or catastrophically poor performance or both
Ordering all dimensions for good usability
Build times can be hours and/or timeout
Aggregations are designed in and take extra configuration
Once aggregations built, still need to watch performance trace to
ensure that they’re used
Data cubes best for daily or larger time periods. Use the database for
half-hour (leaf) measurements.
It’s easy to write a very slow MDX query
Not Empty very important
It’s really easy to write an MDX query that is hard to understand
Knowing more about what queries are of interest to the scientists will
help us tune the existing cube and/or build targeted cubes
technical overview
Data Staging Pipeline
technical overview
Data Staging Pipeline
Scheduled
download
from Website
Convert to
CSV
Canonical
Form
Load CSV
files into
Staging
Database
Stage Data
Decode and
Conversion
Basic Data
Checks
Incremental
Data Copy to
Active
Database
Data can be downloaded from internet sites regularly
Sometimes the only way to detect changed data is to compare with the
data already archived
The download is relatively cheap, the subsequent staging is expensive
New or changed data discovered during staging
Simple checksum before load
Chunk checksum after decode
Comparison query if requested
Decode stage critical to handle the uncontrolled vocabularies
Measurement type, location offset, quality indicators, units, derivation
methods often encoded in column headers
Different sites use different units
Incremental copy moves staged data to one or more sitesets
Automated via siteset:site:source mapping
technical overview
Updates to Basic Data Checks
Applied checks will be constantly increasing
Waiting for descriptions of current ORNL checks
Investigations pending:
Per-site, per datumtype limits (eg Summer is hotter Florida than in
Alaska)
Systematic time errors (eg GMT rather than local time)
Time dependent checks for datumtypes max, min, zero at night or
day day
Unit conversions or other simple conversions (eg relative
humidity)
Add a new table to log results of all checks
Enables standardized reporting longer term
Can be copied to active database as part of provenance metadata
for user access
Schema TBD
technical overview
Copying Only Changed Data
Proposed approach: throw processing at the problem
File checksum can be used for older, retired sites or slowly reporting sites
Only addresses the “no new or changed data” case
Recommend as a filter to suppress load rather than suppress copy.
Time (yearly?) chunked count and checksum can be used for older, unlikely to
change data
Reduces the total data that must be explicitly compared
Recommended given the likely query time for full comparison
Exhaustive timestep-by-timestep comparison between data in archive database
and staging database used to determine the amount of change
Limit amount of returned data to bound query time
Compare only makes sense with“like” data (same processing and gap fill
method)
Desired behavior for each site:
If the target siteset is fixed, do nothing
If the target siteset is append only and the data are not changed, append
only the new data
If the target siteset is flexible, append new data and replace any changed
data
Otherwise, create a new siteset and copy all data
technical overview
L3/L4 European Quality and Gap-Filled Data
Additional column headings to denote quality flags, derived
quantity algorithm and gap fill algorithm
Example: Rg_f, Rg_fqc, qf_NEE_st, NEE_st_fNDS,
NEE_st_fMDSqc
• _f indicates fill
• qc indicates quality
• st, or, MDS, and ANN indicate algorithms
Approach: load, decode, convert, delete
Loading unchanged: quality flags treated as data values.
Decoding upgraded: additional column headers decoded
and flag/data indicator added
Convert: builds quality flag row and connect with the
corresponding measurement
Delete: removes quality flag rows from the data table prior
to data copy from the staging database
technical overview
Work in Progress and Future Plans
technical overview
Work in progress: improving current capabilities
Debug and deploy quality flags on L3/L4 data
Implement incremental data staging to enable speedy and
simple data editing by an actual scientist (rather than a
programmer)
Implement expanded metadata handling to enable scientist to
add site characteristics and sort sites on those expanded
definitions
Investigating integration with ArcGIS to get spatial data analyses
Investigating MatLab, R2, and SPlus integration
Direct SQL queries available from MatLab and R2
Given the query complexity, should we / can we build a
wizard?
Get BigPlot working again and invest in usability
Last summer’s effort was really research into feasiblity
technical overview
Future Plans: transitioning from prototype
Add data download
Current tablular report intended as stopgap/technology
demonstration only
Likely starting point: adapting Euroflux collaboration
download to ftp zip file given the good usability and
compression
Alternate would be to write CSV file converter via SSIS
CasJobs and MyDb to support individual scientist analysis
including data editing for correction, recalibration, etc
Implement expanded metadata handling to enable scientists to
add site characteristics and sort sites on those expanded
definitions
Automate MyCube construction to pair with MyDb
Internal Microsoft experience with real life data suggests
this is possible and may even be fairly straight forward for
cubes with with common dimensions.
Review current deployment for true DMZ deployment
technical overview
Longer Term Futures
Handling imagery and other remote sensing information
Curating images is different from curating time series data
Using both together enables new science and new insights
Graphical selection and display of data
Support for user specified calculations within the database
We’ve done LE_Pot, but there are clearly others
Support for direct connections to analysis and statistical
packages to cube as well as database
Leverage stored procedure (SQL) and nice wizards
(MDX/SQL) to simplify query generation
Linkage with models
Additional (emerging) data standards such as NetCDF
Handling “just in time” data delivery and model result
curation
Need for workflow and single sign on credentials
Data mining subscription services
Handling of a broader array of data types
technical overview