Early Experience Prototyping a Science Data Server for

Download Report

Transcript Early Experience Prototyping a Science Data Server for

Ameriflux Scientific Data Server
Technology Overview
Catharine van Ingen, MSFT
([email protected])
11 December 2006
technical overview
Outline
 Overview
 Database schema
 Data cube structure
 Data staging pipeline
 Work in progress and future plans
technical overview
Overview
technical overview
Ameriflux Collaboration Overview
 149 Sites across the Americas
 Each site reports a minimum
of 22 common measurements.
 Communal science – each
principle investigator acts
independently to prepare and
publish data.
 Data published to and
archived at Oak Ridge.
 Total data reported to date on
the order of 160M half-hourly
measurements (includes 3
processing levels).
 http://public.ornl.gov/ameriflux/
technical overview
4
Ameriflux Scientific Data Server - Goals
 Act as a local repository for data and metadata assembled by a
small group of scientists from a wide variety of sources
 Simplify provenance by providing a common “safe deposit box”
for assembled data
 Interact simply with existing and emerging Internet portals for data
and metadata download, and, over time, upload
 Simplify data assembly by adding automation
 Simplify name space confusion by adding explicit decode
translation
 Support basic analyses across the entire dataset for both data
cleaning and science
 Simplify mundane data handling tasks
 Simplify quality checking and data selection by enabling data
browsing
technical overview
Ameriflux Scientific Data Server - Non-Goals
 Replace the large Internet data source sites

The technology developed may be applicable, but the focus is
on the group collaboration scale and usability
 Very large datasets require different operational practices
 Perform complex modeling and statistical analyses
 There are a lot of existing tools with established trust based on
long track records
 Only part of a full LIMS (laboratory information management
system)
 Develop a new standard schema or controlled vocabulary
 Other work on these is progressing independently
 Due to the heterogeneity of the data, more than one such
standard seems likely to be relevant
technical overview
Ameriflux Scientific Data Server - Workflows





Staging: adding data or metadata

New downloaded or field measurements
added

New derived measurements added
Editing: changing data or metadata

Existing older measurements re-calibrated
or re-derived

Data cleaning or other algorithm changes

Gap filling
Sharing: making the latest acquired data
available rapidly

Even before all the checks have been
made

Browsing new data before more detailed
analyses
Private Analysis: Supporting individual
researchers (MyDB)

Stable location for personal calibrations,
derivations, and other data transformations

Import/Export to analysis tools and models
Curating: data versioning and provenance

Simple parent:child versioning to track
collections of data used for specific uses
Large Data Archives
technical overview
Local measurements
Scientific Data Server - Logical Overview
Data
Access
and
Analysis
Tools
Analysis Tools
Excel, Matlab,
SPlus, SAS,
ArcGIS
Simple web
data plots and
tables
Staging
Databases
and
Cubes
BigPlot data
browsing
Private
Data
Analysis
Databases
and
Cubes
Older Dataset(s)
Archive
Database
Last Known
Good Dataset(s)
Database
Last Known Good
Dataset(s) Cubes
Flat file data
import/export
Latest Dataset
Database
Computational
Models
Scientific Data Server
technical overview
Latest
Dataset
Cube
Databases: SQL Server 2005





All data, descriptive ancillary data and
metadata held in relational databases

Access via SQL query

Metadata is important too but handled
differently than data
While separate databases are shown, the
datasets actually reside in a single database
today

Mapping is transparent to the scientist
due to dataset grouping and versioning

Separate databases used for
performance

Unified databases used for simplicity
New ancillary data and data are staged with a
temporary database

Minimal quality checks applied

All name and unit conversions
Data may be exported to flat file, copied to a
private MyDb database, directly accessed
programmatically via ODBC connection, or ?
Database may be replicated for availability
and load balancing
technical overview
Older Dataset(s)
Archive
Database
Staging
Database
Last Known
Good Dataset(s)
Database
Latest Dataset
Database
MyDb
Analysis
Database
Data cubes: SQL Server Analysis Services




A data cube is a database specifically for data
mining (OLAP)

Simple aggregations (sum, min, max, count)
can be pre-computed for speed

Additional calculations (median) can be
computed dynamically

Both operate along dimensions such as time,
site, or datumtype
Staging
Data

Constructed from a relational database via
Cube
Visual 2005 project

A specialized query language (MDX) is used
Client tool integration is evolving

Excel PivotTables allow simple data viewing

More powerful charting with Tableaux or
ProClarity (commercial mining tools)
Management similar, but different to SQL Server

Most tools are shared, but actual usage
differs due to functional and organizational
differences
References:


Professional SQL Server Analysis Services 2005
with MDX (Programmer to Programmer) by
Sivakumar Harinath and Stephen Quinn
MDX Solutions: With Microsoft SQL Server
Analysis Services by George Spofford
technical overview
Last Known Good
Dataset(s) Cubes
MyDb
Analysis
Data
Cubes
Latest
Dataset
Cube
Simple plots: SQL Server Reporting Services





Server-side rendering for simple tables
and charts
 Initially developed for static
rendering of database reports
such as quarterly sales
 Report layout via Visual Studio
2005 project
 Report description language is
XML and stored in a relational
database
 Accessed via ASP.NET executing
in IIS
Actual data source can be either
relational database or data cube
Reports can be downloaded in various
formats (eg Excel, jpg).
Not intended for large quantities of data
(>50000 points) or highly customized
graphics
References: SQL Books Online (online
product documentation)
Download Site Ancillary Data
Ameriflux Fair Use Policy
Datumtypes: Leaf area index (LAI)
Sites: CA - Tonzi Ranch, CA - Vaira Ranch
Ancillary Datumtype
Site
Value
Source
0.600000024
Global Biogeochemical Cycles, Vol. 18,
GB4002, 2004.
2
Xu and Baldocchi, 2003.
2
Xu and Baldocchi, 2003.
2
http://www.cnr.berkeley.edu/biometlab/Bio
metWeb/FieldStudies.html
Leaf area index
technical overview
CA - Tonzi Ranch
CA - Vaira Ranch
Data Import/Export: SQL Server Integration Services
 Simple GUI-based tool generation for
data migration, import, export
 Good for exploration and/or rapid
script development
 Includes several format converters
 Eliminates the need for much of the
“glue” code
 We’ve had mixed luck with for more
complex operations such as custom
selection for export to flat file and/or
ingesting all data from the ORNL web
site.

References:

Extending SSIS 2005 with Script by
Donald Farmer or Professional SQL
Server 2005

Integration Services (Programmer to
Programmer) by Brian Knight et al
technical overview
bwc.berkeley.edu Today
ORNLArchive
ReportServer
Analysis Tools
Excel, Matlab,
SPlus, SAS,
ArcGIS
OD
BC
ReportServer
TempDB
SeptemberORNL
EverythingORNL
Harvard
Forest
NovemberORNL
Vaira
Willow
Flat Files
technical overview
IIS
Website
Database Schema
technical overview
Schema Overview
investigator
PK
firstname
lastname
name
phone
fax
email
institution
address1
address2
address3
description
site
PK
columndecode
siteid
latitude
longitude
shortname
name
state
country
intocube
dataPruneView
datasetid
siteid
datumid
value
time
exdatumid
offsetid
repeatid
qualityid
exdatumid
debris
intoCube
name
PK
siteid
investigatorid
startTime
endTime
comment
FK1
datasetid
createTime
lastAppendTime
lastModifyTime
appendOnlyTime
fixTime
deleteTime
creatorid
name
description
howmade
PK
sitePruneView
FK1
FK2
datumid
quality
name
match
PK
dataView
datasetid
siteid
datumid
time
exdatumid
offsetid
qualityid
value
summaryEuro
lowVariabilityEuro
spikeDetectEuro
USTcorrectEuro
gapborderEuro
PIcontrolEuro
gapEuro
exdatumtype
offset
PK
PK
exdatumid
offsetid
debris
intoCube
name
value
units
qualityid
siteset
PK
sitesetid
FK2
siteid
createTime
lastAppendTime
lastModifyTime
appendOnlyTime
fixTime
deleteTime
parentSitesetid
creatorid
name
description
howmade
path
dataset_siteset
siteid
latitude
longitude
name
display
offsetView
units
value
offsetid
totalcm
display
datumtypeStage
siteid
latitude
longitude
display
lai
dataset
Subsets will be
considered in turn
time
yy
mm
dd
hh
mi
doy
yymm
yymmdd
yymmddhh
yymmddhhmi
yydoy
yydoyhh
yydoyhhmi
shortname
columntext
units
offset
repeat
debris
quality
offsetunits
datumidStage
datumid
offsetid
exdatumid
qualityid
exdatumtypeView
siteView
site_investigator
FK2
FK1
timeline
investigatorId
datasetid
sitesetid
comment
FK3
FK1
datumtype
PK
siteStage
PK
datumid
shortname
units
name
offsetunits
ancdata
intoCube
dailyCalc
repeatCalc
siteCalc
siteid
name
latitude
longitude
investigatortext
fileChecksum
match
repeat
PK
datumtypeView
datumid
shortname
units
name
offsetunits
ancdata
intoCube
dailyCalc
repeatCalc
siteCalc
sitedecode
data
ancdata
sitesetview
siteid
investigatorid
FK3
FK1
FK2
sitename
sitetext
gapfill
sitesetname
siteidStage
siteid
sitesetid
fileChecksum
ORNLkeyID
FK7
FK6
FK1
FK8
FK2
FK3
FK5
FK4
investigatordecode
text
investigatorId
sitesetchecksums
FK1
sitesetid
year
checksum
technical overview
sitesetid
siteid
datumid
value
time
exdatumid
offsetid
repeatid
qualityid
FK2
FK1
siteid
datumid
value
string
valuetype
starttime
stoptime
source
repeat
times
PK
time
L2 Data Decode
Discovered column headings are represented as:
[Datumtype] [repeat][_offset][_offset][extended datumtype][units]
 Datumtype: the measurement primary datumtype.




 Example: TA, PREC, or LE.
Repeat: an optional number indicating that multiple
measurements were taken at the same site and offset.
 Example: include TA2.
[_offset][_offset]: major and minor part of the z offset.
 Example: SWC_10 (SWC at 10 cm) or TA_10_7 (TA at
10.7m).
Extended datumtype: any remaining column text.
 Example: “fir”, “E”, “sfc”.
Units: measurement units (should only be one per datumtype)
 Example: w/m2, or deg C.
technical overview
L3/L4 European Data Decode
European discovered column headings can be represented as:
[qualityprefix][Datumtype][count][extended datumtype][qualitypostfix]
 QualityPrefix: “qf_” quality flags computed at level 3 processing
 Datumtype: measurement primary datumtype, although other text can
be used (such as G for soil heat flux)
 Count: an optional number a site-specific offset depth
 Replaces offset from previous column decode
 We’ve asked this be changed; likely an artifact of the original
ORNL column format
 Extended datumtype: a datumtype-specific combination of additional
modifying text OR deriviation algorithm identifier
 QualityPrefix: “qc” quality flags computed at level 4 processing
Key issue: merging the quality data in columns with the correct data value
column. Not all quality columns are 1:1 with data columns
technical overview
Data Table
datumtype
 Data values are stored in a





“tall” normalized or atomized
table
Only valid data values are
retained
Each row contains exactly
one value; all other columns
are foreign keys to other
tables
Data are stored as single
precision floating point value
Horizontal partitioned tables
are created for BigPlot
Various views created to for
cube building and report
generation
PK
site
data
datumid
shortname
units
name
offsetunits
ancdata
intoCube
dailyCalc
repeatCalc
siteCalc
FK6
FK5
FK1
FK2
FK3
FK4
FK7
FK8
PK
latitude
longitude
shortname
name
state
country
intocube
sitesetid
siteid
datumid
value
exdatumid
offsetid
repeatid
qualityid
time
siteset
exdatumtype
PK
exdatumid
PK
sitesetid
FK1
siteid
createTime
lastAppendTime
lastModifyTime
appendOnlyTime
fixTime
deleteTime
parentSitesetid
creatorid
name
description
howmade
path
debris
intoCube
name
quality
PK
FK2
qualityid
summaryEuro
lowVariabilityEuro
spikeDetectEuro
USTcorrectEuro
gapborderEuro
PIcontrolEuro
gapEuro
repeat
PK
repeat
siteid
offset
times
PK
time
PK
offsetid
value
units
technical overview
Times, repeat, offset
 Times table algorithmically generated
on half-hour intervals
 All times are site local
 Considering adding UTC via view
with site-specific conversions or
additional column
 Repeat table monitonically increases
 Offset table generated by discovery of
staged data
 Views of offset and times used for cube
and report menu generation
 Multiple time represented to
support hierarchies
 Offsets ordered to simplify usage
technical overview
times
PK
time
repeat
PK
repeat
offset
PK
timeline
time
yy
mm
dd
hh
mi
doy
yymm
yymmdd
yymmddhh
yymmddhhmi
yydoy
yydoyhh
yydoyhhmi
offsetid
value
units
offsetView
units
value
offsetid
totalcm
display
Datumtype and Exdatumtype
 Datumtype holds primary (not “other”) data
datumtypes and all ancillary datumtypes (eg
LAI)
 Columns for cube building and perdatumtype handling
 intoCube: determines whether
datumtype flows to cube
 dailyCalc: indicates whether daily value
should be average or cumulative
 repeatCalc: indicates whether repeats
are averaged, ignored, or summed
when building default cube
 siteCalc: determines how numeric site
ancillary data should flowed into cube.
Eg LAI is max
 Ordered views used for cube building and
report generation
technical overview
datumtype
PK
datumid
shortname
units
name
offsetunits
ancdata
intoCube
dailyCalc
repeatCalc
siteCalc
exdatumtype
PK
exdatumid
debris
intoCube
name
Quality
 Holds datumtype specific European
quality indicators
 Algorithmically generated given
quality definitions
 Intended to be extensible
 qualityID indicates remaining
column applicability
 Additional columns can be
added as new quality metric
emerge
 Flows into cube as a quality
dimension
 Will support selection and
plotting by quality
 Still being debugged
technical overview
quality
PK
qualityid
summaryEuro
lowVariabilityEuro
spikeDetectEuro
USTcorrectEuro
gapborderEuro
PIcontrolEuro
gapEuro
Site Ancillary Data
 Ancillary data holds site specific




properties such as biome, climate,
canopy height
 May be text or numeric
Separate table necessary as multiple
reported measurements exist due to
different sources or changes over time
Source column tracks provenance
To flow into the cube, need an
algorithm to select one and only one
value
 Leaf area index uses “max”
Over time, could use start/stop times to
generate values and treat as data
technical overview
datumtype
PK
datumid
shortname
units
name
offsetunits
ancdata
intoCube
dailyCalc
repeatCalc
siteCalc
site
PK
siteid
latitude
longitude
shortname
name
state
country
intocube
ancdata
FK2
FK1
siteid
datumid
value
string
valuetype
starttime
stoptime
source
Site, Investigator
 Site and investigator initially
created by scrape of ORNL site
 Now maintained by manual
query
 Spline connects the two and
tracks the relevant start/stop time
 Investigator could be removed
with GUID identifier to alternate
(membership) database
 Deploying CasJobs or other
workflow or authentication
mechanism will necessitate
this
site
PK
siteid
latitude
longitude
shortname
name
state
country
intocube
site_investigator
FK2
FK1
siteid
investigatorid
startTime
endTime
comment
technical overview
investigator
PK
investigatorId
firstname
lastname
name
phone
fax
email
institution
address1
address2
address3
description
Datasets, Sitesets
 Datasets used to expose
siteset
data versions such as
NEE_or_MDS or
September download
 Sitesets used to build
datasets and support
incremental data staging or
other changes to existing
data
 Original single level
versioning scheme
replaced by this two level
scheme due to better
mix/match ability with
different data sources
across datumtypes
PK
dataset
PK
datasetid
createTime
lastAppendTime
lastModifyTime
appendOnlyTime
fixTime
deleteTime
creatorid
name
description
howmade
technical overview
FK1
dataset_siteset
FK1
FK2
datasetid
sitesetid
comment
sitesetid
siteid
createTime
lastAppendTime
lastModifyTime
appendOnlyTime
fixTime
deleteTime
parentSitesetid
creatorid
name
description
howmade
path
sitesetchecksums
FK1
sitesetid
year
checksum
Decoding Tables
 The conversion from column header
to decoded data is table driven
 Sites and column headers are
discovered, then decoded.
 Allows for multiple different
representations of the same
site/datumtype.
 Checksums used to suppress load if
newly scraped data file has not
changed
siteStage
PK
siteid
name
latitude
longitude
investigatortext
fileChecksum
match
sitedecode
FK1
sitename
sitetext
gapfill
sitesetname
siteidStage
siteid
sitesetid
fileChecksum
ORNLkeyID
datumtypeStage
PK
shortname
columntext
units
offset
repeat
debris
quality
offsetunits
datumidStage
datumid
offsetid
exdatumid
qualityid
investigatordecode
datumid
name
match
technical overview
columndecode
text
investigatorId
Versioning: Sitesets and Datasets
 A siteset is a collection of similar data from a single site


May include actual observations, measurements derived
from observations by known algorithm, derived from
observations by private (investigator specific algorithm)
Derived from a single source – actual measurements,
download from national data set, simulation or other
• The Ameriflux site is only one such source
 Sitesets are primarily internal bookkeeping
 A dataset is a collection of sitesets
 No restriction on sitesets in the collection, although some
combinations won’t make scientific sense
 Datasets are the external scientist-facing “version”
Reduces older data churn, maps to (likely) scientist data usage,
avoids even more complex queries and views
technical overview
Server and Researcher Sets
 Sitesets and datasets may be maintained by the server or maintained
by a researcher
 Server Sets are maintained on behalf of all researchers in the
collaboration
 Accessible to all researchers
 Set change policy decisions determined by the server
(administrator) on behalf of the collaboration
 Researcher Sets are maintained by a specific researcher or small
subset group of researchers in the collaboration
 Accessible only to owning researcher and/or collaborators
selected by the owner
 Set change decisions determined by the owner
technical overview
Fixed, AppendOnly, and Dynamic
Sets



Fixed sets are used to define “known good” sets for long term
data curation

Contains data of a known quality as of a specific point in
time

Analogous to a software release

Once fixed, always fixed

A fixed dataset may only contain fixed sitesets
AppendOnly sets are used to define “latest good” sets for
common accesses

Contains data of a known quality although additional data
may be added at any time

May not be necessary, but offers a little protection against
inadvertent staging side effects

An appendonly dataset may contain both fixed and append
only sitesets
Dynamic sets are used to define “latest” sets for rapid data
access, active analysis or cleaning

Contains data of a relatively well known quality.

No restriction on changes

A dynamic dataset may contain any siteset
technical overview
Modify
Dynamic
Add New
Remove
AppendOnly
Fixed
Add Data
Dataset and Siteset Timestamps
 Significant changes to a dataset or siteset are tracked by timestamps


CreateTime: dataset or siteset creation
LastAppendTime: last addition of new siteset to a dataset or new
data to a siteset (includes new derived measurements, new
measurement types and recent data additions)
 LastModifyTime: last deletion or substitution of a siteset member
of a dataset or change to older previously existing data values of a
siteset
 FixTime: dataset or siteset converted to fixed
 DeleteTime: all dataset or siteset data deleted
 Notes:
 Intended to be used as “something has changed, query for more
details” indicators only
 DeleteTime retained for some period to help the “oops” cases to
help scientists determine what happened in inadvertent or
forgotten scenarios
 All timestamps are server local time
technical overview
Dataset and Siteset provenance
 creatorid tracks investigator that created the dataset or siteset





includes any software automation tracked via investigator table
entry
 prepares for investigator private sets in MyDB/MyCube
name is (relatively) short text string for simple identification
 Commonly used to select dataset when building a cube or other
user interaction
description is longer text string for reference
Siteset howmade is a controlled vocabulary text string to track data
origins
 Text used to simplify data viewing.
 Initial strings:
• ORNLsite: non-gap filled data downloaded from ORNL website
• ANN: gap-filled data using ANN method
• gap: gap-filled data using unknown or unspecified method
Siteset path is the URL (if available) that was used when downloading
data
 NULL used when URL unavailable
 Treated as a text string only, no sanity checking or format
enforcement
technical overview
Siteset and Dataset Scenarios
 Sitesets and datasets are just mechanisms, we also need
policies to determine the actual behavior
 Usage cases of interest:
 New staged measurements
 Existing older measurements re-calibrated or re-derived
 Data cleaning or other algorithm changes
 New derived measurements added
 Gap filling
 Making the latest acquired data available rapidly
 Supporting individual researchers (MyDB)
 Data version provenance: parents and children
technical overview
Example: data version provenance
 When making a new version, it’s good to track the original version
 Full provenance is possible, but very easily leads to poor usability
 Proposal: track siteset parents





A siteset has exactly one parent
A siteset may have any number of children due to gap filling, other
algorithm change applied, or new data reported
When a new siteset is made by combining two different sitesets,
the combination can be determined by annotation only
When a siteset is deleted, the youngest surviving ancestor
become the parent to any surviving children
Relationships between datasets are more complex and should be
determined by the constituent siteset relations
technical overview
Database schema learnings
 The tall table has both advantages and disadvantages

Best for building a cube and most extensible as new
datumtypes or exdatumtypes et al are encountered
 Queries are fairly complex to write and join multiple tables
to specify siteset, site, datumtype, exdatumtype, offset, and
repeat values
 Views can simplify the query, but not the performance
 Materialized views and pivot tables too restricted for
practical application here
 Build infrastructure for machine schemas and use the right
one for the right job
 100+M rows in any table takes care
 Clustered unique index to improve performance of common
queries
 Updates affecting many rows can fill the log; allocate 2-5x
storage
 Full backups are far simpler operationally and footnetting often a
good alternative
technical overview
Data cube structure
technical overview
Sample Cube Construction
Visual Studio UI for construction, deployment, and browsing
technical overview
Datacube structure
 Structure follows tall table organization
 Each dimension corresponds to a foreign key
 Time has multiple hierarchies
 Dataset replaces siteset as it is the scientist-facing concept
 Accumulating computed measures as we discover them
 Median, standard deviation, variance can be expensive
 HasDailyData, HasYearlyData, DailyValue, Average cheap
and handy
 Development uses smaller subset cubes
 Harvard Forest for L2 data and SeptemberORNL
development
 VairaWillow for L2/L3/L4 data and NovemberORNL
development
 Spreadsheets for Q/A checking
technical overview
Minimalist MDX Query
SELECT
NON EMPTY { [Measures].[Average]} ON COLUMNS,
NON EMPTY {
[Timeline].[Year To Day].[Yydoy].&[2002-001]:
[Timeline].[Year To Day].[Yydoy].&[2002-365] }
ON ROWS
FROM [NovemberORNL]
WHERE( [Datumtype].[Datumtype].&[42],
[Site].[Site].&[29],
[Exdatumtype].[Exdatumtype].&[1],
[Dataset].[Dataset].&[6],
[Offset].[Offset].&[0 (cm)])
Retrieves daily average GPP values for Tonzi Ranch for 2002
technical overview
Report Generation Cube Query
SELECT NON EMPTY { [Measures].[Average] } ON
COLUMNS,
NON EMPTY {
([Site].[Site].[Site].ALLMEMBERS *
[Datumtype].[Datumtype].[Datumtype].ALLMEM
BERS * [Timeline].[Year To
Month].[Yymmdd].ALLMEMBERS ) }
DIMENSION PROPERTIES
MEMBER_CAPTION,
MEMBER_UNIQUE_NAME ON ROWS FROM (
SELECT ( { [Offset].[Offset].&[0 (cm)] } ) ON
COLUMNS FROM (
SELECT ( { [Exdatumtype].[Exdatumtype].&[1] }
) ON COLUMNS FROM (
SELECT ( STRTOSET(@TimelineYear,
CONSTRAINED) ) ON COLUMNS FROM (
SELECT ( STRTOSET(@SiteSite,
CONSTRAINED) ) ON COLUMNS FROM (
SELECT ( STRTOSET(@DatumtypeDatumtype,
CONSTRAINED) ) ON COLUMNS FROM
[SeptemberORNL])))))
WHERE ( IIF( STRTOSET(@TimelineYear,
CONSTRAINED).Count = 1,
STRTOSET(@TimelineYear, CONSTRAINED),
[Timeline].[Year].currentmember ),
[Exdatumtype].[Exdatumtype].&[1],
[Offset].[Offset].&[0 (cm)] )
Site Daily Average Chart
Datumtypes:
GPP
Sites:
CA - Tonzi Ranch
Years:
2004
Query used for report generation;
STRTOSET in the above retrieves report parameters
Still retrieves daily average GPP values for Tonzi Ranch for 2002
technical overview
Excel Cube Query
SELECT NON EMPTY
HIERARCHIZE(Except({AddCalculatedMembers(Except({AddCalculatedMembers(DrillDownLevel({[Datumtype].[Da
tumtype].[All]}))}, {[Datumtype].[Datumtype].&[56], [Datumtype].[Datumtype].&[40], [Datumtype].[Datumtype].&[39],
[Datumtype].[Datumtype].&[38], [Datumtype].[Datumtype].&[37], [Datumtype].[Datumtype].&[36],
[Datumtype].[Datumtype].&[33], [Datumtype].[Datumtype].&[32], [Datumtype].[Datumtype].&[31],
[Datumtype].[Datumtype].&[59], [Datumtype].[Datumtype].&[55], [Datumtype].[Datumtype].&[29],
[Datumtype].[Datumtype].&[58], [Datumtype].[Datumtype].&[57], [Datumtype].[Datumtype].&[26],
[Datumtype].[Datumtype].&[25], [Datumtype].[Datumtype].&[24], [Datumtype].[Datumtype].&[23],
[Datumtype].[Datumtype].&[22], [Datumtype].[Datumtype].&[54], [Datumtype].[Datumtype].&[21],
[Datumtype].[Datumtype].&[20], [Datumtype].[Datumtype].&[19], [Datumtype].[Datumtype].&[18],
[Datumtype].[Datumtype].&[17], [Datumtype].[Datumtype].&[13], [Datumtype].[Datumtype].&[60],
[Datumtype].[Datumtype].&[11], [Datumtype].[Datumtype].&[9], [Datumtype].[Datumtype].&[7],
[Datumtype].[Datumtype].&[6], [Datumtype].[Datumtype].&[5], [Datumtype].[Datumtype].&[2],
[Datumtype].[Datumtype].&[1]}))}, {[Datumtype].[Datumtype].&[56], [Datumtype].[Datumtype].&[40],
[Datumtype].[Datumtype].&[39], [Datumtype].[Datumtype].&[38], [Datumtype].[Datumtype].&[37],
[Datumtype].[Datumtype].&[36], [Datumtype].[Datumtype].&[33], [Datumtype].[Datumtype].&[32],
[Datumtype].[Datumtype].&[31], [Datumtype].[Datumtype].&[59], [Datumtype].[Datumtype].&[55],
[Datumtype].[Datumtype].&[29], [Datumtype].[Datumtype].&[58], [Datumtype].[Datumtype].&[57],
[Datumtype].[Datumtype].&[26], [Datumtype].[Datumtype].&[25], [Datumtype].[Datumtype].&[24],
[Datumtype].[Datumtype].&[23], [Datumtype].[Datumtype].&[22], [Datumtype].[Datumtype].&[54],
[Datumtype].[Datumtype].&[21], [Datumtype].[Datumtype].&[20], [Datumtype].[Datumtype].&[19],
[Datumtype].[Datumtype].&[18], [Datumtype].[Datumtype].&[17], [Datumtype].[Datumtype].&[13],
[Datumtype].[Datumtype].&[60], [Datumtype].[Datumtype].&[11], [Datumtype].[Datumtype].&[9],
[Datumtype].[Datumtype].&[7], [Datumtype].[Datumtype].&[6], [Datumtype].[Datumtype].&[5],
[Datumtype].[Datumtype].&[2], [Datumtype].[Datumtype].&[1]})) DIMENSION PROPERTIES
PARENT_UNIQUE_NAME ON COLUMNS , NON EMPTY
HIERARCHIZE(Except({AddCalculatedMembers(Except({AddCalculatedMembers(DrillDownMember({DrillDownLev
el({[Timeline].[Year To Day].[All]})}, {[Timeline].[Year To Day].[Year].&[2004]}))}, {[Timeline].[Year To
Day].[Year].&[2006], [Timeline].[Year To Day].[Year].&[2005], [Timeline].[Year To Day].[Year].&[2003],
[Timeline].[Year To Day].[Year].&[2002], [Timeline].[Year To Day].[Year].&[2001], [Timeline].[Year To
Day].[Year].&[2000], [Timeline].[Year To Day].[Year].&[1999], [Timeline].[Year To Day].[Year].&[1998],
[Timeline].[Year To Day].[Year].&[1997], [Timeline].[Year To Day].[Year].&[1996], [Timeline].[Year To
Day].[Year].&[1995], [Timeline].[Year To Day].[Year].&[1994], [Timeline].[Year To Day].[Year].&[1993],
[Timeline].[Year To Day].[Year].&[1992], [Timeline].[Year To Day].[Year].&[1991], [Timeline].[Year To
Day].[Year].&[1990]}))}, {[Timeline].[Year To Day].[Year].&[2006], [Timeline].[Year To Day].[Year].&[2005],
[Timeline].[Year To Day].[Year].&[2003], [Timeline].[Year To Day].[Year].&[2002], [Timeline].[Year To
Day].[Year].&[2001], [Timeline].[Year To Day].[Year].&[2000], [Timeline].[Year To Day].[Year].&[1999],
[Timeline].[Year To Day].[Year].&[1998], [Timeline].[Year To Day].[Year].&[1997], [Timeline].[Year To
Day].[Year].&[1996], [Timeline].[Year To Day].[Year].&[1995], [Timeline].[Year To Day].[Year].&[1994],
[Timeline].[Year To Day].[Year].&[1993], [Timeline].[Year To Day].[Year].&[1992], [Timeline].[Year To
Day].[Year].&[1991], [Timeline].[Year To Day].[Year].&[1990]})) DIMENSION PROPERTIES
PARENT_UNIQUE_NAME ON ROWS FROM [NovemberORNL] WHERE ([Measures].[Average],
[Dataset].[Dataset].&[6], [Site].[Site].&[29], [Offset].[Offset].&[0 (cm)], [Exdatumtype].[Exdatumtype].&[1])
Query as generated by Excel Pviot Table
Still retrieves daily average GPP values for Tonzi Ranch for 2002
technical overview
Datacube learnings
 It’s fairly simple to build a cube





 We’ve built lots
 The GUI tool is straight forward and the tutorials help
It’s not so simple to build a good cube
 Incorrectly specified hierarchies can give either incorrect answers
or catastrophically poor performance or both
 Ordering all dimensions for good usability
 Build times can be hours and/or timeout
 Aggregations are designed in and take extra configuration
 Once aggregations built, still need to watch performance trace to
ensure that they’re used
Data cubes best for daily or larger time periods. Use the database for
half-hour (leaf) measurements.
It’s easy to write a very slow MDX query
 Not Empty very important
It’s really easy to write an MDX query that is hard to understand
Knowing more about what queries are of interest to the scientists will
help us tune the existing cube and/or build targeted cubes
technical overview
Data Staging Pipeline
technical overview
Data Staging Pipeline
Scheduled
download
from Website
Convert to
CSV
Canonical
Form
Load CSV
files into
Staging
Database
Stage Data
Decode and
Conversion
Basic Data
Checks
Incremental
Data Copy to
Active
Database
 Data can be downloaded from internet sites regularly

Sometimes the only way to detect changed data is to compare with the
data already archived
 The download is relatively cheap, the subsequent staging is expensive
 New or changed data discovered during staging
 Simple checksum before load
 Chunk checksum after decode
 Comparison query if requested
 Decode stage critical to handle the uncontrolled vocabularies
 Measurement type, location offset, quality indicators, units, derivation
methods often encoded in column headers
 Different sites use different units
 Incremental copy moves staged data to one or more sitesets
 Automated via siteset:site:source mapping
technical overview
Updates to Basic Data Checks
 Applied checks will be constantly increasing
 Waiting for descriptions of current ORNL checks
 Investigations pending:
 Per-site, per datumtype limits (eg Summer is hotter Florida than in
Alaska)
 Systematic time errors (eg GMT rather than local time)
 Time dependent checks for datumtypes max, min, zero at night or
day day
 Unit conversions or other simple conversions (eg relative
humidity)
 Add a new table to log results of all checks
 Enables standardized reporting longer term
 Can be copied to active database as part of provenance metadata
for user access
 Schema TBD
technical overview
Copying Only Changed Data


Proposed approach: throw processing at the problem
File checksum can be used for older, retired sites or slowly reporting sites

Only addresses the “no new or changed data” case

Recommend as a filter to suppress load rather than suppress copy.
 Time (yearly?) chunked count and checksum can be used for older, unlikely to
change data

Reduces the total data that must be explicitly compared

Recommended given the likely query time for full comparison
 Exhaustive timestep-by-timestep comparison between data in archive database
and staging database used to determine the amount of change

Limit amount of returned data to bound query time

Compare only makes sense with“like” data (same processing and gap fill
method)
 Desired behavior for each site:

If the target siteset is fixed, do nothing

If the target siteset is append only and the data are not changed, append
only the new data

If the target siteset is flexible, append new data and replace any changed
data

Otherwise, create a new siteset and copy all data
technical overview
L3/L4 European Quality and Gap-Filled Data
 Additional column headings to denote quality flags, derived
quantity algorithm and gap fill algorithm
 Example: Rg_f, Rg_fqc, qf_NEE_st, NEE_st_fNDS,
NEE_st_fMDSqc
• _f indicates fill
• qc indicates quality
• st, or, MDS, and ANN indicate algorithms
 Approach: load, decode, convert, delete




Loading unchanged: quality flags treated as data values.
Decoding upgraded: additional column headers decoded
and flag/data indicator added
Convert: builds quality flag row and connect with the
corresponding measurement
Delete: removes quality flag rows from the data table prior
to data copy from the staging database
technical overview
Work in Progress and Future Plans
technical overview
Work in progress: improving current capabilities
 Debug and deploy quality flags on L3/L4 data
 Implement incremental data staging to enable speedy and




simple data editing by an actual scientist (rather than a
programmer)
Implement expanded metadata handling to enable scientist to
add site characteristics and sort sites on those expanded
definitions
Investigating integration with ArcGIS to get spatial data analyses
Investigating MatLab, R2, and SPlus integration
 Direct SQL queries available from MatLab and R2
 Given the query complexity, should we / can we build a
wizard?
Get BigPlot working again and invest in usability
 Last summer’s effort was really research into feasiblity
technical overview
Future Plans: transitioning from prototype
 Add data download





Current tablular report intended as stopgap/technology
demonstration only
 Likely starting point: adapting Euroflux collaboration
download to ftp zip file given the good usability and
compression
 Alternate would be to write CSV file converter via SSIS
CasJobs and MyDb to support individual scientist analysis
including data editing for correction, recalibration, etc
Implement expanded metadata handling to enable scientists to
add site characteristics and sort sites on those expanded
definitions
Automate MyCube construction to pair with MyDb
 Internal Microsoft experience with real life data suggests
this is possible and may even be fairly straight forward for
cubes with with common dimensions.
Review current deployment for true DMZ deployment
technical overview
Longer Term Futures
 Handling imagery and other remote sensing information





 Curating images is different from curating time series data
 Using both together enables new science and new insights
 Graphical selection and display of data
Support for user specified calculations within the database
 We’ve done LE_Pot, but there are clearly others
Support for direct connections to analysis and statistical
packages to cube as well as database
 Leverage stored procedure (SQL) and nice wizards
(MDX/SQL) to simplify query generation
Linkage with models
 Additional (emerging) data standards such as NetCDF
 Handling “just in time” data delivery and model result
curation
 Need for workflow and single sign on credentials
Data mining subscription services
Handling of a broader array of data types
technical overview