Early Experience Prototyping a Science Data Server for

Download Report

Transcript Early Experience Prototyping a Science Data Server for

Early Experience
Prototyping a Science
Data Server for
Environmental Data
Catharine van Ingen (et al), MSFT
Microsoft eScience Workshop, Johns Hopkins
University
13 October 2006
Outline
• Background
– Data archives and other sources
– Typical small group collaboration needs
– Berkeley Water Center and Ameriflux
collaboration
• Science Data Server
– Goals and capabilities
– Approach
• Lessons Learned
Unprecedented Data
Availability
Typical Data Flow Today
• Prior to analysis, data and
ancillary data are must be
assembled, checked, and
cleaned
– Some of this is mundane
(eg unit conversions)
– Some requires domainspecific knowledge
including instrumentation
or location knowledge
– Ancillary data is often
critical to understanding
and using the data
• After all that, data are
often misplaced, scattered,
and even lost
– Provenance is in the mind
of the beholder
– “Everybody knows” yet no
one is sure
Internet Data
Archives
Local
Measurements
Large
Models
Legacy
Sources
Improved Data Flow
• Local repository for data
and ancillary data
assembled by a small
scientific collaboration
from a wide variety of
sources
Internet Data
Archives
– A common “safe deposit
box”
– Versioned and logged to
provide basic provenance
• Simple interactions with
existing and emerging
internet portals for data
and ancillary data
download, and, over time,
upload
– Simplify data assembly by
adding automation for
tracking and data
conversions
Local
Measurements
Large
Models
Legacy
Sources
Ameriflux Collaboration
Overview
• 149 Sites across the Americas
• Each site reports a minimum of
22 common measurements.
• Communal science – each
principle investigator acts
independently to prepare and
publish data.
• Second level data published to
and archived at Oak Ridge.
• Total data reported to date on the
order of 150M half-hourly
measurements.
• http://public.ornl.gov/ameriflux/
6
Example Carbon-Climate
Investigations
• Effects of early spring
on carbon uptake
• Role of vegetation,
climate, and latitude on
carbon flux
• Effect of airplane
groundings after 9/11
on atmospheric carbon
• Linkages to remote
sensing and other data
sets just beginning
T SOIL
T AIR
Onset of
photosynthesis
Soils
Climate
Observatory
datasets
Spatially
continuous
datasets
Example Carbon-Climate Datasets
Remote Sensing
Scientific Data Server
Goals
• Act as a local repository for data and metadata
assembled by a small group of scientists from a
wide variety of sources
– Simplify provenance by providing a common “safe
deposit box” for assembled data
• Interact simply with existing and emerging
internet portals for data and metadata download,
and, over time, upload
– Simplify data assembly by adding automation
– Simplify name space confusion by adding explicit decode
translation
• Support basic analyses across the entire dataset
for both data cleaning and science
– Simplify mundane data handling tasks
– Simplify quality checking and data selection by enabling
data browsing
Scientific Data Server
Non-Goals
• Replace the large internet data source sites
– The technology developed may be applicable, but the
focus is on the group collaboration scale and usability
– Very large datasets require different operational
practices
• Perform complex modeling and statistical
analyses
– There are a lot of existing tools with established trust
based on long track records
– Only part of a full LIMS (laboratory information
management system)
• Develop a new standard schema or controlled
vocabulary
– Other work on these is progressing independently
– Due to the heterogeneity of the data, more than one
such standard seems likely to be relevant
Scientific Data Server Logical
Overview
Data
Access
and
Analysis
Tools
Analysis Tools
Excel, Matlab,
SPlus, SAS,
ArcGIS
Simple web
data plots and
tables
Staging
Databases
and
Cubes
BigPlot data
browsing
Private
Data
Analysis
Databases
and
Cubes
Older Dataset(s)
Archive
Database
Last Known
Good Dataset(s)
Database
Last Known Good
Dataset(s) Cubes
Flat file data
import/export
Latest Dataset
Database
Computational
Models
Scientific Data Server
Latest
Dataset
Cube
Databases
• All data and descriptive or ancillary
data held in relational databases
• While separate databases are shown,
the datasets may actually reside in a
single database
– Mapping is transparent
– Separate databases used for
performance
– Unified databases used for simplicity
• New data and ancillary data are
staged with a temporary database
Older Dataset(s)
Archive
Database
Staging
Database
Last Known
Good Dataset(s)
Database
– Minimal quality checks applied
– All name and unit conversions
• Data may be exported to flat file,
copied to a private MyDb database,
directly accessed programmatically, or
?
Latest Dataset
Database
MyDb
Analysis
Database
Database Schema Subset
siteset
PK
sitesetid
FK1
siteid
createTime
lastAppendTime
lastModifyTime
appendOnlyTime
fixTime
deleteTime
ingestChecksum
parentSitesetid
creatorid
name
description
howmade
path
datumtype
FK2
FK3
data
FK1
FK2
FK3
FK4
FK5
FK6
FK7
FK8
site
PK
siteid
...
name
...
sitesetid
siteid
datumid
value
time
exdatumid
repeat
offsetid
qualityid
PK
shortname
units
name
offsetunits
exdatumtype
PK
exdatumid
debris
investigator
PK
datumid
investigatorid
offset
dataset_siteset
PK
...
name
...
FK1
FK2
datasetid
sitesetid
offsetid
repeat
PK
value
units
time
dataset
PK
FK2
PK
time
repeat
quality
PK
qualityid
datasetid
howmade
createTime
lastAppendTime
lastModifyTime
appendOnlyTime
fixTime
deleteTime
creatorid
name
description
qualityflags
gapflags
• Two level
versioning maps to
actual data usage
and data sourcing
– Bound into a
dataset version
with spline filter
– Only the dataset
flows to the
datacube
• Ancillary data
handled like data
– Active over a
time range
– Numeric or text
– Flows to the data
cube as site
attribute or time
series data
Data Cubes
•
•
A data cube is a database
specifically for data mining (OLAP)
– Initially developed for commercial
needs like tracking sales of Oreos
and milk
– Simple aggregations (sum, min, or
max) can be pre-computed for
speed
– Additional calculations (median) can
be computed dynamically
– Both operate along dimensions such
as time, site, or datumtype
– Constructed from a relational
database
– A specialized query language (MDX)
is used
Client tool integrations is evolving
– Excel PivotTables allow simple data
viewing
– More powerful charting with
Tableaux or ProClarity (commercial
tools)
– Browsing with SQL Report Server
Staging
Data
Cube
Last Known Good
Dataset(s) Cubes
MyDb
Analysis
Data
Cubes
Latest
Dataset
Cube
Server Workflows
•
Staging: adding data or metadata
•
Editing: changing data or metadata
•
Sharing: making the latest acquired
data available rapidly
•
•
– New downloaded or field
Internet Data
measurements added
Archives
– New derived measurements computed
– Existing older measurements recalibrated or re-derived
– Data cleaning, gap-filling or other
algorithm changes
– Even before all checks completed
– Browsing new data before more
detailed analyses
Private Analysis: Supporting
individual researchers (MyDB)
– Stable location for personal
calibrations, derivations, etc
– Import/Export to analysis tools and
models
Curating: data versioning and
provenance
– Simple parent:child versioning to track
collections of data used for specific
uses
Local
Measurements
Large
Models
Legacy
Sources
Data Staging Pipeline
Scheduled
download
from Website
Convert to
CSV
Canonical
Form
Load CSV
files into
Staging
Database
Stage Data
Decode
Basic Data
Checks
Incremental
Data Copy to
Active
Database
• Data can be downloaded from internet sites regularly
– Sometimes the only way to detect changed data is to compare with
the data already archived
– The download is relatively cheap, the subsequent staging is expensive
• New or changed data discovered during staging
– Simple checksum before load
– Chunk checksum after decode
– Comparison query if requested
• Decode stage critical to handle the uncontrolled vocabularies
– Measurement type, location offset, quality indicators, units, derivation
methods often encoded in column headers
• Incremental copy moves staged data to one or more sitesets
– Automated via siteset:site:source mapping
0
USA -- AK Upad, Alaska
USA -- KS Walnut River Watersh
USA -- MI Univ. of Mich. Biolo
USA -- WV Canaan Valley, West
USA -- WI Willow Creek, Wiscon
USA -- WI Park Falls/WLEF, Wis
USA -- WI Lost Creek, Wisconsi
USA -- WA Wind River Crane Sit
USA -- TN Walker Branch Waters
USA -- SD Brookings, South Dak
H
Rd
TAU
USA -- SD Black Hills, South D
USA -- OR Metolius-old aged po
USA -- OR Metolius-intermediat
USA -- OR Metolius-first young
USA -- OK Southern Great Plain
GPP
PRESS
TA
WS
USA -- OK Shidler, Oklahoma
USA -- OK Ponca City, Oklahoma
USA -- OK Little Washita Water
USA -- NE Mead - rainfed maize
USA -- NE Mead - irrigated mai
USA -- NE Mead - irrigated con
FPAR
PREC
SWC
WD
USA -- NC Duke Forest-hardwood
USA -- NC Duke Forest - loblol
USA -- MT Fort Peck, Montana
USA -- MS Goodwin Creek, Missi
USA -- MO Missouri Ozark Site
FH2O
PAR
SVP
VPD
USA -- MI Sylvania Wilderness
USA -- ME Howland Forest (main
USA -- MA Little Prospect Hill
USA -- MA Harvard Forest hemlo
USA -- MA Harvard Forest EMS T
FG
Other
SCO2
UW
USA -- IN Morgan Monroe State
USA -- IL Bondville, Illinois
USA -- FL Slashpine-Rayonier,m
USA -- FL Slashpine-Mize,clear
USA -- FL Slashpine-Donaldson,
FC
O3
Sa
UST
USA -- FL Slashpine-Austin Car
USA -- FL Florida-Kennedy Spac
USA -- FL Florida-Kennedy Spac
USA -- CT Great Mountain Fores
USA -- CO Niwot Ridge Forest,
USA -- CA Vaira Ranch, Ione, C
DT
NEE
Rn
U
USA -- CA Tonzi Ranch, Califor
USA -- CA Sky Oaks, Young Stan
USA -- CA Sky Oaks, Old Stand,
USA -- CA Blodgett Forest, Cal
USA -- AZ Audubon Research Ran
CO2
Leafwetness
RH
TS
USA -- AK Happy Valley, Alaska
USA -- AK Barrow, Alaska
USA -- AK Atqasuk, Alaska
Canada -- Lethbridge
Canada -- British Col., Campbe
Canada -- BOREAS NSA - Old Bla
APAR
LE
Rgl
Tdew
Canada -- BOREAS NSA - 1998 bu
Canada -- BOREAS NSA - 1989 bu
Canada -- BOREAS NSA - 1981 bu
Canada -- BOREAS NSA - 1963 bu
6,000,000
Canada -- BOREAS NSA - 1930 bu
7,000,000
Canada - Boreas 1850
Brazil -- Tapajos (Santarem,Km
Brazil -- Tapajos (Santarem,Km
Browsing for Data Availability
Total Data Availability by Site Colored by
Type
H2O
Rg
Tbole
5,000,000
4,000,000
3,000,000
2,000,000
1,000,000
Sites report more data either because of longevity or specific research interests
La
SantaremSantaremBOREAS NSA Campbell RiverLethbridge
UCI-1850 burn
UCI-1930 burn
UCI-1964 burn
UCI-1964 burn
UCI-1981 burn
UCI-1989 burn
UCI-1998 burn
UCI-2003 burn
La Selva
La Paz
Atqasuk
Barrow
Happy Valley
Ivotuk
Upad
Audubon
Santa Rita
Walnut Gulch
Blodgett Forest
Sky Oaks
Sky Oaks-Old
Sky OaksTonzi Ranch
Vaira Ranch
CRP grazed site
CRP minimumCRP ungrazed
Niwot Ridge
Great Mountain
Kennedy Space
Kennedy Space
Mangrove
SlashpineSlashpineSlashpine-MizeSlashpineNeal Smith
Bondville
Bondville
FermiLabFermiLabMorgan Monroe
Walnut River
Harvard Forest
Harvard Forest
Little Prospect
Howland Forest
Howland Forest
Howland Forest
KBS Crops
Northern
Sylvania
Univ. of Mich.
KUOM tower
Rosemount-C7
RosemountRosemountMissouri Ozark
Goodwin Creek
Fort Peck
Duke ForestDuke ForestDuke ForestNC_Clearcut
NC_Loblolly
Mead-irrigated
Mead-irrigated
Mead-rainfed
Bartlett
Cedar Bridge
Fort Dix
Silas Little
Valles Caldera
Oak Openings
ARM Southern
Little Washita
Ponca City
Shidler
Fir site
Metolius-Eyerly
Metolius-first
MetoliusMetolius-old
MetoliusBlack Hills
Brookings
Chestnut Ridge
Walker Branch
Freeman
Freeman Ranch
Freeman
Wind River
Lost Creek
Park
Willow Creek
Canaan Valley
GLEES
Sky Oaks-Post
La
SantaremSantaremBOREAS NSA Campbell RiverLethbridge
UCI-1850 burn
UCI-1930 burn
UCI-1964 burn
UCI-1964 burn
UCI-1981 burn
UCI-1989 burn
UCI-1998 burn
UCI-2003 burn
La Selva
La Paz
Atqasuk
Barrow
Happy Valley
Ivotuk
Upad
Audubon
Santa Rita
Walnut Gulch
Blodgett Forest
Sky Oaks
Sky Oaks-Old
Sky OaksTonzi Ranch
Vaira Ranch
CRP grazed site
CRP minimumCRP ungrazed
Niwot Ridge
Great Mountain
Kennedy Space
Kennedy Space
Mangrove
SlashpineSlashpineSlashpine-MizeSlashpineNeal Smith
Bondville
Bondville
FermiLabFermiLabMorgan Monroe
Walnut River
Harvard Forest
Harvard Forest
Little Prospect
Howland Forest
Howland Forest
Howland Forest
KBS Crops
Northern
Sylvania
Univ. of Mich.
KUOM tower
Rosemount-C7
RosemountRosemountMissouri Ozark
Goodwin Creek
Fort Peck
Duke ForestDuke ForestDuke ForestNC_Clearcut
NC_Loblolly
Mead-irrigated
Mead-irrigated
Mead-rainfed
Bartlett
Cedar Bridge
Fort Dix
Silas Little
Valles Caldera
Oak Openings
ARM Southern
Little Washita
Ponca City
Shidler
Fir site
Metolius-Eyerly
Metolius-first
MetoliusMetolius-old
MetoliusBlack Hills
Brookings
Chestnut Ridge
Walker Branch
Freeman
Freeman Ranch
Freeman
Wind River
Lost Creek
Park
Willow Creek
Canaan Valley
GLEES
Sky Oaks-Post
Browsing for Data Availability
Data Availability by Site
TA Data Availability
2007
2006
2005
2004
2003
2002
2001
2000
1999
1998
1997
1996
1995
1994
1993
1992
1991
1990
GPP Data Availability
2007
2006
2005
2004
2003
2002
2001
2000
1999
1998
1997
1996
1995
1994
1993
1992
1991
1990
Measuring temperature is easy; deriving ecosystem production problematic
Browsing for Data Quality
•
Real field data has unit and time
scale conversion problems
– Sometimes easy to spot in
isolation
– Sometimes easier to spot when
comparing to other data
– Browsing data values can give
rapid insight into how the data
can be used before more complex
analyses are performed
Local time or GMT time?
Maximum Annual Air Temperature
Global Warming or
Reporting in Fahrenheit?
Average Air Temperature at
Two Nearby Sites
Odd Microclimate Effects or
Error in Time Reporting ?
Browsing for Data
Applicability
• Real field data has both short
term gaps and longer term
outages due to instrument
outages
What’s going on at higher
latitudes?
(It should be getting colder)
25
20
15
Deg C
– The utility of the data
depends on the nature of the
science being performed
– Browsing data counts can
give rapid insight into how the
data can be used before more
complex analyses are
performed
30
10
5
0
20
30
40
50
60
70
80
-5
-10
Average Temperature
-15
Latitude
6000
55.86306 BOREAS NSA 1981 burn site
Data often missing in the winter!
55.879002 BOREAS NSA Old Black Spruce
5000
55.90583 BOREAS NSA 1930 burn site
4000
55.911671 BOREAS NSA 1963 burn site
3000
55.916672 BOREAS NSA 1989 burn site
56.63583 BOREAS NSA 1998 burn site
2000
69.133331 AK Happy
Valley
1000
70.281471 AK Upad
Data Count
0
70.496002 AK Atqasuk
1
2
3
4
5
6
7
8
9
10
11
12
Lessons Learned To Date
(1/2)
• Ancillary data is as important as data
– Comparing sites of like vegetation, climate as
important as latitude or other physical quantity
– Only some are numeric, most are debated,
some vary with time
– Curate the two together
• Controlled vocabularies are hard
– Humans like making up names and have a
hard time remembering 100+ names
– We’ve seen over 1000 column headings from
the ORNL Ameriflux site and the names have
changed over time
– Assume a decode step in the staging pipeline
• Data analysis and data cleaning are
intertwined
– Data cleaning is always on-going
– Some measurements can be used as indicators
of quality of other measurements
– Share the simple tools and visualizations
Lessons Learned To Date
(2/2)
•
There are at least three database schema families
and two cube construction approaches
– Everyone has a favorite
– Each has advantages and disadvantages
– Automate the maintenance and use the right one for
the right job
•
Visual programming tools are great for prototyping
•
There are at least three different visualization
needs
– But debugging and maintenance can hit a wall
– It’s easy to overbuild – use when “good enough”
– Browsing and careful charting each have their place
– Viewing very large (1M datapoints) datasets takes
care in both the rendering and the query
– Integrate with client tools, target server tools for
browsing
The saga continues at
http://dsd.lbl.gov/BWC/amfluxblog/ and
http://research.microsoft.com/~vaningen/BWC/B
WC.htm
Acknowledgements
Berkeley Water Center,
University of California,
Berkeley, Lawrence
Berkeley Laboratory
Deb Agarwal
Monte Good
Susan Hubbard
Matt Rodriguez
Yoram Rubin
Microsoft
Jim Gray
Tony Hey
Dan Fay
Stuart Ozer
SQL product team
Ameriflux Collaboration
Dennis Baldocchi
Beverly Law
Gretchen Miller
Tara Stiefl
Mattias Falk
Tom Boden