Unifying Diverse Watershed Data to Enable Analysis

Download Report

Transcript Unifying Diverse Watershed Data to Enable Analysis

C van Ingen, D Agarwal, M Goode, J Gupchup,
J Hunt, R Leonardson, M Rodriguez, N Li
Berkeley Water Center
John Hopkins University
Lawrence Berkeley Laboratory
Microsoft Research
University of California, Berkeley
Introduction
 Over the past year, we’ve been
exploring how to build and user a
digital watershed in the cloud
 Our focus is enabling end-user
analysis
 Assumes data access will get better
(thanks to CUAHSI and others)
 Bottoms up approach: start with
database and build to the tool
 Just in time approach: build tools to
solve science needs
 In the cloud to free the scientist from
any operational issues associated
with the technology we use
http:/www.berkeley.edu/RussianRiver
Hydrologic Data Analysis Pipeline
Distributed
Data Sets
Data
Transformations
Models, Analysis Tools
Data Gateway
Analysis Gateway
Data
Archive
Knowledge discovery,
Hypothesis testing, Water
Synthesis
Challenge is to Connect Data, Resources, and People
Dissemination
Data Flow Pipeline
Reports, Excel Pivot
Table, MatLab, ArcGIS
CSV Files
Agency web
site, streaming
sensor data, or
other source
BWC SQL Server
Database
BWC Data Cube
Key Schema Abstractions
 Data, ancillary data, and metadata
 Analyses often require combining time series data with fixed, or nearly fixed
ancillary data such as river mile, vegetative cover, sediment grain size
 Ancillary data used as fixed property, time series, or event time window
 Metadata describing algorithms, measurement techniques, etc.
 Normalized table structure simplifies adding variables and cube building
 Versioning and folder-like collections
 Accommodate algorithm changes, temporal granularity and derived quantities
 Track derivations through processing pipeline
 Define and track analysis “working set”
 Namespace translation
 Data assembly traverses different repositories each with own (useful?) name
space
 Some repositories encode metadata in variable name space (eg USGS turbidity)
Any access layer shares the same abstractions.
Database Schema Subset
siteset
PK
sitesetid
FK1
siteid
createTime
lastAppendTime
lastModifyTime
appendOnlyTime
fixTime
deleteTime
ingestChecksum
parentSitesetid
creatorid
name
description
howmade
path
datumtype
FK2
FK3
data
FK1
FK2
FK3
FK4
FK5
FK6
FK7
FK8
site
PK
siteid
...
name
...
sitesetid
siteid
datumid
value
time
exdatumid
repeat
offsetid
qualityid
PK
shortname
units
name
offsetunits
•
exdatumid
debris
investigatorid
offset
dataset_siteset
PK
...
name
...
FK1
FK2
datasetid
sitesetid
offsetid
repeat
PK
value
units
time
dataset
PK
FK2
PK
time
repeat
quality
PK
qualityid
datasetid
howmade
createTime
lastAppendTime
lastModifyTime
appendOnlyTime
fixTime
deleteTime
creatorid
name
description
qualityflags
gapflags
Star schema for data
similar to CUAHSI
ODM
Ancillary data
shredded like data
– Active over a time
range
– Numeric or text
– Flows to the data
cube as site attribute
or time series data
exdatumtype
PK
investigator
PK
•
datumid
•
Two level versioning
maps to data sourcing
– Bound into a dataset
version with spline
filter
– Only the dataset
flows to the
datacube
 A data cube is a database specifically for
data mining (OLAP)
 Organizes data along dimensions such as
time, site, or variable type
 Easy to group, filter, and aggregate data in
a variety of ways
 Simple aggregations such as sum, min, or
max can be pre-computed for speed
 Additional calculations such as median
can be computed dynamically
 SQL Server Analysis Services (SSAS)
provides the OLAP engine
 SQL Server Business Intelligence
Development Studio is used to define and
tune
 Excel and other client tools enable simple
browsing
 Minimizes total software burden writing
queries (SQL or MDX)
BIG SULPHUR C NR CLOVERDALE CA
Site
Discharge and Turbidity variability
WARM SPRINGS C NR ASTI CA
SANTA ROSA C NR SANTA ROSA CA
SANTA ROSA C A WILLOWSIDE RD NR
SANTA ROSA C A SANTA ROSA CA
RUSSIAN R NR UKIAH CA
RUSSIAN R NR REDWOOD VALLEY CA
RUSSIAN R NR HOPLAND CA
RUSSIAN R NR HEALDSBURG CA
RUSSIAN R NR GUERNEVILLE CA
RUSSIAN R NR CLOVERDALE CA
RUSSIAN R A GEYSERVILLE CA
RUSSIAN R A DIGGER BEND NR
POTTER VALLEY PH (TR ONLY) NR
POTTER VALLEY IRRIG CN E6 NR
POTTER VALLEY IRRIG CN E5 NR
POTTER VALLEY IRRIG CN 5+6 NR
PENA C NR GEYSERVILLE CA
MATANZAS C A SANTA ROSA CA
MAACAMA C NR KELLOGG CA
LAGUNA DE SANTA ROSA C NR
LAGUNA DE SANTA ROSA A STONY PT
FRANZ C NR KELLOGG CA
FELIZ C NR HOPLAND CA
EF RUSSIAN R TRIB NR POTTER VAL
EF RUSSIAN R NR UKIAH CA
EF RUSSIAN R AND POTTER VALLEY
DUTCHER C NR ASTI CA
DRY C TRIB NR HOPLAND CA
DRY C NR YORKVILLE CA
DRY C NR MOUTH NR HEALDSBURG
DRY C NR CLOVERDALE CA
COLGAN C NR SEBASTOPOL CA
20000
BIG SULPHUR C NR MIDDLETOWN CA
25000
AUSTIN C NR CAZADERO CA
Datacube Basics
30000
BIG SULPHUR C A G RESORT NR
Dataset USGS Surface Water Data Jan 2007 Datumtype Mean Discharge Quality All
Daily Discharge Availability by Site by Year
Each bar is a count of data points color coded by
reporting per year The higher the bar, the more
reported datal
Count
15000
Year
10000
5000
0
Learnings and Observations
2000
Annual Runoff [mm]
 Simplifying data discovery speeds analysis
 Discovery is a necessary precursor step to
analysis
 What data where when? At what quality?
 Versioning is critical
 Site-variable most naturally maps to analysis
patterns
 Dataset too coarse; individual measurement
too fine
 Ancillary data must be versioned as well
 Matching the scientific notion of time to
commercial tools can problematic
 Second month of water year has 30 days in US
 MODIS week
 Granularity widely varying
 Plan on decode stage for name, location, time,
quality
 Don’t forget historic (non-digital) data
Ukaih (100 sq mi)
Hopland (362 sq mi)
Cloverdale (503 sq mi)
Healdsburg (793 sq mi)
Guerneville (1338 sq mi)
1500
1000
500
0
0
500
1000
1500
Annual Precipitation [mm]
2000