SRDSPres011127Tmp

Download Report

Transcript SRDSPres011127Tmp

Proposal Presentation to the Supersite Program
Supersite Relational Database Project:
‘Federated PM Data Warehouse’
a sub- project of
St. Louis Midwest Supersite Project, Jay Turner, PI
Rudolf Husar, PI
Center for Air Pollution Impact and Trend Analysis (CAPITA)
Washington University, St. Louis, MO
Nov 27, 2001
Purpose of the Project:
Design, Populate and Maintain a
Supersite Relational Database System
• Include monitoring data from Supersites and auxiliary projects
• Facilitate cross-Supersite [regional or comparative] data analyses
• Support analyses by a variety of research groups
EPA Specs of the Supersite Relational Data System (from RFP)
• Data Input:
– Data input electronically
– Modest amount of metadata on sites, instruments, data sources/version, contacts etc.
– Simple data structures, formats and convenient submission procedures
• Data Storage and Maintenance:
– Data storage in relational database(s), possibly distributed over multiple servers
– A catalog of data holdings and request logs
– Supersite data updates quarterly
• Data Access:
– User-friendly web-access by multiple authorized users
– Data query by parameter, method, location, date/time, or other metadata
– Multiple data output formats (ASCII, spreadsheet, other (dbf, XML)
General Approach to SRDS Design
• Based on consensus, adopt a uniform relational data structure, suitable for
regional and cross-Supersite data integration and analysis.
• We propose a star schema with spatial, temporal, parameter and method
dimensions.
• The ‘original’ data are to be maintained at the respective providers or
custodians (Supersites, CIRA, CAPITA..).
• We propose the creation of flexible ‘adapters’ and web-submission forms for
the transfer of data subsets into the uniformly formatted ‘Federated Data
Warehouse’.
• Data users would access the data warehouse manually or through software.
• We propose data access using modern ‘web services’ protocol, suitable for
building data viewers and processors (filtering, aggregation and fusion)
The ‘Minimal’ Star Schema
For integrative, cross-Supersite analysis, data queries by time, location and
parameter, the database has to have time, location and parameter as dimensions
•
The minimal Site table includes SiteID, Name and
Lat/Lon.
•
The minimal Parameter table consists of
ParamterID, Description and Unit
•
The time dimensional table is usually skipped
since time is self-describing
•
The minimal Fact (Data) table consists of the
Obs_Value and the three dimensional codes for
Obs_DateTime, Site_ID and Parameter_ID
The above minimal (multidimensional) schema was used in the CAPITA data exploration
software, Voyager for the past 22 years, encoding 1000+ datasets.
Most Supersite data require a more elaborate schema to fully capture the content
From Heterogeneous to Homogeneous Schema
•
Individual Supersite SQL databases have varied designs, usually following the ‘snowflake’ pattern
(see Database Schema Design for the Federated Data Warehouse)
•
Though they have complicated schemata, these SQL servers can be queried along spatial, temporal
and parameter dimensions. However, the query to retrieve the same information depends on the of the
particular database.
•
A way to homogenize the distributed data is access all the data through a Data Adapter using only a
subset of the tables/fields from any particular database (red)
•
The proposed extracted uniform (abstract) schema is the Minimal Star Schema, (possibly expanded).
The final form of the uniformly extracted data schema will be arrived at by consensus.
Subset used
Uniform Schema
Fact
Table
Data Adapter
Extraction of
homogeneous data from
heterogeneous sources
Live Demo of the Data Warehouse Prototype
http://capita.wustl.edu/DSViewer/DSviewer.aspx
Currently online data are accessible from the
CIRA (IMPROVE) and CAPITA SQL servers
Uniform Data Query regardless of the native
schema: Query by parameter, location,
time, method
The hidden DataAdopter
- accepts the uniform query
- accesses the data server
- transforms the original to uniform data
- delivers uniforms DataSets
A rudimentary viewer displays the data in a
table for browsing.
Federated Data Warehouse Architecture
•
Tree-tier architecture consisting of
–
–
–
•
The Provider servers interact only with the Proxy Server in accordance with the Federation Contract
–
–
•
Provider Tier: Back-end servers containing heterogeneous data, maintained by the federation members
Proxy Tier: Retrieves designated Provider data and homogenizes it into common, uniform Datasets
User Tier: Accesses the Proxy Server and uses the uniform data for presentation, integration or processing
The contract sets the rules of interaction (accessible data subsets, types of queries)
Strong server security measures enforced, e.g. through Secure Socket layer
The data User interacts only with the generic Proxy Server using flexible Web Services interface
–
–
–
Generic data queries, applicable to all data in the Warehouse (e.g. space, time, parameter data sub-cube)
The data query is addressed to a Web Service provided by the Proxy Server of the Federation
Uniformly formatted, self-describing data packages are handed to the user for presentation or further processing
Federated Data Warehouse
Proxy Tier
Provider Tier
Data Homogenization, etc.
Heterogeneous Data
User Tier
Data Consumption
Presentation
SQLDataAdapter1
SQLServer1
Processing
SQLDataAdapter2
SQLServer2
Integration
CustomDataAdapter
LegacyServer
Data Access & Use
Proxy Server
Member Servers
Web Service, Uniform Query & Data
Fire Wall, Federation Contract
Data Re-Use and Synergy
•
•
•
Data producers maintain their own workspace and resources (data, reports, comments).
Part of the resources are shared by creating a common virtual resources.
Web-based integration of the resources can be across several dimensions:
Spatial scale:
Data content:
Local – global data sharing
Combination of data generated internally and externally
Local
Local
User
Shared part of resources
User
Content
Virtual Shared Resources
User
Data, Knowledge
Tools, Methods
Content
User
Global
•
•
Global
User
The main benefits of sharing are data re-use, data complementing and synergy.
The goal of the system is to have the benefits of sharing outweigh the costs.
Data Entry to the Supersite Relational Data System:
1.
Automatic translation and transfer of NARSTO-archived DES data to SQL
2.
Web-submission of of relational tables by the data producers/custodians
3.
Batch transfer of large auxiliary datasets to the SQL server
NARSTO ORNL
DES, Data Ingest
DES-SQL
Transformer
EPA
EOSDIS
Supersite Data
Data
Supersite
Archive
SQL
Coordinated
Supersite
Relational
Server
Direct Web Data Input
Tables
Supersite & other
SQL Data
Data
Query
DataAdapter
Table
Output
Data Warehouse Features
• As much as possible, data should reside in their respective home
environment. ‘Uprooted’ data in decoupled databases tend to decay i.e. can
not be easily updated, maintained, enriched.
• Data from the providers will be transferred to the ‘federated data warehouse’
through (1) on-line DataAdapters, (2) Manual web submission and (3) Semiautomated transfer from the NARSTO archive.
• Retrieval of uniform data from the data warehouse facilitates integration and
comparison along the key dimensions (space, time, parameter, method)
• The open architecture data warehouse (see Web Services) promotes the
building of further value chains: Data Viewers, Data Integration Programs,
Automatic Report Generators etc..
Data Preparation Procedures:
•
Data gathering, QA/QC and standard formatting is to be done by individual projects
•
The data exchange standards, data ingest and archives are by ORNL and NASA
•
Data ingest is to automated, aided by tools and procedures supplied by this project
– NARSTO DES-SQL translator
– Web submission tools and procedures
– Metadata Catalog and I/O facilities
•
Data submissions and access will be password protected as set by the community.
•
Submitted data will be retained in a temporary buffer space and following verification
transferred to the shared SQL database.
•
The data access, submissions etc. will be automatically recorded an summarized in
human-readable reports.
Data Catalog
•
Data Catalog will be
maintained through CAPITA
website.
•
Limited metadata (based on
user consensus) will be
recorded for each dataset
•
User feedback on individual
datasets will be through
comments/feedback pages
•
An example is the data
section of the St. Louis
Supersite website.
Related CAPITA Projects
•
EPA Network Design Project (~$150K/yr –April 2003). Development of novel quantitative
methods of network optimization. The network performance evaluation is conducted using
the complete PM FRM data set in AIRS which will be available for input into the SRDS.
•
EPA WebVis Project (~$120K/yr - April 2003). Delivery current visibility data to the public
through a web-based system. The surface met data are being transferred into the SQL
database (Since March 2001) and will be available to SRDS.
•
NSF Collaboration Support Project (~$140K/yr – Dec 2004). Continuing development of
interactive web sites for community discussions and for web-based data sharing; (directly
applicable to this project)
•
NOAA ASOS Analysis Project (~$50K/yr - May 2002). Evaluate the potential utility of the
ASOS visibility sensors (900 sites, one minute resolution) as PM surrogate. Data now
available for April-October 2001 – can be incorporated into to the Supersite Relational Data
System.
•
St. Louis Supersite Project website (~$50K/yr – Dec 2003) . The CAPITA group maintains
the St. Louis Supersite website and some auxiliary data. It will also be used for this project
Federated Data Warehouse Applications
XDim Data
SQL
Table
XML Web
Services
Data Warehouse Tier
OLAP
Cube
Data View & Process Tier
Layered Map
GIS Data
Satellite
OpenGIS
Services
Vector
Connection
Cursor-Query
Manager
Manager
Data Access
Data View
Manager
Manager
Cursor
Time Chart
Text, Table
Scatter Chart
Text Data
Web
Page
HTTP
Services
Text
Data
Distributed data of multiple
types (spatial, temporal text)
The Broker handles the views,
connections, data access, cursor
Data are rendered by linked
Data Views (map, time, text)
Supersite Relational Data System: Schedule
Year 1 - 2002
RDMS
Design
Year 2 - 2003
Year 2 - 2004
Feed
back
Impl. &
Test SQL
Supersite Data Entry
Auxiliary Data Entry
Other Coordinated Data Entry
Supersite, Coordinated and Auxiliary Data Updates
•
•
•
First four four months to design of the relational database, associated data transformers, I/O;
submitted to the Supersite workgroups for comment
In six months, Supersite data preparation and entry begins
In Year 2 and Year 3, data sets will be updated by providers as needed; system accessible to data
user community
Personnel, Management and Facilities
Personnel
•
•
•
PI, R. B. Husar (10%), Kari Hoijarvi (25%). Software experience at CAPITA, Microsoft, Visala.
20% of project budget ($12k/yr) to consultants: J. Watson, DRI, W. White and J. Turner, WU.
Collaborators, (CAPITA associates): B. Schichtel, CIRA, S. Falke, EPA, M. Bezic, Microsoft.
Management
•
•
•
This project is a sub-project of the St. Louis-Midwest Supersite project, Dr. Jay Turner, PI.
Special focus is on supporting large scale, crosscutting, and integrative analysis.
This project will leverage the other CAPITA data sharing projects
Resources and Facilities
•
•
•
CAPITA has the ‘largest known privately held collection of air quality, metrological and emission
data’, available in uniform Voyager format and extensively accessed from the CAPITA website
The computing and communication facilities include two servers, ten workstations and laptops,
connected internally and externally through high-speed networks.
Software development tools includes Visual Studio, part of the .NET distributed development
environment