SRDSPres011113

Download Report

Transcript SRDSPres011113

Supersite Relational Database Project:
(Data Portal?)
a sub- project of
St. Louis Midwest Supersite Project
Draft of the November 16, 2001 Presentation to the Supersite
Program
Nov 13, 2001
• Web Service – input and output for the computer (not for the human)
• Data integration facilitated by output format suitable for merging
• Rudimentary Query and Viewing facilities
Prominent Members of ‘PM Information Federation’
RPO
RPO
RPO
FLM
FLM
FLM
Regional Planning Orgs
EPA
Federal Land Managers
SuperSite
Industry
NARSTO
Academic
EPA
EPA
EPA Regul. & Research
• sdhd
Other: Private, Academic
Resource and Tools Sharing among RPOs and Others.
Scientific Rationale:
• Regional haze and its precursors have a 1000-10000 km air shed.
• Emissions in one RPO domain influences the the haze in surrounding RPOs
• Cross-ROS knowledge sharing yield better science support to AQ management
Administrative Rationale:
• Some data and analyses are inherently national; breaking them up is extra work
• Data sharing
• and analysis tools can be re-used
Data Management at Supersites
• Most Supersites manage their respective data using SQL RDMS
• St. Louis, LA and Fresno coordinate their RDMS in the format of the CCAQS
• Other organizations follow similar approach (Houston, IMPROVE)
• CAPITA has a wide range of regional auxiliary data set on open SQL Server
Proposed Strategy for the Supersite Relational Data
System: Data Portal
• Make the individual Supersite SQL servers accessible from a Data Portal Server
• Devise a simple, uniform virtual schema that is common to all databases and
allows query by location, time, parameter and method
• Use the Data Portal to query and retrieve data for cross-Supersite integrative
analysis.
Database Schema Design
•
Fact Table: A fact table (yellow) contains the main
data of interest, i.e. the pollutant concentration by
location, day, pollutant and measurement method.
•
Star Schema consists of a central fact table
surrounded by de-normalized dimensional tables
(blue) describing the sites, parameters, methods..
•
Snowflake Schema is an extension of the star
schema where each point of the star ‘explodes’ into
further fully normalized tables, expanding the
description of each dimension.
•
Snowflake schema can capture all the key data
content and relationships if full detail. It is well
suited for capturing and encoding complex
monitoring data into a robust relational database.
Abstract (Minimal) Star Schema for
Integrative, Cross-Supersite, Spatio-Temporal Analysis
For integrative, cross-Supersite analysis, data queries by time, location and
parameter, the database has to have time, location and parameter as dimensions
•
The minimal Site table includes SiteID, Name and
Lat/Lon.
•
The minimal Parameter table consists of
ParamterID, Description and Unit
•
The time dimensional table is usually skipped
since time is self-describing
•
The minimal Fact (Data) table consists of the
Obs_Value and the three dimensional codes for
Obs_DateTime, Site_ID and Parameter_ID
The above minimal (multidimensional) schema was used in the CAPITA data exploration
software, Voyager for the past 22 years, encoding 1000+ datasets.
Most Supersite data require a more elaborate schema to fully capture the content
Extended Star Schema for SRDS
The Supersite program employs a variety of instrument/sampling/procedures
Hence, at least one additional dimension table is needed for Methods
A example extended star schema encodes the IMPROVE relational database (B. Schichtel)
Snowflake Example: Central Calif. AQ Study, CCAQS
CCAQS schema incorporates a rich set of
parameters needed for QA/QC (e.g.
sample tracking) as well as for data
analysis.
The fully relational CCAQS schema permits
the enforcing of integrity constraints and
it has been demonstrated to be useful for
data entry/verification.
However, no two snowflakes are identical.
The rich snowflake schemata for one
sampling/analysis environment cannot
be easily transplanted elsewhere.
More importantly, many of the recorded
parameters ‘on the fringes’ are not
particularly useful for integrative, crosssupersite, regional analyses.
Abstract Schema
•
A common feature of all SQL databases for AQ data is that they can be queried by along spatial,
temporal and parameter dimensional queries.
•
However, the query to retrieve the same information depends on the of the particular database.
•
A way to homogenize the distributed data is access all the data through an abstract virtual schema.
•
The abstract schema would use only a subset of the tables/fields from any particular database (red)
•
The proposed abstract schema is the Minimal Star Schema, possibly expanded ….
•
The final form of the abstract schema will be arrived at by consensus
Subset used for Abstract Schema
Universal Query and DataSet using Abstract DataAdopter
•
The User selects an existing SQL database affiliated with the Supersite system
•
A Data Query is submitted to the database through a simple Star Schema (virtual).
•
The Abstract DataAdapter translates the abstract query into a specific query for to the selected database
•
The Abstract DataAdapter packages the returned data into self-describing ‘universal’ DataSet package
•
A ‘universally readable ‘DataSet’ is returned to the consuming data viewer/processor
Query to abstract
star schema
Query
&
Data
Abstract
DataAdapter
&
Schema
Data as universal
DataSet
Custom
Components
Web
Services
SQLDataAdapter1
SQLServer1
SQLDataAdapter2
SQLServer2
CustomDataAdapter
LegacyServer
Federated Data Warehouse Features
• Data reside in their respective home environment where it can mature.
‘Uprooted’ data in separated databases are not easily updated, maintained,
enriched.
• Abstract (universal) query/retrieval facilitates integration and comparison
along the key dimensions (space, time, parameter, method)
• The open data query based on Web Services promotes the building of further
value chains: Data Viewers, Data Integration Programs, Automatic Report
Generators etc..
• The data access through the Proxy server protects the data providers and the
data users from security breaches, excessive detail
Tree-Tier Federated Data Warehouse Architecture
1.
Provider Tier: Back-end servers containing heterogeneous data, maintained by the federation members
2.
Proxy Tier: Retrieves designated Provider data and homogenizes it into common, uniform Datasets
3.
User Tier: Accesses the Proxy Server and uses the uniform data for presentation, integration or processing
Federated Data Warehouse
User Tier
Data presentation,
processing
Proxy Tier
Data homogenization,
transformation
Provider Tier
Heterogeneous data in
distributed SQL Servers
Federated Data Warehouse Interactions
•
The Provider servers interact only with the Proxy Server in accordance with the Federation Contract
–
–
•
The contract sets the rules of interaction (accessible data subsets, types of queries)
Strong server security measures enforced, e.g. through Secure Socket layer
The data User interacts only with the generic Proxy Server using flexible Web Services interface
–
–
–
Generic data queries, applicable to all data in the Warehouse (e.g. data sub-cube by space, time, parameter)
The data query is addressed to the Web Service provided by the Proxy Server
Uniform, self-describing data packages are passed to the user for presentation or further processing
Federated Data Warehouse
Proxy Tier
Provider Tier
Data Homogenization, etc.
Heterogeneous Data
User Tier
Data Consumption
Presentation
SQLDataAdapter1
SQLServer1
Processing
SQLDataAdapter2
SQLServer2
Integration
CustomDataAdapter
LegacyServer
Data Access & Use
Proxy Server
Member Servers
Web Service, Uniform Query & Data
Fire Wall, Federation Contract
Data Query and Return
through ‘Universal’ I/O
DataQuery:
Universal for all data
GUI using Maps, Time..charts
IMPROVE @ CIRA
DataAdapter:
Specific to data source
Here it is just a SQL statement
DataReturn:
Universal format for all data
Can be merged, rendered, processed
Explore the prototype Federated PM/Haze Data
Warehouse
Surface Bext @CAPITA
Technology of the Data Portal
• SQL server, DTS
• .NET
Data Portal Prototype
• Queries to IMPRVOE, CAPITA datasets