On-line Analytical Processing OLAP

Download Report

Transcript On-line Analytical Processing OLAP

Database Schema Design
for the
Federated Data Warehouse
Nov 27, 2001, RBH
The ‘Minimal’ Star Schema
For integrative, cross-Supersite analysis, data queries by time, location and
parameter, the database has to have time, location and parameter as dimensions
•
The minimal Site table includes SiteID, Name and
Lat/Lon.
•
The minimal Parameter table consists of
ParamterID, Description and Unit
•
The time dimensional table is usually skipped
since time is self-describing
•
The minimal Fact (Data) table consists of the
Obs_Value and the three dimensional codes for
Obs_DateTime, Site_ID and Parameter_ID
The above minimal (multidimensional) schema was used in the CAPITA data exploration
software, Voyager for the past 22 years, encoding 1000+ datasets.
Most Supersite data require a more elaborate schema to fully capture the content
Database Schema Design
•
Fact Table: A fact table (yellow) contains the main
data of interest, i.e. the pollutant concentration by
location, day, pollutant and measurement method.
•
Star Schema consists of a central fact table
surrounded by de-normalized dimensional tables
(blue) describing the sites, parameters, methods..
•
Snowflake Schema is an extension of the star
schema where each point of the star ‘explodes’ into
further fully normalized tables, expanding the
description of each dimension.
•
Snowflake schema can capture all the key data
content and relationships if full detail. It is well
suited for capturing and encoding complex
monitoring data into a robust relational database.
Extended Star Schema for SRDS
The Supersite program employs a variety of instrument/sampling/procedures
Hence, at least one additional dimension table is needed for Methods
A example extended star schema encodes the IMPROVE relational database (B. Schichtel)
Snowflake Example: Central Calif. AQ Study, CCAQS
CCAQS schema incorporates a rich set of
parameters needed for QA/QC (e.g.
sample tracking) as well as for data
analysis.
The fully relational CCAQS schema permits
the enforcing of integrity constraints and
it has been demonstrated to be useful for
data entry/verification.
However, no two snowflakes are identical.
Similarly, the rich snowflake schemata
for one sampling/analysis environment
cannot be easily transplanted elsewhere.
More importantly, many of the recorded
parameters ‘on the fringes’ are not
particularly useful for integrative, crosssupersite, regional analyses.
Hence the shared ( exposed) subset of the
entire data set may consist of a small
subset of the ‘snowflake’
From Heterogeneous to Homogeneous Schema
•
Individual Supersite SQL databases can be queried along spatial, temporal and parameter
dimensions. However, the query to retrieve the same information depends on the of the particular
database.
•
A way to homogenize the distributed data is access all the data through a Data Adapter using only a
subset of the tables/fields from any particular database (red)
•
The proposed extracted uniform (abstract) schema is the Minimal Star Schema, (possibly expanded).
The final form of the uniformly extracted data schema will be arrived at by consensus.
Subset used
Uniform Schema
Fact
Table
Data Adapter
Extraction of
homogeneous data from
heterogeneous sources
Data Warehouse Features
• As much as possible data should reside in their respective home environment.
‘Uprooted’ data in decoupled databases tend to decay ie can not be easily
updated, maintained, enriched.
• Abstract (universal) query/retrieval facilitates integration and comparison
along the key dimensions (space, time, parameter, method)
• The open architecture data warehouse (based on Web Services) promotes the
building of further value chains: Data Viewers, Data Integration Programs,
Automatic Report Generators etc..