DvoyDatabaseIdeas

Download Report

Transcript DvoyDatabaseIdeas

Dvoy Database Ideas
• Heterogeneous to homogeneous
• Homogenization by applying uniform schema: Multidimensional data
model
• User queries are directed toward the nDim data cubes, generally for
extraction of data slices along specific dimensions.
• The slices represent data views, e.g. map view, time view
– Fore relational data, homogenization is by view-specific adding queries
– For unstructured data, metadata is added to describe the structure, so the
agent can retrieve the data for views
Multi-dimensional Data Access
k
Data Granule
Selection
j
i
i
j
i
1D Dataset
2D Dataset
3D Dataset
e.g. Time selector
e.g. Param & Time selector
e.g. Param, Time & Location
• In array notation, the granule ‘value’ is accessed as
– MyGranule = My1DArray(i)
– MyGranule = My2DArray(i,j)
– MyGranule = MynDArray(i,j,…..n)
• In order to select a data granule, a controller is assigned to each data dimension
Multi-Dimensional Data Model
1 Dimensional
i
e.g. Time dimension
Data Space
View 2
j
1 Dimensional
e.g. Location & Time
i
View 1
Data Granule
k
1 Dimensional
e.g. Location, Time &
Parameter
i
j
Data can be distributed
over 1,2, …n dimensions
Views are orthogonal slices through
multidimensional data cubes
Spatial and temporal slices through the
data are most common
Database Schema Design
• Fact Table: A fact table (yellow) contains the
main data of interest, i.e. the pollutant
concentration by location, day, pollutant and
measurement method.
• Star Schema consists of a central fact table
surrounded by de-normalized dimensional tables
(blue) describing the sites, parameters,
methods..
• Snowflake Schema is an extension of the star
schema where each point of the star ‘explodes’
into further fully normalized tables, expanding
the description of each dimension.
• Snowflake schema can capture all the key
data content and relationships if full detail. It
is well suited for capturing and encoding
complex monitoring data into a robust relational
database.
IMPROVE Relational Database Schema (Tentative)
For the IMPROVE data, an extended star schema captures the needed information
The dimensional tables (Site, Parameter, Method) are partly normalized (further
broken into relational tables)
Snowflake Example: Central Calif. AQ Study, CCAQS
CCAQS schema incorporates a rich set
of parameters needed for QA/QC
(e.g. sample tracking) as well as for
data analysis.
The fully relational CCAQS schema
permits the enforcing of integrity
constraints and it has been
demonstrated to be useful for data
entry/verification.
However, no two snowflakes are
identical. Similarly, the rich
snowflake schemata for one
sampling/analysis environment
cannot be easily transplanted
elsewhere.
More importantly, many of the recorded
parameters ‘on the fringes’ are not
particularly useful for integrative,
cross-supersite, regional analyses.
Catalog of Multidimensional Datasets
•
•
•
•
•
Designed to Publish, Find(Select), Bind(Access) and Render distributed datasets
Publishing is though an open web interface for user/broker registration of datasets
Finding a dataset is aided by a metadata on Provider and Dataset
Binding (data access) information is contained in the Dimensional Tables
Rendering parameters are also contained in the Dimensional Tables
Minimal Star Schema for Spatio-Temporal Data
In order to respond to data queries by time, location and parameter,
the database has to have time, location and parameter as dimensions
•
The minimal Site table includes SiteID, Name
and Lat/Lon.
•
The minimal Parameter table consists of
ParamterID, Description and Unit
•
The time dimensional table is usually skipped
since time is self-describing
•
The minimal Fact (Data) table consists of the
Observation value and the three dimensional
codes for DateTime, Site and Parameter
•
The above minimal multidimensional schema
was used in the CAPITA data exploration
software, Voyager.
From Heterogeneous to Homogeneous Schema
•
Individual Supersite SQL databases can be queried along spatial, temporal and parameter dimensions. However, the query to retrieve the same information
depends on the of the particular database.
•
A way to homogenize the distributed data is access all the data through a Data Adapter using only a subset of the tables/fields from any particular database (red)
•
The proposed extracted uniform (abstract) schema is the Minimal Star Schema, (possibly expanded). The final form of the uniformly extracted data schema will
be arrived at by consensus.
Subset used
Uniform Schema
Fact
Table
Data Wrapper
Extraction of
homogeneous data from
heterogeneous sources
Remote data access configuration for DATAFED
CGI or WS call
[[param_list]]
Wrapper Software
[[filter]]
Fill template
Call SQL
Dvoy
Return Data
Dataset
XML or CSV
Firewall
•
SQL query
template
Connection between DATAFED and remote server can be either through Web Services or CGI call.
SQL Server
Spatial Query Template for DVoy
SELECT
SiteTable.Longitude * (180.0/PI()) AS Lon,
SiteTable.Latatitude * (180.0/PI()) AS Lat,
SiteTable. Site_ID,
[[param_list]]
FROM
DataFactTable RIGHT OUTER(??) JOIN SiteTable
ON DataFactTable.Site_ID = SiteTable.Site_ID
WHERE [[filter]]
ORDER BY
SiteTable.Lon ASC
Default [[param_list]] – list of fields to be extracted in the table
DataFactTable. [[param_abbr]]
AS VALUE
Default [[filter]] (for spatial query, datetime_min = datatime_max):
(DataFactTable.Date_Time BETWEEN '[[datetime_min]]' AND '[[datetime_max]]')
•
•
•
Items in red brackets [[item]] are replaced by the code at run time.
SiteTable and DataFactTable names to be replaced by the actual table names
The JOIN can be an inner join for some tables
Purpose of the Supersite Relational Database System
Design, populate and maintain a database which:
– Includes monitoring data from Supersites and auxiliary projects
– Facilitates cross-Supersite [regional or comparative] data analyses
– Supports the analyses by a variety of research groups
Stated Features of Relational Data System
• Data Input:
– Data input electronically through FTP, Web browser, (CD, if necessary)
– Modest amount of metadata on sites, instruments, data sources/version, contacts etc.
– Data structures, formats and submission procedures simple for the submitters
• Data Storage and Maintenance:
– Data stored in relational database(s), possibly distributed over multiple servers
– Maintenance of data holdings catalog and and request logs
– Data updates quarterly
• Data Access:
–
–
–
–
Access method: User-friendly web-access by multiple authorized users
Data finding: Metadata catalog of datasets
Data query: by parameter, method, location, date/time, or other metadata
Data output format: ASCII, spreadsheet, other (dbf, XML)
Snowflake Example: Central Calif. AQ Study, CCAQS
The CCAQS provides a good example of a snowflake relational schema.
It incorporates a rich set of parameters needed for QA/QC as well as for data analysis
The ‘branches’ and ‘leafs’ contain tables for QA/QC and for detailed dimensional metadata
The fully relational snowflake schema permits the enforcing of integrity constraints
However, its hard to apply an elaborate snowflake schema for other sampling/analysis
conditions
Data Preparation Procedures:
• Data gathering, QA/QC and standard formatting is to be done by individual
projects
• The data exchange standards, data ingest and archives are by ORNL and
NASA
• Data ingest is to automated, aided by tools and procedures supplied by this
project
– NARSTO DES-SQL translator
– Web submission tools and procedures
– Metadata Catalog and I/O facilities
• Data submissions and access will be password protected as set by the
community.
• Submitted data will be retained in a temporary buffer space and following
verification transferred to the shared SQL database.
• The data access, submissions etc. will be automatically recorded an summarized
Database Schema Design
for the
Federated Data Warehouse
Nov 27, 2001, RBH
Data Warehouse Features
• As much as possible data should reside in their respective home
environment. ‘Uprooted’ data in decoupled databases tend to
decay ie can not be easily updated, maintained, enriched.
• Abstract (universal) query/retrieval facilitates integration and
comparison along the key dimensions (space, time, parameter,
method)
• The open architecture data warehouse (based on Web Services)
promotes the building of further value chains: Data Viewers,
Data Integration Programs, Automatic Report Generators etc..
Proposal Presentation to the Supersite Program
Nov 30, 2001
Supersite Relational Database System (SRDS)
a sub- project of
St. Louis Midwest Supersite Project, Jay Turner, PI
SQL Data Servers
Subset used
Uniform Schema
User
Fact
Data Adapter
Table
Rudolf Husar, PI
Center for Air Pollution Impact and Trend Analysis (CAPITA)
Washington University, St. Louis, MO
Purpose of the Project:
Design, Populate and Maintain a
Supersite Relational Database System
• Facilitate cross-Supersite, regional, comparative data analyses
• Support analyses by a variety of research groups
• Include monitoring data from Supersites and auxiliary projects
EPA Specs of the Supersite Relational Data System (from RFP)
• Data Input:
– Data input electronically
– Modest amount of metadata on sites, instruments, data sources/version, contacts etc.
– Simple data structures, formats and convenient submission procedures
• Data Storage and Maintenance:
– Data storage in relational database(s), possibly distributed over multiple servers
– A catalog of data holdings and request logs
– Supersite data updates quarterly
• Data Access:
– User-friendly web-access by multiple authorized users
– Data query by parameter, method, location, date/time, or other metadata
– Multiple data output formats (ASCII, spreadsheet, other (dbf, XML)
General Approach to SRDS Design
• Based on consensus, adopt a uniform relational data structure, suitable for
regional and cross-Supersite data integration and analysis.
• We propose a star schema with spatial, temporal, parameter and method
dimensions.
• The ‘original’ data are to be maintained at the respective providers or
custodians (Supersites, CIRA, CAPITA...).
• We propose the creation of flexible ‘adapters’ and web-submission forms for
the transfer of data subsets into the uniformly formatted ‘Federated Data
Warehouse’.
• Data users would access the data warehouse manually or through software.
• We propose data access using modern ‘web services’ protocol, suitable for
adding data viewers, processors (filtering, aggregation and fusion) and
other value-adding processes.
From Heterogeneous to Homogeneous Schema
•
Individual SQL databases have varied designs, usually following a more elaborate ‘snowflake’ pattern
(see Database Schema Design for the Federated Data Warehouse).
•
Though they have more complicated schemata, these Supersite SQL servers can be queried along
spatial, temporal, parameter, method dimensions. However, the query to retrieve the same information
depends on the particular database schema.
•
A way to homogenize the distributed data is by accessing all the data through a Data Wrapper which
accesses only a subset of the tables/fields from any particular database (shown red in schemata below).
•
The proposed extracted uniform (abstract) schema is the Minimal Star Schema, (possibly expanded).
•
The final form of the uniformly extracted data schema will be arrived at by Supersite consensus.
Subset used
Uniform Schema
Fact
Table
Data Adapter
Extraction of
homogeneous data from
heterogeneous sources
On-line Analytical Processing: OLAP
•
A multidimensional data model making it easy to select, navigate,
integrate and explore the data.
• An
analytical query language providing power to filter, aggregate
and merge data as well as explore complex data relationships.
• Ability
to create calculated variables from expressions based on
other variables in the database.
•
Pre-calculation of frequently queried aggregated values, i.e.
monthly averages, enables fast response time to ad hoc queries.
Fast Analysis of Shared
Multidimensional Information (FASMI)
(Nigel, P. “The OLAP Report”)
An OLAP system is characterized as:
being Fast – The system is designed to deliver relevant data to
users quickly and efficiently; suitable for ‘real-time’ analysis
facilitating Analysis – The capability to have users extract not only
“raw” data but data that they “calculate” on the fly.
being Shared – The data and its access are distributed.
being Multidimensional – The key feature. The system provides a
multidimensional view of the data.
exchanging Information – The ability to disseminate large
quantities of various forms of data and information.
Multi-Dimensional Data Cubes
•Multi-dimensional data models
use inherent relationships in data
to populate multidimensional
matrices called data cubes.
•A cube's data can be queried
using any combination of
dimensions
•Hierarchical data structures are
created by aggregating the data
along successively larger ranges
of a given dimension, e.g time
dimension can contain the
aggregates year, season, month
and day.
Conceptual Data Cube
Data Measure, DataPoints
and Data Cube
Measure
A measure (in OLAP terminology) represent
numerical values for a specific entity to be
analyzed (e.g. temperature, wind speed,
pollutant).
Measure
Dimension Z
Data
Granules
Dimension Y
A collection of measures form a special dimension ‘
Measures’ (??Can Measures be Dimensions??)
Data Granules
Dimension X
A measure has set of discrete data granules –atomic data entities that cannot be further broken down.
All data points in a measure represent the same measured parameter e.g. temperature. Hence, they share
the same units and dimensionality.
The data points of a measure are enclosed in a conceptual multidimensional data cube; each data point
occupies a volume (slice or point) in the data cube.
Data points in a measure share the same dimensions; Conversely, each data point has to have the
dimensional coordinates in the data cube of the measure that it belongs to.
DVoy Data Space
Data Space for a Measure
Measure Data Space
Data
Views
View Data Space
Data
Zmax
Xmin
Z
Ymax
Ymax
Ymin
Xmax
Zmin
Zmin
Ymin
Xmin
Xmax
Y
X
• aergqegqet
Data Granule
Zmin
Measure Data Space
View Data Space
Data Granule
DVoy: Components and Image Data Flow and
Distributed Image Data
Data Query for a Measure:
Image
Default Cube DX, DY, DZ, DT
Web Service
Measure:
Registration
Elevation
Web Service
Measure:
TOMS
Image Data Browser
HTTP, FTP,
Data
XY MAP: Z,T fixed
Web Service
Catalog
Image Delivery
Measure:
SeaWiFS
Web Service
Dvoy
Components:
•
•
•
•
•
•
Distributed Data
Image Registration
Data Catalog
Data Query
Image Delivery
Image Viewer
Distributed Voyager:
Safe Connection for Remote Data (SQL) Servers
R. Husar and K. Hoijarvi
CAPITA
October 24, 2003
DVoy Data Space
Data Set
Data Instance
Dvoy: Components and Image Data Flow and
Distributed Image Data
Data Selection:
Measure:
Image
Measure, X, Y, Z, T
Elevation
Registration
Image Data Browser
Measure:
Web Service
TOMS
XY MAP: Z,T fixed
Data Query
Web Service
Image Delivery
Data
Measure:
Catalog
SeaWiFS
Web Service
Dvoy
Components:
•
•
•
•
•
•
Distributed Data
Image Registration
Data Catalog
Data Query
Image Delivery
Image Viewer
DVoy Data Space, DataSet, Data
Measures:
Data Set:
X, Y, Z, T
Subset of the Data Space
e.g. TOMS Images, 96-01
Data Instance:
X, Y …
Subset of Data Set
single TOMS image
Data Space
Data Instance: Atomic data entity that cannot be
broken down to smaller entities. It is a record in the
database, e.g:
Data Set
- data point..Temperature (xi, yi, zi, ti)
Data
Zmin
Ymax
- image Temperature (xrange, yrange, ti )
Ymin
Xmin
Xmax
Z
Zmin
Ymax
Data Set: Collection of Data Instances
Ymin
Xmin
Xmax
Y
X
The parameter (or ‘variable’, or the
OLAP ‘measure’) is NOT a
dimension
Rather, each parameter covers a
multidimensional space.
Service Chaining in Spatio-Temporal Data Browser
Data Sources
Homogenizer
Catalog
XDim Data
Wrapper
SQL Table
Mediator
OGC-Compliant GIS Services
Spatial Portrayal
Spatial Overlay
Client Browser
OLAP
GIS Data
Vector
Data Cube
Spatial Slice
Time-Series Services
Time Portrayal
Satellite
Time Overlay
Time Slice
Images
Cursor/Controller
Maintain Data
Find/Bind Data
Portray
Overlay
Render
Data Entry to the Supersite Relational Data System:
1.
Batch transfer of large Supersite and other datasets to the SRDS SQL server
2.
Web-submission of relational tables by the data producers/custodians
3.
Automatic translation and transfer of NARSTO-archived DES data to SQL
NARSTO ORNL
DES, Data Ingest
3. DES-SQL
Transformer
EPA
EOSDIS
Supersite Data
Data
Supersite
Archive
SQL
Coordinated
Supersite
Relational
Server
2. Direct Web Data Input
Tables
Supersite & other
SQL Data
Data
Query
1. DataAdapter
Table
Output
Summary of Proposed Database Schema Design
•
The starting point for the design of Supersite Relational Database schema will be the
Minimal Star Schema for fixed-location monitoring data.
•
Extensions will be made if it clearly benefits regional analysis and cross-Supersite
comparisons
•
The possible extensions, based on user needs, may include the addition of:
• ‘Methods’ dimension table to identify the sampling/analysis method of each
observation
• Additional attributes (columns) Site and Parameter tables
•
The Supersite data are not yet ready for submission to the NARSTO archive. Thus,
there is still time to develop an agreed-upon schema for the Supersite data in SRDS.
•
The schema modifications and and the consensus-building will be conducted through
the SRDS website
The RDMS Schema: ‘Minimal’ Star Schema
For integrative, cross-Supersite analysis, the database has
to have, at the minimum, a ‘fact table’ and associated time,
location, parameter and method tables as dimensions
•
The minimal Sites table includes SiteID, Name and
Lat/Lon.
•
The minimal Parameter table consists of
ParamterID, Description and Unit
•
The time dimensional table is usually skipped since
the time code is self-describing
•
The minimal Fact (Data) table consists of the
Obs_Value and the three dimensional codes for
Obs_DateTime, Site_ID and Parameter_ID
•
Additional dimension tables may include Method and
Data Quality.
The CAPITA data exploration software, Voyager uses this minimal schema. Voyager has been in use for the past 12
years successfully encoding and browsing 1000+ datasets worldwide.
The state of California still formats and distributes their AQ data on CDs using Voyager.