SRDSPres011120

Download Report

Transcript SRDSPres011120

Supersite Relational Database Project:
(Data Portal?)
a sub- project of
St. Louis Midwest Supersite Project
Draft of the November 16, 2001 Presentation to the Supersite
Program
Nov 13, 2001
Purpose of the Supersite Relational Database System
Design, populate and maintain a database which:
– Includes monitoring data from Supersites and auxiliary projects
– Facilitates cross-Supersite [regional or comparative] data analyses
– Supports the analyses by a variety of research groups
Stated Features of Relational Data System
• Data Input:
– Data input electronically through FTP, Web browser, (CD, if necessary)
– Modest amount of metadata on sites, instruments, data sources/version, contacts etc.
– Data structures, formats and submission procedures simple for the submitters
• Data Storage and Maintenance:
– Data stored in relational database(s), possibly distributed over multiple servers
– Maintenance of data holdings catalog and and request logs
– Data updates quarterly
• Data Access:
–
–
–
–
Access method: User-friendly web-access by multiple authorized users
Data finding: Metadata catalog of datasets
Data query: by parameter, method, location, date/time, or other metadata
Data output format: ASCII, spreadsheet, other (dbf, XML)
Database Schema Design
•
Fact Table: A fact table (yellow) contains the main
data of interest, i.e. the pollutant concentration by
location, day, pollutant and measurement method.
•
Star Schema consists of a central fact table
surrounded by de-normalized dimensional tables
(blue) describing the sites, parameters, methods..
•
Snowflake Schema is an extension of the star
schema where each point of the star ‘explodes’ into
further fully normalized tables, expanding the
description of each dimension.
•
Snowflake schema can capture all the key data
content and relationships if full detail. It is well
suited for capturing and encoding complex
monitoring data into a robust relational database.
Abstract (Minimal) Star Schema for
Integrative, Cross-Supersite, Spatio-Temporal Analysis
For integrative, cross-Supersite analysis, data queries by time, location and
parameter, the database has to have time, location and parameter as dimensions
•
The minimal Site table includes SiteID, Name and
Lat/Lon.
•
The minimal Parameter table consists of
ParamterID, Description and Unit
•
The time dimensional table is usually skipped
since time is self-describing
•
The minimal Fact (Data) table consists of the
Obs_Value and the three dimensional codes for
Obs_DateTime, Site_ID and Parameter_ID
The above minimal (multidimensional) schema was used in the CAPITA data exploration
software, Voyager for the past 22 years, encoding 1000+ datasets.
Most Supersite data require a more elaborate schema to fully capture the content
Extended Star Schema for SRDS
The Supersite program employs a variety of instrument/sampling/procedures
Hence, at least one additional dimension table is needed for Methods
A example extended star schema encodes the IMPROVE relational database (B. Schichtel)
Snowflake Example: Central Calif. AQ Study, CCAQS
CCAQS schema incorporates a rich set of
parameters needed for QA/QC (e.g.
sample tracking) as well as for data
analysis.
The fully relational CCAQS schema permits
the enforcing of integrity constraints and
it has been demonstrated to be useful for
data entry/verification.
However, no two snowflakes are identical.
The rich snowflake schemata for one
sampling/analysis environment cannot
be easily transplanted elsewhere.
More importantly, many of the recorded
parameters ‘on the fringes’ are not
particularly useful for integrative, crosssupersite, regional analyses.
Data Portal: Features
• Data reside in their respective home environment. ‘Uprooted’ data in separate
databases are not easily updated, maintained, enriched.
• Abstract (universal) query/retrieval facilitates integration and comparison
along the key dimensions (space, time, parameter, method)
• The open architecture data portal, (based on Web Services) promotes the
building of further value chains: Data Viewers, Data Integration Programs,
Automatic Report Generators etc..
From Heterogeneous to Homogeneous Schema
•
Individual Supersite SQL databases can be queried by along spatial, temporal and parameter
dimensional queries. However, the query to retrieve the same information depends on the of the
particular database.
•
A way to homogenize the distributed data is access all the data through a Data Adapter using only a
subset of the tables/fields from any particular database (red)
•
The proposed extracted (abstract) schema is the Minimal Star Schema, (possibly expanded ….). The
final form of the extracted data schema will be arrived at by consensus.
Subset used
Abstract Schema
Fact
Table
Data Adapter
Extraction of
homogeneous data from
heterogeneous sources
Federated Data Warehouse Architecture
•
Tree-tier architecture consisting of
–
–
–
•
The Provider servers interact only with the Proxy Server in accordance with the Federation Contract
–
–
•
Provider Tier: Back-end servers containing heterogeneous data, maintained by the federation members
Proxy Tier: Retrieves designated Provider data and homogenizes it into common, uniform Datasets
User Tier: Accesses the Proxy Server and uses the uniform data for presentation, integration or processing
The contract sets the rules of interaction (accessible data subsets, types of queries)
Strong server security measures enforced, e.g. through Secure Socket layer
The data User interacts only with the generic Proxy Server using flexible Web Services interface
–
–
–
Generic data queries, applicable to all data in the Warehouse (e.g. space, time, parameter data sub-cube)
The data query is addressed to a Web Service provided by the Proxy Server of the Federation
Uniformly formatted, self-describing data packages are handed to the user for presentation or further processing
Federated Data Warehouse
Proxy Tier
Provider Tier
Data Homogenization, etc.
Heterogeneous Data
User Tier
Data Consumption
Presentation
SQLDataAdapter1
SQLServer1
Processing
SQLDataAdapter2
SQLServer2
Integration
CustomDataAdapter
LegacyServer
Data Access & Use
Proxy Server
Member Servers
Web Service, Uniform Query & Data
Fire Wall, Federation Contract
Universal Query/Response from SQL servers
•
A common feature of all SQL databases for AQ data is that they can be queried by along spatial,
temporal and parameter dimensional queries.
•
However, the query to retrieve the same information depends on the of the particular database.
•
A way to homogenize the distributed data is access all the data through an abstract virtual schema.
Summary of Proposed Database Schema Design
•
The starting point for the design of Supersite Relational Database schema will be the
Minimal Star Schema for fixed-location monitoring data.
•
Extensions will be made if it clearly benefits regional analysis and cross-Supersite
comparisons
•
The possible extensions, based on user needs, may include the addition of:
• ‘Methods’ dimension table to identify the sampling/analysis method of each
observation
• Additional attributes (columns) Site and Parameter tables
•
The Supersite data are not yet ready for submission to the NARSTO archive. Thus,
there is still time to develop an agreed-upon schema for the Supersite data in SRDS.
•
The schema modifications and and the consensus-building will be conducted through
the SRDS website
Data Entry to the Supersite Relational Data System:
1.
Automatic translation and transfer of NARSTO-archived DES data to SQL
2.
Web-submission of of relational tables by the data producers/custodians
3.
Batch transfer of large auxiliary datasets to the SQL server
NARSTO ORNL
DES, Data Ingest
DES-SQL
Transformer
EPA
EOSDIS
Supersite Data
Data
Supersite
Archive
SQL
Coordinated
Supersite
Relational
Server
Direct Web Data Input
Tables
Auxiliary
Batch Data
Data
Query
Manual-SQL Transformer
Table
Output
Data Preparation Procedures:
•
Data gathering, QA/QC and standard formatting is to be done by individual projects
•
The data exchange standards, data ingest and archives are by ORNL and NASA
•
Data ingest is to automated, aided by tools and procedures supplied by this project
– NARSTO DES-SQL translator
– Web submission tools and procedures
– Metadata Catalog and I/O facilities
•
Data submissions and access will be password protected as set by the community.
•
Submitted data will be retained in a temporary buffer space and following verification
transferred to the shared SQL database.
•
The data access, submissions etc. will be automatically recorded an summarized in
human-readable reports.
Data Catalog
• Data Catalog and
discussion page of
the CAPITA
Xsystem
Related CAPITA Projects
•
EPA Network Design Project (~$150K/yr –April 2003). Development of novel quantitative
methods of network optimization. The network performance evaluation is conducted using
the complete PM FRM data set in AIRS which will be available for input into the SRDS.
•
EPA WebVis Project (~$120K/yr - April 2003). Delivery current visibility data to the public
through a web-based system. The surface met data are being transferred into the SQL
database (Since March 2001) and will be available to SRDS.
•
NSF Collaboration Support Project (~$140K/yr – Dec 2004). Continuing development of
interactive web sites for community discussions and for web-based data sharing; (directly
applicable to this project)
•
NOAA ASOS Analysis Project (~$50K/yr - May 2002). Evaluate the potential utility of the
ASOS visibility sensors (900 sites, one minute resolution) as PM surrogate. Data now
available for April-October 2001 – can be incorporated into to the Supersite Relational Data
System.
•
St. Louis Supersite Project website (~$50K/yr – Dec 2003) . The CAPITA group maintains
the St. Louis Supersite website and some auxiliary data. It will also be used for this project
Federated Data Warehouse Architecture
XDim Data
SQL
Table
XML Web
Services
Data Warehouse Tier
OLAP
Cube
Data View & Process Tier
Layered Map
GIS Data
Satellite
OpenGIS
Services
Vector
Connection
Cursor-Query
Manager
Manager
Data Access
Data View
Manager
Manager
Cursor
Time Chart
Text, Table
Scatter Chart
Text Data
Web
Page
HTTP
Services
Text
Data
Distributed data of multiple
types (spatial, temporal text)
The Broker handles the views,
connections, data access, cursor
Data are rendered by linked
Data Views (map, time, text)
Example Data Viewer
(to be made more Supersite relevant)
Map
View
Time
View
Variable
View
WebCam
View
The views are linked so that making a change in one view, such as
selecting a different location in the map view, updates the other views.
Supersite Relational Data System: Schedule
Year 1 - 2002
RDMS
Design
Year 2 - 2003
Year 2 - 2004
Feed
back
Impl. &
Test SQL
Supersite Data Entry
Auxiliary Data Entry
Other Coordinated Data Entry
Supersite, Coordinated and Auxiliary Data Updates
•
•
•
First four four months to design of the relational database, associated data transformers, I/O;
submitted to the Supersite workgroups for comment
In six months, Supersite data preparation and entry begins
In Year 2 and Year 3, data sets will be updated by providers as needed; system accessible to data
user community
Personnel, Management and Facilities
Personnel
•
•
•
PI, R. B. Husar (10%), Kari Hoijarvi (25%). Software experience at CAPITA, Microsoft, Visala.
20% of project budget ($12k/yr) to consultants: J. Watson, DRI, W. White and J. Turner, WU.
Collaborators, (CAPITA associates): B. Schichtel, CIRA, S. Falke, EPA, M. Bezic, Microsoft.
Management
•
•
•
This project is a sub-project of the St. Louis-Midwest Supersite project, Dr. Jay Turner, PI.
Special focus on supporting large scale, crosscutting, and integrative analysis.
This project will leverage the other CAPITA data sharing projects
Resources and Facilities
•
•
•
CAPITA has the largest known privately held collection of air quality, metrological and emission
data, available in uniform Voyager format and extensively accessed from the CAPITA website
The computing and communication facilities include two servers, ten workstations and laptops,
connected internally and externally through high-speed networks.
Software development tools, including the Visual Studio, part of the .NET dev-environment
Miscellaneous Stuff
• The remainder is pages are potentially reusable stuff – not yet
organized.
OpenGIS Web Services
•
•
•
•
Mission: Definition and specification of geospatial web services.
A Web service is an application that can be published, located, and dynamically invoked
across the Web.
Applications and other Web services can discover and invoke the service.
The sponsors of the Web services initiative include
–
–
–
–
–
–
–
–
•
Federal Geographic Data Committee
Natural Resources Canada
Lockheed Martin
National Aeronautics and Space Administration
U.S. Army Corps of Engineers Engineer Research and Development Center
U.S. Environmental Protection Agency EMPACT Program
U.S. Geological Survey
US National Imagery and Mapping Agency.
Phase I - February 2002
–
–
–
Common Architecture: OGC Services Model, OGC Registry Services, and Sensor Model
Language.
Web Mapping: Map Server- raster, Feature Server-vector, Coverage Server-image, Coverage
Portrayal Services.
Sensor Web: OpenGIS Sensor Collection Service for accessing data from a variety of land, water, air
and other sensors.
Distributed Data Analysis & Dissemination System:
D-DADS
• Specifications:
Uses standardized forms of data, metadata and access protocols
 Supports distributed data archives, each run by its own provider
 Provides tools for data exploration, analysis and presentation

• Features:




Data are structured as relational tables and multidim. data cubes
Dimensional data cubes are distributed but shared
Analysis is supported by built-in and user functions
Supports other data types, such as images, GIS data layers, etc.
D-DADS Architecture
The D-DADS Components
•
•
Data Providers supply primary data to system, through SQL or other data servers.
Standardized Description & Format populate and describe the data cubes and
other data types using a standard metadata describing data
•
Data Access and Manipulation tools for providing a unified interface to data
cubes, GIS data layers, etc. for accessing and processing (filtering, aggregating, fusing)
data and integrating data into virtual data cubes
•
Users are the analysts who access the D-DADS and produce knowledge from the data
The multidimensional data access and manipulation
component of D-DADS will be implemented using OLAP.
Interoperability
One requirement for an effective distributed environmental
data system is interoperability, defined as,
“the ability to freely exchange all kinds of spatial
information about the Earth and about objects and
phenomena on, above, and below the Earth’s surface;
and to cooperatively, over networks, run software
capable of manipulating such information.” (Buehler &
McKee, 1996)
Such a system has two key elements:
• Exchange of meaningful information
• Cooperative and distributed data management
On-line Analytical Processing: OLAP
•
A multidimensional data model making it easy to select, navigate,
integrate and explore the data.
• An
analytical query language providing power to filter, aggregate
and merge data as well as explore complex data relationships.
• Ability
to create calculated variables from expressions based on
other variables in the database.
•
Pre-calculation of frequently queried aggregated values, i.e.
monthly averages, enables fast response time to ad hoc queries.
User Interaction with D-DADS
Query
XML data
Distributed
Database
Data View
(Table, Map, etc.)
XML data
Metadata Standardization
Metadata standards for describing air quality data are
currently being actively pursued by several
organizations, including:
• The Supersite Data Management Workgroup
• NARSTO
• FGDC
Potential D-DADS Nodes
The following organizations are potential nodes in a
distributed data analysis and dissemination system:
• CAPITA
• NPS-CIRA
• EPA Supersites
- California
- Texas
- St. Louis
Summary
In the past, data analysis has been hampered by data flow
resistances. However, the tools and framework to
overcome each of these resistances now exist, including:
• World Wide Web
• XML
• OLAP
• OpenGIS
• Metadata standards
Incorporating these tools will initiate a distributed data
analysis and dissemination system.
‘Global’ and ‘Local’ AQ Analysis
•
•
•
•
•
AQ data analysis needs to be performed at both global and local levels
The ‘global’ refers to regional national, and global analysis. It establishes the largerscale context.
‘Local’ analysis focuses on the specific and detailed local features
Both global and local analyses are needed for for full understanding.
Global-local interaction (information flow) needs to be established for effective
management.
National and Local AQ Analysis
Data Re-Use and Synergy
•
•
•
Data producers maintain their own workspace and resources (data, reports, comments).
Part of the resources are shared by creating a common virtual resources.
Web-based integration of the resources can be across several dimensions:
Spatial scale:
Data content:
Local – global data sharing
Combination of data generated internally and externally
Local
Local
User
Shared part of resources
User
Content
Virtual Shared Resources
User
Data, Knowledge
Tools, Methods
Content
User
Global
•
•
Global
User
The main benefits of sharing are data re-use, data complementing and synergy.
The goal of the system is to have the benefits of sharing outweigh the costs.
Integration for Global-Local Activities
Global and local activities are both needed – e.g. ‘think global, act local’
‘Global’ and ‘Local’ here refers to relative, not absolute scale
Global Activity
Local Benefit
Global data, tools
=> Improved local productivity
Global data analysis
=> Spatial context; initial analysis
Analysis guidance
=> Standardized analysis, reporting
Local Activity
Global Benefit
Local data, tools
=> Improved global productivity
Local data analysis
=> Elucidate, expand initial analysis
Identify relevant issues
=> Responsive, relevant global work
Content Integration for Multiple Uses (Reports)
Data from multiple measurements are shared by their providers or custodians
Data are integrated, filtered, aggregated and fused in the process of analysis
Reports use the analysis for Status and Trends; Exposure Assessment; Compliance …
The creation of the needed reports requires data sharing and integration from multiple sources.