On-line Analytical Processing OLAP

Download Report

Transcript On-line Analytical Processing OLAP

Data Warehousing
November 2001
PORTALS
• A portal provides users with personalized, one-stop shopping for
structured and unstructured data, as well as various types of
applications, all of which may exist either inside or outside the
corporation. However, data warehousing teams need to be
especially careful when selecting portal software.
• Portal products that warrant attention will be based on open
standards of data communication like XML and provide an
extensible platform that can accommodate a range of
technologies. Portal vendors that also market ETL tools designed
to maintain the data warehouse will be especially well positioned
to provide enterprise data integration. Ascential Software is such a
vendor.
Database-DataWarehouse Differences
•
Data Acquisition Databases
•
•
Acronyms: OLTP - On Line Transaction Processing
•
Examples:
•
- Order entry system;
•
- Look up your checking account when you go to an ATM to request a withdraw
•
Features:
•
Designed for very rapid selects and inserts of simple transactions
•
Simple transaction that needs to be executed with speed.
•
DBMS designed for OLTP (Oracle) do not do the best job at data querying.
•
Several databases are designed to query and manage data
•
Stores transactional data of an enterprise
•
A database is nothing more, or less, than a technology for managing data files.
•
OLTP Transactional data focusing particular operations or department
•
Current data only, no historical data
•
For OLTP systems that means small tables (row size), probably some specific indexes related to transactional processing, and a high degree of normalization
•
A Database is normalized and contain several constraints to minimize input errors.
•
Contains only fine-grain data, no coarse grain aggregations
•
The database is the underlying infrastructure of a data warehouse (DW)
•
Data Analysis Databases
•
Features:
•
designed for massive ad- hoc queries of large data volumes
•
not to process transactions.
•
Stores historical data of an enterprise
•
Datawarehouse is a centralized storage facility
•
Used for reporting purposes; helps management making critical decisions
•
For analysis of patterns, derived after analyzing data aggregations
•
datawarehouse does not contain all records/info, only summarized info
•
data gathered from a variety of sources and retained for extend periods
•
Integrated data formatted for easy access for queries and reports- trend analysis
•
May contain all relevant detail transaction info for tracebility and drill down of summaries.
•
There is need for good, clean, transactional data in the warehouse
•
The summaries and aggregations are also in
•
Larger tables, more indexes to facilitate queries, and many tables are denormalized to varying degrees
•
Implemented using a database engine, RDBMS or OLAP tools
•
The schema is not normalized as in operational database.
•
The data are arranged in dimensions like Time, Geographic Region, Product, Customer class, promotion etc.
•
The user doesn't need to know SQL or other language to access the database.
•
A data warehouse does not normally accept user inputs and is read only.
Contains fine-grain as well as coarse grain aggregate data
•
Summaries inside the relational warehouse could be a simple star schema
•
If you use a microstrategy to provide information, you will need a snow flake schema.
•
If you use a hyperion solution, you must have this summarized area in star schema.
Data Warehousing Trends
•
•
During the 1990s, distributed computing became the predominant architecture;
most hardware and software companies focused their research and
development efforts on developing new and enhanced products that were
compatible with this new technology. Specific to data warehousing, we saw
tremendous progress relative to both the functionality and scalability
associated with products in extract/transform/load (ETL), data
repositories/databases, OLAP, data mining and other associated decisionsupport technologies.
In the past few years, we have seen the rise of the Internet. The Internet's
impact on data warehousing will be tremendous in terms of enabling more
individuals to access and gain value from existing warehouses beginning with
intranets and, more recently, making the information available to trading
partners via extranets. At the same time, the Internet provides valuable
information about customers, suppliers and competitors that was not readily
available from traditional sources.
A Retrospective look at Data Warehousing
•
•
•
In the late 1980s and early 1990s, something had to be done to address the
growing level of discomfort with legacy applications. Thus, the concept of the
data warehouse was born. The data warehouse was a database that stored
detailed, integrated, current and historical data.
The data warehouse was built from the detailed transaction data that was
generated by and aggregated in the legacy applications. One of the major tasks
of the data warehouse developer was going back into the legacy systems
environment to find, convert and reformat legacy data. The task of integrating
legacy data that was never designed for integration was daunting and dirty.
Yet, this task was one of the cornerstones of the success of data warehousing.
In the early days of data warehousing, there were no tools for integrating and
converting data. The work had to be done by hand; and it was thankless,
toilsome work. Soon, a subindustry evolved called the
integration/transformation (i/t) or the extract, transform and load (ETL)
industry. Software products were created that allowed the legacy environments
to be integrated in an automated manner. There were two types of ETL tools
code generators that could handle any conversion scenario and run-time
generators that were easy to use but allowed for only limited complexity in
integration.
Driving Forces of Data Flow
• Need a ‘force’ to move data from one-shot to reusable form
• External force – contracts
• Internal – humanitarian, benefits
Resistances
(it takes extra effort to recycle information)
• Mechanical
• Personal
• Institutional
Monitoring
The Researcher/Analyst’s Challenge
“The researcher cannot get access to the data;
if he can, he cannot read them;
if he can read them,
he does not know how good they are;
and if he finds them good he cannot merge them
with other data.”
Information Technology and the Conduct of Research: The Users View
National Academy Press, 1989
Data Flow Resistances
The data flow process is hampered by a number of resistances.
• The user does not know what data are available
• The available data are poorly described (metadata)
• There is a lack of QA/QC information
• Incompatible data can not be combined and fused
These resistances can be overcome through a distributed
system that catalogs and standardizes the data allowing easy
access for data manipulation and analysis.
NARSTO-Supersite Data System: Data Flow
NARSTO ORNL
DES, Data Ingest
DES-SQL
Transformer
EPA
EOSDIS
Supersite Data
Data
Coordinated
Archive
Supersite
Relational
Supersite
SQL
Server
Direct Web Data Input
Tables
Auxiliary
Manual-SQL Transformer
Batch Data
•
•
•
Data
Query
Data gathering, QA/QC and standard formatting is done by individual projects
The data exchange standards, data ingest and archives are by ORNL and NASA
The data catalog, relational transformers, SQL server and I/O is by this project
Table
Output
Distributed Data Analysis & Dissemination System:
D-DADS
• Specifications:
Uses standardized forms of data, metadata and access protocols
 Supports distributed data archives, each run by its own provider
 Provides tools for data exploration, analysis and presentation

• Features:




Data are structured as relational tables and multidim. data cubes
Dimensional data cubes are distributed but shared
Analysis is supported by built-in and user functions
Supports other data types, such as images, GIS data layers, etc.
D-DADS Architecture
The D-DADS Components
•
•
Data Providers supply primary data to system, through SQL or other data servers.
Standardized Description & Format populate and describe the data cubes and
other data types using a standard metadata describing data
•
Data Access and Manipulation tools for providing a unified interface to data
cubes, GIS data layers, etc. for accessing and processing (filtering, aggregating, fusing)
data and integrating data into virtual data cubes
•
Users are the analysts who access the D-DADS and produce knowledge from the data
The multidimensional data access and manipulation
component of D-DADS will be implemented using OLAP.
Interoperability
One requirement for an effective distributed environmental
data system is interoperability, defined as,
“the ability to freely exchange all kinds of spatial
information about the Earth and about objects and
phenomena on, above, and below the Earth’s surface;
and to cooperatively, over networks, run software
capable of manipulating such information.” (Buehler &
McKee, 1996)
Such a system has two key elements:
• Exchange of meaningful information
• Cooperative and distributed data management
On-line Analytical Processing: OLAP
•
A multidimensional data model making it easy to select, navigate,
integrate and explore the data.
• An
analytical query language providing power to filter, aggregate
and merge data as well as explore complex data relationships.
• Ability
to create calculated variables from expressions based on
other variables in the database.
•
Pre-calculation of frequently queried aggregated values, i.e.
monthly averages, enables fast response time to ad hoc queries.
User Interaction with D-DADS
Query
XML data
Distributed
Database
Data View
(Table, Map, etc.)
XML data
Metadata Standardization
Metadata standards for describing air quality data are
currently being actively pursued by several
organizations, including:
• The Supersite Data Management Workgroup
• NARSTO
• FGDC
Potential D-DADS Nodes
The following organizations are potential nodes in a
distributed data analysis and dissemination system:
• CAPITA
• NPS-CIRA
• EPA Supersites
- California
- Texas
- St. Louis
Summary
In the past, data analysis has been hampered by data flow
resistances. However, the tools and framework to
overcome each of these resistances now exist, including:
• World Wide Web
• XML
• OLAP
• OpenGIS
• Metadata standards
Incorporating these tools will initiate a distributed data
analysis and dissemination system.
Fast Analysis of Shared
Multidimensional Information (FASMI)
(Nigel, P. “The OLAP Report”)
An OLAP system is characterized as:
being Fast – The system is designed to deliver relevant data to
users quickly and efficiently; suitable for ‘real-time’ analysis
facilitating Analysis – The capability to have users extract not only
“raw” data but data that they “calculate” on the fly.
being Shared – The data and its access are distributed.
being Multidimensional – The key feature. The system provides a
multidimensional view of the data.
exchanging Information – The ability to disseminate large
quantities of various forms of data and information.
Multi-Dimensional Data Cubes
•Multi-dimensional data models
use inherent relationships in data
to populate multidimensional
matrices called data cubes.
•A cube's data can be queried
using any combination of
dimensions
•Hierarchical data structures are
created by aggregating the data
along successively larger ranges
of a given dimension, e.g time
dimension can contain the
aggregates year, season, month
and day.
Example Application: Visibility D-DADS
Visibility observations (extinction coefficient) are an indicator
of air quality and serve as an important data set in the public’s
understanding of air quality.
A visibility D-DADS will consist of multiple forms of
visibility data, such as visual range observations and digital
images from web cameras.
Potential visibility data providers include:
- EMPACT projects and their hourly visual range data
- The IMPROVE database
- CAPITA, a warehouse for global surface observation
data available every six hours
Possible Node in Geography Network
National Geographic and ESRI are establishing a geography
network consisting of distributed spatial databases.
Some EMPACT projects are participating as nodes in the initial
start-up phase
The visibility distributed data and analysis system could link to
and become another node in the geography network, making use
of the geography network’s spatial viewers.
Other views, such as a time view could be linked with the spatial
viewer to take advantage of the multidimensional visibility data
cubes.