Designing a Data Warehouse

Download Report

Transcript Designing a Data Warehouse

Designing a Data Warehouse
Issues in DW design
Data Warehouse
A read-only database for decision
analysis
 Subject Oriented
 Integrated
 Time variant
 Nonvolatile
consisting of time stamped operational
and external data.
Data Warehouse vs
Operational Databases





Highly tuned
Real time Data
Detailed records
Current values
Accesses small
amounts of data in a
predictable manner





Flexible access
Consistent timing
Summarized as
appropriate
Historical
Access large
amounts of data in
unexpected ways
Data Warehouse Purpose


Identify problems in time to avoid them
Locate opportunities you might
otherwise miss
Data Warehouse:
New Approach
An old idea with a new interest because
of:
Cheap Computing Power
Special Purpose Hardware
New Data Structures
Intelligent Software
Warehousing Problems
Business Issues
Data Quantity
Data Accuracy
Maintenance
Ownership
Cost
Warehousing Problems
Business Issues
Database Issues
DBMS Software
Technology
Complexity
Warehousing Problems
Business Issues
Data Issues
Analysis Issues
User Interface
Intelligent Processing
Three Approaches

Classical Enterprise Database
Contains operational data from all areas of
the organization.

Data Mart
Extracted and managerial support data
designed for departmental or EUC
applications

Data Package
Data required for a specific application
Classical Warehouse
Source
Archived data
Extraction
Batch extraction programs
Data
Atomic transaction data
Tool
VLDB technology
Analysis
IT driven software
Mart
Source
Deposit or External sources
Extraction
Batch summary
Data
Designed departmental database
Tool
OLAP, ROLAP, MDBMS
Analysis
IT driven or trained user
Package
Source
Mart
Extraction
Sample and summary
Data
Problem specific dataset
Tool
PC tools
Analysis
Trained user
Three Fundamental Processes




Data Acquisition
Data Storage
Data a
Access
Data Acquisition


Handles acquisition of data from legacy
systems and outside sources.
Data is identified, copied, formatted and
prepared for loading into the
warehouse.
Acquisition steps

Catalog the data


Clean and prepare the data.


Develop an inventory of where it is
and what it means.
Extract from legacy files and reformat
to make it usable.
Transport data from one location to
another.
Storage
The storage component holds the
data so that the many different data
mining, executive information
and decision support systems can make
use of it effectively.
The Storage Area
Managed by
 Relational databases


like those from Oracle Corp. or Informix
Software Inc.
Specialized hardware


symmetric multiprocessor (SMP)
or massively parallel processor
(MPP) machines
Storage


The majority of warehouse storage
today is being managed by relational
databases running on Unix platforms.
Oracle, Sybase Inc., IBM Corp. and
Informix control 65 percent of the
warehouse storage market. Meta Group
Inc. (1996)
Access


Different end-user PCs and workstations draw
data from the warehouse with the help of
multidimensional analysis products, neural
networks, data discovery tools or analysis
tools.
These powerful, "smart" software products
are the real driving force behind the viability
of data warehousing.
Access Tools




Intelligent Agents and Agencies
Query Facilities and Managed Query
Environments
Statistical Analysis
Data Discovery.
(decision support, artificial intelligence and
expert systems)


OLAP
Data Visualization
Hardware Budget

A typical startup warehouse project
allocates more than 60 percent of its
budget for hardware and software to
the creation of a powerful storage
component, spending just 30 percent
on data mining and user access
technologies.
Systems Analysis Budget



Budgeting for systems analysis and
development, however, follows a very
different pattern.
More than 50 percent of development dollars
are spent on building acquisition capabilities,
30 percent fund the development of user
solutions and
20 percent are dedicated to the creation of
databases in the storage component.
Design Issues
Relational and Multidimensional Models
 Denormalized and indexed relational
models more flexible
 Multidimensional models simpler to use
and more efficient
Star Schemas in a RDBMS
In most companies doing ROLAP, the DBAs
have created countless indexes and summary
tables in order to avoid I/O-intensive table
scans against large fact tables. As the indexes
and summary tables proliferate in order to
optimize performance for the known queries
and aggregations that the users perform, the
build times and disk space needed to create
them has grown enormously, often requiring
more time than is allotted and more space
than the original data!
Building a Data Warehouse
from a Normalized Database
The steps
 Develop a normalized entity-relationship
business model of the data warehouse.
 Translate this into a dimensional model. This
step reflects the information and analytical
characteristics of the data warehouse.
 Translate this into the physical model. This
reflects the changes necessary to reach the
stated performance objectives.
The Business Model



Identify the data structure, attributes
and constraints for the client’s data
warehousing environment.
Stable
Optimized for update
Flexible
Business Model
As always in life, there are some disadvantages
to 3NF:
 Performance can be truly awful. Most of the
work that is performed on denormalizing a
data model is an attempt to reach
performance objectives.
 The structure can be overwhelmingly
complex. We may wind up creating many
small relations which the user might think of
as a single relation or group of data.
Structural Dimensions


The first step is the development of the
structural dimensions. This step corresponds
very closely to what we normally do in a
relational database.
The star architecture that we will develop
here depends upon taking the central
intersection entities as the fact tables and
building the foreign key => primary key
relations as dimensions.
Simple DW pattern.
Other Dimensions



Categorical dimensions: generated
groups (additional key components)
Partitioning dimensions: subtypes
(planned vs. actual)
Informational dimensions: generate
different types of data (messy).