Designing a Data Warehouse
Download
Report
Transcript Designing a Data Warehouse
Designing a Data Warehouse
Issues in DW design
Data Warehouse
A read-only database for decision
analysis
Subject Oriented
Integrated
Time variant
Nonvolatile
consisting of time stamped operational
and external data.
Data Warehouse vs
Operational Databases
Highly tuned
Real time Data
Detailed records
Current values
Accesses small
amounts of data in a
predictable manner
Flexible access
Consistent timing
Summarized as
appropriate
Historical
Access large
amounts of data in
unexpected ways
Data Warehouse Purpose
Identify problems in time to avoid them
Locate opportunities you might
otherwise miss
Data Warehouse:
New Approach
An old idea with a new interest because
of:
Cheap Computing Power
Special Purpose Hardware
New Data Structures
Intelligent Software
Warehousing Problems
Business Issues
Data Quantity
Data Accuracy
Maintenance
Ownership
Cost
Warehousing Problems
Business Issues
Database Issues
DBMS Software
Technology
Complexity
Warehousing Problems
Business Issues
Data Issues
Analysis Issues
User Interface
Intelligent Processing
Three Approaches
Classical Enterprise Database
Contains operational data from all areas of
the organization.
Data Mart
Extracted and managerial support data
designed for departmental or EUC
applications
Data Package
Data required for a specific application
Classical Warehouse
Source
Archived data
Extraction
Batch extraction programs
Data
Atomic transaction data
Tool
VLDB technology
Analysis
IT driven software
Mart
Source
Deposit or External sources
Extraction
Batch summary
Data
Designed departmental database
Tool
OLAP, ROLAP, MDBMS
Analysis
IT driven or trained user
Package
Source
Mart
Extraction
Sample and summary
Data
Problem specific dataset
Tool
PC tools
Analysis
Trained user
Three Fundamental Processes
Data Acquisition
Data Storage
Data a
Access
Data Acquisition
Handles acquisition of data from legacy
systems and outside sources.
Data is identified, copied, formatted and
prepared for loading into the
warehouse.
Acquisition steps
Catalog the data
Clean and prepare the data.
Develop an inventory of where it is
and what it means.
Extract from legacy files and reformat
to make it usable.
Transport data from one location to
another.
Storage
The storage component holds the
data so that the many different data
mining, executive information
and decision support systems can make
use of it effectively.
The Storage Area
Managed by
Relational databases
like those from Oracle Corp. or Informix
Software Inc.
Specialized hardware
symmetric multiprocessor (SMP)
or massively parallel processor
(MPP) machines
Storage
The majority of warehouse storage
today is being managed by relational
databases running on Unix platforms.
Oracle, Sybase Inc., IBM Corp. and
Informix control 65 percent of the
warehouse storage market. Meta Group
Inc. (1996)
Access
Different end-user PCs and workstations draw
data from the warehouse with the help of
multidimensional analysis products, neural
networks, data discovery tools or analysis
tools.
These powerful, "smart" software products
are the real driving force behind the viability
of data warehousing.
Access Tools
Intelligent Agents and Agencies
Query Facilities and Managed Query
Environments
Statistical Analysis
Data Discovery.
(decision support, artificial intelligence and
expert systems)
OLAP
Data Visualization
Hardware Budget
A typical startup warehouse project
allocates more than 60 percent of its
budget for hardware and software to
the creation of a powerful storage
component, spending just 30 percent
on data mining and user access
technologies.
Systems Analysis Budget
Budgeting for systems analysis and
development, however, follows a very
different pattern.
More than 50 percent of development dollars
are spent on building acquisition capabilities,
30 percent fund the development of user
solutions and
20 percent are dedicated to the creation of
databases in the storage component.
Design Issues
Relational and Multidimensional Models
Denormalized and indexed relational
models more flexible
Multidimensional models simpler to use
and more efficient
Star Schemas in a RDBMS
In most companies doing ROLAP, the DBAs
have created countless indexes and summary
tables in order to avoid I/O-intensive table
scans against large fact tables. As the indexes
and summary tables proliferate in order to
optimize performance for the known queries
and aggregations that the users perform, the
build times and disk space needed to create
them has grown enormously, often requiring
more time than is allotted and more space
than the original data!
Building a Data Warehouse
from a Normalized Database
The steps
Develop a normalized entity-relationship
business model of the data warehouse.
Translate this into a dimensional model. This
step reflects the information and analytical
characteristics of the data warehouse.
Translate this into the physical model. This
reflects the changes necessary to reach the
stated performance objectives.
The Business Model
Identify the data structure, attributes
and constraints for the client’s data
warehousing environment.
Stable
Optimized for update
Flexible
Business Model
As always in life, there are some disadvantages
to 3NF:
Performance can be truly awful. Most of the
work that is performed on denormalizing a
data model is an attempt to reach
performance objectives.
The structure can be overwhelmingly
complex. We may wind up creating many
small relations which the user might think of
as a single relation or group of data.
Structural Dimensions
The first step is the development of the
structural dimensions. This step corresponds
very closely to what we normally do in a
relational database.
The star architecture that we will develop
here depends upon taking the central
intersection entities as the fact tables and
building the foreign key => primary key
relations as dimensions.
Simple DW pattern.
Other Dimensions
Categorical dimensions: generated
groups (additional key components)
Partitioning dimensions: subtypes
(planned vs. actual)
Informational dimensions: generate
different types of data (messy).