Data Warehouse Design

Download Report

Transcript Data Warehouse Design

Designing a Data Warehouse
Issues in DW design
Three Fundamental Processes




Data Acquisition
Data Storage
Data a
Access
Data Acquisition


Handles acquisition of data from legacy
systems and outside sources.
Data is identified, copied, formatted and
prepared for loading into the
warehouse.
Acquisition steps

Catalog the data


Clean and prepare the data.


Develop an inventory of where it is
and what it means.
Extract from legacy files and reformat
to make it usable.
Transport data from one location to
another.
Storage
The storage component holds the
data so that the many different data
mining, executive information
and decision support systems can make
use of it effectively.
The Storage Area
Managed by
 Relational databases


like those from Oracle Corp. or Informix
Software Inc.
Specialized hardware


symmetric multiprocessor (SMP)
or massively parallel processor
(MPP) machines
Storage


The majority of warehouse storage
today is being managed by relational
databases running on Unix platforms.
Oracle, Sybase Inc., IBM Corp. and
Informix control 65 percent of the
warehouse storage market. Meta Group
Inc. (1996)
Access


Different end-user PCs and workstations draw
data from the warehouse with the help of
multidimensional analysis products, neural
networks, data discovery tools or analysis
tools.
These powerful, "smart" software products
are the real driving force behind the viability
of data warehousing.
Access Tools




Intelligent Agents and Agencies
Query Facilities and Managed Query
Environments
Statistical Analysis
Data Discovery.
(decision support, artificial intelligence and
expert systems)


OLAP
Data Visualization
Hardware Budget

A typical startup warehouse project
allocates more than 60 percent of its
budget for hardware and software to
the creation of a powerful storage
component, spending just 30 percent
on data mining and user access
technologies.
Systems Analysis Budget



Budgeting for systems analysis and
development, however, follows a very
different pattern.
More than 50 percent of development dollars
are spent on building acquisition capabilities,
30 percent fund the development of user
solutions and
20 percent are dedicated to the creation of
databases in the storage component.
Design Issues
Relational and Multidimensional Models
 Denormalized and indexed relational
models more flexible
 Multidimensional models simpler to use
and more efficient
Star Schemas in a RDBMS
In most companies doing ROLAP, the DBAs
have created countless indexes and summary
tables in order to avoid I/O-intensive table
scans against large fact tables. As the indexes
and summary tables proliferate in order to
optimize performance for the known queries
and aggregations that the users perform, the
build times and disk space needed to create
them has grown enormously, often requiring
more time than is allotted and more space
than the original data!
Building a Data Warehouse
from a Normalized Database
The steps
 Develop a normalized entity-relationship
business model of the data warehouse.
 Translate this into a dimensional model. This
step reflects the information and analytical
characteristics of the data warehouse.
 Translate this into the physical model. This
reflects the changes necessary to reach the
stated performance objectives.
The Business Model



Identify the data structure, attributes
and constraints for the client’s data
warehousing environment.
Stable
Optimized for update
Flexible
Business Model
As always in life, there are some disadvantages
to 3NF:
 Performance can be truly awful. Most of the
work that is performed on denormalizing a
data model is an attempt to reach
performance objectives.
 The structure can be overwhelmingly
complex. We may wind up creating many
small relations which the user might think of
as a single relation or group of data.
Structural Dimensions


The first step is the development of the
structural dimensions. This step corresponds
very closely to what we normally do in a
relational database.
The star architecture that we will develop
here depends upon taking the central
intersection entities as the fact tables and
building the foreign key => primary key
relations as dimensions.
Simple DW pattern.
Other Dimensions



Categorical dimensions: generated
groups (additional key components)
Partitioning dimensions: subtypes
(planned vs. actual)
Informational dimensions: generate
different types of data (messy).