Data Warehouse Design
Download
Report
Transcript Data Warehouse Design
Designing a Data Warehouse
Issues in DW design
Three Fundamental Processes
Data Acquisition
Data Storage
Data a
Access
Data Acquisition
Handles acquisition of data from legacy
systems and outside sources.
Data is identified, copied, formatted and
prepared for loading into the
warehouse.
Acquisition steps
Catalog the data
Clean and prepare the data.
Develop an inventory of where it is
and what it means.
Extract from legacy files and reformat
to make it usable.
Transport data from one location to
another.
Storage
The storage component holds the
data so that the many different data
mining, executive information
and decision support systems can make
use of it effectively.
The Storage Area
Managed by
Relational databases
like those from Oracle Corp. or Informix
Software Inc.
Specialized hardware
symmetric multiprocessor (SMP)
or massively parallel processor
(MPP) machines
Storage
The majority of warehouse storage
today is being managed by relational
databases running on Unix platforms.
Oracle, Sybase Inc., IBM Corp. and
Informix control 65 percent of the
warehouse storage market. Meta Group
Inc. (1996)
Access
Different end-user PCs and workstations draw
data from the warehouse with the help of
multidimensional analysis products, neural
networks, data discovery tools or analysis
tools.
These powerful, "smart" software products
are the real driving force behind the viability
of data warehousing.
Access Tools
Intelligent Agents and Agencies
Query Facilities and Managed Query
Environments
Statistical Analysis
Data Discovery.
(decision support, artificial intelligence and
expert systems)
OLAP
Data Visualization
Hardware Budget
A typical startup warehouse project
allocates more than 60 percent of its
budget for hardware and software to
the creation of a powerful storage
component, spending just 30 percent
on data mining and user access
technologies.
Systems Analysis Budget
Budgeting for systems analysis and
development, however, follows a very
different pattern.
More than 50 percent of development dollars
are spent on building acquisition capabilities,
30 percent fund the development of user
solutions and
20 percent are dedicated to the creation of
databases in the storage component.
Design Issues
Relational and Multidimensional Models
Denormalized and indexed relational
models more flexible
Multidimensional models simpler to use
and more efficient
Star Schemas in a RDBMS
In most companies doing ROLAP, the DBAs
have created countless indexes and summary
tables in order to avoid I/O-intensive table
scans against large fact tables. As the indexes
and summary tables proliferate in order to
optimize performance for the known queries
and aggregations that the users perform, the
build times and disk space needed to create
them has grown enormously, often requiring
more time than is allotted and more space
than the original data!
Building a Data Warehouse
from a Normalized Database
The steps
Develop a normalized entity-relationship
business model of the data warehouse.
Translate this into a dimensional model. This
step reflects the information and analytical
characteristics of the data warehouse.
Translate this into the physical model. This
reflects the changes necessary to reach the
stated performance objectives.
The Business Model
Identify the data structure, attributes
and constraints for the client’s data
warehousing environment.
Stable
Optimized for update
Flexible
Business Model
As always in life, there are some disadvantages
to 3NF:
Performance can be truly awful. Most of the
work that is performed on denormalizing a
data model is an attempt to reach
performance objectives.
The structure can be overwhelmingly
complex. We may wind up creating many
small relations which the user might think of
as a single relation or group of data.
Structural Dimensions
The first step is the development of the
structural dimensions. This step corresponds
very closely to what we normally do in a
relational database.
The star architecture that we will develop
here depends upon taking the central
intersection entities as the fact tables and
building the foreign key => primary key
relations as dimensions.
Simple DW pattern.
Other Dimensions
Categorical dimensions: generated
groups (additional key components)
Partitioning dimensions: subtypes
(planned vs. actual)
Informational dimensions: generate
different types of data (messy).