Data Warehouse
Download
Report
Transcript Data Warehouse
Data Warehouse
IMS5024 – presented by Eder Tsang
Data Warehouse
A data warehouse is a system consisting of processes
and databases used to provide the “data infrastructure”
for EIS and DSS
“… a subject-oriented, integrated, timevariant, and nonvolatile collection of data in support of management’s
decisions”
Inmon and Hackathorn (1994)
Data warehouse - subject oriented
The data warehouse is organised by “data subjects” that
are relevant to the organisation.
– Customer, claim, shipment, product
This may be contrasted with the process orientation of
many OLTP systems
Data warehouse - integrated
Data in the warehouse is structured based on a corporatewide model, spanning the functional boundaries of legacy
systems
This includes naming standards, units of measurement and
periodicity
Data warehouse - time variant
Data is the data warehouse is characterised by the timeseries nature of historical data
The data consists of a series of “snapshots” which are
time-stamped and record values at a moment in time
This supports trend analysis of the data
Data warehouse - non volatile
The data warehouse is not continuously updated (inserts,
eletes and changes) like data in an OLTP system
Data in a data warehouse is periodically up-loaded at a
scheduled time intervals (say daily)
Motivations for data warehousing
Demands on OLTP data bases for query processing
would be too great
Data warehousing is designed for efficient retrieval
Data in legacy systems is frequently inconsistent, of poor
quality and stored in different formats
Reduce costs in providing data for decision making
Motivations for data warehousing
Support for focus on complete business processes (BPR)
Support for new initiatives
– CRM, Balanced Scorecard
Industry sources quote ROI’s averaging 401% over 3
years
Remain competitive
Typical Data Warehouse Architecture
An Actual Data Warehouse
Data warehouse development
Requirements identification
Logical design, data modelling
Data extract, transform and load (ETL)
Warehouse architecture , technology and tools
Physical database design
Delivery systems
Operational policies
Designing a data warehouse –
data design
There are two main approaches to data modelling or data
warehouse design
– entity relationship modelling and normalisation
– dimensional modelling
The design of databases using a
traditional E-R approach
Entities and relationships
Normalisation 3NF
Entity relationship schema
Why do we normalise data?
Normalisation is a process for converting complex data
structures into simple, stable data structures
Normalisation protects integrity of database by avoiding
anomalies (update, delete, create)
• Normalised data models are:
• robust and stable
• have minimum redundancy
Dimensional Modeling (star schema)
Dimensional Modeling (star schema)
Components of dimensional model:
–
Fact Tables : contain measurements of business
eg. Sales, purchase order, shipment
–
Dimension Tables : store the descriptions of the dimensions
of the business
eg. Product, customer, vendor, store
Dimensional Modeling (star schema)
Each dimension table has a single primary key that
corresponds exactly to one of the components of the
multipart key in the fact table.
A fact table always expresses a many to many relationship
(the key is composed of foreign keys
The most useful facts in a fact table are numeric and
additive
( typically values are added up)
Snowflake schema
Snowflake schema –all the tables are normalised
Star schemas are preferable to snowflake – fewer joins for
information retrieval
Dimensional Modelling vs
E-R modelling
the purpose of dimensional modelling structure data for easy
and efficient analysis
E-R
modelling
creates
a
single
required
to
support
organisation’s Whereas
DM creates individual models for business/decision interest
eg. • model for sales info
• model for Inventory info
Entity relationship schema (3NF)
Corresponding Star schema
Corresponding snowflake schema
Dimensional Modelling vs
E-R modelling (Con’t)
OLTP and DW have different purpose: – operational vs
informational
Normalisation protects integrity of database by avoiding
anomalies (update, delete, create)
Data models for data warehouse do not have to be normalised
– In contrast, data in DW does not change often
– periodic additions of new data
DM vs. E-R modeling debate
(Kimball’s view)
OLTP systems are volatile – high rates of update transactions
In normalised models the goal is to reduce data redundancy
and prevent update anomalies
Data in a data warehouse does not need to be normalised
because it is periodically refreshed not updated by user
transactions