Lesson15 Data_Warehousingx

Download Report

Transcript Lesson15 Data_Warehousingx

Data Warehouses
Decision Support
Advanced Database
Dr. AlaaEddin Almabhouh
Topic & Structure of Lesson
In this lecture we will be looking at:
 Decision making
 Data Warehousing



Definition
Applications
Technologies
Slide 2 (of 63)
The Evolution of Data Warehousing






Since 1970s, organizations gained competitive advantage through systems
that automate business processes to offer more efficient services to the
customer.
This resulted in accumulation of growing amounts of data in operational
databases.
Organizations now focus on ways to use operational data to support
decision-making, as a means of gaining competitive advantage.
However, operational systems were never designed to support such
business activities.
Organizations need to turn their archives of data into a source of
knowledge, so that a single integrated / consolidated view of the
organization’s data is presented to the user.
A data warehouse was deemed the solution to meet the requirements of a
system capable of supporting decision-making, receiving data from multiple
operational data sources.
Slide 24 (of 63)
Types of Data Sources






Operational Databases
Commercial Databases
Physical Documents
Internet / WWW
“Desktop excel files”
Any electronic document ?
Slide 7 (of 63)
Types of Data

Operational Data (OLTP applications) Online
Transaction Processing
OLTP refers to a class of systems that facilitate and manage
transaction-oriented applications, typically for data entry and
retrieval transaction processing.





Data that “works”
Frequent updates and queries
Normalized for efficient search and updates
Point Queries:
Does OLTP aid decision making ?
Slide 3 (of 63)
Types of Data

Historical Data (OLAP applications) Online
Analytical Processing
OLAP is a data structure that allows fast analysis of data.
It can also be defined as the capability of analyzing data from
multiple perspectives.







Data that “tells”
Very frequent updates
Integrated data set with global relevance
Analytical queries that required huge amounts of data
Performance issues mainly in query response time (not
in updates)
Slide 3 (of 63)
Example of OLTP Queries
What is the salary of Mr.Adam?
 What is the address and phone number of the
person in charge of the Supplies department?
 How many employees working in UK branch?

Slide 3 (of 63)
Example of OLAP Queries
How is the employee attrition scene changing
over the years across the company?
 Is
there a correlation between the
geographical location of a company unit and
excellent employee appraisals?
 Is it financially viable to continue our
manufacturing unit in Malaysia?

Slide 3 (of 63)
What is a Data Warehouse?

A central database that is loaded from multiple
operational databases for the purpose of end-user
access and decision support.

Data warehouses offer data integration solutions
and improved access to timely, accurate and
consistent data.

Data warehouses also help in accessing,
aggregating and analyzing large amounts of data
from diverse sources to understand historical
performance or behavior and to predict and manage
outcomes.
Slide 24 (of 63)
Definition
Data Warehousing


Bill Inmon
"A subject-oriented, integrated, timevariant, non-volatile collection of data in
support of management's decision -making
process"
Slide 25 (of 63)
Definition (Cont ...)

Subject-Oriented: The warehouse is organized
around the major subjects of the enterprise (e.g.
customers, products, and sales) rather than the
major application areas (e.g. customer invoicing,
stock control, and product sales).

Integrated: The data warehouse integrates
corporate application-oriented data from different
source systems, which often includes data that is
inconsistent.
Slide 25 (of 63)
Definition (Cont ...)

Time-Variant: Data in the warehouse is only
accurate and valid at some point in time or over
some time interval.

Non-Volatile: Data in the warehouse is not
updated in real-time but is refreshed from
operational systems on a regular basis.
Slide 25 (of 63)
Data Warehouse vs. Operational
DB Systems
Operational Systems
Data Warehouse Systems
Used by management
Used by front-line workers
Strategic value
Tactical value
Supports strategic direction
Used for on-line analysis
Supports day-to-day
operation
Used for transaction
processing
Subject oriented
Application oriented
Stores historical data
Stores current data only
Slide 27 (of 63)
Data Warehouse Architecture
Operational
/information data
Sales
Database
Data Warehouse
Costumer
Database
•Report/ Query
tools
Metadata
Product
Database
•OLAP tools
Data extraction
Data tranforming
Data load
Data that describes d/w. (eg: •Data mining tools
info about data sources, rules,
audit trails
Major Building Blocks (Components)
of the Data Warehouse
1.
2.
3.
4.
5.
6.
Source data component (operation data store)
Data staging component
Data storage component
Information delivery component
Metadata component
Management and control component
Definition of Data Mart (or Datamart)



A database, or collection of databases, designed to
help managers make strategic decisions about their
business.
Whereas a data warehouse combines databases
across an entire enterprise, data marts are usually
smaller and focus on a particular subject or
department.
Some data marts, called dependent data marts, are
subsets of larger data warehouses.
Slide 41 (of 63)
Architecture

See other handout
Slide 43 (of 63)
Q &A
Slide 81 (of 82)
Data Warehouse
Dimensional Modeling Design
Advanced Databse
Topics





Levels of Modeling
Data Warehouse Modeling
The General Approach – The Star Schema Development
The Database Component of a Data Warehouse – Fact Table
and Dimension Table
A Case Study
Databases & Modeling
Type of
Database
Relational
Database
Constructs
ERD
Characteristics
Row/
Column
Data Warehouse
Dimensional
Modeling
Cube
Distributed
Database
Distributed
Component
Object Model
Client Object
(DCOM)
Class
Diagram
Object
Object-Oriented
Database
Object = Data + Operations(Services);
Entity = Data only
New
Trend
OLAP
DW
XML
UML
Data Warehouse Modeling
•
•
•
Also called Dimensional Modelling
Ways to derive the database component of a data
warehouse
Every dimensional model (DM) is composed of
one table with a composite primary key, called the
fact table, and a set of smaller tables called
dimension tables.
Example: Data Warehouse Modeling
Fact Table
Hotel
Hotel_No Key
Hotel Desc
Hotel name
Room_no key
Single
Double
Family
Times
Sales
Hotel_No Key
Guest Key
Time Key
YTD_Sales_dollars_by_hotel
YTD_Sales_dollar_by_Type
YTD_Sales_By_Business
YTD_Sales_by_non-business
time key
day of week
quarter
year
Guest Profile
Profile key
Profile desc
Territory
Age category
Income category
Demographics
Demographic Key
Cluster 1 Population
Cluster 2 Population
The General Approach

Create the high level enterprise ERD

Develop logical data model for subject area only

Create data warehouse data model DM

Develop physical data model
Dimensional Modelling

Modelling technique that aims to present the data in a standard,
intuitive form that allows for high-performance access.

Every dimensional model (DM) is composed of one table with
a composite primary key, called the fact table, and a set of
smaller tables called dimension tables.

Each dimension table has a simple (non-composite) primary
key that corresponds exactly to one of the components of the
composite key in the fact table.
Dimensional Modelling

All natural keys are replaced with surrogate keys (branch Id
instead of branch #). Means that every join between fact and
dimension tables is based on surrogate (intelligence) keys, not
natural keys.

Surrogate keys allows data in the warehouse to have some
independence from the data used and produced by the OLTP
systems.
Dimensional Modelling

Bulk of data in data warehouse is in fact tables, which can be
extremely large.

Important to treat fact data as read-only reference data that will
not change over time.

Most useful fact tables contain one or more numerical
measures, or ‘facts’ that occur for each record and are numeric
and additive.
Fact Table and Dimension Table

Fact Table: A Fact Table is a table in a relational
database with a multi-part key. Each element of the key is
itself a foreign key to a single dimension tale.

Dimension Tables
They are the constraints used in forming the fact table.
Inside A Dimension Table

Dimension table key. Primary key uniquely identifies each row
in the table.

Table is wide. Typically, a dimension table has many columns
or attributes.

Textual attributes. Dimension tables usually contain descriptive
textual information.
Inside A Dimension Table (Cont’d)

Drilling down, rolling up. The attributes in a dimension table
provide the ability to get to the details from high levels of
aggregation to lower levels of details.

Multiple Hierarchies. Dimension tables often provide for
multiple hierarchies, so that drilling down may be performed
along any of the multiple hierarchies.

Few number of record. A dimension table typically has fewer
number of records or rows than the fact table.
Inside The Fact Table

Concatenated Key. A row in the fact table relates to a
combination of rows from all the dimension tables.

Fully Additive Measures. The values of the attributes can be
summed up by simple additions.

Semi-additive Measures. Derived attributes such as percentages
are not additive. They are known as semi-additive measures.
Inside The Fact Table (Cont’d)

Table Deep, not Wide. Typically a fact table has fewer
attributes than a dimension table. But the number of records in a
fact table is very large in comparison.

Sparse Data. There are rows with null measures such as the date
representing a closed holiday. In this case, there is no need to
keep these rows.

Degenerate Dimensions. Examples of such attributes are
reference numbers like order numbers, invoice numbers, order
line numbers, and so on.
OLAP Cubes


A data warehouse is based on a multidimensional
data model which views data in the form of a data
cube
A data cube, such as sales, allows data to be
modeled and viewed in multiple dimensions


Dimension tables, such as item (item_name, brand, type),
or time (day, week, month, quarter, year)
Fact table contains measures (such as dollars_sold) and
keys to each of the related dimension tables
33
MOLAP Operations

Roll up (drill-up): summarize data


by climbing up hierarchy or by dimension reduction
Drill down (roll down): reverse of roll-up

from higher level summary to lower level summary or detailed
data, or introducing new dimensions
34
Figure: Slicing a data cube
35
Star Schema vs. Snowflake Schema

Star Schema (or Star Joint Schema)
“A specific organization of a database in which a fact table with a
composite key is joined to a number of single-level dimension tables, each
with a single, primary key”

Snowflake Schema
A variant of the star schema where each dimension can have its
dimensions. Starflake schema is a hybrid structure that contains a mixture of
star (denormalized) and snowflake (normalized) schemas. Allows
dimensions to be present in both forms to cater for different query
requirements.
-- Kimball Ralph, Data Warehouse Toolkit ---
A STAR SCHEMA for Auto Sales
Product
Auto
Sale
Time
Payment
method
Dealer
Customer
Demographics
Facts: Actual sale price, Options price, Full price, Dealer add-on, Dealer
credit, Dealer invoice, Down payment , Proceeds, Finance vs.
Dimension Tables below
Time
Product
Payment
Method
Customer
Demographics
Dealer
Year
Model Name
Finance Type
Age
Dealer
name
Quarter
Model Year
Term (months)
Gender
City
Month
Package
styling
Interest rate
Income range
State
Date
Product
category
Agent
Marital status
Zone
Day of week
Exterior color
Household size
Day of month
Interior color
Home value
Season
Holiday flag
Own or rent
A Star Join Schema For A Food
Cooperative
Fact Table
Times
Food Item
Food Item Key
Food Item Desc
Qty
Dimension tables
Member Profile
Profile key
Profile desc
Territory
Age category
Income category
Sales
Food Item Key
Profile Key
Time Key
YTD_Sales_dollars
YTD_Sales_qty
Demographics
Demographic Key
Cluster 1 Population
time key
day of week
quarter
year
Time-series
Dimension
table
Star Schema for Property Sales
Time
Fact Table
Time Id (PK)
PropertySale
TimeId key
Propertyid key
Branchid key
Clinetid key
Promotionid key
Staffid key
Ownerid key
Day
week
Quarter
year
Branch
Branchid
(PK)
Promotion
Promotionid
(PK)
PropertyforSale
Propertyid
(PK)
Client
Clientid
(PK)
Staff
Owner
Ownerid (PK)
Staffid
(PK)
Fact and Dimension Tables for each
Business Process of Property Sales
Business Process
Fact Tables
Dimension Tables
Property Sales
PropertySale
Time, Branch Staff,
PropertyForSale, owner,
ClientBuyer, Promotion
Property Rentals
Lease
Time, Branch, Staff,
PropertyForRent, owner,
ClientRenter, Promotion
Property Viewing
PropertyViewing
Time,Branch, PropertyForSale
PropertyForRent, ClientBuyer
ClientRenter
Property Advertising
Advert
Time,Branch, PropertyForSale
PropertyForRent, Promotion,
Newspaper
Property Maintenance
PropertyMaintenan
ce
Time, Branch Staff,
PropertyForRent
Summary
In this lecture we looked at:
 Data Warehouse Design
Slide 62 (of 63)
Other Resources



http://office.microsoft.com/en-us/excel-help/data-miningadd-ins-HA010342915.aspx
http://www.microsoft.com/enus/download/details.aspx?id=7294
http://www.microsoft.com/enus/download/details.aspx?id=29061
Q &A
Slide 81 (of 82)