Slides - Zhangxi Lin`s homepage

Download Report

Transcript Slides - Zhangxi Lin`s homepage

ISQS 6339, Business Intelligence
Dimensional Modeling
Zhangxi Lin
Texas Tech University
1
1
Outline



Principles of Dimensional Modeling
Data Warehousing Methodology
Three Phases of Dimensional Modeling
2
PRINCIPLES OF
DIMENSIONAL
MODELING
3
Dimensional Model

Also called star schema (but snowflake schema is also fine)
◦ Fact table is in the middle and dimensions serving as the points
on the star.
◦ A normalized fact table plus denormalized dimension tables

Reference: database normalization
◦ Edgar F. Codd, the inventor of the relational model, introduced the concept of
normalization and what we now know as the First Normal Form (1NF) in
1970. Codd went on to define the Second Normal Form (2NF) and Third
Normal Form (3NF) in 1971, and Codd and Raymond F. Boyce defined the
Boyce-Codd Normal Form (BCNF) in 1974.
◦ Informally, a relational database table is often described as "normalized" if it is in
the Third Normal Form. Most 3NF tables are free of insertion, update, and
deletion anomalies.
4
4
Star Schema Model
Product Table
Store Table
Product_id
Product_disc,...
Store_id
District_id,...
Sales Fact Table
Central
fact table
Product_id
Store_id
Item_id
Day_id
Sales_amount
Sales_units, ...
Denormalized
dimensions
Time Table
Day_id
Month_id
Year_id,...
Item Table
Item_id
Item_desc,...
5
5
Snowflake Schema Model
Product Table
Product_id
Product_desc
Store Table
Store_id
Store_desc
District_id
District Table
District_id
District_desc
Sales Fact Table
Item_id
Store_id
Product_id
Week_id
Sales_amount
Sales_units
Time Table
Item Table
Dept Table
Mgr Table
Week_id
Period_id
Year_id
Item_id
Item_desc
Dept_id
Dept_id
Dept_desc
Mgr_id
Dept_id
Mgr_id
Mgr_name
6
6
Snowflake Schema Model
◦ Direct use by some tools
◦ More flexible to change
◦ Provides for speedier data loading
◦ Can become large and unmanageable
◦ Degrades query performance
◦ More complex metadata
Country
State
County
City
7
7
Facts

Definition
◦ Measure – a numeric quantity expressing some aspect of the
organization’s performance
◦ Aggregate – formed by combining values from a given dimension
or set of dimensions to create a single value.
Measurements associated with a specific business
process.
 Most facts are additive (calculative); others are semiadditive, non-additive, or descriptive (e.g. factless fact
table).
 Many facts can be derived from other facts. So, nonadditive facts can be avoided by calculating it from
additive facts.

Fact Table Characteristics
◦
◦
◦
◦
Contain numerical metrics of the business
Can hold large volumes of data
Can grow quickly
Can contain base, derived,
and summarized data
Sales Fact Table
◦ Are typically additive
Product_id
◦ Are joined to dimension
Store_id
tables through foreign keys
Item_id
Day_id
that reference primary
keys in the dimension tables Sales_amount
Sales_units
...
9
9
The Three Fact Table Types

Transaction fact table
◦ The most basic and fundamental
◦ “One row per line in a transaction", e.g., every line on a receipt
◦ A transactional fact table holds data of the most detailed level
◦ have a great number of dimensions associated with it

Periodic snapshot fact table
◦ Takes a "picture of the moment“
◦ Cumulative performance over specific time intervals
◦ Dependent on the transactional table
◦ Valuable to combine data across several business processes in the value chain

Accumulating snapshot fact table
◦ Used to show the activity of a process that has a well-defined beginning and end
◦ Constantly updated over time
10
Types of facts
Week Date Trans#
1
1A1-1
1
2A1-2
1
4A1-3
2
2A2-1
2
2A2-2
3
1A3-1
4
2A4-1
4
3A4-2
4
3A4-3
4
5A4-4
Change OldBal NewBal
100
1000
1100
-50
1100
1050
200
1050
1250
-120
1250
1130
200
1130
1330
-300
1330
1030
-20
1030
1010
100
1010
1110
250
1110
1360
-220
1360
1140



Transaction fact: each row
Periodic snapshot fact:
(OldBal, NewBal) on each
transaction
Accumulating snapshot
fact: The average numbers in a
week, such as average balance,
number of transactions,
average amount of
transactions, the total amount
of trading in a given period.
Dimensions

Definition: a categorization used to spread out an aggregate
measure to reveal its constituent part

The foundation of the dimensional model to describe the objects
of the business
The nouns of the DW/BI system
◦ Business processes (facts) are the verbs of the business
Dimension tables link to all the business processes.
A dimension shared across all processes is called conformed
dimension
The analysis involving data from more than one business process
is called drill-across.




12
12
Attributes
An additional piece of information
pertaining to a dimension member that is
not the unique identifier or the
description of the member.
 Attributes can be used to more fully
describe dimension members

Dimension Table Characteristics

Dimension tables have the following characteristics:
◦ Contain textual information that represents the attributes of
the business
◦ Contain relatively static data
◦ Are joined to a fact table through
a foreign key reference
14
14
Star Dimensional Model
Characteristics
◦
◦
◦
◦
◦
◦
◦
◦
◦
The model is easy for users to understand.
Primary keys represent a dimension.
Nonforeign key columns are values.
Facts are usually highly normalized.
Dimensions are completely denormalized.
Fast response to queries is provided.
Performance is improved by reducing table joins.
End users can express complex queries.
Support is provided by many front-end tools.
15
15
The Time Dimension

Time is critical to the data warehouse. A consistent
representation of time is required for extensibility.
Sales fact
Time
dimension
Where should the element of time be stored?
16
16
Hierarchies

Meaningful, standard ways to group the data within
a dimension
◦ Variable-depth hierarchies
◦ Frequently changing hierarchies

Examples of hierarchy in a dimension
◦ Address: street, city, state, country
◦ Organization: section, division, branch, region
◦ Time: year, quarter, month, date
17
Data Cube



Data cubes are multidimensional extensions of 2-D tables, just as in
geometry a cube is a three-dimensional extension of a square. The
word cube brings to mind a 3-D object, and we can think of a 3-D
data cube as being a set of similarly structured 2-D tables stacked
on top of one another.
Data cubes aren't restricted to just three dimensions. Most OLAP
systems can build data cubes with many more dimensions allows up
to 64 dimensions.
In practice, we often construct data cubes with many dimensions,
but we tend to look at just three at a time. What makes data cubes
so valuable is that we can index the cube on one or more of its
dimensions.
18
18
Data Cube
Region
Product
Time
OLAP system


OLAP – allows users to retrieve information from data
quickly for analysis purposes
Features
◦ Multidimensional database
◦ Easily understood
◦ What is OLAP? 5’04”
◦ SQL OLAP Tutorial - Data Warehouse Schema Design 9’45”
Dimensional Modeling Process

High level dimensional model design
◦
◦
◦
◦
Choosing business model in accordance with the analytic theme
Declaring the grain
Choosing dimensions
Identifying the facts
Detailed dimensional model development
 Dimensional model review and validation

◦ IS
◦ Core users
◦ Business community

Final design iteration
ISQS 6339, Data Mgmt & BI, Zhangxi
Lin
21
DATA WAREHOUSING
METHODOLOGY
22
Data Warehouse Development
Approaches
Data warehouse development approaches

◦
Kimball Model: Data mart approach

◦
Data marts - EDW
Inmon Model: EDW approach

EDW – Data Marts
Which model is better?

◦
◦
There is no one-size-fits-all strategy to data warehousing
One alternative is the hosted warehouse
23
Comparison

Kimball Model
◦ Kimball’s model follows a bottom-up approach. The Data Warehouse (DW) is
provisioned from Datamarts (DM) as and when they are available or required.
◦ The Datamarts are sourced from OLTP systems are usually relational databases
in Third normal form (3NF).
◦ The Data Warehouse which is central to the model is a de-normalized star
schema. The OLAP cubes are built on this DW.

Inmon Model
◦ Inmon’s model follows a top-down approach. The Data Warehouse (DW) is
sourced from OLTP systems and is the central repository of data.
◦ The Data Warehouse in Inmon’s model is in Third Normal Form (3NF).
◦ The Datamarts (DM) are provisioned out of the Data Warehouse as and when
required. Datamarts in Inmon’s model are in 3NF from which the OLAP cubes
are built.
Strengths and Weaknesses

Scalable vs. structural
◦ Kimball’s model is more scalable because of the bottom-up approach and hence
you can start small and scale-up eventually. The ROI is usually faster with
Kimball’s model. Because of this approach it is difficult to created re-usable
structures/ ETL for different data marts.
◦ On the other hand Inmon’s model is more structured and easier to maintain
while it is rigid and takes more time to build. The significant advantage of Inmon’s
model is because the DW is in 3NF; it is easier to build data mining models.


Both Kimball and Inmon models agree and emphasis that DW is
the central repository of data and OLAP cubes are built of denormalized star schemas.
In conclusion, when it comes to data modeling, it is irrelevant which
camp you belong to as long as you understand why you are
adopting a specific model. Sometimes it makes sense to take a
hybrid approach.
General Data Warehouse
Development Approaches

“Big bang” approach

Incremental approach:
◦ Top-down incremental approach
◦ Bottom-up incremental approach
ISQS 6339, Data Mgmt & BI, Zhangxi
Lin
27
“Big Bang” Approach
Analyze enterprise
requirements
Build enterprise
data warehouse
Report in subsets or
store in data marts
ISQS 6339, Data Mgmt & BI, Zhangxi
Lin
28
Incremental Approach
to Warehouse Development
Multiple iterations
 Shorter implementations
 Validation of each phase

Increment 1
Strategy
Definition
Analysis
Design
Iterative
Build
Production
ISQS 6339, Data Mgmt & BI, Zhangxi
Lin
29
Top-Down Approach
Analyze requirements at the enterprise level
Develop conceptual information model
Identify and prioritize subject areas
Complete a model of selected subject area
Map to available data
Perform a source system analysis
Implement base technical architecture
Establish metadata, extraction, and load
processes for the initial subject area
Create and populate the initial subject area
data mart within the overall warehouse
framework
ISQS 6339, Data Mgmt & BI, Zhangxi
Lin
30
Bottom-Up Approach
Define the scope and coverage of the
data warehouse and analyze the source
systems within this scope
Define the initial increment based on the
political pressure, assumed business
benefit and data volume
Implement base technical architecture
and establish metadata, extraction, and
load processes as required by increment
Create and populate the initial subject
areas within the overall warehouse
framework
ISQS 6339, Data Mgmt & BI, Zhangxi
Lin
31
Note:There are many details about data warehouse design, which need
a lot effort to learn. Because of limited time to spend for this part, here
are only some of the details.
THREE PHASES OF
DATA WAREHOUSE
DESIGN
32
Data Warehouse Database Design
Phases
 Phase
1: Defining the business model
 Phase 2: Defining the dimensional model
 Phase 3: Defining the physical model
33
33
Phase 1: Defining the Business
Model
◦ Performing strategic analysis
◦ Define business analytic theme
◦ Creating the business model
◦ Documenting metadata
34
34
Performing Strategic Analysis
Identify crucial business processes
 Understand business processes
 Prioritize and select the business processes to
implement

High
Business
Benefit
Low
Low
Feasibility
High
35
35
Creating the Business Model

Defining business requirements:
◦ Identifying the business measures
◦ Identifying the dimensions
◦ Identifying the grain
◦ Identifying the business definitions and rules

Verifying data sources
36
36
Business Requirements Drive the
Design Process
◦ Primary input
Business
Requirements
◦ Secondary input
Existing Metadata
Production ERD Model
Research
37
37
Identifying Measures and
Dimensions
The
Measures
attribute is perceived as
constant or discrete:
◦ Product
◦ Location
◦ Time
◦ Size
◦
◦
◦
◦
attribute varies continuously:
Balance
Units Sold
Cost
Sales
The
Dimensions
38
38
Using a Business Process Matrix
Business Processes
Business
Dimensions
Sales
Returns
Inventory
Customer
Date
Product
Channel
Promotion
Sample of business process matrix
39
39
Determining Granularity
YEAR?
QUARTER?
MONTH?
WEEK?
DAY?
40
40
Identifying Business Rules
Location
Geographic proximity
0 - 1 miles
1 - 5 miles
> 5 miles
Time
Month > Quarter > Year
Product
Type
Monitor
Status
PC
Server
15 inch
17 inch
19 inch
None
New
Rebuilt
Custom
Store
Store > District > Region
41
41
Documenting Metadata
 Documenting
metadata should include:
◦ Documenting the design process
◦ Documenting the development process
◦ Providing a record of changes
◦ Recording enhancements over time
42
42
Metadata Documentation Approaches
◦ Automated
 Data modeling tools
 ETL tools
 End-user tools
◦ Manual
43
43
Phase 2: Defining the
Dimensional Model
◦ Identify fact tables:
 Translate business measures into fact tables
 Analyze source system information for additional
measures
◦ Identify dimension tables
◦ Link fact tables to the dimension tables
◦ Model the time dimension
44
44
Illustrative case – IMW Data
Transaction fact – The transaction table
 Periodic snapshot fact table – current
records in Land & Office facts
 Accumulating snapshot fact table – N/A in
this case

Steps in designing a fact table

Identify a business process for analysis (like sales).

Identify measures or facts (sales dollar), by asking questions like 'What
number of XX are relevant for the business process?', replacing the XX
with various options that make sense within the context of the business.

Identify dimensions for facts (product dimension, location dimension, time
dimension, organization dimension), by asking questions that make sense
within the context of the business, like 'Analyse by XX', where XX is
replaced with the subject to test.

List the columns that describe each dimension (region name, branch name,
business unit name).

Determine the lowest level (granularity) of summary in a fact table (e.g.
sales dollars).

An alternative approach is the four step design process described in
Kimball. – Check what it is
Using Time in the Data Warehouse
◦ Defining standards for time is critical.
◦ Aggregation based on time is complex.
47
47
Using Data Modeling Tools
◦ Tools with a GUI enable definition, modeling, and reporting.
◦ Avoid a mix of modeling techniques caused by:
 Development pressures
 Developers with lack of knowledge
 No strategy
◦ Determine a strategy.
◦ Write and publish formally.
◦ Make available electronically.
48
48
Phase 3: Defining the Physical
Model

Why
◦ Huge amount of data must be effectively processed and retrieved in
realtime.

How
◦ Translate the dimensional design to a physical model for
implementation.
◦ Define storage strategy for tables and indexes.
◦ Perform database sizing.
◦ Define initial indexing strategy.
◦ Define partitioning strategy.
◦ Update metadata document with physical information.
49
49
Storage and Performance
Considerations
 Database
sizing
 Data partitioning
 Indexing
 Star query optimization
50
50
Database Sizing - Test Load
Sampling


Analyze a representative sample of the data chosen using
proven statistical methods.
Ensure that the sample reflects:
◦
◦
◦
◦
Test loads for different periods
Day-to-day operations
Seasonal data and worst-case scenarios
Indexes and summaries
51
51
Data Partitioning
Breaking up of data into separate physical
units that can be handled independently
 Types of data partitioning
◦ Horizontal partitioning.
◦ Vertical partitioning

52
52
Indexing

Indexing is used for the following reasons:
◦ It is a huge cost saving, greatly
improving performance and
scalability.
◦ It can replace a full table scan by
a quick read of the index followed
by a read of only those disk
blocks that contain the rows
needed.
53
53
Parallelism
Sales table
Customers
table
P1
P2
P3
P1
P2
P3
Parallel Execution Servers
54
54
Using Summary Data

Designing summary tables offers the following benefits:
◦ Provides fast access to precomputed data
◦ Reduces use of I/O, CPU, and memory
55
55