Dimensional Modeling - Zhangxi Lin
Download
Report
Transcript Dimensional Modeling - Zhangxi Lin
ISQS 3358, Business Intelligence
Dimensional Modeling
Zhangxi Lin
Texas Tech University
1
1
Outline
Data Warehousing Approaches
Dimensional Modeling
Data Warehousing with Microsoft SQL Server
2005
Case: Adventure Works Cycles (AWC)
: Data Warehouse Design Phases
2
Data Warehousing
Approaches
3
Data Warehouse Development
Approaches
Data warehouse development approaches
◦
◦
Inmon Model: EDW approach
Kimball Model: Data mart approach
Which model is better?
◦
◦
There is no one-size-fits-all strategy to data
warehousing
One alternative is the hosted warehouse
4
General Data Warehouse
Development Approaches
“Big bang” approach
Incremental approach:
◦ Top-down incremental approach
◦ Bottom-up incremental approach
ISQS 6339, Data Mgmt & BI, Zhangxi
Lin
5
“Big Bang” Approach
Analyze enterprise
requirements
Build enterprise
data warehouse
Report in subsets or
store in data marts
ISQS 6339, Data Mgmt & BI, Zhangxi
Lin
6
Incremental Approach
to Warehouse Development
Multiple iterations
Shorter implementations
Validation of each phase
Increment 1
Strategy
Definition
Analysis
Design
Iterative
Build
Production
ISQS 6339, Data Mgmt & BI, Zhangxi
Lin
7
Top-Down Approach
Analyze requirements at the enterprise level
Develop conceptual information model
Identify and prioritize subject areas
Complete a model of selected subject area
Map to available data
Perform a source system analysis
Implement base technical architecture
Establish metadata, extraction, and load
processes for the initial subject area
Create and populate the initial subject area
data mart within the overall warehouse
framework
ISQS 6339, Data Mgmt & BI, Zhangxi
Lin
8
Bottom-Up Approach
Define the scope and coverage of the
data warehouse and analyze the source
systems within this scope
Define the initial increment based on the
political pressure, assumed business
benefit and data volume
Implement base technical architecture
and establish metadata, extraction, and
load processes as required by increment
Create and populate the initial subject
areas within the overall warehouse
framework
ISQS 6339, Data Mgmt & BI, Zhangxi
Lin
9
Dimensional Modeling
10
Dimensional Model
Also called star schema
◦ Fact table is in the middle and dimensions serving as the points on the star.
◦ A normalized fact table plus denormalized dimension tables
Facts
◦ Measurements associated with a specific business process.
◦ Most facts are additive (calculative); others are semi-additive, non-additive, or
descriptive (e.g. factless fact table).
◦ Many facts can be derived from other facts. So, non-additive facts can be
avoided by calculating it from additive facts.
Grain
◦ The level of detail contained in the fact table
◦ The lowest level of detail is called atomic fact table
11
11
Dimensions
The foundation of the dimensional model to describe the objects of
the business
The nouns of the DW/BI system
◦ Business processes (facts) are the verbs of the business
Dimension tables link to all the business processes.
A dimension shared across all processes is called conformed
dimension
The analysis involving data from more than one business process is
called drill-across.
12
12
Data Cube
Data cubes are multidimensional extensions of 2-D tables, just as in
geometry a cube is a three-dimensional extension of a square. The
word cube brings to mind a 3-D object, and we can think of a 3-D data
cube as being a set of similarly structured 2-D tables stacked on top of
one another.
Data cubes aren't restricted to just three dimensions. Most OLAP
systems can build data cubes with many more dimensions allows up to
64 dimensions.
In practice, we often construct data cubes with many dimensions, but
we tend to look at just three at a time. What makes data cubes so
valuable is that we can index the cube on one or more of its
dimensions.
13
13
Determining Granularity
YEAR?
QUARTER?
MONTH?
WEEK?
DAY?
14
Star Schema Model
Product Table
Store Table
Product_id
Product_disc,...
Store_id
District_id,...
Sales Fact Table
Central
fact table
Product_id
Store_id
Item_id
Day_id
Sales_amount
Sales_units, ...
Denormalized
dimensions
Time Table
Day_id
Month_id
Year_id,...
Item Table
Item_id
Item_desc,...
15
15
Snowflake Schema Model
Product Table
Product_id
Product_desc
Store Table
Store_id
Store_desc
District_id
District Table
District_id
District_desc
Sales Fact Table
Item_id
Store_id
Product_id
Week_id
Sales_amount
Sales_units
Time Table
Item Table
Dept Table
Mgr Table
Week_id
Period_id
Year_id
Item_id
Item_desc
Dept_id
Dept_id
Dept_desc
Mgr_id
Dept_id
Mgr_id
Mgr_name
16
16
Snowflake Schema Model
◦
◦
◦
◦
◦
◦
Country
Direct use by some tools
More flexible to change
Provides for speedier data loading
Can become large and unmanageable
Degrades query performance
More complex metadata
State
County
City
17
17
Dimensional Modeling Process
High level dimensional model design
◦
◦
◦
◦
Choosing business model
Declaring the grain
Choosing dimensions
Identifying the facts
Detailed dimensional model development
Dimensional model review and validation
◦ IS
◦ Core users
◦ Business community
Final design iteration
ISQS 6339, Data Mgmt & BI, Zhangxi
Lin
18
Example: Commrex Real Estate Data
Warehousing
Analytic themes
◦ How to encourage realtors to use the online ASP services
Value Chain
◦ Listors create their account
◦ Listors post their real estate properties to the web-based database services
and pay listing fees
◦ Property buyers search the website-based database and buy properties from
listors. This is the incentive for listors to use the ASP services
Business Processes
◦
◦
◦
◦
◦
Listor sign up
Listor account management
Property data posting
Property search
Property database maintenance
19
19
IMW’s Database ERD Model
Property Listing Database
Membership Database
Property ID
M:1
Listor ID
M:M
Property Type
Listor ID
Listor Name
Property Type
Type Name
Address
Company ID
Subtype 1
City
Subtype 2
Chapter
Chapter
Subtype n
Feature
Legends
Primary Key
M:M
Functions
Specializations
Company ID
Comp Name
Address
Secondary Key
Telephone #
Link to a table
ISQS 6339, Data Mgmt & BI, Zhangxi
Lin
20
IMW’s Data Warehouse Dimensional Model
Property Listing Fact
Property SubType
Dimension
Membership Dimension
Property ID
Listor ID
Listor ID
Prop SubType
Listor Name
Prop SubType
SubType Name
Address
Company ID
Property Type
City
Chapter
Property Type
Type Name
Property Type
Dimension
Chapter
Functions
Feature
Specializations
Company ID
Legends
Primary Key
Secondary Key
Comp Name
Company
Dimension
Address
Telephone #
Link to a table
ISQS 6339, Data Mgmt & BI, Zhangxi
Lin
21
Data Warehousing with
Microsoft SQL Server 2005
ISQS 6339, Data Mgmt & BI, Zhangxi
Lin
22
Unified Dimensional Model
(UDM)
A SQL Server 2005 technology
A UDM is a structure that sits over the top of a data mart and
looks exactly like an OLAP system to the end user.
Advantages
◦
◦
◦
◦
No need for a data mart.
Can be built over one or more OLTP systems.
Mixed data mart and OLTP system data
Can include data from database from other vendors and XMLformatted data
◦ Allows OLAP cubes to be built directly on top of transactional data
◦ Low latency
◦ Ease of creation and maintenance
ISQS 6339, Data Mgmt & BI, Zhangxi
Lin
23
Microsoft BI Toolset
Relational engine (RDBMS)
◦ T-SQL
◦ .NET Framework Command Language Runtime (CLR)
SQL Server Integration Services (SSIS) – ETL
◦ Data Transformation Pipeline (DTP)
◦ Data Transformation Runtime (DTR)
SQL Server Analysis Service (SSAS) – queries, ad hoc use, OLAP, data mining
◦ Multi-Dimensional eXpressions (MDX) – a scripting language for data retrieval from
dimensional database
◦ Dimension design
◦ Cube design
◦ Data mining
SQL Server Reporting Services (SSRS) – ad hoc query, report building
Microsoft Visual Studio .NET is the fundamental tool for application
development
24
Structure and Components of
Business Intelligence
MS SQL Server 2005
SSMS
SSIS
SSAS
BIDS
SSRS
SAS
EG
SAS
EM
25
Understanding the Cube Designer
Tabs
Cube Structure: Use this tab to modify the architecture of a cube.
Dimension Usage: Use this tab to define the relationships between dimensions and measure
groups, and the granularity of each dimension within each measure group.
Calculations: Use this tab to examine calculations that are defined for the cube, to define new
calculations for the whole cube or for a subcube, to reorder existing calculations, and to debug
calculations step by step by using breakpoints.
KPIs: Use this tab to create, edit, and modify the Key Performance Indicators (KPIs) in a cube.
Actions: Use this tab to create or modify drillthrough, reporting, and other actions for the
selected cube..
Partitions: Use this tab to create and manage the partitions for a cube. Partitions let you store
sections of a cube in different locations with different properties, such as aggregation definitions.
Perspectives: Use this tab to create and manage the perspectives in a cube. A perspective is a
defined subset of a cube, and is used to reduce the perceived complexity of a cube to the
business user.
Translations: Use this tab to create and manage translated names for cube objects, such as
month or product names.
Browser: Use this tab to view data in the cube.
ISQS 6339, Data Mgmt & BI, Zhangxi
Lin
26
Case: Adventure Works
Cycles (AWC)
27
Case: Adventure Works Cycles
(AWC)
A fictitious multinational manufacturer
and seller of bicycles and accessories
Based on Bothell, Washington, USA and
has regional sales offices in several
countries
http://www.msftdwtoolkit.com/
ISQS 6339, Data Mgmt & BI, Zhangxi
Lin
28
Basic Business Information
Product orders by category
Product Orders by Country/Region
Product Orders by Sales Channel
Customers by Sales Channel Snapshot
ISQS 6339, Data Mgmt & BI, Zhangxi
Lin
29
AWC Business Requirements Interview summary
Interviewee: Brian Welker, VP of Sales
Sales to resellers: $37 million last year
17 people report to him including 3 regional sales managers
Previous problem: Hard to get information out of the company’s system
Major analytic areas:
Sales planning
Growth analysis
Customer analysis
Territory analysis
Sales performance
Basic sales reporting
Price lists
Special offers
Customer satisfaction
International support
Success criteria
Easy data access, Flexible reporting and analyzing, All data in one place
What’s missing? – A lot – No indication of business value
ISQS 6339, Data Mgmt & BI, Zhangxi
Lin
30
Business Processes
Purchase Orders
Distribution Center Deliveries
Distribution Center Inventory
Store Deliveries
Store Inventory
Store Sales
ISQS 6339, Data Mgmt & BI, Zhangxi
Lin
31
Analytic Themes
See the Excel file
AW_Analytic_Themes_List.xls
ISQS 6339, Data Mgmt & BI, Zhangxi
Lin
32
AWC’s Bus Matrix
Dimensions
Customer (Reseller)
Customer (Internet)
Sales Territory
Currency
X
X
X
X
X
Orders
X
X
X
X
X
X
X
Call
Tracking
X
X
X
X
X
X
Returns
X
X
X
X
X
X
X
Facility
Employee
X
Call Reason
Product
X
Promotion
Date
Sales
Forecasting
Channel
Business
Process
X
X
X
X
ISQS 6339, Data Mgmt & BI, Zhangxi
Lin
33
Prioritization Grid
High
Customer
Profitability
Product
Profitability
Orders
Orders
Forecast
Business
Value /
Impact
Call
Tracking
Manufacturing
Costs
Low Low
Feasibility
Exchange
Rates
Returns
High
ISQS 6339, Data Mgmt & BI, Zhangxi
Lin
34
Exercise 2 – A quick walk
through an SSAS application
Learning Objectives
◦ How to design a data source view with SSAS based on an
existing data warehouse
◦ How to design and deploy a cube.
Tasks
◦ Analysis Service Tutorial Lesson 1: Defining a Data Source View within an
Analysis Services Project
◦ Analysis Service Tutorial Lesson 2: Defining and Deploying a Cube
Deliverable:
◦ A Word file with the screenshot of the star schema emailed to
[email protected]
◦ The subject of the email is: “ISQS 3358 Exercise 2”
ISQS 6339, Data Mgmt & BI, Zhangxi
Lin
35
Supplemental Slides : Data
Warehouse Design Phases
36
Data Warehouse Database
Design Phases
Phase 1: Defining the business model
Phase 2: Defining the dimensional model
Phase 3: Defining the physical model
37
37
Phase 1: Defining the Business Model
◦ Performing strategic analysis
◦ Creating the business model
◦ Documenting metadata
38
38
Performing Strategic Analysis
Identify crucial business processes
Understand business processes
Prioritize and select the business processes to
implement
High
Business
Benefit
Low
Low
Feasibility
High
39
39
Creating the Business Model
Defining business requirements:
◦ Identifying the business measures
◦ Identifying the dimensions
◦ Identifying the grain
◦ Identifying the business definitions and rules
Verifying data sources
40
40
Business Requirements Drive
the Design Process
◦ Primary input
Business
Requirements
◦ Secondary input
Existing Metadata
Production ERD Model
Research
41
41
Identifying Measures
and Dimensions
Measures
The
attribute varies
continuously:
◦
◦
◦
◦
attribute is perceived as
constant or discrete:
Balance
Units Sold
Cost
Sales
The
◦
◦
◦
◦
Product
Location
Time
Size
Dimensions
42
42
Using a Business Process Matrix
Business Processes
Business
Sales
Dimension
s
Returns
Inventor
y
Customer
Date
Product
Channel
Promotion
Sample of business process matrix
43
43
Determining Granularity
YEAR?
QUARTER?
MONTH?
WEEK?
DAY?
44
44
Identifying Business Rules
Location
Geographic proximity
0 - 1 miles
1 - 5 miles
> 5 miles
Time
Month > Quarter > Year
Product
Type
Monitor
Status
PC
Server
15 inch
17 inch
19 inch
None
New
Rebuilt
Custom
Store
Store > District > Region
45
45
Documenting Metadata
Documenting metadata should include:
◦ Documenting the design process
◦ Documenting the development process
◦ Providing a record of changes
◦ Recording enhancements over time
46
46
Metadata Documentation Approaches
◦ Automated
Data modeling tools
ETL tools
End-user tools
◦ Manual
47
47
Phase 2: Defining the Dimensional Model
◦ Identify fact tables:
Translate business measures into fact tables
Analyze source system information for
additional measures
◦ Identify dimension tables
◦ Link fact tables to the dimension tables
◦ Model the time dimension
48
48
Star Dimensional Modeling
Product Table
Product_id
Product_desc
...
Time Table
Day_id
Month_id
Period_id
Year_id
Sales Fact Table
Product_id
Store_id
Item_id
Day_id
Sales_amount
Sales_units
...
Store Table
Store_id
District_id
...
Item Table
Item_id
Item_desc
...
49
49
Fact Table Characteristics
◦
◦
◦
◦
Contain numerical metrics of the business
Can hold large volumes of data
Can grow quickly
Can contain base, derived,
Sales Fact Table
and summarized data
Product_id
Store_id
◦ Are typically additive
Item_id
◦ Are joined to dimension
Day_id
Sales_amount
tables through foreign keys
Sales_units
that reference primary
...
keys in the dimension tables
50
50
Dimension Table Characteristics
Dimension tables have the following characteristics:
◦ Contain textual information that represents the attributes of the
business
◦ Contain relatively static data
◦ Are joined to a fact table through
a foreign key reference
51
51
Star Dimensional
Model Characteristics
◦
◦
◦
◦
◦
◦
◦
◦
◦
The model is easy for users to understand.
Primary keys represent a dimension.
Nonforeign key columns are values.
Facts are usually highly normalized.
Dimensions are completely denormalized.
Fast response to queries is provided.
Performance is improved by reducing table joins.
End users can express complex queries.
Support is provided by many front-end tools.
52
52
Using Time in the Data Warehouse
◦ Defining standards for time is critical.
◦ Aggregation based on time is complex.
53
53
The Time Dimension
Time is critical to the data warehouse. A consistent representation
of time is required for extensibility.
Sales fact
Time
dimension
Where should the element of time be stored?
54
54
Using Data Modeling Tools
◦ Tools with a GUI enable definition, modeling, and
reporting.
◦ Avoid a mix of modeling techniques caused by:
Development pressures
Developers with lack of knowledge
No strategy
◦ Determine a strategy.
◦ Write and publish formally.
◦ Make available electronically.
55
55
Phase 3: Defining the
Physical Model
Why
◦ Huge amount of data must be effectively processed and
retrieved in realtime.
How
◦ Translate the dimensional design to a physical model for
implementation.
◦ Define storage strategy for tables and indexes.
◦ Perform database sizing.
◦ Define initial indexing strategy.
◦ Define partitioning strategy.
◦ Update metadata document with physical information.
56
56
Storage and Performance Considerations
Database sizing
Data partitioning
Indexing
Star query optimization
57
57
Database Sizing - Test Load Sampling
Analyze a representative sample of the data chosen using
proven statistical methods.
Ensure that the sample reflects:
◦ Test loads for different periods
◦ Day-to-day operations
◦ Seasonal data and worst-case scenarios
◦ Indexes and summaries
58
58
Data Partitioning
Breaking up of data into separate physical
units that can be handled independently
Types of data partitioning
◦ Horizontal partitioning.
◦ Vertical partitioning
59
59
Indexing
Indexing is used for the following reasons:
◦ It is a huge cost saving, greatly
improving performance and
scalability.
◦ It can replace a full table scan by
a quick read of the index followed
by a read of only those disk
blocks that contain the rows
needed.
60
60
Parallelism
Sales table
Customers
table
P1
P2
P3
P1
P2
P3
Parallel Execution Servers
61
61
Using Summary Data
Designing summary tables offers the following benefits:
◦ Provides fast access to precomputed data
◦ Reduces use of I/O, CPU, and memory
62
62