Chapter 16 of Database Design, Application Development and
Download
Report
Transcript Chapter 16 of Database Design, Application Development and
Chapter 16
Data Warehouse Technology and
Management
McGraw-Hill/Irwin
Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved.
Outline
Basic concepts and characteristics
Business architectures and applications
Data cube concepts and operators
Relational DBMS features
Maintaining a data warehouse
16-2
Comparison of Environments
Transaction processing
Uses operational databases
Short-term decisions: fulfill orders, resolve
complaints, provide staffing
Decision support processing
Uses integrated and summarized data
Medium and long-term decisions: capacity
planning, store locations, new lines of
business
16-3
Definition and Characteristics
A central repository for summarized and
integrated data from operational
databases and external data sources
Key Characteristics
Subject-oriented
Integrated
Time-variant
Nonvolatile
16-4
Data Comparison
Characteristic
Data Warehouse
Currency
Operational
Database
Current
Detail level
Individual
Orientation
Process
orientation
Few
Individual and
Summary
Subject orientation
Number of records
processed
Normalization level Mostly
normalized
Update level
Volatile
Data model
Relational
Historical
Thousands
Mostly
denormalized
Nonvolatile
(refreshed)
Multidimensional
and relational with
star schemas
16-5
Architectures and Applications
Data warehouse projects
Top-down architectures
Bottom-up architecture
Applications and data mining
16-6
Data Warehouse Projects
Large efforts with much coordination
across departments
Enterprise data model
Important artifact of data warehouse project
Structure of data model
Meta data for data transformation
Top-down vs. bottom-up business
architectures
16-7
Two Tier Architecture
Data warehouse
server
Operational
database
User departments
Transformation
process
Operational
database
Detailed and
summarized data
External
data source
EDM
Data warehouse
16-8
Three Tier Architecture
Data warehouse
server
Operational
database
Transformation
process
Data mart tier
User
departments
Extraction
process
Data mart
Operational
database
Detailed and
summarized data
External
data source
EDM
Data warehouse
Data mart
16-9
Bottom-up Architecture
Data mart tier
User
departments
Operational
database
Transformation
process
Data mart
Operational
database
External
data source
Data mart
16-10
Applications
Industry
Airline
Key Applications
Yield management,
route assessment
Telecommunications Customer retention,
network design
Insurance
Risk assessment, product
design, fraud detection
Retail
Target marketing,
supply-chain
management
16-11
Maturity Model
Guidance for investment decisions
Stages provide a framework to view an
organization’s progress
Insights: difficulty moving between stages
Infant to child stages because of investment
level
Teenager to adult because of strategic
importance of data warehouse
16-12
Data Mining
Discover significant, implicit patterns
Target promotions
Change mix and collocation of items
Requires large volumes of transaction
data
Important application for data warehouses
16-13
Data Cube Concepts and
Operators
Basics
Dimension and measure details
Operators
16-14
Data Cube Basics
Multidimensional arrangement of data
Users think about decision support data as
data cubes
Terminology
Dimension: subject label for a row or column
Member: value of dimension
Measure: quantitative data stored in cells
16-15
Data Cube Example
110
60
25
Utah
40
90
50
30
Arizona
70
55
60
35
Washington
75
85
45
45
65
45
85
60
Location
80
California
Colorado
Mono
Laser
Ink Photo
Jet
Product
Portable
12/31/2006
…..
1/2/2006
1/1/2006
Time
16-16
Dimensions and Measures
Dimensions
Hierarchies: members can have sub members
Sparsity: many cells do not have data
Measures
Derived measures
Multiple measures in cells
16-17
Time Series Data
Common data type in trend analysis
Reduce dimensionality using time series
Time series properties
Data type
Start date
Calendar
Periodicity
Conversion
16-18
Slice Operator
Focus on a subset of dimensions
Set dimension to specific value: 1/1/2006
Location
California
Utah
Arizona
Washington
Colorado
Mono Laser
80
40
70
75
65
Product
Ink Jet
Photo
110
60
90
50
55
60
85
45
45
85
Portable
25
30
35
45
60
16-19
Dice Operator
Location
Focus on a subset of member values
Replace dimension with a subset of values
Dice operation often follows a slice
operation
Utah
40
90
50
30
Mono Ink Photo Portable
Laser Jet
Product
16-20
Other Operators
Operators for hierarchical dimensions
Drill-down: add detail to a dimension
Roll-up: remove detail from a dimension
Recalculate measure values
Pivot: rearrange dimensions
16-21
Operator Summary
Operator
Purpose
Description
Slice
Focus attention on a
subset of dimensions
Replace a dimension with a single
member value or with a summary
of its measure values
Dice
Focus attention on a
subset of member
values
Replace a dimension with a subset
of members
Drill-down
Obtain more detail about Navigate from a more general level
a dimension
to a more specific level
Roll-up
Summarize details
about a dimension
Navigate from a more specific level
to a more general level
Pivot
Present data in a
different order
Rearrange the dimensions in a
data cube
16-22
Relational DBMS Support
Data modeling
Dimension representation
GROUP BY extensions
Materialized views and query rewriting
Storage structures and optimization
16-23
Relational Data Modeling
Dimension table: contains member values
Fact table: contains measure values
1-M relationships from dimension to fact
tables
Grain: most detailed measure values
stored
16-24
Star Schema Example
Store
Item
ItemId
ItemName
ItemUnitPrice
ItemBrand
ItemCategory
StoreSales
ItemSales
Customer
CustId
CustName
CustPhone
CustStreet
CustCity
CustState
CustZip
CustNation
Sales
SalesNo
SalesUnits
SalesDollar
SalesCost
TimeDim
TimeSales
CustSales
StoreId
StoreManager
StoreStreet
StoreCity
StoreState
StoreZip
StoreNation
DivId
DivName
DivManager
TimeNo
TimeDay
TimeMonth
TimeQuarter
TimeYear
TimeDayOfWeek
TimeFiscalYear
16-25
Constellation Schema
Supplier
SuppId
SuppName
SuppCity
SuppState
SuppZip
SuppNation
Inventory
InvNo
InvQOH
InvCost
InvReturns
SuppInv
StoreInv
ItemInv
Store
Item
ItemId
ItemName
ItemUnitPrice
ItemBrand
ItemCategory
StoreSales
ItemSales
Customer
CustId
CustName
CustPhone
CustStreet
CustCity
CustState
CustZip
CustNation
Sales
SalesNo
SalesUnits
SalesDollar
SalesCost
TimeInv
TimeDim
TimeSales
CustSales
StoreId
StoreManager
StoreStreet
StoreCity
StoreState
StoreZip
StoreNation
DivId
DivName
DivManager
TimeNo
TimeDay
TimeMonth
TimeQuarter
TimeYear
TimeDayOfWeek
TimeFiscalYear
16-26
Snowflake Schema Example
Item
Store
ItemId
ItemName
ItemUnitPrice
ItemBrand
ItemCategory
StoreSales
ItemSales
Customer
CustId
CustName
CustPhone
CustStreet
CustCity
CustState
CustZip
CustNation
Sales
SalesNo
SalesUnits
SalesDollar
SalesCost
CustSales
TimeSales
StoreId
StoreManager
StoreStreet
StoreCity
StoreState
StoreZip
StoreNation
Division
DivStore
DivId
DivName
DivManager
TimeDim
TimeNo
TimeDay
TimeMonth
TimeQuarter
TimeYear
TimeDayOfWeek
TimeFiscalYear
16-27
Handling M-N Relationships
Source data may have M-N relationships,
not 1-M relationships
Adjust fact or dimension tables for a fixed
number of exceptions
More complex solutions to support M-N
relationships with a variable number of
connections
16-28
Time Representation
Timestamp
Time dimension table for organization
specific calendar features
Two fact tables for international operations
Accumulating fact table for representation
of multiple events
16-29
Level of Historical Integrity
Primarily an issue for dimension updates
Type I: overwrite old values
Type II: version numbers for an unlimited
history
Type III: new columns for a limited history
16-30
Historical Integrity Example
Type II Representation
Type III Representation
Customer
Customer
CustId
VersionNo
CustName
CustPhone
CustStreet
CustCity
CustCityBegEffDate
CustCityEndEffDate
CustState
CustZip
CustNation
CustId
CustName
CustPhone
CustStreet
CustCityCurr
CustCityCurrBegEffDate
CustCityCurrEndEffDate
CustCityPrev
CustCityPrevBegEffDate
CustCityPrevEndEffDate
CustCityPast
CustCityPastBegEffDate
CustCityPastEndEffDate
CustState
CustZip
CustNation
16-31
Dimension Representation
Star schema and variations lack
dimension representation
Explicit dimension representation
important to data cube operations and
optimization
Proprietary extensions for dimension
representation
Represent levels, hierarchies, and
constraints
16-32
Oracle Dimension Representation
Levels: dimension components
Hierarchies: may have multiple hierarchies
Constraints: functional dependency
relationships
16-33
CREATE DIMENSION Example
CREATE DIMENSION StoreDim
LEVEL StoreId IS Store.StoreId
LEVEL City
IS Store.StoreCity
LEVEL State
IS Store.StoreState
LEVEL Zip
IS Store.StoreZip
LEVEL Nation IS Store.StoreNation
LEVEL DivId
IS Division.DivId
HIERARCHY CityRollup (
StoreId CHILD OF
City CHILD OF
State CHILD OF
Nation )
HIERARCHY ZipRollup (
StoreId CHILD OF
Zip CHILD OF
State CHILD OF
Nation )
HIERARCHY DivisionRollup (
StoreId CHILD OF
DivId
JOIN KEY Store.DivId REFERENCES DivId )
ATTRIBUTE DivId DETERMINES Division.DivName
ATTRIBUTE DivId DETERMINES Division.DivManager ;
16-34
GROUP BY Extensions
ROLLUP operator
CUBE operator
GROUPING SETS operator
Other extensions
Ranking
Ratios
Moving summary values
16-35
CUBE Example
SELECT StoreZip, TimeMonth,
SUM(SalesDollar) AS SumSales
FROM Sales, Store, Time
WHERE Sales.StoreId = Store.StoreId
AND Sales.TimeNo = Time.TimeNo
AND (StoreNation = 'USA'
OR StoreNation = 'Canada')
AND TimeYear = 2005
GROUP BY CUBE (StoreZip, TimeMonth)
16-36
ROLLUP Example
SELECT TimeMonth, TimeYear,
SUM(SalesDollar) AS SumSales
FROM Sales, Store, Time
WHERE Sales.StoreId = Store.StoreId
AND Sales.TimeNo = Time.TimeNo
AND (StoreNation = 'USA'
OR StoreNation = 'Canada')
AND TimeYear BETWEEN 2005 AND 2006
GROUP BY ROLLUP (TimeMonth,TimeYear);
16-37
GROUPING SETS Example
SELECT StoreZip, TimeMonth,
SUM(SalesDollar) AS SumSales
FROM Sales, Store, Time
WHERE Sales.StoreId = Store.StoreId
AND Sales.TimeNo = Time.TimeNo
AND (StoreNation = 'USA'
OR StoreNation = 'Canada')
AND TimeYear = 2005
GROUP BY GROUPING SETS((StoreZip, TimeMonth),
StoreZip, TimeMonth, ());
16-38
Variations of the Grouping
Operators
Partial cube
Partial rollup
Composite columns
CUBE and ROLLUP inside a GROUPIING
SETS operation
16-39
Materialized Views
Stored view
Periodically refreshed with source data
Usually contain summary data
Fast query response for summary data
Appropriate in query dominant
environments
16-40
Materialized View Example
CREATE MATERIALIZED VIEW MV1
BUILD IMMEDIATE
REFRESH COMPLETE ON DEMAND
ENABLE QUERY REWRITE AS
SELECT StoreState, TimeYear,
SUM(SalesDollar) AS SUMDollar1
FROM Sales, Store, Time
WHERE Sales.StoreId = Store.StoreId
AND Sales.TimeNo = Time.TimeNo
AND TimeYear > 2003
GROUP BY StoreState, TimeYear;
16-41
Query Rewriting
Substitution process
Materialized view replaces references to
fact and dimension tables in a query
Query optimizer must evaluate whether
the substitution will improve performance
over the original query
More complex than query modification
process for traditional views
16-42
Query Rewriting Process
QueryFD
Results
QueryMV
Rewrite
SQL Engine
QueryFD: query that references fact and dimension tables
QueryMV: rewrite of QueryFD such that materialized views are
substituted for fact and dimension tables whenever justified by
expected performance improvements.
16-43
Query Rewriting Matching
Row conditions: query conditions at least as
restrictive as MV conditions
Grouping detail: query grouping columns at least
as general as MV grouping columns
Grouping dependencies: query columns must
match or be derivable by functional
dependencies
Aggregate functions: query aggregate functions
must match or be derivable from MV aggregate
functions
16-44
Query Rewriting Example
-- Data warehouse query
SELECT StoreState, TimeYear, SUM(SalesDollar)
FROM Sales, Store, Time
WHERE Sales.StoreId = Store.StoreId
AND Sales.TimeNo = Time.TimeNo
AND StoreNation IN ('USA','Canada')
AND TimeYear = 2005
GROUP BY StoreState, TimeYear;
-- Query Rewrite: replace Sales and Time tables with MV1
SELECT DISTINCT MV1.StoreState,
TimeYear, SumDollar1
FROM MV1, Store
WHERE MV1.StoreState = Store.StoreState
AND TimeYear = 2005
AND StoreNation IN ('USA','Canada');
16-45
Storage and Optimization
Technologies
MOLAP: direct storage and manipulation
of data cubes
ROLAP: relational extensions to support
multidimensional data
HOLAP: combine MOLAP and ROLAP
storage engines
16-46
ROLAP Techniques
Bitmap join indexes
Star join optimization
Query rewriting
Summary storage advisors
Parallel query execution
16-47
Maintaining a Data Warehouse
Data sources
Workflow representation
Optimizing the refresh process
16-48
Data Sources
Cooperative:
Notification using triggers
Requires source system changes
Logged
Readily available
Extraneous data in logs
Queryable
Queries using timestamps
Requires timestamps in source data
Snapshot
Periodic dumps of source data
Significant processing for difference operations
16-49
Maintenance Workflow
Notification
Update
Phase
Propagation
Auditing
Integration
Phase
Merging
Auditing
Preparation
Phase
Cleaning
Transportation
Extraction
16-50
Data Quality Problems
Multiple identifiers
Multiple field names
Different units
Missing values
Orphaned values
Multipurpose fields
Conflicting data
Different update times
16-51
ETL Tools
Extraction, Transformation, and Loading
Specification based
Eliminate custom coding
Third party and DBMS based tools
16-52
Refresh Processing
Unknown
Processes
External
Data Sources
Primarily
Dimension
Changes
Load Time Lag
Valid Time Lag
ETL
Tools
Data
Warehouse
Internal
Data Sources
Accounting
Fact and
Dimension
Changes
16-53
Determining the Refresh
Frequency
Maximize net refresh benefit
Value of data timeliness
Cost of refresh
Satisfy data warehouse and source
system constraints
16-54
Refresh Constraints
Source access: restrictions on time and
frequency
Integration: restrictions that require
concurrent reconciliation
Completeness/consistency: loading in the
same refresh period
Availability: load scheduling restrictions
due to storage capacity, online availability,
and server usage
16-55
Summary
Data warehouse requirements differ from
transaction processing.
Architecture choice is important.
Multidimensional data model is intuitive
Relational representation and storage
techniques are significant.
Maintaining a data warehouse is an
important, operational problem.
16-56