Chapter 16 of Database Design, Application Development and

Download Report

Transcript Chapter 16 of Database Design, Application Development and

Chapter 16
Data Warehouse Technology and
Management
McGraw-Hill/Irwin
Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved.
Outline





Basic concepts and characteristics
Business architectures and applications
Data cube concepts and operators
Relational DBMS features
Maintaining a data warehouse
16-2
Comparison of Environments
 Transaction processing
 Uses operational databases
 Short-term decisions: fulfill orders, resolve
complaints, provide staffing
 Decision support processing
 Uses integrated and summarized data
 Medium and long-term decisions: capacity
planning, store locations, new lines of
business
16-3
Definition and Characteristics
 A central repository for summarized and
integrated data from operational
databases and external data sources
 Key Characteristics




Subject-oriented
Integrated
Time-variant
Nonvolatile
16-4
Data Comparison
Characteristic
Data Warehouse
Currency
Operational
Database
Current
Detail level
Individual
Orientation
Process
orientation
Few
Individual and
Summary
Subject orientation
Number of records
processed
Normalization level Mostly
normalized
Update level
Volatile
Data model
Relational
Historical
Thousands
Mostly
denormalized
Nonvolatile
(refreshed)
Multidimensional
and relational with
star schemas
16-5
Architectures and Applications




Data warehouse projects
Top-down architectures
Bottom-up architecture
Applications and data mining
16-6
Data Warehouse Projects
 Large efforts with much coordination
across departments
 Enterprise data model
 Important artifact of data warehouse project
 Structure of data model
 Meta data for data transformation
 Top-down vs. bottom-up business
architectures
16-7
Two Tier Architecture
Data warehouse
server
Operational
database
User departments
Transformation
process
Operational
database
Detailed and
summarized data
External
data source
EDM
Data warehouse
16-8
Three Tier Architecture
Data warehouse
server
Operational
database
Transformation
process
Data mart tier
User
departments
Extraction
process
Data mart
Operational
database
Detailed and
summarized data
External
data source
EDM
Data warehouse
Data mart
16-9
Bottom-up Architecture
Data mart tier
User
departments
Operational
database
Transformation
process
Data mart
Operational
database
External
data source
Data mart
16-10
Applications
Industry
Airline
Key Applications
Yield management,
route assessment
Telecommunications Customer retention,
network design
Insurance
Risk assessment, product
design, fraud detection
Retail
Target marketing,
supply-chain
management
16-11
Maturity Model
 Guidance for investment decisions
 Stages provide a framework to view an
organization’s progress
 Insights: difficulty moving between stages
 Infant to child stages because of investment
level
 Teenager to adult because of strategic
importance of data warehouse
16-12
Data Mining
 Discover significant, implicit patterns
 Target promotions
 Change mix and collocation of items
 Requires large volumes of transaction
data
 Important application for data warehouses
16-13
Data Cube Concepts and
Operators
 Basics
 Dimension and measure details
 Operators
16-14
Data Cube Basics
 Multidimensional arrangement of data
 Users think about decision support data as
data cubes
 Terminology
 Dimension: subject label for a row or column
 Member: value of dimension
 Measure: quantitative data stored in cells
16-15
Data Cube Example
110
60
25
Utah
40
90
50
30
Arizona
70
55
60
35
Washington
75
85
45
45
65
45
85
60
Location
80
California
Colorado
Mono
Laser
Ink Photo
Jet
Product
Portable
12/31/2006
…..
1/2/2006
1/1/2006
Time
16-16
Dimensions and Measures
 Dimensions
 Hierarchies: members can have sub members
 Sparsity: many cells do not have data
 Measures
 Derived measures
 Multiple measures in cells
16-17
Time Series Data
 Common data type in trend analysis
 Reduce dimensionality using time series
 Time series properties





Data type
Start date
Calendar
Periodicity
Conversion
16-18
Slice Operator
 Focus on a subset of dimensions
 Set dimension to specific value: 1/1/2006
Location
California
Utah
Arizona
Washington
Colorado
Mono Laser
80
40
70
75
65
Product
Ink Jet
Photo
110
60
90
50
55
60
85
45
45
85
Portable
25
30
35
45
60
16-19
Dice Operator
Location
 Focus on a subset of member values
 Replace dimension with a subset of values
 Dice operation often follows a slice
operation
Utah
40
90
50
30
Mono Ink Photo Portable
Laser Jet
Product
16-20
Other Operators
 Operators for hierarchical dimensions
 Drill-down: add detail to a dimension
 Roll-up: remove detail from a dimension
 Recalculate measure values
 Pivot: rearrange dimensions
16-21
Operator Summary
Operator
Purpose
Description
Slice
Focus attention on a
subset of dimensions
Replace a dimension with a single
member value or with a summary
of its measure values
Dice
Focus attention on a
subset of member
values
Replace a dimension with a subset
of members
Drill-down
Obtain more detail about Navigate from a more general level
a dimension
to a more specific level
Roll-up
Summarize details
about a dimension
Navigate from a more specific level
to a more general level
Pivot
Present data in a
different order
Rearrange the dimensions in a
data cube
16-22
Relational DBMS Support





Data modeling
Dimension representation
GROUP BY extensions
Materialized views and query rewriting
Storage structures and optimization
16-23
Relational Data Modeling
 Dimension table: contains member values
 Fact table: contains measure values
 1-M relationships from dimension to fact
tables
 Grain: most detailed measure values
stored
16-24
Star Schema Example
Store
Item
ItemId
ItemName
ItemUnitPrice
ItemBrand
ItemCategory
StoreSales
ItemSales
Customer
CustId
CustName
CustPhone
CustStreet
CustCity
CustState
CustZip
CustNation
Sales
SalesNo
SalesUnits
SalesDollar
SalesCost
TimeDim
TimeSales
CustSales
StoreId
StoreManager
StoreStreet
StoreCity
StoreState
StoreZip
StoreNation
DivId
DivName
DivManager
TimeNo
TimeDay
TimeMonth
TimeQuarter
TimeYear
TimeDayOfWeek
TimeFiscalYear
16-25
Constellation Schema
Supplier
SuppId
SuppName
SuppCity
SuppState
SuppZip
SuppNation
Inventory
InvNo
InvQOH
InvCost
InvReturns
SuppInv
StoreInv
ItemInv
Store
Item
ItemId
ItemName
ItemUnitPrice
ItemBrand
ItemCategory
StoreSales
ItemSales
Customer
CustId
CustName
CustPhone
CustStreet
CustCity
CustState
CustZip
CustNation
Sales
SalesNo
SalesUnits
SalesDollar
SalesCost
TimeInv
TimeDim
TimeSales
CustSales
StoreId
StoreManager
StoreStreet
StoreCity
StoreState
StoreZip
StoreNation
DivId
DivName
DivManager
TimeNo
TimeDay
TimeMonth
TimeQuarter
TimeYear
TimeDayOfWeek
TimeFiscalYear
16-26
Snowflake Schema Example
Item
Store
ItemId
ItemName
ItemUnitPrice
ItemBrand
ItemCategory
StoreSales
ItemSales
Customer
CustId
CustName
CustPhone
CustStreet
CustCity
CustState
CustZip
CustNation
Sales
SalesNo
SalesUnits
SalesDollar
SalesCost
CustSales
TimeSales
StoreId
StoreManager
StoreStreet
StoreCity
StoreState
StoreZip
StoreNation
Division
DivStore
DivId
DivName
DivManager
TimeDim
TimeNo
TimeDay
TimeMonth
TimeQuarter
TimeYear
TimeDayOfWeek
TimeFiscalYear
16-27
Handling M-N Relationships
 Source data may have M-N relationships,
not 1-M relationships
 Adjust fact or dimension tables for a fixed
number of exceptions
 More complex solutions to support M-N
relationships with a variable number of
connections
16-28
Time Representation
 Timestamp
 Time dimension table for organization
specific calendar features
 Two fact tables for international operations
 Accumulating fact table for representation
of multiple events
16-29
Level of Historical Integrity
 Primarily an issue for dimension updates
 Type I: overwrite old values
 Type II: version numbers for an unlimited
history
 Type III: new columns for a limited history
16-30
Historical Integrity Example
Type II Representation
Type III Representation
Customer
Customer
CustId
VersionNo
CustName
CustPhone
CustStreet
CustCity
CustCityBegEffDate
CustCityEndEffDate
CustState
CustZip
CustNation
CustId
CustName
CustPhone
CustStreet
CustCityCurr
CustCityCurrBegEffDate
CustCityCurrEndEffDate
CustCityPrev
CustCityPrevBegEffDate
CustCityPrevEndEffDate
CustCityPast
CustCityPastBegEffDate
CustCityPastEndEffDate
CustState
CustZip
CustNation
16-31
Dimension Representation
 Star schema and variations lack
dimension representation
 Explicit dimension representation
important to data cube operations and
optimization
 Proprietary extensions for dimension
representation
 Represent levels, hierarchies, and
constraints
16-32
Oracle Dimension Representation
 Levels: dimension components
 Hierarchies: may have multiple hierarchies
 Constraints: functional dependency
relationships
16-33
CREATE DIMENSION Example
CREATE DIMENSION StoreDim
LEVEL StoreId IS Store.StoreId
LEVEL City
IS Store.StoreCity
LEVEL State
IS Store.StoreState
LEVEL Zip
IS Store.StoreZip
LEVEL Nation IS Store.StoreNation
LEVEL DivId
IS Division.DivId
HIERARCHY CityRollup (
StoreId CHILD OF
City CHILD OF
State CHILD OF
Nation )
HIERARCHY ZipRollup (
StoreId CHILD OF
Zip CHILD OF
State CHILD OF
Nation )
HIERARCHY DivisionRollup (
StoreId CHILD OF
DivId
JOIN KEY Store.DivId REFERENCES DivId )
ATTRIBUTE DivId DETERMINES Division.DivName
ATTRIBUTE DivId DETERMINES Division.DivManager ;
16-34
GROUP BY Extensions




ROLLUP operator
CUBE operator
GROUPING SETS operator
Other extensions
 Ranking
 Ratios
 Moving summary values
16-35
CUBE Example
SELECT StoreZip, TimeMonth,
SUM(SalesDollar) AS SumSales
FROM Sales, Store, Time
WHERE Sales.StoreId = Store.StoreId
AND Sales.TimeNo = Time.TimeNo
AND (StoreNation = 'USA'
OR StoreNation = 'Canada')
AND TimeYear = 2005
GROUP BY CUBE (StoreZip, TimeMonth)
16-36
ROLLUP Example
SELECT TimeMonth, TimeYear,
SUM(SalesDollar) AS SumSales
FROM Sales, Store, Time
WHERE Sales.StoreId = Store.StoreId
AND Sales.TimeNo = Time.TimeNo
AND (StoreNation = 'USA'
OR StoreNation = 'Canada')
AND TimeYear BETWEEN 2005 AND 2006
GROUP BY ROLLUP (TimeMonth,TimeYear);
16-37
GROUPING SETS Example
SELECT StoreZip, TimeMonth,
SUM(SalesDollar) AS SumSales
FROM Sales, Store, Time
WHERE Sales.StoreId = Store.StoreId
AND Sales.TimeNo = Time.TimeNo
AND (StoreNation = 'USA'
OR StoreNation = 'Canada')
AND TimeYear = 2005
GROUP BY GROUPING SETS((StoreZip, TimeMonth),
StoreZip, TimeMonth, ());
16-38
Variations of the Grouping
Operators




Partial cube
Partial rollup
Composite columns
CUBE and ROLLUP inside a GROUPIING
SETS operation
16-39
Materialized Views





Stored view
Periodically refreshed with source data
Usually contain summary data
Fast query response for summary data
Appropriate in query dominant
environments
16-40
Materialized View Example
CREATE MATERIALIZED VIEW MV1
BUILD IMMEDIATE
REFRESH COMPLETE ON DEMAND
ENABLE QUERY REWRITE AS
SELECT StoreState, TimeYear,
SUM(SalesDollar) AS SUMDollar1
FROM Sales, Store, Time
WHERE Sales.StoreId = Store.StoreId
AND Sales.TimeNo = Time.TimeNo
AND TimeYear > 2003
GROUP BY StoreState, TimeYear;
16-41
Query Rewriting
 Substitution process
 Materialized view replaces references to
fact and dimension tables in a query
 Query optimizer must evaluate whether
the substitution will improve performance
over the original query
 More complex than query modification
process for traditional views
16-42
Query Rewriting Process
QueryFD
Results
QueryMV
Rewrite
SQL Engine
QueryFD: query that references fact and dimension tables
QueryMV: rewrite of QueryFD such that materialized views are
substituted for fact and dimension tables whenever justified by
expected performance improvements.
16-43
Query Rewriting Matching
 Row conditions: query conditions at least as
restrictive as MV conditions
 Grouping detail: query grouping columns at least
as general as MV grouping columns
 Grouping dependencies: query columns must
match or be derivable by functional
dependencies
 Aggregate functions: query aggregate functions
must match or be derivable from MV aggregate
functions
16-44
Query Rewriting Example
-- Data warehouse query
SELECT StoreState, TimeYear, SUM(SalesDollar)
FROM Sales, Store, Time
WHERE Sales.StoreId = Store.StoreId
AND Sales.TimeNo = Time.TimeNo
AND StoreNation IN ('USA','Canada')
AND TimeYear = 2005
GROUP BY StoreState, TimeYear;
-- Query Rewrite: replace Sales and Time tables with MV1
SELECT DISTINCT MV1.StoreState,
TimeYear, SumDollar1
FROM MV1, Store
WHERE MV1.StoreState = Store.StoreState
AND TimeYear = 2005
AND StoreNation IN ('USA','Canada');
16-45
Storage and Optimization
Technologies
 MOLAP: direct storage and manipulation
of data cubes
 ROLAP: relational extensions to support
multidimensional data
 HOLAP: combine MOLAP and ROLAP
storage engines
16-46
ROLAP Techniques





Bitmap join indexes
Star join optimization
Query rewriting
Summary storage advisors
Parallel query execution
16-47
Maintaining a Data Warehouse
 Data sources
 Workflow representation
 Optimizing the refresh process
16-48
Data Sources
 Cooperative:
 Notification using triggers
 Requires source system changes
 Logged
 Readily available
 Extraneous data in logs
 Queryable
 Queries using timestamps
 Requires timestamps in source data
 Snapshot
 Periodic dumps of source data
 Significant processing for difference operations
16-49
Maintenance Workflow
Notification
Update
Phase
Propagation
Auditing
Integration
Phase
Merging
Auditing
Preparation
Phase
Cleaning
Transportation
Extraction
16-50
Data Quality Problems








Multiple identifiers
Multiple field names
Different units
Missing values
Orphaned values
Multipurpose fields
Conflicting data
Different update times
16-51
ETL Tools




Extraction, Transformation, and Loading
Specification based
Eliminate custom coding
Third party and DBMS based tools
16-52
Refresh Processing
Unknown
Processes
External
Data Sources
Primarily
Dimension
Changes
Load Time Lag
Valid Time Lag
ETL
Tools
Data
Warehouse
Internal
Data Sources
Accounting
Fact and
Dimension
Changes
16-53
Determining the Refresh
Frequency




Maximize net refresh benefit
Value of data timeliness
Cost of refresh
Satisfy data warehouse and source
system constraints
16-54
Refresh Constraints
 Source access: restrictions on time and
frequency
 Integration: restrictions that require
concurrent reconciliation
 Completeness/consistency: loading in the
same refresh period
 Availability: load scheduling restrictions
due to storage capacity, online availability,
and server usage
16-55
Summary
 Data warehouse requirements differ from
transaction processing.
 Architecture choice is important.
 Multidimensional data model is intuitive
 Relational representation and storage
techniques are significant.
 Maintaining a data warehouse is an
important, operational problem.
16-56