Chapter 16 - Spatial Database Group

Download Report

Transcript Chapter 16 - Spatial Database Group

Chapter 16
Data Warehouse Technology and
Management
Outline
Basic concepts and characteristics
 Business architectures and applications
 Data cube concepts and operators
 Relational DBMS features
 Populating a data warehouse

Comparison of Processing
Environments

Transaction processing
– Uses operational databases
– Short-term decisions: fulfill orders, resolve
complaints, provide staffing

Decision support processing
– Uses integrated and summarized data
– Medium and long-term decisions: capacity
planning, store locations, new lines of
business
Data Warehouse Definition and
Characteristics
A central repository for summarized and
integrated data from operational databases
and external data sources
 Key Characteristics

–
–
–
–
Subject-oriented
Integrated
Time-variant
Nonvolatile
Data Comparison
Characteristic
Data Warehouse
Currency
Operational
Database
Current
Detail level
Individual
Orientation
Process
orientation
Few
Individual and
Summary
Subject
orientation
Thousands
Relational
Mostly
denormalized
Nonvolatile
(refreshed)
Multidimensional
Number of records
processed
Normalization level Mostly
normalized
Update level
Volatile
Data model
Historical
Business Architectures and
Applications
Data warehouse projects
 Top-down architectures
 Bottom-up architecture
 Applications and data mining

Data Warehouse Projects
Large efforts with much coordination
across departments
 Enterprise data model

– Important artifact of data warehouse project
– Structure of data model
– Meta data for data transformation

Top-down vs. bottom-up business
architectures
Two Tier Architecture
Data warehouse
server
Operational
database
Transformation
process
Operational
database
Summarized
data
External
data source
EDM
Data warehouse
User departments
Three Tier Architecture
Data warehouse
server
Operational
database
Transformation
process
Data mart tier
Extraction
process
Data mart
Operational
database
Summarized
data
External
data source
EDM
Data warehouse
Data mart
User
departments
Bottom-up Architecture
Data mart tier
Operational
database
Transformation
process
Data mart
Operational
database
External
data source
Data mart
User
departments
Applications
Industry
Airline
Key Applications
Yield management,
route assessment
Telecommunications Customer retention,
network design
Insurance
Risk assessment, product
design, fraud detection
Retail
Target marketing,
supply-chain
management
Data Mining

Discover significant, implicit patterns
– Target promotions
– Change mix and collocation of items
Requires large volumes of transaction data
 Important application for data warehouses

Data Cube Concepts and
Operators
Basics
 Dimension and measure details
 Operators

Data Cube Basics
Multidimensional arrangement of data
 Users think about decision support data as
data cubes
 Terminology

– Dimension: subject label for a row or column
– Member: value of dimension
– Measure: quantitative data stored in cells
Data Cube Example
110
60
25
Utah
40
90
50
30
Arizona
70
55
60
35
Washington
75
85
45
45
65
45
85
60
Location
80
California
Colorado
Soda
Diet Orange
soda soda
Product
Lime
soda
12/31/2003
…..
1/2/2003
1/1/2003
Time
Dimension and Measure
Details

Dimensions
– Hierarchies: members can have sub members
– Sparsity: many cells do not have data

Measures
– Derived measures
– Multiple measures in cells
Time Series Data
Common data type in trend analysis
 Reduce dimensionality using time series
 Time series properties

–
–
–
–
–
Data type
Start date
Calendar
Periodicity
Conversion
Slice Operator


Focus on a subset of dimensions
Set dimension to specific value: 1/1/2003
Location
California
Utah
Arizona
Washington
Colorado
Soda
80
40
70
75
65
Product
Diet soda
Lime soda
110
60
90
50
55
60
85
45
45
85
Orange soda
25
30
35
45
60
Dice Operator
Focus on a subset of member values
 Replace dimension with a subset of values
 Dice operation often follows a slice
operation
LOCATION

Utah
40
Soda
90
50
30
Diet Orange Lime
soda soda
soda
PRODUCT
Other Operators

Operators for hierarchical dimensions
– Drill-down: add detail to a dimension
– Roll-up: remove detail from a dimension
– Recalculate measure values

Pivot: rearrange dimensions
Operator Summary
Operator
Purpose
Description
Slice
Focus attention on a
subset of dimensions
Replace a dimension with a single
member value or with a summary of
its measure values
Dice
Focus attention on a
subset of member values
Replace a dimension with a subset of
members
Drill-down Obtain more detail about
a dimension
Navigate from a more general level
to a more specific level
Roll-up
Summarize details about
a dimension
Navigate from a more specific level
to a more general level
Pivot
Present data in a different Rearrange the dimensions in a data
order
cube
Relational DBMS Support
Data modeling
 Dimension representation
 GROUP BY extensions
 Materialized views and query rewriting
 Storage structures and optimization

Relational Data Modeling
Dimension table: contains member values
 Fact table: contains measure values
 1-M relationships from dimension to fact
tables
 Grain: most detailed measure values stored

Star Schema Example
Store
Item
ItemId
ItemName
ItemUnitPrice
ItemBrand
ItemCategory
StoreSales
ItemSales
Customer
CustId
CustName
CustPhone
CustStreet
CustCity
CustState
CustZip
CustNation
Sales
SalesNo
SalesUnits
SalesDollar
SalesCost
Time
TimeSales
CustSales
StoreId
StoreManager
StoreStreet
StoreCity
StoreState
StoreZip
StoreNation
DivId
DivName
DivManager
TimeNo
TimeDay
TimeMonth
TimeQuarter
TimeYear
TimeDayOfWeek
TimeFiscalYear
Constellation Schema Example
Supplier
SuppId
SuppName
SuppCity
SuppState
SuppZip
SuppNation
Inventory
InvNo
InvQOH
InvCost
InvReturns
SuppInv
StoreInv
ItemInv
Store
Item
ItemId
ItemName
ItemUnitPrice
ItemBrand
ItemCategory
StoreSales
ItemSales
Customer
CustId
CustName
CustPhone
CustStreet
CustCity
CustState
CustZip
CustNation
Sales
SalesNo
SalesUnits
SalesDollar
SalesCost
Time
TimeSales
CustSales
StoreId
StoreManager
StoreStreet
StoreCity
StoreState
StoreZip
StoreNation
DivId
DivName
DivManager
TimeNo
TimeDay
TimeMonth
TimeQuarter
TimeYear
TimeDayOfWeek
TimeFiscalYear
TimeInv
Snowflake Schema Example
Item
Store
ItemId
ItemName
ItemUnitPrice
ItemBrand
ItemCategory
StoreSales
ItemSales Sales
Customer
CustId
CustName
CustPhone
CustStreet
CustCity
CustState
CustZip
CustNation
SalesNo
SalesUnits
SalesDollar
SalesCost
Time
TimeSales
CustSales
StoreId
StoreManager
StoreStreet
StoreCity
StoreState
StoreZip
StoreNation
TimeNo
TimeDay
TimeMonth
TimeQuarter
TimeYear
TimeDayOfWeek
TimeFiscalYear
Division
DivStore
DivId
DivName
DivManager
Handling M-N Relationships
Source data may have M-N relationships,
not 1-M relationships
 Adjust fact or dimension tables for a fixed
number of exceptions
 More complex solutions to support M-N
relationships with a variable number of
connections

Dimension Representation
Star schema and variations lack dimension
representation
 Explicit dimension representation
important to data cube operations and
optimization
 Proprietary extensions for dimension
representation
 Represent levels, hierarchies, and
constraints

Oracle Dimension Representation
Levels: dimension components
 Hierarchies: may have multiple hierarchies
 Constraints: functional dependency
relationships

CREATE DIMENSION Example
CREATE DIMENSION StoreDim
LEVEL StoreId IS Store.StoreId
LEVEL City
IS Store.StoreCity
LEVEL State
IS Store.StoreState
LEVEL Zip
IS Store.StoreZip
LEVEL Nation IS Store.StoreNation
LEVEL DivId
IS Division.DivId
HIERARCHY CityRollup (
StoreId CHILD OF
City CHILD OF
State CHILD OF
Nation )
HIERARCHY ZipRollup (
StoreId CHILD OF
Zip CHILD OF
State CHILD OF
Nation )
HIERARCHY DivisionRollup (
StoreId CHILD OF
DivId
JOIN KEY Store.DivId REFERENCES DivId )
ATTRIBUTE DivId DETERMINES Division.DivName
ATTRIBUTE DivId DETERMINES Division.DivManager ;
GROUP BY Extensions
ROLLUP operator
 CUBE operator
 GROUPING SETS operator
 Other extensions

– Ranking
– Ratios
– Moving summary values
CUBE Example
SELECT StoreZip, TimeMonth,
SUM(SalesDollar) AS SumSales
FROM Sales, Store, Time
WHERE Sales.StoreId = Store.StoreId
AND Sales.TimeNo = Time.TimeNo
AND (StoreNation = 'USA'
OR StoreNation = 'Canada')
AND TimeYear = 2002
GROUP BY CUBE (StoreZip, TimeMonth)
ROLLUP Example
SELECT TimeMonth, TimeYear,
SUM(SalesDollar) AS SumSales
FROM Sales, Store, Time
WHERE Sales.StoreId = Store.StoreId
AND Sales.TimeNo = Time.TimeNo
AND (StoreNation = 'USA'
OR StoreNation = 'Canada')
AND TimeYear BETWEEN 2002 AND 2003
GROUP BY ROLLUP (TimeMonth,TimeYear);
GROUPING SETS Example
SELECT StoreZip, TimeMonth,
SUM(SalesDollar) AS SumSales
FROM Sales, Store, Time
WHERE Sales.StoreId = Store.StoreId
AND Sales.TimeNo = Time.TimeNo
AND (StoreNation = 'USA'
OR StoreNation = 'Canada')
AND TimeYear = 2002
GROUP BY GROUPING SETS((StoreZip, TimeMonth),
StoreZip, TimeMonth, ());
Variations of the Grouping
Operators
Partial cube
 Partial rollup
 Composite columns
 CUBE and ROLLUP inside a
GROUPIING SETS operation

Materialized Views
Stored view
 Periodically refreshed with source data
 Usually contain summary data
 Fast query response for summary data
 Appropriate in query dominant
environments

Materialized View Example
CREATE MATERIALIZED VIEW MV1
BUILD IMMEDIATE
REFRESH COMPLETE ON DEMAND
ENABLE QUERY REWRITE AS
SELECT StoreState, TimeYear,
SUM(SalesDollar) AS SUMDollar1
FROM Sales, Store, Time
WHERE Sales.StoreId = Store.StoreId
AND Sales.TimeNo = Time.TimeNo
AND TimeYear > 2000
GROUP BY StoreState, TimeYear;
Query Rewriting
Substitution process
 Materialized view replaces references to
fact and dimension tables in a query
 Query optimizer must evaluate whether the
substitution will improve performance
over the original query
 More complex than query modification
process for traditional views

Query Rewriting Example
-- Data warehouse query
SELECT StoreState, TimeYear, SUM(SalesDollar)
FROM Sales, Store, Time
WHERE Sales.StoreId = Store.StoreId
AND Sales.TimeNo = Time.TimeNo
AND StoreNation IN ('USA','Canada')
AND TimeYear = 2002
GROUP BY StoreState, TimeYear;
-- Query Rewrite: replace Sales and Time tables with MV1
SELECT DISTINCT MV1.StoreState,
TimeYear, SumDollar1
FROM MV1, Store
WHERE MV1.StoreState = Store.StoreState
AND TimeYear = 2002
AND StoreNation IN ('USA','Canada');
Storage and Optimization
Technologies
MOLAP: direct storage and manipulation
of data cubes
 ROLAP: relational extensions to support
multidimensional data
 HOLAP: combine MOLAP and ROLAP
storage engines

ROLAP Techniques
Bitmap join indexes
 Star join optimization
 Query rewriting
 Summary storage advisors
 Parallel query execution

Populating a Data Warehouse
Data sources
 Workflow representation
 Optimizing the refresh process

Data Sources
Cooperative
 Logged
 Queryable
 Snapshot

Maintenance Workflow
Notification
Update
Phase
Propagation
Auditing
Integration
Phase
Merging
Auditing
Preparation
Phase
Cleaning
Transportation
Extraction
Data Quality Problems
Multiple identifiers
 Multiple field names
 Different units
 Missing values
 Orphaned values
 Multipurpose fields
 Conflicting data
 Different update times

ETL Tools
Extraction, Transformation, and Loading
 Specification based
 Eliminate custom coding
 Third party and DBMS based tools

Refresh Optimization
Unknown
Processes
External
Data Sources
Primarily
Dimension
Changes
Load Time Lag
Valid Time Lag
ETL
Tools
Internal
Data Sources
Accounting
Fact and
Dimension
Changes
Data
Warehouse
Determining the Refresh
Frequency
Maximize net refresh benefit
 Value of data timeliness
 Cost of refresh
 Satisfy data warehouse and source system
constraints

Determining the Level of Historical
Integrity
Primarily an issue for dimension updates
 Type I: overwrite old values
 Type II: version numbers for an unlimited
history
 Type III: new columns for a limited history

Summary
Data warehouse requirements differ from
transaction processing.
 Architecture choice is important.
 Multidimensional data model is intuitive
 Relational representation and storage
techniques are significant.
 Maintaining a data warehouse is an
important, operational problem.
