Chapter 16 - Spatial Database Group
Download
Report
Transcript Chapter 16 - Spatial Database Group
Chapter 16
Data Warehouse Technology and
Management
Outline
Basic concepts and characteristics
Business architectures and applications
Data cube concepts and operators
Relational DBMS features
Populating a data warehouse
Comparison of Processing
Environments
Transaction processing
– Uses operational databases
– Short-term decisions: fulfill orders, resolve
complaints, provide staffing
Decision support processing
– Uses integrated and summarized data
– Medium and long-term decisions: capacity
planning, store locations, new lines of
business
Data Warehouse Definition and
Characteristics
A central repository for summarized and
integrated data from operational databases
and external data sources
Key Characteristics
–
–
–
–
Subject-oriented
Integrated
Time-variant
Nonvolatile
Data Comparison
Characteristic
Data Warehouse
Currency
Operational
Database
Current
Detail level
Individual
Orientation
Process
orientation
Few
Individual and
Summary
Subject
orientation
Thousands
Relational
Mostly
denormalized
Nonvolatile
(refreshed)
Multidimensional
Number of records
processed
Normalization level Mostly
normalized
Update level
Volatile
Data model
Historical
Business Architectures and
Applications
Data warehouse projects
Top-down architectures
Bottom-up architecture
Applications and data mining
Data Warehouse Projects
Large efforts with much coordination
across departments
Enterprise data model
– Important artifact of data warehouse project
– Structure of data model
– Meta data for data transformation
Top-down vs. bottom-up business
architectures
Two Tier Architecture
Data warehouse
server
Operational
database
Transformation
process
Operational
database
Summarized
data
External
data source
EDM
Data warehouse
User departments
Three Tier Architecture
Data warehouse
server
Operational
database
Transformation
process
Data mart tier
Extraction
process
Data mart
Operational
database
Summarized
data
External
data source
EDM
Data warehouse
Data mart
User
departments
Bottom-up Architecture
Data mart tier
Operational
database
Transformation
process
Data mart
Operational
database
External
data source
Data mart
User
departments
Applications
Industry
Airline
Key Applications
Yield management,
route assessment
Telecommunications Customer retention,
network design
Insurance
Risk assessment, product
design, fraud detection
Retail
Target marketing,
supply-chain
management
Data Mining
Discover significant, implicit patterns
– Target promotions
– Change mix and collocation of items
Requires large volumes of transaction data
Important application for data warehouses
Data Cube Concepts and
Operators
Basics
Dimension and measure details
Operators
Data Cube Basics
Multidimensional arrangement of data
Users think about decision support data as
data cubes
Terminology
– Dimension: subject label for a row or column
– Member: value of dimension
– Measure: quantitative data stored in cells
Data Cube Example
110
60
25
Utah
40
90
50
30
Arizona
70
55
60
35
Washington
75
85
45
45
65
45
85
60
Location
80
California
Colorado
Soda
Diet Orange
soda soda
Product
Lime
soda
12/31/2003
…..
1/2/2003
1/1/2003
Time
Dimension and Measure
Details
Dimensions
– Hierarchies: members can have sub members
– Sparsity: many cells do not have data
Measures
– Derived measures
– Multiple measures in cells
Time Series Data
Common data type in trend analysis
Reduce dimensionality using time series
Time series properties
–
–
–
–
–
Data type
Start date
Calendar
Periodicity
Conversion
Slice Operator
Focus on a subset of dimensions
Set dimension to specific value: 1/1/2003
Location
California
Utah
Arizona
Washington
Colorado
Soda
80
40
70
75
65
Product
Diet soda
Lime soda
110
60
90
50
55
60
85
45
45
85
Orange soda
25
30
35
45
60
Dice Operator
Focus on a subset of member values
Replace dimension with a subset of values
Dice operation often follows a slice
operation
LOCATION
Utah
40
Soda
90
50
30
Diet Orange Lime
soda soda
soda
PRODUCT
Other Operators
Operators for hierarchical dimensions
– Drill-down: add detail to a dimension
– Roll-up: remove detail from a dimension
– Recalculate measure values
Pivot: rearrange dimensions
Operator Summary
Operator
Purpose
Description
Slice
Focus attention on a
subset of dimensions
Replace a dimension with a single
member value or with a summary of
its measure values
Dice
Focus attention on a
subset of member values
Replace a dimension with a subset of
members
Drill-down Obtain more detail about
a dimension
Navigate from a more general level
to a more specific level
Roll-up
Summarize details about
a dimension
Navigate from a more specific level
to a more general level
Pivot
Present data in a different Rearrange the dimensions in a data
order
cube
Relational DBMS Support
Data modeling
Dimension representation
GROUP BY extensions
Materialized views and query rewriting
Storage structures and optimization
Relational Data Modeling
Dimension table: contains member values
Fact table: contains measure values
1-M relationships from dimension to fact
tables
Grain: most detailed measure values stored
Star Schema Example
Store
Item
ItemId
ItemName
ItemUnitPrice
ItemBrand
ItemCategory
StoreSales
ItemSales
Customer
CustId
CustName
CustPhone
CustStreet
CustCity
CustState
CustZip
CustNation
Sales
SalesNo
SalesUnits
SalesDollar
SalesCost
Time
TimeSales
CustSales
StoreId
StoreManager
StoreStreet
StoreCity
StoreState
StoreZip
StoreNation
DivId
DivName
DivManager
TimeNo
TimeDay
TimeMonth
TimeQuarter
TimeYear
TimeDayOfWeek
TimeFiscalYear
Constellation Schema Example
Supplier
SuppId
SuppName
SuppCity
SuppState
SuppZip
SuppNation
Inventory
InvNo
InvQOH
InvCost
InvReturns
SuppInv
StoreInv
ItemInv
Store
Item
ItemId
ItemName
ItemUnitPrice
ItemBrand
ItemCategory
StoreSales
ItemSales
Customer
CustId
CustName
CustPhone
CustStreet
CustCity
CustState
CustZip
CustNation
Sales
SalesNo
SalesUnits
SalesDollar
SalesCost
Time
TimeSales
CustSales
StoreId
StoreManager
StoreStreet
StoreCity
StoreState
StoreZip
StoreNation
DivId
DivName
DivManager
TimeNo
TimeDay
TimeMonth
TimeQuarter
TimeYear
TimeDayOfWeek
TimeFiscalYear
TimeInv
Snowflake Schema Example
Item
Store
ItemId
ItemName
ItemUnitPrice
ItemBrand
ItemCategory
StoreSales
ItemSales Sales
Customer
CustId
CustName
CustPhone
CustStreet
CustCity
CustState
CustZip
CustNation
SalesNo
SalesUnits
SalesDollar
SalesCost
Time
TimeSales
CustSales
StoreId
StoreManager
StoreStreet
StoreCity
StoreState
StoreZip
StoreNation
TimeNo
TimeDay
TimeMonth
TimeQuarter
TimeYear
TimeDayOfWeek
TimeFiscalYear
Division
DivStore
DivId
DivName
DivManager
Handling M-N Relationships
Source data may have M-N relationships,
not 1-M relationships
Adjust fact or dimension tables for a fixed
number of exceptions
More complex solutions to support M-N
relationships with a variable number of
connections
Dimension Representation
Star schema and variations lack dimension
representation
Explicit dimension representation
important to data cube operations and
optimization
Proprietary extensions for dimension
representation
Represent levels, hierarchies, and
constraints
Oracle Dimension Representation
Levels: dimension components
Hierarchies: may have multiple hierarchies
Constraints: functional dependency
relationships
CREATE DIMENSION Example
CREATE DIMENSION StoreDim
LEVEL StoreId IS Store.StoreId
LEVEL City
IS Store.StoreCity
LEVEL State
IS Store.StoreState
LEVEL Zip
IS Store.StoreZip
LEVEL Nation IS Store.StoreNation
LEVEL DivId
IS Division.DivId
HIERARCHY CityRollup (
StoreId CHILD OF
City CHILD OF
State CHILD OF
Nation )
HIERARCHY ZipRollup (
StoreId CHILD OF
Zip CHILD OF
State CHILD OF
Nation )
HIERARCHY DivisionRollup (
StoreId CHILD OF
DivId
JOIN KEY Store.DivId REFERENCES DivId )
ATTRIBUTE DivId DETERMINES Division.DivName
ATTRIBUTE DivId DETERMINES Division.DivManager ;
GROUP BY Extensions
ROLLUP operator
CUBE operator
GROUPING SETS operator
Other extensions
– Ranking
– Ratios
– Moving summary values
CUBE Example
SELECT StoreZip, TimeMonth,
SUM(SalesDollar) AS SumSales
FROM Sales, Store, Time
WHERE Sales.StoreId = Store.StoreId
AND Sales.TimeNo = Time.TimeNo
AND (StoreNation = 'USA'
OR StoreNation = 'Canada')
AND TimeYear = 2002
GROUP BY CUBE (StoreZip, TimeMonth)
ROLLUP Example
SELECT TimeMonth, TimeYear,
SUM(SalesDollar) AS SumSales
FROM Sales, Store, Time
WHERE Sales.StoreId = Store.StoreId
AND Sales.TimeNo = Time.TimeNo
AND (StoreNation = 'USA'
OR StoreNation = 'Canada')
AND TimeYear BETWEEN 2002 AND 2003
GROUP BY ROLLUP (TimeMonth,TimeYear);
GROUPING SETS Example
SELECT StoreZip, TimeMonth,
SUM(SalesDollar) AS SumSales
FROM Sales, Store, Time
WHERE Sales.StoreId = Store.StoreId
AND Sales.TimeNo = Time.TimeNo
AND (StoreNation = 'USA'
OR StoreNation = 'Canada')
AND TimeYear = 2002
GROUP BY GROUPING SETS((StoreZip, TimeMonth),
StoreZip, TimeMonth, ());
Variations of the Grouping
Operators
Partial cube
Partial rollup
Composite columns
CUBE and ROLLUP inside a
GROUPIING SETS operation
Materialized Views
Stored view
Periodically refreshed with source data
Usually contain summary data
Fast query response for summary data
Appropriate in query dominant
environments
Materialized View Example
CREATE MATERIALIZED VIEW MV1
BUILD IMMEDIATE
REFRESH COMPLETE ON DEMAND
ENABLE QUERY REWRITE AS
SELECT StoreState, TimeYear,
SUM(SalesDollar) AS SUMDollar1
FROM Sales, Store, Time
WHERE Sales.StoreId = Store.StoreId
AND Sales.TimeNo = Time.TimeNo
AND TimeYear > 2000
GROUP BY StoreState, TimeYear;
Query Rewriting
Substitution process
Materialized view replaces references to
fact and dimension tables in a query
Query optimizer must evaluate whether the
substitution will improve performance
over the original query
More complex than query modification
process for traditional views
Query Rewriting Example
-- Data warehouse query
SELECT StoreState, TimeYear, SUM(SalesDollar)
FROM Sales, Store, Time
WHERE Sales.StoreId = Store.StoreId
AND Sales.TimeNo = Time.TimeNo
AND StoreNation IN ('USA','Canada')
AND TimeYear = 2002
GROUP BY StoreState, TimeYear;
-- Query Rewrite: replace Sales and Time tables with MV1
SELECT DISTINCT MV1.StoreState,
TimeYear, SumDollar1
FROM MV1, Store
WHERE MV1.StoreState = Store.StoreState
AND TimeYear = 2002
AND StoreNation IN ('USA','Canada');
Storage and Optimization
Technologies
MOLAP: direct storage and manipulation
of data cubes
ROLAP: relational extensions to support
multidimensional data
HOLAP: combine MOLAP and ROLAP
storage engines
ROLAP Techniques
Bitmap join indexes
Star join optimization
Query rewriting
Summary storage advisors
Parallel query execution
Populating a Data Warehouse
Data sources
Workflow representation
Optimizing the refresh process
Data Sources
Cooperative
Logged
Queryable
Snapshot
Maintenance Workflow
Notification
Update
Phase
Propagation
Auditing
Integration
Phase
Merging
Auditing
Preparation
Phase
Cleaning
Transportation
Extraction
Data Quality Problems
Multiple identifiers
Multiple field names
Different units
Missing values
Orphaned values
Multipurpose fields
Conflicting data
Different update times
ETL Tools
Extraction, Transformation, and Loading
Specification based
Eliminate custom coding
Third party and DBMS based tools
Refresh Optimization
Unknown
Processes
External
Data Sources
Primarily
Dimension
Changes
Load Time Lag
Valid Time Lag
ETL
Tools
Internal
Data Sources
Accounting
Fact and
Dimension
Changes
Data
Warehouse
Determining the Refresh
Frequency
Maximize net refresh benefit
Value of data timeliness
Cost of refresh
Satisfy data warehouse and source system
constraints
Determining the Level of Historical
Integrity
Primarily an issue for dimension updates
Type I: overwrite old values
Type II: version numbers for an unlimited
history
Type III: new columns for a limited history
Summary
Data warehouse requirements differ from
transaction processing.
Architecture choice is important.
Multidimensional data model is intuitive
Relational representation and storage
techniques are significant.
Maintaining a data warehouse is an
important, operational problem.