Transcript Lecture 6

25. Data Warehouses and Decision
Support
Len Shapiro, for CS386, 11/2-3/05.
Some slides taken from Ramakrishnan and
Gherke, with permission.
4/11/2016
PSU’s CS 587-3
1
25. Whs.
Overview
Increasingly, organizations are analyzing
current and historical data to identify useful
patterns and support business strategies.
 Emphasis is on complex, interactive,
exploratory analysis of very large datasets
created by integrating data from across all
parts of an enterprise; data is fairly static.

 Contrast such On-Line Analytic Processing
(OLAP) with traditional On-line Transaction
Processing (OLTP): mostly long queries, instead
of short update Xacts.
4/11/2016
PSU’s CS 587-3
2
25. Whs.
Enterprise Applications
OLTP / Operational / Production
DSS / Warehouse / DataMart
Operate the business / Clerks
Diagnose the business / Managers
Short queries, small amts of data
opposite
Queries change data
opposite
Customer inquiry, Order Entry, etc.
Statistics, Visualization, Data
Mining, etc.
Legacy Applications, Heterogeneous Opposite
databases
Often Distributed
Often Centralized (Warehouse)
Current data
Current and Historical data
4/11/2016
PSU’s CS 587-3
3
Data Warehouse
Requirements
Operational
Warehouse
General E-R Diagrams
Locks necessary
Multidimensional data model
common
No Locks necessary
Crash recovery required
Crash recovery optional
Smaller volume of data
Huge volume of data
Need indexes designed to access
small amounts of data
Need indexes designed to access
large volumes of data
4/11/2016
PSU’s CS 587-3
4
EXTERNAL DATA
SOURCES
25. Whs.
Data Warehousing
Integrated data spanning
EXTRACT
TRANSFORM
long time periods, often
LOAD
REFRESH
augmented with summary
information.
 Several gigabytes to
DATA
Metadata
WAREHOUSE
terabytes common.
Repository
 Interactive response
SUPPORTS
times expected for
complex queries; ad-hoc
updates uncommon. DATA

4/11/2016
MINING
PSU’s CS 587-3
OLAP
5
11 1 1 25
Collection of numeric measures,
11 2
which depend on a set of dimensions. 11 3
 E.g., measure Sales, dimensions
12 1
Product (key: pid), Location (locid),
12 2
and Time (timeid).
12 3
8 10 10
Slice locid=1
13 1
30 20 50
is shown:
13 2
25 8 15
13 3
locid
1
2
3
11 1
timeid
pid
11 12 13

4/11/2016
PSU’s CS 587-3
locid
sales
timeid
pid
25.2 Multidimensional
Data Model
1 8
1 15
1 30
1 20
1 50
1 8
1 10
1 10
2 35
6
Examples of
Dimensional Data

Products(ProductID, StoreID, DateID, Sale)
 Product(ID, SKU, size, brand)
 Store(ID, Address, Sales District, Region, Manager)
 Date (Date, Week, Month, Holiday, Promotion)

Claims(ProvID, MembID, Procedure, DateID, Cost)
 Providers(ID, Practice, Address, ZIP, City, State)
 Members(ID, Contract, Name, Address)
 Procedure (ID, Name, Type)

Telecomm (CustID, SalesRepID, ServiceID,
DateID)
 SalesRep(ID, Address, Sales District, Region, Manager)
 Service(ID, Name, Category)
4/11/2016
PSU’s CS 587-3
7
25. Whs.
MOLAP vs ROLAP
Multidimensional data can be stored physically
in a (disk-resident, persistent) array; called
MOLAP systems. Alternatively, can store as a
relation; called ROLAP systems.
 The main relation, which relates dimensions to
a measure, is called the fact table. Each
dimension can have additional attributes and
an associated dimension table.

 E.g., Products(pid, pname, category, price)
 Fact tables are much larger than dimensional tables.
4/11/2016
PSU’s CS 587-3
8
25. Whs.
Dimension Hierarchies

For each dimension, some of the attributes
may be organized in a hierarchy:
PRODUCT
TIME
LOCATION
year
category
quarter
state
pname
week
city
PID
date
4/11/2016
PSU’s CS 587-3
ZIP
9
25. Whs.
25.3 OLAP Queries
Influenced by SQL and by spreadsheets.
 A common operation is to aggregate a
measure over one or more dimensions.

 Find total sales.
 Find total sales for each city, or for each state.
 Find top five products ranked by total sales.

Roll-up: Aggregating at different levels of a
dimension hierarchy.
 E.g., Given total sales by city, we can roll-up to get
sales by state.
4/11/2016
PSU’s CS 587-3
10
25. Whs.
OLAP Queries

Drill-down: The inverse of roll-up.
 E.g., Given total sales by state, can drill-down to get
total sales by city.
 E.g., Can also drill-down on different dimension to
get total sales by product for each state.

Pivoting: Aggregation on selected dimensions.
WI CA Total
 E.g., Pivoting on Location and Time
yields this cross-tabulation:
2002 63 81 144
 Slicing
and Dicing: Equality
and range selections on one
or more dimensions.
4/11/2016
PSU’s CS 587-3
2003 38 107 145
2004 75
35 110
Total 176 223 339
11
25. Whs.
Comparison with SQL Queries

The cross-tabulation obtained by pivoting can also
be computed using a collection of SQLqueries:
SELECT SUM(S.sales)
FROM Sales S, Times T, Locations L
WHERE S.timeid=T.timeid AND S.timeid=L.timeid
GROUP BY T.year, L.state
SELECT SUM(S.sales)
FROM Sales S, Times T
WHERE S.timeid=T.timeid
GROUP BY T.year
4/11/2016
SELECT SUM(S.sales)
FROM Sales S, Location L
WHERE S.timeid=L.timeid
GROUP BY L.state
PSU’s CS 587-3
12
25. Whs.
The CUBE Operator
Generalizing the previous example, if there
are k dimensions, we have 2^k possible SQL
GROUP BY queries that can be generated
through pivoting on a subset of dimensions.
 CUBE pid, locid, timeid BY SUM Sales

 Equivalent to rolling up Sales on all eight subsets
of the set {pid, locid, timeid}; each roll-up
corresponds to an SQL query of the form:
SELECT SUM(S.sales)
Lots of work on optimizing
FROM Sales S
the CUBE operator!
GROUP BY grouping-list
4/11/2016
PSU’s CS 587-3
13
25. Whs.
Summary
Decision support is an emerging, rapidly
growing subarea of databases.
 Involves the creation of large, consolidated
data repositories called data warehouses.
 Warehouses exploited using sophisticated
analysis techniques: complex SQL queries
and OLAP “multidimensional” queries
(influenced by both SQL and spreadsheets).
 New techniques for database design,
indexing, view maintenance, and interactive
querying need to be supported (CS587).

4/11/2016
PSU’s CS 587-3
14