Transcript XML - Rose

On-Line Analytical Processing
Salman Azhar
Warehousing
Data Cubes
Data Mining
These slides use some figures, definitions, and explanations from ElmasriNavathe’s Fundamentals of Database Systems
and Molina-Ullman-Widom’s Database Systems
2/10/05
Salman Azhar: Database Systems
1
Overview

Traditional database systems


Some newer “analytic” applications


tuned to many, small, simple queries
fewer, more time-consuming, complex
queries
New architectures

2/10/05
developed to handle complex “analytic”
queries efficiently
Salman Azhar: Database Systems
2
The Data Warehouse

The most common form of data
integration:



2/10/05
Copy sources into a single DB (warehouse)
and try to keep it up-to-date
Usual method: periodic reconstruction of
the warehouse, perhaps overnight
Warehouse essential for analytic queries
Salman Azhar: Database Systems
3
OLTP

Most database operations involve OnLine Transaction Processing (OTLP).

Short, simple, frequent queries and/or
modifications


Each involving a small number of tuples.
Examples… :



2/10/05
Looking up a phone number on the web
Sales at cash registers
Selling airline tickets
Salman Azhar: Database Systems
4
OLAP

Increasing importance of On-Line
Application Processing (OLAP) queries



Few, but complex queries --- may run for
hours.
Queries do not depend on having an
absolutely up-to-date database.
Sometimes called Data Mining
2/10/05
Salman Azhar: Database Systems
5
OLAP Examples
Amazon analyzes customer purchases by its
customers to recommend with products of
likely interest
1.


Compares purchases between customers
Takes longer than customers are willing to wait
Wal-Mart looks for items with sales trends in
a region or time period
2.


2/10/05
Presents data to vendors
Used to determine ordering and inventory
Salman Azhar: Database Systems
6
Common Architecture



Databases at branches handle OLTP
Local databases copied to a central
warehouse overnight (or periodically)
Analysts use the warehouse for OLAP
OLTP
OLTP
OLAP
OLTP
Transaction Users
2/10/05
OLTP
Analysts
Salman Azhar: Database Systems
7
Data Warehouse
Data Warehouse
Data
Sources
Staging
Area
Data Marts
User
Data Access
Data Input
Data Access
2/10/05
Salman Azhar: Database Systems
8
Star Schemas
A star schema


common organization for data at a warehouse
It consists of…


Fact table :



a very large accumulation of facts such as sales
often “insert-only”
Dimension tables :


2/10/05
smaller, generally static information
about the entities involved in the facts
Salman Azhar: Database Systems
9
Star
Schema
Dimension Table
Time_Dim
TimeKey
TheDate
...
Employee_Dim
EmployeeKey
EmployeeID
...
Fact Table
Sales_Fact
Product_Dim
ProductKey
TimeKey
EmployeeKey
ProductKey
CustomerKey
ShipperKey
ProductID
...
Sales Amount
Unit Sales ...
Shipper_Dim
Customer_Dim
ShipperKey
CustomerKey
ShipperID
...
CustomerID
...
2/10/05
Salman Azhar: Database Systems
10
Example: Star Schema

Suppose we want to record in a
warehouse information about every car
sale:


dealer, car, buyer, day, time, price paid
The fact table is a relation:

2/10/05
Sale(dealer, model, buyer, day, time, price)
Salman Azhar: Database Systems
11
Example, Continued

The dimension tables include
information about the dealer, car, and
buyer “dimensions”:




Dealer(dealer, city, zip)
Car(model, manufacturer)
Buyer(buyer, city, phone)
Recall the fact table:

2/10/05
Sale(dealer, model, buyer, day, time, price)
Salman Azhar: Database Systems
12
Dimensions and Dependent
Attributes
Two classes of fact-table attributes:


Dimension attributes :



Dependent attributes :



2/10/05
the key of a dimension table
Sale(dealer, model, buyer, day, time, price)
a value determined by the dimension
attributes of the row
Sale(dealer, model, buyer, day, time, price)
E.g., price determined by the combination of
dealer, model, buyer, day, time
Salman Azhar: Database Systems
13
Example: Dependent Attribute

price is determined by

the combination of dimension attributes:

2/10/05
dealer, car, buyer, and the time (combination
of day and time attributes).
Salman Azhar: Database Systems
14
Approaches to Building
Warehouses
ROLAP = “relational OLAP”:


Tune a relational DBMS to support star
schemas
MOLAP = “multidimensional OLAP”:


2/10/05
Use a specialized DBMS with a model
such as the “data cube”
Salman Azhar: Database Systems
15
ROLAP Techniques

Bitmap indexes :

For each key value of a dimension table
(e.g., each model for relation Cars)


create a bit-vector telling which tuples of the
fact table have that value
Materialized views :

Store the answers to several useful
queries (views) in the warehouse itself

2/10/05
Stored views!
Salman Azhar: Database Systems
16
Typical OLAP Queries

Often, OLAP queries begin with a “star join”:


the natural join of the fact table with all or most of the
dimension tables
Recall the tables:
Sales(dealer, model, buyer, day, time, price)
Dealers(dealer, city, zip)

Cars(model, manufacturer)
Buyers(buyer, city, phone)
Example:
SELECT *
FROM Sales, Dealers,
WHERE Sales.dealer =
Sales.model =
Sales.buyer =
2/10/05
Cars, Buyers
Dealers.dealer AND
Cars.model AND
Buyers.buyer;
Salman Azhar: Database Systems
17
Typical OLAP Queries --- 2
The typical OLAP query will:

1.
2.
3.
4.
2/10/05
Start with a star join
Select for interesting tuples, based on
dimension data
Group by one or more dimensions
Aggregate certain attributes of the result
Salman Azhar: Database Systems
18
Example: OLAP Query
For each dealer in Indianapolis


find the total sales of each car manufactured by
BMW
Filter:


city = “Indianapolis”
Grouping:


by dealer and car
Aggregation:


2/10/05
Sum of price
manf = “BMW”
GROUP EXERCISE:
Write the SQL Query
Note: Do not turn over to the
next page before attempting
this exercise yourself!
Salman Azhar: Database Systems
19
Example: In SQL
SELECT dealer, model, SUM(price)
FROM Sales NATURAL JOIN Dealers
NATURAL JOIN Cars
WHERE Dealer.city = ’Indianapolis’
AND Car.manf = ’BMW’
GROUP BY dealer, model;
2/10/05
Salman Azhar: Database Systems
20
Using Materialized Views


A direct execution of this query from
Sales and the dimension tables could
take too long
If we create a materialized view that
contains enough information,

2/10/05
we may be able to answer our query much
faster
Salman Azhar: Database Systems
21
Example: Materialized View
Which views could help with our query?
Key issues:


1.
2.
3.
4.
It must
It must
It must
Cars
It must
2/10/05
join Sales, Dealers, and Cars, at least
group by at least dealer and car
not select out Indianapolis Dealers or BMW
not project out city or manf
Salman Azhar: Database Systems
22
Example --- Continued

Here is a materialized view that could help:
CREATE VIEW vSales(dealer, city,
car, manf, sales) AS
SELECT dealer, city, model, manf,
SUM(price) sales
FROM Sales NATURAL JOIN Dealers
NATURAL JOIN Cars
GROUP BY dealer, city,
model, manf;
Since dealer -> city and model -> manf, there is no real grouping.
We need city and manf in the SELECT.
2/10/05
Salman Azhar: Database Systems
23
Example --- Concluded

Here’s our query using the
materialized view vSales:
SELECT dealer, car, sales
FROM vSales
WHERE city = ’Indianapolis’
AND manf = ’BMW’;
2/10/05
Salman Azhar: Database Systems
24
MOLAP and Data Cubes

Keys of dimension tables are the
dimensions of a hypercube


Example: for the Sales data, the four
dimensions are Dealers, Cars, Buyers, and
time
Dependent attributes (e.g., price)
appear at the points of the cube
2/10/05
Salman Azhar: Database Systems
25
Defining a Cube
Atlanta
Chicago
Denver
Grapes
Cherries
Melons
Apples
Detroit
Q1
2/10/05
Q2
Q3
Time Dimension
Q4
Salman Azhar: Database Systems
26
Querying a Cube
Sales
Fact
Atlanta
Chicago
Denver
Grapes
Cherries
Melons
Apples
Dallas
Q1
2/10/05
Q2
Q3
Time Dimension
Q4
Salman Azhar: Database Systems
27
Defining a Cube Slice
Atlanta
Chicago
Denver
Grapes
Cherries
Melons
Apples
Detroit
Q1
2/10/05
Q2
Q3
Time Dimension
Q4
Salman Azhar: Database Systems
28
Working with Dimensions and
Hierarchies

Dimensions Allow You to
Slice
Dice



Hierarchies Allow You to


Drill Down
Drill Up
2/10/05
Salman Azhar: Database Systems
29
Marginals


The data cube also includes
aggregation (typically SUM) along the
margins of the cube
The marginals include

2/10/05
aggregations over one dimension, two
dimensions,…
Salman Azhar: Database Systems
30
Example: Marginals

Our 4-dimensional Sales cube includes


the sum of price over each dealer, each
car, each buyer, and each time unit
(perhaps days)
It would also have the sum of price
over

2/10/05
all dealer-model pairs, all dealer-buyer-day
triples,…
Salman Azhar: Database Systems
31
Structure of the Cube

Think of each dimension as having an
additional value *


A point with one or more *’s in its coordinates
aggregates over the dimensions with the *’s.
Example:

Sales(“Auto Nation”, “Mini Cooper”, *, *) holds the
sum over all Buyers and all time of the Mini
Coopers bought at AutoNation
2/10/05
Salman Azhar: Database Systems
32
Drill-Down

Drill-down = “de-aggregate”


= break an aggregate into its constituents
Example:

having determined that Auto Nation sells
very few BMW Cars,

2/10/05
break down his sales by particular car
Salman Azhar: Database Systems
33
Roll-Up

Roll-up


= aggregate along one or more
dimensions.
Example:

given a table of how many Mini Coopers
each buyer buys at each dealer,

2/10/05
roll it up into a table giving total number of Mini
Coopers bought by each buyer
Salman Azhar: Database Systems
34
Materialized Data-Cube Views


Data cubes invite materialized views
that are aggregations in one or more
dimensions
Dimensions may not be completely
aggregated

2/10/05
an option is to group by an attribute of the
dimension table
Salman Azhar: Database Systems
35
Example
A materialized view for our Sales data
cube might:

1.
2.
3.
4.
2/10/05
Aggregate by buyer completely
Not aggregate at all by car
Aggregate by time according to the week
Aggregate according to the city of the
dealer
Salman Azhar: Database Systems
36
Data Mining
Data mining is a popular term for

queries that summarize big data sets
in useful ways
Examples:

1.
2.
2/10/05
Clustering all Web pages by topic
Finding characteristics of fraudulent
credit-card use
Salman Azhar: Database Systems
37
Market-Basket Data

An important form of mining from
relational data involves market baskets


sets of “items” that are purchased together
as a customer leaves a store
Summary of basket data is frequent
itemsets

2/10/05
sets of items that often appear together in
baskets
Salman Azhar: Database Systems
38
Example: Market Baskets
If people often buy bread and butter
together, the store can:

1.
2.
2/10/05
Put bread and butter near each other and
put potato chips between the two
Run a sale on bread and raise the price of
butter
Salman Azhar: Database Systems
39
Finding Frequent Pairs

The simplest case is when we only want
to find “frequent pairs” of items.


Assume data is in a relation
Baskets(basket, item)
The support thresholds is the minimum
number of baskets in which a pair
appears before we are interested
2/10/05
Salman Azhar: Database Systems
40
Frequent Pairs in SQL
Look for two
Basket tuples
with the same
basket and
different items.
First item must
precede second,
so we don’t
count the same
pair twice.
SELECT b1.item, b2.item
FROM Baskets b1, Baskets b2
WHERE b1.basket = b2.basket
AND b1.item < b2.item
GROUP BY b1.item, b2.item
HAVING COUNT(*) >= s;
Create a group for
Throw away pairs of items
that do not appear at least
s times.
2/10/05
each pair of items
that appears in at
least one basket.
Salman Azhar: Database Systems
41
Summary

OLAP vs. OLTP


Two different worlds
Warehousing




2/10/05
Data Cubes
Data Mining
Materialized views
Storing aggregate data
Salman Azhar: Database Systems
42