Transcript Document

Data Analysis
Overview
• Traditional database systems are tuned to many, small,
simple queries.
• Some applications use fewer, more time-consuming,
analytic queries.
OLTP
• Most database operations involve On-Line
Transaction Processing (OTLP).
– Short, simple, frequent queries and/or modifications,
each involving a small number of tuples.
• Examples:
– Answering queries from a Web interface,
– sales at cash registers,
– selling airline tickets.
OLAP
• On-Line Analytical Processing (OLAP, or “analytic”)
queries are, typically:
– Few, but complex queries --- may run for hours.
– Queries don't depend on having an absolutely up-to-date
database.
• Example
– Partition car sales by years, and look at only the last two
years (2007 and 2008). Aggregate price.
Common Architecture
• Databases at store branches handle OLTP.
• Local store databases copied to a central data warehouse
overnight.
• Analysts use the data warehouse for OLAP.
Star Schemas
•
A star schema is a common organization for data at a
warehouse. It consists of:
1.
2.
Fact table : a very large accumulation of facts such as
sales.
Dimension tables : smaller, generally static information
about the entities involved in the facts.
Example: Star Schema
• Suppose we want to record in a warehouse information
about every car sale, e.g.:
–
–
–
–
serial number,
dealer who sold the car,
date of the sale, and
price charged.
• The fact table is thus:
Sales(serialNo, dealer, day, price)
Example -- Continued
• The dimension tables include information about the
autos, dealers, and days “dimensions”:
Autos(serialNo, model, color)
Dealers(name, city, province)
Days(day, week, month, year)
Days dimension table
probably not stored.
Visualization – Star Schema
Dimension Table (Autos)
Dimension Attrs.
Dimension Table (Dealer)
Dependent Attrs., e.g. price
Fact Table - Sales
Dimension Table (Day)
Dimension Table (etc.)
Data Cubes
• Keys of dimension tables are the dimensions of a
hypercube.
• Example: for the Sales data, the three dimensions are
serialNo, date, and dealer.
• Dependent attributes (e.g., price) appear at the points of
the cube.
Visualization -- Data Cubes
car
price
dealer
date
Slicing and Dicing
• Slice
– focus on particular partitions along (one or more) dimension
i.e., focus on a particular slice of the cube
WHERE clause in SQL
• Dice
– partition the cube into smaller subcubes and aggregated the
points in each
GROUP BY clause in SQL
Slicing and Dicing in SQL
SELECT grouping-attributes and aggregations
FROM fact table joined with (zero or more) dimension tables
WHERE certain attributes are compared with constants /* slicing */
GROUP BY grouping-attributes /* dicing */
Slicing and Dicing Example
• Suppose a particular car model, say ‘Gobi’, isn't selling as
well as anticipated. How to analyze?
• Maybe it’s the color…
• Slice for ‘Gobi.’ Dice for color.
SELECT color, SUM(price)
FROM Sales NATURAL JOIN Autos
WHERE model = 'Gobi'
GROUP BY color;
Slicing and Dicing Example
• Suppose the previous query doesn't tell us much; each color
produces about the same revenue.
• Since the query does not dice for time, we only see the total
over all time for each color.
– We may thus issue a revised query that also partitions time by
month.
SELECT color, month, SUM(price)
FROM (Sales NATURAL JOIN Autos) NATURAL JOIN Days
WHERE model = 'Gobi'
GROUP BY color, month;
Slicing and Dicing Example
• We might discover that red Gobis haven’t sold well recently.
• Does this problem exists at all dealers, or only some dealers
have had low sales of red Gobis?
• Let’s dice on dealer dimension.
SELECT dealer, month, SUM(price)
FROM (Sales NATURAL JOIN Autos) NATURAL JOIN Days
WHERE model = 'Gobi' AND color = 'red‘
GROUP BY month, dealer;
Slicing and Dicing Example
• At this point, we find that the sales per month for red Gobis
are so small that we cannot observe any trends easily.
• We decide it was a mistake to dice by month. A better
partition by years, and look at only the last two years (2007
and 2008).
SELECT dealer, year, SUM(price)
FROM (Sales NATURAL JOIN Autos) NATURAL JOIN Days
WHERE model = 'Gobi' AND color = 'red' AND
(year = 2007 OR year = 2008)
GROUP BY year, dealer;
Drill-Down and Roll-Up
• Previous examples illustrate two common patterns in
sequences of queries that slice-and-dice the data cube.
• Drill-down is the process of partitioning more finely and/or
focusing on specific values in certain dimensions.
– Each of the example steps except the last is an instance of
drill-down.
• Roll-up is the process of partitioning more coarsely.
– The last step, where we grouped by years instead of months
to eliminate the effect of randomness in the data, is an
example of roll-up.
Marginals --- Data Cube w/Aggregation
car
price
dealer
date
Cube operator
• CUBE(F) is an augmented table for fact table F
– tuples (or points) added in CUBE(F)
• have a value, denoted * to each dimension
– * represents aggregation along that dimension
Example
Sales(serialNo, date, dealer, price)
serialNo dimension not well suited for the cube as summing the
price over all dates, or over all dealers, but keeping the serial
number fixed has no effect.
Sales(model, color, date, dealer, val, cnt)
We replace the serialNo by model and color to which the serial
number connects.
Also, we will have as independent attributes, val for sum of prices
and cnt for number of cars sold.
Example: CUBE(Sale)
• ('Gobi', 'red', '2001-05-21', 'Friendly Fred', 45000, 2)
– On May 21, 2001, dealer Friendly Fred sold two red Gobis
for a total of $45,000.
– This tuple is in both Sales and CUBE(Sales).
• ('Gobi', *, '2001-05-21', 'Friendly Fred', 152000, 7)
– On May 21, 2001, Friendly Fred sold seven Gobis of all
colors, for a total price of $152,000.
– Note that this tuple is in CUBE(Sales) but not in Sales.
Example: CUBE(Sale)
• ('Gobi', *, '2001-05-21', *, 2348000, 100)
– On May 21, 2001, there were 100 Gobis sold by all the
dealers, and the total price of those Gobis was $2,348,000.
• ('Gobi', *, *, *, 1339800000, 58000)
– Over all time, dealers, and colors, 58,000 Gobis have been
sold for a total price of $1,339,800,000.
• (*, *, *, *, 3521727000, 198000)
– Total sales of all models in all colors, over all time at all
dealers is 198,000 cars for a total price of $3,521,727,000.
CUBE Helps Answering Queries
SELECT color, AVG(price)
FROM Sales
WHERE model = 'Gobi'
GROUP BY color;
is answered by looking for all tuples of CUBE(Sales) with the form
('Gobi', c, *, *, v, n)
where c is any specific color. v and n will be the sum and number of
sales of Gobis in that color. The average price, is v/n. The answer to
query is the set of (c; v/n) pairs obtained from all ('Gobi', c, *, *, v, n)
tuples.
CUBE in SQL
SELECT model, color, date, dealer, SUM(val) AS v, SUM(cnt) AS n
FROM Sales
GROUP BY model, color, date, dealer
WITH CUBE;
Suppose we materialize this into SalesCube.
Then the previous query is rewritten into:
SELECT color, SUM(v)/SUM(n)
FROM SalesCube
WHERE model = 'Gobi'
AND
date IS NULL AND
dealer IS NULL;
Data Mining
Data mining is a popular term for summarization of big data
sets in useful ways.
Examples:
1.
2.
3.
Clustering Web pages by topic.
Finding characteristics of fraudulent credit-card use.
Finding products that are usually bought together.
Market-Basket Data
• An important form of mining from relational data involves
market baskets = sets of “items” that are purchased
together as a customer leaves a store.
• Summary of basket data is frequent itemsets = sets of
items that often appear together in baskets.
Example: Market Baskets
•
If people often buy hamburger and ketchup together, the
store can:
1.
2.
Put hamburger and ketchup near each other and put
potato chips between.
Run a sale on hamburger and raise the price of ketchup.
Finding Frequent Pairs
• The simplest case is when we only want to find “frequent
pairs” of items.
• Assume data is in a relation Baskets(basket, item).
• The support threshold s is the minimum number of
baskets in which a pair appears before we are interested.
Frequent Pairs in SQL
SELECT b1.item, b2.item
FROM Baskets b1, Baskets b2
WHERE b1.basket = b2.basket
AND b1.item < b2.item
GROUP BY b1.item, b2.item
HAVING COUNT(*) >= s;
Throw away pairs of items
that do not appear at least
s times.
Look for two
Basket tuples
with the same
basket and
different items.
First item must
precede second,
so we don’t
count the same
pair twice.
Create a group for
each pair of items
that appears in at
least one basket.
A-Priori Trick – (1)
• Straightforward implementation involves a join of a huge
Baskets table with itself.
• The A-Priori algorithm speeds the query by recognizing
that a pair of items {i, j } cannot have support s unless
both {i } and {j } do.
A-Priori Trick – (2)
• Use a materialized view to hold only
information about frequent items.
INSERT INTO Baskets1(basket, item)
SELECT * FROM Baskets
Items that
WHERE item IN (
appear in at
least s baskets.
SELECT item FROM Baskets
GROUP BY item
HAVING COUNT(*) >= s
);
A-Priori Algorithm
1.
2.
Materialize the view Baskets1.
Run the obvious query, but on Baskets1 instead of
Baskets.
Notes
•
Computing Baskets1 is cheap, since it doesn’t involve a
join.
•
Baskets1 probably has many fewer tuples than Baskets.
–
Running time shrinks with the square of the number of
tuples involved in the join.
Course Plug
Fall 2009: CSC 485E / CSC 571: Advanced Databases
Spring 2010: SENG 474: Data Mining