ppt - Stanford University

Download Report

Transcript ppt - Stanford University

CS 345:
Topics in Data Warehousing
Thursday, September 30, 2004
Review of Tuesday’s Class
• Roles of Information Technology
– 1. Automate clerical work
– 2. Decision support
• OLAP vs. OLTP
–
–
–
–
Different query characteristics
Different performance requirements
Different data modeling requirements
OLAP combines data from many sources
• High-level course outline
–
–
–
–
Logical Database Design
Query Processing
Physical Database Design
Data Mining
Outline of Today’s Class
• Data integration
• Basic OLAP queries
– Data cubes
– Slice and dice, drill down, roll up
– MOLAP vs. ROLAP
– SQL OLAP Extensions: ROLLUP, CUBE
• Star Schemas
– Facts and Dimensions
Loading the Data Warehouse
Data is periodically
extracted
Data is cleansed and
transformed
Users query the data
warehouse
Source Systems
(OLTP)
Data Staging Area
Data Warehouse
Terminology: ETL
• ETL = Extraction, Transformation, & Load
• Extraction: Get the data out of the source
systems
• Transformation: Convert the data into a useful
format for analysis
• Load: Get the data into the data warehouse
(…and build indexes, materialized views, etc.)
• We will return to this topic in a couple weeks.
Data Integration is Hard
• Data warehouses combine data from multiple sources
• Data must be translated into a consistent format
• Data integration represents ~80% of effort for a typical
data warehouse project!
• Some reasons why it’s hard:
– Metadata is often poor or non-existent
– Data quality is often bad
• Missing or default values
• Multiple spellings of the same thing
(Cal vs. UC Berkeley vs. University of California)
– Inconsistent semantics
• What is an airline passenger?
Federated Databases
• An alternative to data warehouses
• Data warehouse
– Create a copy of all the data
– Execute queries against the copy
• Federated database
– Pull data from source systems as needed to answer queries
• “lazy” vs. “eager” data integration
Query
Extraction
Query
Answer
Warehouse
Source
Systems
Rewritten
Queries
Data Warehouse
Answer Mediator
Federated Database
Source
Systems
Warehouses vs. Federation
• Advantages of federated databases:
– No redundant copying of data
– Queries see “real-time” view of evolving data
– More flexible security policy
• Disadvantages of federated databases:
–
–
–
–
Analysis queries place extra load on transactional systems
Query optimization is hard to do well
Historical data may not be available
Complex “wrappers” needed to mediate between analysis server
and source systems
• Data warehouses are much more common in practice
– Better performance
– Lower complexity
– Slightly out-of-date data is acceptable
Two Approaches to Data
Warehousing
• Data mart: like a data warehouse, but smaller and more
focused
• Top-down approach
– First build single unified data warehouse with all enterprise data
– Then create data marts containing specialized subsets of the
data from the warehouse
• Bottom-up approach
– First build a data mart to solve the most pressing problem
– Then build another data mart, then another
– Data warehouse = union of all data marts
• In practice, not much difference between the two
• Our book advocates the bottom-up approach
Data Cube
• Axes of the cube
represent attributes of the
data records
– Generally discrete-valued /
categorical
– e.g. color, month, state
– Called dimensions
• Cells hold aggregated
measurements
– e.g. total $ sales, number
of autos sold
– Called facts
• Real data cubes have >>
3 dimensions
Auto Sales
Red
Blue
WA
OR
CA
Gray
Jul Aug Sep
Slicing and Dicing
Red
Red
Blue
Blue
WA
OR
CA
Gray
Gray
Jul Aug Sep
Jul Aug Sep
Blue
Total
Jul Aug Sep
WA
OR
CA
WA
OR
CA
Blue
Jul Aug Sep
Querying the Data Cube
Number of Autos Sold
• Cross-tabulation
– “Cross-tab” for short
– Report data grouped by 2
dimensions
– Aggregate across other
dimensions
– Include subtotals
• Operations on a cross-tab
– Roll up (further aggregation)
– Drill down (less aggregation)
CA
OR
WA
Total
Jul
45
33
30
108
Aug
50
36
42
128
Sep
38
31
40
109
Total
133
100
112
345
Roll Up and Drill Down
Number of Autos Sold
Number of Autos Sold
CA
OR
WA
Total
Jul
45
33
30
108
Aug
50
36
42
128
Sep
38
31
40
109
100
112
345
Total 133
CA
OR
WA
Total
133
100
112
345
Roll up
by Month
Drill down
by Color
Number of Autos Sold
CA
OR
WA
Total
Red
40
29
40
109
Blue
45
31
37
113
Gray 48
40
35
123
Total 133
100
112
345
“Standard” Data Cube Query
• Measurements
– Which fact(s) should be reported?
• Filters
– What slice(s) of the cube should be used?
• Grouping attributes
– How finely should the cube be diced?
– Each dimension is either:
• (a) A grouping attribute
• (b) Aggregated over (“Rolled up” into a single total)
– n dimensions → 2n sets of grouping attributes
– Aggregation = projection to a lower-dimensional
subspace
Full Data Cube with Subtotals
• Pre-computation of aggregates → fast answers
to OLAP queries
• Ideally, pre-compute all 2n types of subtotals
• Otherwise, perform aggregation as needed
• Coarser-grained totals can be computed from
finer-grained totals
– But not the other way around
Data Cube Lattice
State, Month,
Color
Drill
Down
State,
Month
State,
Color
Month,
Color
State
Month
Color
Total
Roll
Up
MOLAP vs. ROLAP
•
•
•
•
MOLAP = Multidimensional OLAP
Store data cube as multidimensional array
(Usually) pre-compute all aggregates
Advantages:
– Very efficient data access → fast answers
• Disadvantages:
– Doesn’t scale to large numbers of dimensions
– Requires special-purpose data store
Sparsity
• Imagine a data warehouse for Safeway.
• Suppose dimensions are: Customer, Product, Store, Day
• If there are 100,000 customers, 10,000 products, 1,000
stores, and 1,000 days…
• …data cube has 1,000,000,000,000,000 cells!
• Fortunately, most cells are empty.
• A given store doesn’t sell every product on every day.
• A given customer has never visited most of the stores.
• A given customer has never purchased most products.
• Multi-dimensional arrays are not an efficient way to store
sparse data.
MOLAP vs. ROLAP
•
•
•
•
ROLAP = Relational OLAP
Store data cube in relational database
Express queries in SQL
Advantages:
–
–
–
–
Scales well to high dimensionality
Scales well to large data sets
Sparsity is not a problem
Uses well-known, mature technology
• Disadvantages:
– Query performance is slower than MOLAP
– Need to construct explicit indexes
Creating a Cross-tab with SQL
Grouping
Attributes
Measurements
SELECT state, month, SUM(quantity)
FROM sales
GROUP BY state, month
WHERE color = 'Red'
Filters
What about the totals?
• SQL aggregation query
with GROUP BY does not
produce subtotals, totals
• Our cross-tab report is
incomplete.
Number of Autos Sold
CA
OR
WA
Total
Jul
45
33
30
?
Aug
50
36
42
?
Sep
38
31
40
?
Total
?
?
?
?
State
CA
CA
CA
OR
OR
OR
WA
WA
WA
Month
Jul
Aug
Sep
Jul
Aug
Sep
Jul
Aug
Sep
SUM
45
50
38
33
36
31
30
42
40
One solution: a big UNION ALL
Original
Query
State
Subtotals
Month
Subtotals
Overall
Total
SELECT state, month, SUM(quantity)
FROM sales
GROUP BY state, month
WHERE color = 'Red‘
UNION ALL
SELECT state, "ALL", SUM(quantity)
FROM sales
GROUP BY state
WHERE color = 'Red'
UNION ALL
SELECT "ALL", month, SUM(quantity)
FROM sales
GROUP BY month
WHERE color = 'Red‘
UNION ALL
SELECT "ALL", "ALL", SUM(quantity)
FROM sales
WHERE color = 'Red'
A better solution
• “UNION ALL” solution gets cumbersome with
more than 2 grouping attributes
• n grouping attributes → 2n parts in the union
• OLAP extensions added to SQL 99 are more
convenient
– CUBE, ROLLUP
SELECT state, month, SUM(quantity)
FROM sales
GROUP BY CUBE(state, month)
WHERE color = 'Red'
Results of the CUBE query
Notice the use of
NULL for totals
Subtotals at
all levels
State
CA
CA
CA
CA
OR
OR
OR
OR
WA
WA
WA
WA
NULL
NULL
NULL
NULL
Month
Jul
Aug
Sep
NULL
Jul
Aug
Sep
NULL
Jul
Aug
Sep
NULL
Jul
Aug
Sep
NULL
SUM(quantity)
45
50
38
133
33
36
31
100
30
42
40
112
108
128
109
345
ROLLUP vs. CUBE
• CUBE computes entire lattice
• ROLLUP computes one path through lattice
– Order of GROUP BY list matters
– Groups by all prefixes of the GROUP BY list
GROUP BY ROLLUP(A,B,C) GROUP BY CUBE(A,B,C)
•A,B,C
•A,B,C
•(A,B) subtotals
•Subtotals for the following:
•(A) subtotals
(A,B), (A,C), (B,C),
•Total
(A), (B), (C)
•Total
ROLLUP example
SELECT color, month, state, SUM(quantity)
FROM sales
State, Month,
GROUP BY ROLLUP(color,month,state)
Color
State,
Month
State,
Color
Month,
Color
State
Month
Color
Total
Logical Database Design
• Logical database design is the topic for the next
two weeks.
• Logical design vs. physical design:
– Logical design = conceptual organization for the
database
• Create an abstraction for a real-world process
– Physical design = how is the data stored
• Select data structures (tables, indexes, materialized views)
• Organize data structures on disk
• Three main goals for logical design:
– Simplicity
– Expressiveness
– Performance
Goals for Logical Design
• Simplicity
– Users should understand the design
– Data model should match users’ conceptual model
– Queries should be easy and intuitive to write
• Expressiveness
– Include enough information to answer all important
queries
– Include all relevant data (without irrelevant data)
• Performance
– An efficient physical design should be possible
Another perspective…
• From a presentation at SIGMOD*:
• Three design rules:
– Simplicity
– Simplicity
– Simplicity
*by Jeff Byard and Donovan Schneider, Red Brick Systems
Star Schema
Fact table
Date
Promotion
Sales
Store
Product
Dimension tables
Dimension Tables
• Each one corresponds to a real-world object or concept.
– Examples: Customer, Product, Date, Employee, Region, Store,
Promotion, Vendor, Partner, Account, Department
• Properties of dimension tables:
– Contain many descriptive columns
• Dimension tables are wide (dozens of columns)
– Generally don’t have too many rows
• At least in comparison to the fact tables
• Usually < 1 million rows
– Contents are relatively static
• Almost like a lookup table
• Uses of dimension tables
– Filters are based on dimension attributes
– Grouping columns are dimension attributes
– Fact tables are referenced through dimensions
Fact Tables
• Each fact table contains measurements about a process
of interest.
• Each fact row contains two things:
– Numerical measure columns
– Foreign keys to dimension tables
– That’s all!
• Properties of fact tables:
– Very big
• Often millions or billions of rows
– Narrow
• Small number of columns
– Changes often
• New events in the world → new rows in the fact table
• Typically append-only
• Uses of fact tables:
– Measurements are aggregated fact columns.
Comparing Facts and Dimensions
Facts
•
•
•
•
Narrow
Big (many rows)
Numeric
Growing over time
Dimensions
•
•
•
•
Wide
Small (few rows)
Descriptive
Static
Facts contain numbers, dimensions contain labels
Four steps in dimensional modeling
1. Identify the process being modeled.
2. Determine the grain at which facts will be
stored.
3. Choose the dimensions.
4. Identify the numeric measures for the
facts.
Grain of a Fact Table
• Grain of a fact table = the meaning of one fact table row
• Determines the maximum level of detail of the
warehouse
• Example grain statements: (one fact row represents a…)
–
–
–
–
–
Line item from a cash register receipt
Boarding pass to get on a flight
Daily snapshot of inventory level for a product in a warehouse
Sensor reading per minute for a sensor
Student enrolled in a course
• Finer-grained fact tables:
– are more expressive
– have more rows
• Trade-off between performance and expressiveness
– Rule of thumb: Err in favor of expressiveness
– Pre-computed aggregates can solve performance problems
Choosing Dimensions
• Determine a candidate key based on the
grain statement.
– Example: a student enrolled in a course
– (Course, Student, Term) is a candidate key
• Add other relevant dimensions that are
functionally determined by the candidate
key.
– For example, Instructor and Classroom
• Assuming each course has a single instructor!