Data Warehouse

Download Report

Transcript Data Warehouse

Data Warehouse
and the
Star Schema
CSCI 242
©Copyright 2016, David C. Roberts, all rights reserved
Agenda







2


Definition
Why data warehouse
Product History
Processing Star queries
Data warehouse in the enterprise
Data warehouse design
Relevance of normalization
Star schema
Processing the star schema
Definition
Data warehouse: A repository of integrated
information, available for queries and
analysis. Data and information are extracted
from heterogeneous sources as they are
generated
The point is that it’s not used for transaction
processing; that is, it’s read-only. And the
data can come from heterogeneous sources
and it can all be queried in one database.
3
Why Data Warehouse




4
A read lock on a table will prevent any updating of a
table
A long-running analytic operation on all rows of a
table will prevent any updates
Analysis (a.k.a. decision support) can seriously
interfere with updates
Using a duplicate table for analysis, recopied once a
day, allows unlimited analysis and doesn’t interfere
with OLTP.
Data Warehouse vs. OLTP
OLTP
5
DW
Purpose
Automate day-to-day
operations
Analysis
Structure
RDBMS
RMBMS
Data Model
Normalized
Dimensional
Access
SQL
SQL and business analysis
programs
Data
Data that runs the business
Current and historical information
Condition of data
Changing, incomplete
Historical, complete, descriptive
Red Brick



6
Invented data warehouse; they sold a hardware
product with a star schema database
You loaded the Red Brick Warehouse and then
queried it for OLTP
It featured new optimizations for star schemas, was
very fast
Enter Sybase




7
Sybase learned the optimization and
developed their own product.
The Sybase product was a stand-alone
software data warehouse product
It couldn’t do general-purpose database
work, was just a data warehouse
They appear to have copied the Red Brick
idea, without selling hardware
Enter Oracle



8
Oracle, later, also copied the same
optimization
They added a bitmap index to their database
product, and added the star schema
optimization
Now their product could do data warehouse
as well as database
Status Today



9
Oracle dominates the field today
IBM eventually bought Red Brick so still
offers some sort of Red Brick product
Sybase offers their OLTP product, now as an
offering of SAP
So what is this algorithm that is so copied?
PROCESSING STAR QUERIES
10
Optimizing Star Queries



11
Build a bitmap index on each foreign key
column of the fact table
Index is a 2-dimensional array, one column
for each row being indexed, one row per
value of that column
Bitmap indexes are typically much smaller
than b-tree indexes, that can be larger than
the data itself
Bitmap Index Example
12
Query Processing


The typical query is a join of foreign keys of
dimension tables to the fact table
This is processed in two phases:
1.
2.
13
From the fact table, retrieve all rows that are part
of the result, using bitmap indexes
Join the result of the step above to the
dimension tables
Example Query
Find sales and profits from the grocery
departments of stores in the West and
Southwest districts over the last three quarters
14
Example Query
SELECT
store.sales_district,
time.fiscal_period,
SUM(sales.dollar_sales) revenue,
SUM(dollar_sales) - SUM(dollar_cost) income
FROM
sales, store, time, product
WHERE
sales.store_key = store.store_key AND
sales.time_key = time.time_key AND
sales.product_key = product.product_key AND
time.fiscal_period IN ('3Q95', '4Q95', '1Q96') and
product.department = 'Grocery' AND
store.sales_district IN ('San Francisco', 'Los Angeles')
GROUP BY
store.sales_district, time.fiscal_period;
15
Phase 1
Finding the rows in the SALES table (using bitmap indexes):
SELECT ... FROM sales
WHERE
store_key IN (SELECT store_key FROM store WHERE
sales_district IN ('WEST', 'SOUTHWEST')) AND
time_key IN (SELECT time_key FROM time WHERE
quarter IN ('3Q96', '4Q96', '1Q97')) AND
product_key IN (SELECT product_key FROM product WHERE
department = 'GROCERY');
16
Phase 2
Now the fact table is joined to dimension
tables. For dimension tables of small
cardinality, a full-table scan may be used. For
large cardinality, a hash join could be used.
17
The Star Transformation

Use bitmap indexes to retrieve all relevant
rows from the fact table, based on foreign
key values
–

Join this result set to the dimension tables
–
–
18
This happens very fast
If there are many values, a hash join may be used
If there are fewer values, a b-tree driven join may
be used
How DW Fits into the Enterprise
User
OLTP1
Application A
User
Integration
User
Data
Warehouse
Application B
OLTP2
User
User
Integration
Application C
User
19
Extract,
Transform
And Load
Data
Mart
OLTP3
User
Data
Mart
Data
Mart
Data
Mart
Data Warehouse Database Design



20
A conventional database design for data
warehouse would lead to joins on large
amounts of data that would run slowly
The star schema allows for fast processing of
very large quantities of data in the data
warehouse
It also allows for very compact representation
of events that occur many times
A Sample OLTP Schema
customers
orders
order
items
21
products
Transformed to a Star Schema
products
times
dimension
table
dimension
table
sales
fact table
22
customers
channels
dimension
table
dimension
table
Star Schema
Time
Location
Item
Fact Table
Supplier
Customer
23
Fact Table





24
The fact table contains the actual business process
measurements or metrics for a specific event, called facts,
usually numbers.
A fact table represents facts by foreign keys from other tables,
called “dimension” tables
These foreign keys are usually generated keys, in order to save
fact table space
If you are building a DW of monthly sales in dollars, your fact
table will contain monthly sales, one row per month.
If you are building a DW of retail sales, each row of the fact
table might have one row for each item sold.
Fact Table Design
A fact table may contain one or more facts. Usually you
create one fact table per business event. For
example if you want to analyze the sales numbers
and also advertising spending, they are two separate
business processes. So you will create two separate
fact tables, one for sales data and one for
advertising cost data. On the other hand if you want
to track the sales tax in addition to the sales number,
you simply create one more fact column in the Sales
fact table called Tax.
25
Dimension Table



26
Dimension tables have a small number of rows
(compared to fact tables) but a large number of
columns
For the lowest level of granularity of a fact in the fact
table, a dimension table will have one row that gives
all the categories for each value
The dimension table is often all key, so a generated
key is used so that the fact table reference to the
dimension table can be small
27
Time Dimension Schema
28
Column Name
Type
Dim_Id
INTEGER (4)
Month
SMALL INTEGER (2)
Month_Name
VARCHAR (3)
Quarter
SMALL INTEGER (4)
Quarter_Name
VARCHAR (2)
Year
SMALL INTEGER (2)
Time Dimension Data
29
TM _Dim_Id
TM _Month
TM_Month_Name
TM _Quarter
TM_Quarter_N
ame
TM_Year
1001
1
Jan
1
Q1
2003
1002
2
Feb
1
Q1
2003
1003
3
Mar
1
Q1
2003
1004
4
Apr
2
Q2
2003
1005
5
May
2
Q2
2003
Location Dimension Schema
30
Column Name
Type
Dim_Id
INTEGER (4)
Loc_Code
VARCHAR (4)
Name
VARCHAR (50)
State_Name
VARCHAR (20)
Country_Name
VARCHAR (20)
Location Dimension Data
31
Dim_Id
Loc_Code
Name
State_Name
Country_Name
1001
IL01
Chicago Loop
Illinois
USA
1002
IL02
Arlington Hts
Illinois
USA
1003
NY01
Brooklyn
New York
USA
1004
TO01
Toronto
Ontario
Canada
1005
MX01
Mexico City
Distrito Federal
Mexico
Product Data Schema
32
Column Name
Type
Dim_Id
INTEGER (4)
SKU
VARCHAR (10)
Name
VARCHAR (30)
Category
VARCHAR (30)
Product Data
33
Dim_Id
SKU
Name
Category
1001
DOVE6K
Dove Soap 6Pk
Sanitary
1002
MLK66F#
Skim Milk 1 Gal
Dairy
1003
SMKSAL55
Smoked Salmon 6oz
Meat
Categories in Dimension Tables


34
Categories may or may not be hierarchical;
or can be both
Categories provide canned values that can
be given to users for queries
Granularity (Grain) of the Fact Table
The level of detail of the fact table is known as
the grain of the fact table. In this example the
grain of the fact table is monthly
sales number per location per product.
35
Note about Granularity




36
There may be multiple star schemas at
different levels of granularity, especially for
very large data warehouses
The first could be the finest—say, each
transaction such as a sale
The next could be an aggregation, like the
previous example
There could be more levels of aggregation
Design Approach
1. Identify the business process.
In this step you will determine what is your business process that your data
warehouse represents. This process will be the source of your metrics or
measurements.
2. Identify the Grain
You will determine what does one row of fact table mean. In the previous example
you have decided that your grain is 'monthly sales per location per product'. It
might be daily sales or even each sale could be one row.
3. Identify the Dimensions
Your dimensions should be descriptive (SQL VARCHAR or CHARACTER) as much
as possible and conform to your grain.
37
4. Finally Identify the facts
In this step you will identify what are your measurements (or metrics or facts). The
facts should be numeric and should confirm to the grain defined in step 2.
Monthly Sales Fact Table Schema
38
Field Name
Type
TM_Dim_Id
INTEGER (4)
PR_ Dim_Id
INTEGER (4)
LOC_ Dim_Id
INTEGER (4)
Sales
INTEGER (4)
Monthly Sales Fact Table Data
39
TM_Dim_Id
PR_ Dim_Id
LOC_ Dim_Id
Sales
1001
1001
1003
435677
1002
1002
1001
451121
1003
1001
1003
98765
1001
1004
1001
65432
Data Mart





40
A data mart is a collection of subject areas organized for decision
support based on the needs of a given department. Examples: finance
has their data mart, marketing has theirs, sales has theirs and so on.
Each department generally runs its own data mart. Ownership of the
data mart allows each department to bypass the control that might
coordinate the data found in the different departments.
Each department's data mart is peculiar to and specific to its own
needs. Typically, the database design for a data mart is built around a
star-join structure designed for that department.
The data mart contains only a modicum of historical information and is
granular only to the point that it suits the needs of the department.
The data mart may also include data from outside the organization,
such as purchased normative salary data that might be purchased by
an HR department.
About the Data Mart





41
The structure of the data in the data mart may or may not be
compatible with the structure of data in the data warehouse.
The amount of historical data found in the data mart is different
from the history of the data found in the warehouse. Data
warehouses contain robust amounts of history, while data marts
usually contain modest amounts of history.
The subject areas found in the data mart are only faintly related
to the subject areas found in the data warehouse.
The relationships found in the data mart may not be those
relationships that are found in the data warehouse.
The types of queries satisfied in the data mart are quite
different from those queries found in the data warehouse.
Walmart’s Data Warehouse





42
Half a petabyte in capacity (.5 x 1015 bytes)
World’s largest DW
Tracks 100 million customers buying billions of
products every week
Every sale from every store is transmitted to
Bentonville every night
Walmart has more than 18,000 retail stores, employs
2.2 million, serves 245 million customers every week
Typical Questions



43
How much orange juice did we sell last year,
last month, last week in store X?
What internal factors (position in store,
advertising campaigns...) influence orange
juice sales?
How much orange juice are we going to sell
next week, next month, next year?
44