(WHAT) Dimension Table - NDSU Computer Science

Download Report

Transcript (WHAT) Dimension Table - NDSU Computer Science

What is Data Warehouse?
Defined in many different ways.
 A decision support database that is maintained separately
from the organization’s operational database



Supports information processing by providing a solid platform of
consolidated, historical data for analysis.
“A data warehouse is a subject-oriented, integrated, timevariant, and nonvolatile collection of data in support of
management’s decision-making process.”—W. H. Inmon
Data warehousing:

The process of constructing and using data warehouses
Subject-Oriented


Organized around major subject(s) or fact(s), such as sales, enrollments, experiments, events.
Focused on modeling and analysis for decision makers, not on daily operations or transactions.
Integrated



Constructed (possibly) by integrating multiple, heterogeneous data sources
That must be cleaned and data integrated

To ensure consistency in naming conventions, encoding structures, attribute measures, etc.
among different data sources (E.g., Hotel price: currency, tax, breakfast covered, etc.).
This step is done at the time the data is moved to the data warehouse.
Time Variant


The time horizon for the DW is significantly longer than that of operational systems.

Operational database: current value data.

Data warehouse data: provide info from a historical perspective (e.g., past 5-10 years)
Every structure in the data warehouse contains an element of time, explicitly or implicitly

But operational data may or may not contain “time element” (always assumed to be the
“current value”)
Non-volatile


A physically separate store of data transformed from the operational environment.
Operational updates of DW data does not occur (every insert is a considered a new item).

Does not require transaction processing, recovery, and concurrency control mechanisms

Requires only two operations in data accessing: initial loading of data and read of data.
Data Mining on a DW?
Data Mining goes into MOUNTAINS of raw data for info gems.
visualization
Pattern Evaluation
and Assay
Data Mining
Task-relevant Data
Data Warehouse: cleaned,
integrated, read-only, periodic,
historical raw database
OLAP
Classification
Clustering
Rule Mining
Loop
backs
Selection
Feature extraction,
tuple selection
Data Cleaning/Integration:
missing data, outliers,
noise, errors
Smart files
April 12, 2017
Data Mining: Concepts and Techniques
4
From Tables and Spreadsheets to Data Cubes


A data warehouse is usually based on a multidimensional data model which
views data in the form of a data cube describing the subject of interest
(e.g., sales)
A data cube allows data to be modeled and viewed in multiple dimensions


Auxiliary dimension tables are added to the central cube for additional
information (e.g., for sales cube,
item (item_name, brand, type)
time (day, week, month, quarter, year), salesman (name, addr, salary)
Fact cube contains measurement(s) (e.g., number_of_sales) and keys
(references) to each of the related dimension tables.
A Sample Data Cube
Each cell contains a sales measurement, e.g., the number of sales
(may contain many other measurements of product-date-country instances)
2Qtr
3Qtr
4Qtr
U.S.A
Canada
Mexico
Country
TV
PC
VCR
1Qtr
Date
Total sales by country and date
Rollup (aggregate under +) along product
2Qtr
Date
3Qtr
4Qtr
U.S.A
Canada
Mexico
Country
TV
PC
VCR
1Qtr
(e.g., using the aggregate, sum)
Total of all
product sales
by country
and quarter
Rollup along date
(e.g., using the aggregate, sum)
2Qtr
Date
3Qtr
4Qtr
U.S.A
Canada
Mexico
Country
TV
PC
VCR
1Qtr
Total annual sales
by country and product
Rollup along country
(e.g., using the aggregate, sum)
2Qtr
3Qtr
4Qtr
U.S.A
Canada
Mexico
Total of all product sales by
product and date
Country
TV
PC
VCR
1Qtr
Date
All rollups
(e.g., using the aggregate, sum)
2Qtr
3Qtr
4Qtr
U.S.A
sales by product, country and quarter
sales by
Canada
product,
country
Mexico
sales by country, date
sales by product, country
sales by date
sales by
country
sales by
by
sales
product
product
Total
sales
Country
TV
PC
VCR
1Qtr
Date
Cuboids Corresponding to the Cube
all
0-D(apex) cuboid
product
product,date
date
country
product,country
1-D cuboids
date, country
2-D cuboids
Drilldown on product
Rollup on country
(Sum over country)
3-D(base or fact) cuboid
product, date, country
Conceptual Modeling of Data Warehouses

Modeling data warehouses: dimensions & measures

Star schema: (simplest) A fact cube in the middle (star
center) connected to the dimension tables (star points)

Snowflake schema: A refinement of star schema where
some dimensional hierarchy is normalized into a set of
smaller dimension tables, forming a shape similar to
snowflake

Fact constellations: Multiple fact cubes share dimension
tables, viewed as a collection of stars, therefore called
galaxy schema or fact constellation
Example of Star Schema
date
date_key
day
day_of_the_week
month
quarter
year
product
Sales Fact Cube
date_key
product_key
country_key
country
country_key
country_name
country_continent
units_sold
dollars_sold
avg_sales
Measures
product_key
product_name
brand
type
supplier_type
Example of Snowflake Schema
time
time_key
day
day_of_the_week
month
quarter
year
item
Sales Fact Cube
time_key
item_key
branch_key
branch
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_key
supplier
supplier_key
supplier_type
location
location_key
street
city_key
city
city_key
city
province_or_street
country
Example of Fact Constellation
time
time_key
day
day_of_the_week
month
quarter
year
item
Sales Fact Cube
time_key
item_key
item_name
brand
type
supplier_type
item_key
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
Measures
time_key
item_key
shipper_key
from_location
branch_key
branch
Shipping Fact Cube
location
to_location
location_key
street
city
province_or_street
country
dollars_cost
units_shipped
shipper
shipper_key
shipper_name
location_key
shipper_type
Visualizing a 5-D Data Cube
Sales Volume = box size
(4th dimensions)
Product color = box-color
(5th dimension)
Visualization is very important and can
be done for more than 3 dimensions.
Typical OLAP Operations

Roll up: summarize data


Grouped-by aggregation – dimension generalization, e.g. if date is
initially in months, rollup to quarters (sum groups by quarter) or
Dimension reduction/elimination (e.g., Slide-7: rollup by summing over
all products leaving just country and date)


Drill down: reverse of roll-up


from higher level summary to lower level summary (detailed data) or introducing
new dims
Slice and dice:


aggregating over an entire dimension, eliminating it.
project and select
Pivot (rotate):

reorient (re-order) the cube, for visualization and faster processing.
Partial Rollup: climbing up a concept hierarchy
(instead of eliminating Product altogether by summing over all products,
rollup partially on Product, from (VCR, PC, TV) to computer (includes PC only)
and non-computer (includes VCR + TV)
TV
non-comp
VCR
comp
PC
2Qtr
3Qtr
4Qtr
U.S.A
Canada
Mexico
Country
1Qtr
Date
SLICE
e.g., slice off PC
TV
VCR
PC
2Qtr
3Qtr
4Qtr
U.S.A
Canada
Mexico
Country
1Qtr
Date
(e.g. dice off PC, the last two quarters, the country Mexico)
1Qtr
TV
VCR
PC
2Qtr
Date
3Qtr
4Qtr
U.S.A
Canada
Mexico
Country
DICE
Pivot/Rotate
4Qtr
1Qtr
2Qtr
Mexico
U.S.A
Canada
Country
primary
Date
3Qtr
4Qtr
TV
PC
VCR
Country
3Qtr
Mexico
1Qtr
Canada
2Qtr
U.S.A
Date
TV
PC
VCR
secondary
tertiary
Some important DWs for our
region: Northern Border Security
Data Warehouse
Category
Activity
Pub
Info
Reported by
Activity0
crossing
1
USBP
Activity1
felony
0
FBI
commerce
Activity2
Activity3
Activity
1
recreation
1
(WHAT)
6’11
220
Mot
farmer
Jill Wade
5’9”
140
Zap
exec
Ahmed Ali
5’8”
190
Fgo
teach
Jose Fox
5’6”
150
GF
labor
Person
(WHO)
a0
a1
a3
Dept of
Commerce
Chamber of
Commerce
Where
l0
l1
a2
p1
p2
p3
time
time
time
time
timee
time
time
time
0
time
time
time
time
e1
time
time
time
time
e2
time
time
time
e3
L
A
T
L
O
N
l0
1
49
91
900
plains
urban
l1
0
49
95
2000
plains
rural
l2
1
49
89
950
river
rural
l3
1
49
90
897
plains
rural
TERRAIN
p0
time
Dimension Table
EL
Dimension Table
Who
What
L Surface
O T/F
C
Location
John Roe
l2
time
time
time
time
l3
time
time
time
time
Class
(WHERE) Dimension Table
Who-What-Where Cube
(with WHEN as an attribute)
for Who-What-Where-When
Border Events (subject)
PUBLIC (Ptree Unfied BioLogical
SubCell-Location
Myta
Ribo
Nucl
Ribo
InformtiCs Data Cube and
Dimension Tables)
Function
apop
meio
mito
apop
StopCodonDensity
.1
.1
.1
.9
PolyA-Tail
1
1
0
0
Organism
Species
Vert
Genome Size
(million bp)
human
Homo sapiens
1
3000
fly
Drosophila
melanogaster
0
185
yeast
Saccharomyces
cerevisiae
0
12.1
Mus
musculus
1
mouse
Organism
Dimension
Table
o3
g0
g2
17, 78 12, 60 Mi, 40 1, 48
17, 40
1
10,
75 1
0 1 0
o1
1 e14,
0 0
1
0
65 0
0
0
o2 16, 760
0 1 9, 45 0 1, 43 0
0
1
0
0
e1
1
1
1
1
t0
0
1
1
e2
1
1
1
e3
0
1
t2
P
I
U
N
V
S
T
R
C
T
Y
S
T
Z
E
D
A
D
S
H
M
N
1
t3
3
2
a
c
h
1
2
2
b
s
h
0
2
4
a
c
a
1
2
4
a
s
a
1
Treatment
Dimension
Table
(MIAME)
(chromosome,length)
g3
o0
1
3000
g1
t1
L
A
B
Gene-Organism
Dimension Table
Gene Dimension Table
0
1
0
1
1
1
0
0
0
0
0
1
0
1
0
1
0
1
Gene-Treatment-Organism Cube
(1 iff that gene from that organism
expresses at a threshold level under that
treatment.)
(subject = experiment)