Data Warehousing - Department of Computer Science

Download Report

Transcript Data Warehousing - Department of Computer Science

COMP207:
Data Mining
COMP207: Data Mining
Dr. M. Sulaiman Khan
([email protected])
Dept. of Computer Science
University of Liverpool
2010
Data Warehousing
COMP207:
Data Mining
Today's Topics
Data Warehouses
Data Cubes
Warehouse Schemas
OLAP
Materialisation
Data Warehousing
COMP207:
Data Mining
What is a Data Warehouse?
Most common definition:
“A data warehouse is a subject-oriented, integrated, time-variant
and nonvolatile collection of data in support of management's
decision-making process.” - W. H. Inmon

Corporate focused, assumes a lot of data, and typically sales
related

Data for “Decision Support System” or “Management Support
System”

1996 survey: Return on Investment of 400+%
Data Warehousing: Process of constructing (and using) a data
warehouse
Data Warehousing
COMP207:
Data Mining

Subject-oriented:
 Focused on important subjects, not transactions


Data Warehouse
Concise view with only useful data for decision making
Integrated:
 Constructed from multiple, heterogeneous data sources.
Normally distributed relational databases, not necessarily
same schema.
 Cleaning, pre-processing techniques applied for missing data,
noisy data, inconsistent data (sounds familiar, I hope)
Data Warehousing
COMP207:
Data Mining
Data Warehouse

Time-variant:
 Has different values for the same fields over time.
 Operational database only has current value. Data
Warehouse offers historical values.

Nonvolatile:
 Physically separate store
 Updates not online, but in offline batch mode only
 Read only access required, so no concurrency issues
Data Warehousing
COMP207:
Data Mining
Data Warehouse
Data Warehouses are distinct from:


Distributed DB: Integrated via wrappers/mediators. Far too
slow, semantic integration much more complicated.
Integration done before loading, not at run time.
Operational DB: Only records current value, lots of extra non
useful information.
Different schemas/models, access patterns, users, functions,
even though the data is derived from an operational db.
Data Warehousing
COMP207:
Data Mining
OLAP vs OLTP
OLAP: Online Analytical Processing (Data Warehouse)
OLTP: Online Transaction Processing (Traditional DBMS)
OLAP data typically: historical, consolidated, and multidimensional (eg: product, time, location).
Involves lots of full database scans, across terabytes or more of
data.
Typically aggregation and summarisation functions.
Distinctly different uses to OLTP on the operational database.
Data Warehousing
COMP207:
Data Mining
Data Cubes
Data is normally Multi-Dimensional,
and can be thought of as a cube.
Often: 3 dimensions of time,
location and product.
No need to have just 3
dimensions -- could have one
for cars with make, colour,
price, location, and time
for example.
Image courtesy of IBM OLAP Miner documentation
Data Warehousing
COMP207:
Data Mining



Data Cubes
Can construct many 'cuboids' from the full cube by excluding
dimensions.
In an N dimensional data cube, the cuboid with N dimensions is
the 'base cuboid'. A 0 dimensional cuboid (other than non
existent!) is called the 'apex cuboid'.
Can think of this as a lattice of cuboids...
(Following lattice courtesy of Han & Kamber)
Data Warehousing
COMP207:
Data Mining
Lattice of Cuboids
all
time
item
0-D(apex) cuboid
location supplier
1-D cuboids
time,item
time,location
item,location
time,supplier
time,item,location
location,supplier
2-D cuboids
item,supplier
time,location,supplier
3-D cuboids
time,item,supplier
item,location,supplier
4-D(base) cuboid
time, item, location, supplier
Data Warehousing
COMP207:
Data Mining
Multi-dimensional Units
Each dimension can also be thought of in terms of different units.
 Time: decade, year, quarter, month, day, hour
(and week,
which isn't strictly hierarchical with the others!)
 Location: continent, country, state, city, store
 Product: electronics, computer, laptop, dell, inspiron
This is called a “Star-Net” model in data warehousing, and allows
for various operations on the dimensions and the resulting
cuboids.
Data Warehousing
COMP207:
Data Mining
Star-Net Model
Customer Orders
Shipping Method
Customer
CONTRACTS
AIR-EXPRESS
TRUCK
ORDER
PRODUCT LINE
Time
Product
ANNUALY
QTRLY
DAILY
PRODUCT ITEM
PRODUCT GROUP
DISTRICT
SALES PERSON
REGION
DISTRICT
COUNTRY
DIVISION
Geography
Promotion
Data Warehousing
Organization
COMP207:
Data Mining

Data Cube Operations
Roll Up: Summarise data by climbing up hierarchy.
Eg. From monthly to quarterly, from Liverpool to England

Drill Down: Opposite of Roll Up
Eg. From computer to laptop, from £100-999 to £100-199

Slice: Remove a dimension by setting a value for it
Eg. location/product where time is Q1,2007

Dice: Restrict cube by setting values for multiple dimensions
Eg. Q1,Q2 / North American cities / 3 products sub cube

Pivot: Rotate the cube (mostly for visualisation)
Data Warehousing
COMP207:
Data Mining



Data Cube Schemas
Star Schema: Single fact table in the middle, with connected set
of dimension tables
(Hence a star)
Snowflake Schema: Some of the dimension tables further
refined into smaller dimension tables
(Hence looks like a snow flake)
Fact Constellation: Multiple fact tables can share dimension
tables
(Hence looks like a collection of star schemas. Also called
Galaxy Schema)
Data Warehousing
COMP207:
Data Mining
Star Schema
Time Dimension
Item Dimension
time_key
day
day_of_week
month
quarter
year
item_key
name
brand
type
supplier_type
Sales Fact Table
time_key
item_key
location_key
units_sold
Loc.n Dimension
location_key
street
city
state
country
continent
Measure (value)
Data Warehousing
COMP207:
Data Mining
Snowflake Schema
Time Dimension
Item Dimension
time_key
day
day_of_week
month
quarter
year
item_key
name
brand
type
supplier_key
Sales Fact Table
time_key
item_key
location_key
units_sold
Loc Dimension
location_key
street
city_key
City Dimension
city_key
city
state
country
Measure (value)
Data Warehousing
COMP207:
Data Mining
Fact Constellation
Time Dimension
Item Dimension
time_key
day
day_of_week
month
quarter
year
item_key
name
brand
type
supplier_key
Sales Fact Table
time_key
item_key
Shipping Table
time_key
item_key
from_key
location_key
units_shipped
units_sold
Loc Dimension
location_key
street
city_key
City Dimension
city_key
city
state
country
Measure (value)
Data Warehousing
COMP207:
Data Mining
OLAP Server Types
ROLAP: Relational OLAP
 Uses relational DBMS to store and manage the warehouse data
 Optimised for non traditional access patterns
 Lots of research into RDBMS to make use of!
MOLAP: Multidimensional OLAP
 Sparse array based storage engine
 Fast access to precomputed data
HOLAP: Hybrid OLAP
 Mixture of both MOLAP and ROLAP
Data Warehousing
COMP207:
Data Mining
Data Warehouse Architecture
(also courtesy of Han & Kamber)
Other
sources
Operational
DBs
Metadata
Extract
Transform
Load
Refresh
Monitor
&
Integrator
Data
Warehouse
OLAP
Server
Serve
Analysis
Query
Reports
Data mining
Data Marts
Data Sources
Data Storage
OLAP Engine
Data Warehousing
Front-End Tools
COMP207:
Data Mining
Materialisation
In order to compute OLAP queries efficiently, need to materialise some of
the cuboids from the data.

None: Very slow, as need to compute entire cube at run time

Full: Very fast, but requires a LOT of storage space and time to
compute all possible cuboids

Partial: But which ones to materialise? Called an 'iceberg cube',
as only partially materialised and the rest is "below water".
Many cells in a cuboid will be empty, only materialise sections that
contain more values than a minimum threshold.
Data Warehousing
COMP207:
Data Mining

Further Reading
http://en.wikipedia.org/wiki/Data_warehouse
and subsequent links
Data Warehousing